Multi-Task Learning for Sequence Tagging: An Empirical Study
We study three general multi-task learning (MTL) approaches on 11 sequence tagging tasks. Our extensive empirical results show that in about 50% of the cases, jointly learning all 11 tasks improves upon either independent or pairwise learning of the tasks. We also show that pairwise MTL can inform us what tasks can benefit others or what tasks can be benefited if they are learned jointly. In particular, we identify tasks that can always benefit others as well as tasks that can always be harmed by others. Interestingly, one of our MTL approaches yields embeddings of the tasks that reveal the natural clustering of semantic and syntactic tasks. Our inquiries have opened the doors to further utilization of MTL in NLP.
Soravit Changpinyo, Hexiang Hu, and Fei Sha Department of Computer Science University of Southern California Los Angeles, CA 90089 schangpi,hexiangh,firstname.lastname@example.org
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/
Multi-task learning (MTL) has long been studied in the machine learning literature, cf. [Caruana, 1997]. The technique has also been popular in NLP, for example, in [Collobert and Weston, 2008, Collobert et al., 2011, Luong et al., 2016]. The main thesis underpinning MTL is that solving many tasks together provides a shared inductive bias that leads to more robust and generalizable systems. This is especially appealing for NLP as data for many tasks are scarce — shared learning thus reduces the amount of training data needed. MTL has been validated in recent work, mostly where auxiliary tasks are used to improve the performance on a target task, for example, in sequence tagging [Søgaard and Goldberg, 2016, Bjerva et al., 2016, Plank et al., 2016, Alonso and Plank, 2017, Bingel and Søgaard, 2017].
Despite those successful applications, several key issues about the effectiveness of MTL remain open. Firstly, with only a few exceptions, much existing work focuses on “pairwise” MTL where there is a target task and one or several (carefully) selected auxiliary tasks. However, can jointly learning many tasks benefit all of them together? A positive answer will significantly raise the utility of MTL. Secondly, how are tasks related such that one could benefit another? For instance, one plausible intuition is that syntactic and semantic tasks might benefit among their two separate groups though cross-group assistance is weak or unlikely. However, such notions have not been put to test thoroughly on a significant number of tasks.
In this paper, we address such questions. We investigate learning jointly multiple sequence tagging tasks. Besides using independent single-task learning as a baseline and a popular shared-encoder MTL framework for sequence tagging [Collobert et al., 2011], we propose two variants of MTL, where both the encoder and the decoder could be shared by all tasks.
We conduct extensive empirical studies on 11 sequence tagging tasks — we defer the discussion on why we select such tasks to a later section. We demonstrate that there is a benefit to moving beyond “pairwise” MTL. We also obtain interesting pairwise relationships that reveal which tasks are beneficial or harmful to others, and which tasks are likely to be benefited or harmed. We find such information correlated with the results of MTL using more than two tasks. We also study selecting only benefiting tasks for joint training, showing that such a “greedy” approach in general improves the MTL performance, highlighting the need of identifying with whom to jointly learn.
The rest of the paper is organized as follows. We describe different approaches for learning from multiple tasks in Sect. 2. We describe our experimental setup and results in Sect. 3 and Sect. 4, respectively. We discuss related work in Sect. 5. Finally, we conclude with discussion and future work in Sect. 6.
2 Multi-Task Learning for Sequence Tagging
In this section, we describe general approaches to multi-task learning (MTL) for sequence tagging. We select sequence tagging tasks for several reasons. Firstly, we want to concentrate on comparing the tasks themselves without being confounded by designing specialized MTL methods for solving complicated tasks. Sequence tagging tasks are done at the word level, allowing us to focus on simpler models while still enabling varying degrees of sharing among tasks. Secondly, those tasks are often the first steps in NLP pipelines that come with extremely diverse resources. Understanding the nature of the relationships between them is likely to have a broad impact on many downstream applications.
Let be the number of tasks and be training data of task . A dataset for each task consists of input-output pairs. In sequence tagging, each pair corresponds to a sequence of words and their corresponding ground-truth tags , where is the sequence length. We note that our definition of “task” is not the same as “domain” or “dataset.” In particular, we differentiate between tasks based on whether or not they share the label space of tags. For instance, part-of-speech tagging on weblog and that on email domains are considered the same task in this paper.
Given the training data , we describe how to learn one or more models to perform all the tasks. In general, our models follow the design of state-of-the-art sequence taggers [Reimers and Gurevych, 2017]. They have an encoder with parameters that encodes a sequence of word tokens into a sequence of vectors and a decoder with parameters that decodes the sequence of vectors into a sequence of predicted tags . That is, and . The model parameters are learned by minimizing some loss function over and . In what follows, we will use superscripts to differentiate instances from different tasks.
|Single-task learning||MTL (Multi-Dec)||MTL (TEDec)||MTL (TEEnc)|
In single-task learning (STL), we learn models independently. For each task , we have an encoder and a decoder . Clearly, information is not shared between tasks in this case.
In multi-task learning (MTL), we consider two or more tasks and train an MTL model jointly over a combined loss . In this paper, we consider the following general frameworks that are different in the nature of how the parameters of those tasks are shared.
Multi-task learning with multiple decoders (Multi-Dec) We learn a shared encoder and decoders . This setting has been explored for sequence tagging in [Collobert and Weston, 2008, Collobert et al., 2011]. In the context of sequence-to-sequence learning [Sutskever et al., 2014], this is similar to the “one-to-many” MTL setting in [Luong et al., 2016].
Multi-task learning with task embeddings (TE) We learn a shared encoder for the input sentence as well as a shared decoder . To equip our model with the ability to perform one-to-many mapping (i.e., multiple tasks), we augment the model with “task embeddings.” Specifically, we additionally learn a function that maps a task ID to a vector. We explore two ways of injecting task embeddings into models. In both cases, our is simply an embedding layer that maps the task ID to a dense vector.
One approach, denoted by TEDec, is to incorporate task embeddings into the decoder. We concatenate the task embeddings with the encoder’s outputs and then feed the result to the decoder.
The other approach, denoted by TEEnc, is to combine the task embeddings with the input sequence of words at the encoder. We implement this by prepending the task token (<<upos>>, <<chunk>>, <<mwe>>, etc.) to the input sequence and treat the task token as a word token [Johnson et al., 2017].
While the encoder in TEDec must learn to encode a general-purpose representation of the input sentence, the encoder in TEEnc knows from the start which task it is going to perform.
Fig. 1 illustrates different settings described above. Clearly, the number of model parameters is reduced significantly when we move from STL to MTL. Which MTL model is more economical depends on several factors, including the number of tasks, the dimension of the encoder output, the general architecture of the decoder, the dimension of task embeddings, how to augment the system with task embeddings, and the degree of tagset overlap.
3 Experimental Setup
3.1 Datasets and Tasks
|Dataset||# sentences||Token/type||Task||# labels||Label entropy|
|Universal Dependencies v1.4||12543/16622||12.3/13.2||upos||17||2.5|
|Broadcast News 1||880/1370||5.2/6.1||com||2||0.6|
Table 1 summarizes the datasets used in our experiments, along with their corresponding tasks and important statistics. Table 2 shows an example of each task’s input-output pairs. We describe details below. For all tasks, we use the standard splits unless specified otherwise.
We perform universal and English-specific POS tagging (upos and xpos) on sentences from the English Web Treebank [Bies et al., 2012], annotated by the Universal Dependencies project [Nivre et al., 2016]. We perform syntactic chunking (chunk) on sentences from the WSJ portion of the Penn Treebank [Marcus et al., 1993], annotated by the CoNLL-2000 shared task [Tjong Kim Sang and Buchholz, 2000]. We use sections 15-18 for training. The shared task uses section 20 for testing and does not designate the development set, so we use the first 1001 sentences for development and the rest 1011 for testing. We perform named entity recognition (ner) on sentences from the Reuters Corpus [Lewis et al., 2004], consisting of news stories between August 1996-97, annotated by the CoNLL-2003 shared task [Tjong Kim Sang and De Meulder, 2003]. For both chunk and ner, we use the IOBES tagging scheme.
We perform multi-word expression identification (mwe) and supersense tagging (supsense) on sentences from the reviews section of the English Web Treebank, annotated under the Streusle project [Schneider and Smith, 2015]111https://github.com/nert-gu/streusle. We perform supersense (sem) and semantic trait (semtr) tagging on SemCor’s sentences [Landes et al., 1998], taken from a subset of the Brown Corpus [Francis and Kučera, 1982], using the splits provided by [Alonso and Plank, 2017] for both tasks222https://github.com/bplank/multitasksemantics. For sem, they are annotated with supersense tags [Miller et al., 1993] by [Ciaramita and Altun, 2006]333We consider supsense and sem as different tasks as they use different sets of supersense tags.. For semtr, [Alonso and Plank, 2017] uses the EuroWordNet list of ontological types for senses [Vossen et al., 1998] to convert supersenses into coarser semantic traits.
For sentence compression (com), we identify which words to keep in a compressed version of sentences from the 1996 English Broadcast News Speech (HUB4) [Graff, 1997], created by [Clarke and Lapata, 2006]444http://jamesclarke.net/research/resources/. We use the labels from the first annotator. For frame target identification (frame), we detect words that evoke frames [Das et al., 2014] on sentences from the British National Corpus, annotated under the FrameNet project [Baker et al., 1998]. For both com and frame, we use the splits provided by [Bingel and Søgaard, 2017]. For hyper-link detection (hyp), we identify which words in the sequence are marked with hyperlinks on text from Daniel Pipes’ news-style blog collected by [Spitkovsky et al., 2010]555https://nlp.stanford.edu/valentin/pubs/markup-data.tar.bz2. We use the “select” subset that correspond to marked, complete sentences.
|upos||once again , thank you all for an outstanding accomplishment .|
|ADV ADV PUNCT VERB PRON DET ADP DET ADJ NOUN PUNCT|
|xpos||once again , thank you all for an outstanding accomplishment .|
|RB RB , VBP PRP DT IN DT JJ NN .|
|chunk||the carrier also seemed eager to place blame on its american counterparts .|
|B-NP E-NP S-ADVP S-VP S-ADJP B-VP E-VP S-NP S-PP B-NP I-NP E-NP O|
|ner||6. pier francesco chili ( italy ) ducati 17541|
|O B-PER I-PER E-PER O S-LOC O S-ORG O|
|mwe||had to keep in mind that the a / c broke , i feel bad it was their opening !|
|B I B I I O O B I I O O O O O O O O O O|
|supsense||this place may have been something sometime ; but it way past it " sell by date " .|
|O n.GROUP O O v.stative O O O O O O p.Time p.Gestalt O v.possession p.Time n.TIME O O|
|sem||a hypothetical example will illustrate this point .|
|O adj.all noun.cognition O verb.communication O noun.communication O|
|semtr||he wondered if the audience would let him finish .|
|O Mental O O Object O Agentive O BoundedEvent O|
|com||he made the decisions in 1995 , in early 1996 , to spend at a very high rate .|
|KEEP KEEP DEL KEEP DEL DEL DEL DEL DEL DEL DEL KEEP KEEP KEEP KEEP DEL KEEP KEEP KEEP|
|frame||please continue our important partnership .|
|O B-TARGET O B-TARGET O O|
|hyp||will this incident lead to a further separation of civilizations ?|
|O O O O O O O B-HTML B-HTML B-HTML O|
3.2 Metrics and Score Comparison
We use the span-based micro-averaged F1 score (without the O tag) for all tasks. We run each configuration three times with different initializations and compute mean and standard deviation of the scores. To compare two scores, we use the following strategy. Let , and , be two sets of scores (mean and std, respectively). We say that is “higher” than if , where is a parameter that controls how strict we want the definition to be. “lower” is defined in the same manner with changed to and switched with . is set to 1.5 in all of our experiments.
We use bidirectional recurrent neural networks (biRNNs) as our encoders for both words and characters [Irsoy and Cardie, 2014, Huang et al., 2015, Lample et al., 2016, Ma and Hovy, 2016]. Our word/character sequence encoders and decoder classifiers are common in literature and most similar to [Lample et al., 2016], but we use two-layer biRNNs (instead of one) with Gated Recurrent Unit (GRU) [Cho et al., 2014] (instead of with LSTM [Hochreiter and Schmidhuber, 1997]).
Each word is represented by a 100-dimensional vector that is the concatenation of a 50-dimensional embedding vector and the 50-dimensional output of a character biRNN (whose hidden representation dimension is 25 in each direction). We feed a sequence of those 100-dimensional representations to a word biRNN, whose hidden representation dimension is 300 in each direction, resulting in a sequence of 600-dimensional vectors. In TEDec, the token encoder is also used to encode a task token (which is then concatenated to the encoder’s output), where each task is represented as a 25-dimensional vector. For decoder/classifiers, we predict a sequence of tags using a linear projection layer (to the tagset size) followed by a conditional random field (CRF) [Lafferty et al., 2001].
Implementation and training details
Words are lower-cased, but characters are not. Word embeddings are initialized with GloVe [Pennington et al., 2014] trained on Wikipedia 2014 and Gigaword 5. We use strategies suggested by [Ma and Hovy, 2016] for initializing other parameters in our networks. Character embeddings are initialized uniformly in , where is the dimension of the embeddings. Weight matrices are initialized with Xavier Uniform [Glorot and Bengio, 2010], i.e., uniformly in , where and are the number of of rows and columns in the structure. Bias vectors are initialized with zeros.
We use Adam [Kingma and Ba, 2015] with default hyperparameters and a mini-batch size of 32. The dropout rate is 0.25 for the character encoder and 0.5 for the word encoder. We use gradient normalization [Pascanu et al., 2013] with a threshold of 5. We halve the learning rate if the validation performance does not improve for two epochs, and stop training if the validation performance does not improve for 10 epochs. We use L2 regularization with parameter 0.01 for the transition matrix of the CRF.
For the training of MTL models, we make sure that each mini-batch is balanced; the difference in numbers of examples from any pair of tasks is no more than 1. As a result, each epoch may not go through all examples of some tasks whose training set sizes are large. In a similar manner, during validation, the average F1 score is over all tasks rather than over all validation examples.
3.4 Various Settings for Learning from Multiple Tasks
We consider the following settings: (i) “STL” where we train each model on one task alone; (ii) “Pairwise MTL” where we train on two tasks jointly; (iii) “All MTL” where we train on all tasks jointly; (iv) “Oracle MTL” where we train on the Oracle set of the testing task jointly with the testing task; (v) “All-but-one MTL” setting where we train on all tasks jointly except for one (as part of Sect. 4.4.)
Constructing the Oracle Set of a Testing Task
The Oracle set of a task is constructed from the pairwise performances: let be the F1 score and the standard deviation of a model that is jointly trained on a set of tasks in the set and that is tested on task . Task is considered “beneficial” to another (testing) task if is “higher” than (cf. Sect. 3.2). Then, the “Oracle” set for a task is the set of its all beneficial (single) tasks. Throughout our experiments, we compute and by averaging over three rounds (cf. Sect. 3.2, standard deviations can be found in Appendix C.)
4 Results and Analysis
4.1 Main Results
Fig. 2 summarizes our main findings. We compare relative improvement over single-task learning (STL) between various settings with different types of sharing in Sect. 3.4. Scores from the pairwise setting (“+One Task”) are represented as a vertical bar, delineating the maximum and minimum improvement over STL by jointly learning a task with one of the remaining 10 tasks. The “All” setting (red triangles) indicates the joint learning all 11 tasks. The “Oracle” setting (blue rectangles) indicates the joint learning using a subset of 11 tasks which are deemed beneficial, based on corresponding performances in pairwise MTL, as defined in Sect. 3.4.
We observe that (1) [STL vs. Pairwise/All] Neither pairwise MTL nor All always improves upon STL; (2) [STL vs. Oracle] Oracle in general outperforms or at least does not worsen STL; (3) [All/Oracle vs. Pairwise] All does better than Pairwise on about half of the cases, while Oracle almost always does better than Pairwise; (4) [All vs. Oracle] Consider when both All and Oracle improve upon STL. For Multi-Dec and TEEnc, Oracle generally dominates All, except on the task hyp. For TEDec, their magnitudes of improvement are mostly comparable, except on semtr (Oracle is better) and on hyp (All is better). In addition, All is better than Oracle on the task com, in which case Oracle is STL.
In Appendix A, we compare different MTL approaches: Multi-Dec, TEDec, and TEEnc. There is no significant difference among them.
4.2 Pairwise MTL results
The summary plot in Fig. 3 gives a bird’s-eye view of patterns in which a task might benefit or harm another one. For example, mwe is always benefited from jointly learning any of the 10 tasks as the incoming edges are green, so is semtr in most cases. On the other end, com seems to be harming any of the 10 as the outgoing edges are almost always red. For chunk and u/xpos, it generally benefits others (or at least does not do harm) as most of their outgoing edges are green.
In Table 3-5, we report F1 scores for Multi-Dec, TEDec, and TEEnc, respectively. In each table, rows denote settings in which we train our models, and columns correspond to tasks we test them on. We also include “Average” of all pairwise scores, as well as the number of positive () and negative () relationships in each row or each column.
Which tasks are benefited or harmed by others in pairwise MTL?
mwe, supsense, semtr, and hyp are generally benefited by other tasks. The improvement is more significant in mwe and hyp. upos, xpos, ner, com, and frame (Multi-Dec and TEDec) are often hurt by other tasks. Finally, the results are mixed for chunk and sem.
Which tasks are beneficial or harmful?
upos, xpos, and chunk are universal helpers, beneficial in 16, 17, and 14 cases, while harmful only in 1, 3, and 0 cases, respectively. Interestingly, chunk never hurts any task, while both upos and xpos can be harmful to ner. While chunk is considered more of a syntactic task, the fact that it informs other tasks about the boundaries of phrases may aid the learning of other semantic tasks (task embeddings in Sect. 4.4 seem to support this hypothesis).
On the other hand, com, frame, and hyp are generally harmful, all useful in 0 cases and causing the performance drop in 22, 10, 12 cases, respectively. One factor that may play a role is the training set sizes of these tasks. However, we note that both mwe and supsense (Streusle dataset) has smaller training set sizes than frame does, but those tasks can still benefit some tasks. (On the other hand, ner has the largest training set, but infrequently benefits other tasks, less frequently than supsense does.) Another potential cause is the fact that all those harmful tasks have the smallest label size of 2. This combined with small dataset sizes leads to a higher chance of overfitting. Finally, it may be possible that harmful tasks are simply unrelated; for example, the nature of com, frame, or hyp may be very different from other tasks — an entirely different kind of reasoning is required.
Finally, ner, mwe, sem, semtr, and supsense can be beneficial or harmful, depending on which other tasks they are trained with.
4.3 All MTL Results
In addition to pairwise MTL results, we report the performances in the All and Oracle MTL settings in the last two rows of Table 3-5. We find that their performances depend largely on the trend in their corresponding pairwise MTL. We provide examples and discussion of such observations below.
|All - upos||94.03||93.59||86.03||61.28||70.87||73.54||68.27||74.42||58.47||51.13||0||0|
|All - xpos||94.57||93.57||86.04||61.91||71.12||74.03||67.99||74.36||60.16||51.65||0||1|
|All - chunk||94.84||94.46||86.05||61.01||71.07||73.97||68.26||74.2||60.01||50.27||0||1|
|All - ner||94.81||94.3||93.59||62.69||70.82||73.51||68.16||74.08||59.17||50.86||0||2|
|All - mwe||94.93||94.45||93.71||86.21||71.01||73.61||68.18||74.7||59.23||50.83||0||2|
|All - sem||94.82||94.34||93.63||85.81||61.17||71.97||67.36||74.31||58.73||50.93||0||1|
|All - semtr||94.83||94.35||93.58||86.11||63.04||69.72||68.17||74.2||59.49||51.27||0||1|
|All - supsense||94.97||94.54||93.67||86.43||60.51||71.22||73.86||74.24||59.23||50.86||0||1|
|All - com||95.19||94.69||93.67||86.6||61.95||72.38||74.75||68.67||62.37||50.28||5||0|
|All - frame||95.15||94.57||93.7||85.9||62.62||71.48||74.24||68.47||75.03||50.89||0||0|
|All - hyp||94.93||94.53||93.78||86.31||62.04||71.22||74.02||68.46||74.62||59.69||1||0|
How much is STL vs. Pairwise MTL predictive of STL vs. All MTL?
We find that the performance of pairwise MTL is predictive of the performance of All MTL to some degree. Below we discuss the results in more detail. Note that we would like to be predictive in both the performance direction and magnitude (whether and how much the scores will improve or degrade over the baseline).
When pairwise MTL improves upon STL even slightly, All improves upon STL in all cases (mwe, semtr, supsense, and hyp). This is despite the fact that jointly learning some pairs of tasks lead to performance degradation (com and frame in the case of supsense and com in the case of semtr). Furthermore, when pairwise MTL leads to improvement in all cases (all pairwise rows in mwe and hyp), All MTL will achieve even better performance, suggesting that tasks are beneficial in a complementary manner and there is an advantage of MTL beyond two tasks.
The opposite is almost true. When pairwise MTL does not improve upon STL, most of the time All MTL will not improve upon STL, either — with one exception: com. Specifically, the pairwise MTL performances of upos, xpos, ner and frame (TEDec) are mostly negative and so are their All MTL performances. Furthermore, tasks can also be harmful in a complementary manner. For instance, in the case of ner, All MTL achieves the lowest or the second lowest score when compared to any row of the pairwise MTL settings. In addition, sem’s pairwise MTL performances are mixed, making the average score about the same or slightly worse than STL. However, the performance of All MTL when tested on sem almost achieves the lowest. In other words, sem is harmed more than it is benefited but pairwise MTL performances cannot tell. This suggests that harmful tasks are complementary while beneficial tasks are not.
Our results when tested on com are the most surprising. While none of pairwise MTL settings help (with some hurting), the performance of All MTL goes in the opposite direction, outperforming that of STL. Further characterization of task interaction is needed to reveal why this happens. One hypothesis is that instances in com that are benefited by one task may be harmed by another. The joint training of all tasks thus works because tasks regularize each other.
We believe that our results open the doors to other interesting research questions. While the pairwise MTL performance is somewhat predictive of the performance direction of All MTL (except com), the magnitude of that direction is difficult to predict. It is clear that additional factors beyond pairwise performance contribute to the success or failure of the All MTL setting. It would be useful to automatically identify these factors or design a metric to capture that. There have been initial attempts along this research direction in [Alonso and Plank, 2017, Bingel and Søgaard, 2017, Bjerva, 2017], in which manually-defined task characteristics are found to be predictive of pairwise MTL’s failure or success.
Recall that a task has an “Oracle” set when the task is benefited from some other tasks according to its pairwise results (cf. Sect. 3.4). In general, our simple heuristic works well. Out of 20 cases where Oracle MTL performances exist, 16 are better than the performance of All MTL. In sem, upos and xpos (TEDec, Oracle MTL is able to reverse the negative results obtained by All MTL to the positive ones, leading to improved scores over STL in all cases. This suggests that pairwise MTL performances are valuable knowledge if we want to go beyond two tasks. But, as mentioned previously, pairwise performance information fails in the case of com; All MTL leads to improvement but we do not have an Oracle set in this case.
Out of 4 cases where Oracle MTL does not improve upon All MTL, 3 is when we test on hyp and one is when we test on mwe. These two tasks are not harmed by any tasks. This result seems to suggest that sometimes “neutral” tasks can help in MTL (but not always, for example, in Multi-Dec and TEEnc of mwe). This also raises the question of whether there is a more effective way to construct an oracle set.
Task Contribution in All MTL
How much does one particular task contribute to the performance of All MTL? To investigate this, we remove one task at a time and train the rest jointly. Results are shown in Table 6 for the method Multi-Dec– results for other two methods are in Appendix B as they are similar to Multi-Dec qualitatively. We find that upos, sem and semtr are in general sensitive to a task being removed from All MTL. Moreover, at least one task significantly contributes to the success of All MTL at some point; if we remove it, the performance will drop. On the other hand, com generally negatively affects the performance of All MTL as removing it often leads to performance improvement.
Fig. 4 shows t-SNE visualization [Van der Maaten and Hinton, 2008] of task embeddings learned from TEDec 666We observed that task embeddings learned from TEEnc are not consistent across multiple runs. in the All MTL setting. The learned task embeddings reflect our knowledge about similarities between tasks, where there are clusters of syntactic and semantic tasks. We also learn that sentence compression (com) is more syntactic, whereas multi-word expression identification (mwe) and hyper-text detection (hyp) are more semantic. Interestingly, chunk seems to be in between, which may explain why it never harms any tasks in any settings (cf. Sect. 4.2).
In general, it is not obvious how to translate task similarities derived from task embeddings into something indicative of MTL performance. While our task embeddings could be considered as “task characteristics” vectors, they are not guaranteed to be interpretable. We thus leave a thorough investigation of information captured by task embeddings to future work.
Nevertheless, we observe that task embeddings disentangle “sentences/tags” and “actual task” to some degree. For instance, if we consider the locations of each pair of tasks that use the same set of sentences for training in Fig. 4, we see that sem and semtr (or mwe and supsense) are not neighbors, while xpos and upos are. On the other hand, mwe and ner are neighbors, even though their label set size and entropy are not the closest. These observations suggest that hand-designed task features used in [Bingel and Søgaard, 2017] may not be the most informative characterization for predicting MTL performance.
5 Related Work
MTL for NLP has been popular since a unified architecture was proposed by [Collobert and Weston, 2008, Collobert et al., 2011]. As for sequence to sequence learning [Sutskever et al., 2014], general multi-task learning frameworks are explored by [Luong et al., 2016].
Our work is different from existing work in several aspects. First, the majority of the work focuses on two tasks, often with one being the main task and the other being the auxiliary one [Søgaard and Goldberg, 2016, Bjerva et al., 2016, Plank et al., 2016, Alonso and Plank, 2017, Bingel and Søgaard, 2017]. For example, pos is the auxiliary task in [Søgaard and Goldberg, 2016] while chunk, CCG supertagging (ccg) [Clark, 2002], ner, sem, or mwe+supsense is the main one. They find that pos benefits chunk and ccg. Another line of work considers language modeling as the auxiliary objective [Godwin et al., 2016, Rei, 2017, Liu et al., 2018]. Besides sequence tagging, some work considers two high-level tasks or one high-level task with another lower-level one. Examples are dependency parsing (dep) with pos [Zhang and Weiss, 2016], with mwe [Constant and Nivre, 2016], or with semantic role labeling (srl) [Shi et al., 2016]; machine translation (translate) with pos or dep [Niehues and Cho, 2017, Eriguchi et al., 2017]; sentence extraction and com [Martins and Smith, 2009, Berg-Kirkpatrick et al., 2011, Almeida and Martins, 2013].
Exceptions to this include the work of [Collobert et al., 2011], which considers four tasks: pos, chunk, ner, and srl; [Raganato et al., 2017], which considers three: word sense disambiguation with pos and coarse-grained semantic tagging based on WordNet lexicographer files; [Hashimoto et al., 2017], which considers five: pos, chunk, dep, semantic relatedness, and textual entailment; [Niehues and Cho, 2017, Kiperwasser and Ballesteros, 2018], which both consider three: translate with pos and ner, and translate with pos and dep, respectively. We consider as many as 11 tasks jointly.
Second, we choose to focus on model architectures that are generic enough to be shared by many tasks. Our structure is similar to [Collobert et al., 2011], but we also explore frameworks related to task embeddings and propose two variants. In contrast, recent work considers stacked architectures (mostly for sequence tagging) in which tasks can supervise at different layers of a network [Søgaard and Goldberg, 2016, Klerke et al., 2016, Plank et al., 2016, Alonso and Plank, 2017, Bingel and Søgaard, 2017, Hashimoto et al., 2017]. More complicated structures require more sophisticated MTL methods when the number of tasks grows and thus prevent us from concentrating on analyzing relationships among tasks. For this reason, we leave MTL for complicated models for future work.
6 Discussion and Future Work
We conduct an empirical study on MTL for sequence tagging, which so far has been mostly studied with two or a few tasks. We also propose two alternative frameworks that augment taggers with task embeddings. Our results provide insights regarding task relatedness and show benefits of the MTL approaches. Nevertheless, we believe that our work simply scratches the surface of MTL. The characterization of task relationships seems to go beyond the performances of pairwise MTL training or similarities of their task embeddings. We are also interested in exploring further other techniques to MTL, especially when tasks become more complicated. For example, it is not clear how to best represent task specification as well as how to incorporate them into NLP systems. Finally, the definition of tasks can be relaxed to include domains or languages. Combining all these will move us toward the goal of having a single robust, generalizable NLP agent that is equipped with a diverse set of skills.
This work is partially supported by USC Graduate Fellowship, NSF IIS-1065243, 1451412, 1513966/1632803/1833137, 1208500, CCF-1139148, a Google Research Award, an Alfred P. Sloan Research Fellowship, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.
- [Almeida and Martins, 2013] Miguel B. Almeida and André F. T. Martins. 2013. Fast and robust compressive summarization with dual decomposition and multi-task learning. In ACL.
- [Alonso and Plank, 2017] Héctor Martínez Alonso and Barbara Plank. 2017. Multitask learning for semantic sequence prediction under varying data conditions. In EACL.
- [Baker et al., 1998] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Project. In COLING-ACL.
- [Berg-Kirkpatrick et al., 2011] Taylor Berg-Kirkpatrick, Daniel Gillick, and Dan Klein. 2011. Jointly learning to extract and compress. In ACL.
- [Bies et al., 2012] Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012. English web treebank. Technical Report LDC2012T13, Linguistic Data Consortium, Philadelphia, PA.
- [Bingel and Søgaard, 2017] Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In EACL.
- [Bjerva et al., 2016] Johannes Bjerva, Barbara Plank, and Johan Bos. 2016. Semantic tagging with deep residual networks. In COLING.
- [Bjerva, 2017] Johannes Bjerva. 2017. Will my auxiliary tagging task help? Estimating Auxiliary Tasks Effectivity in Multi-Task Learning. In NoDaLiDa.
- [Caruana, 1997] Rich Caruana. 1997. Multitask learning. Machine Learning, 28:41–75.
- [Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, Ãaglar GülÃ§ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
- [Ciaramita and Altun, 2006] Massimiliano Ciaramita and Yasemin Altun. 2006. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In EMNLP.
- [Clark, 2002] Stephen Clark. 2002. Supertagging for combinatory categorial grammar. In Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks.
- [Clarke and Lapata, 2006] James Clarke and Mirella Lapata. 2006. Constraint-based sentence compression: An integer programming approach. In ACL.
- [Collobert and Weston, 2008] Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML.
- [Collobert et al., 2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
- [Constant and Nivre, 2016] Matthieu Constant and Joakim Nivre. 2016. A transition-based system for joint lexical and syntactic analysis. In ACL.
- [Das et al., 2014] Dipanjan Das, Desai Chen, André F. T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Frame-semantic parsing. Computational Linguistics, 40:9–56.
- [Eriguchi et al., 2017] Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. In ACL.
- [Francis and Kučera, 1982] Winthrop Nelson Francis and Henry Kučera. 1982. Frequency analysis of english usage: Lexicon and grammar. Journal of English Linguistics, 18(1):64–70.
- [Gardner et al., 2018] Matt A. Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640.
- [Glorot and Bengio, 2010] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
- [Godwin et al., 2016] Jonathan Godwin, Pontus Stenetorp, and Sebastian Riedel. 2016. Deep semi-supervised learning with linguistically motivated sequence labeling task hierarchies. arXiv preprint arXiv:1612.09113.
- [Goldberg, 2017] Yoav Goldberg. 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309.
- [Graff, 1997] David Graff. 1997. The 1996 broadcast news speech and language-model corpus. In Proceedings of the 1997 DARPA Speech Recognition Workshop.
- [Hashimoto et al., 2017] Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A Joint Many-Task Model: Growing a neural network for multiple NLP tasks. In EMNLP.
- [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–80.
- [Huang et al., 2015] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
- [Irsoy and Cardie, 2014] Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In EMNLP.
- [Johnson et al., 2017] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Gregory S. Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL, 5:339–351.
- [Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
- [Kiperwasser and Ballesteros, 2018] Eliyahu Kiperwasser and Miguel Ballesteros. 2018. Scheduled multi-task learning: From syntax to translation. TACL, 6:225–240.
- [Klerke et al., 2016] Sigrid Klerke, Yoav Goldberg, and Anders Søgaard. 2016. Improving sentence compression by learning to predict gaze. In HLT-NAACL.
- [Lafferty et al., 2001] John D. Lafferty, Andrew D. McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In ICML.
- [Lample et al., 2016] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In HLT-NAACL.
- [Landes et al., 1998] Shari Landes, Claudia Leacock, and Randee I. Tengi. 1998. Building semantic concordances. WordNet: An electronic lexical database, 199(216):199–216.
- [Lewis et al., 2004] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397.
- [Liu et al., 2018] Liyuan Liu, Jingbo Shang, Frank F. Xu, Xiang Ren, Huan Gui, Jian Peng, and Jiawei Han. 2018. Empower sequence labeling with task-aware neural language model. In AAAI.
- [Luong et al., 2016] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In ICLR.
- [Ma and Hovy, 2016] Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In ACL.
- [Marcus et al., 1993] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
- [Martins and Smith, 2009] André F. T. Martins and Noah A. Smith. 2009. Summarization with a joint model for sentence extraction and compression. In Proceedings of the NAACL-HLT Workshop on Integer Linear Programming for NLP.
- [Miller et al., 1993] George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunker. 1993. A semantic concordance. In Proceedings of the workshop on Human Language Technology.
- [Mou et al., 2016] Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How transferable are neural networks in nlp applications? In EMNLP.
- [Niehues and Cho, 2017] Jan Niehues and Eunah Cho. 2017. Exploiting linguistic resources for neural machine translation using multi-task learning. In WMT.
- [Nivre et al., 2016] Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In LREC.
- [Pascanu et al., 2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In ICML.
- [Paszke et al., 2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the NIPS Workshop on the future of gradient-based machine learning software and techniques.
- [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP.
- [Plank et al., 2016] Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In ACL.
- [Raganato et al., 2017] Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017. Neural sequence learning models for word sense disambiguation. In EMNLP.
- [Rei, 2017] Marek Rei. 2017. Semi-supervised multitask learning for sequence labeling. In ACL.
- [Reimers and Gurevych, 2017] Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In EMNLP.
- [Ruder, 2017] Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
- [Schneider and Smith, 2015] Nathan Schneider and Noah A. Smith. 2015. A corpus and model integrating multiword expressions and supersenses. In HLT-NAACL.
- [Shi et al., 2016] Peng Shi, Zhiyang Teng, and Yue Zhang. 2016. Exploiting mutual benefits between syntax and semantic roles using neural network. In EMNLP.
- [Søgaard and Goldberg, 2016] Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In ACL.
- [Spitkovsky et al., 2010] Valentin I. Spitkovsky, Daniel Jurafsky, and Hiyan Alshawi. 2010. Profiting from mark-up: Hyper-text annotations for guided parsing. In ACL.
- [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
- [Tjong Kim Sang and Buchholz, 2000] Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In CoNLL.
- [Tjong Kim Sang and De Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL.
- [Van der Maaten and Hinton, 2008] Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85.
- [Vossen et al., 1998] Piek Vossen, Laura Bloksma, Horacio Rodriguez, Salvador Climent, Nicoletta Calzolari, Adriana Roventini, Francesca Bertagna, Antonietta Alonge, and Wim Peters. 1998. The EuroWordNet Base Concepts and Top Ontology. Technical Report LE2-4003, University of Amsterdam, The Netherlands.
- [Yang et al., 2017] Zhilin Yang, Ruslan Salakhutdinov, and William W. Cohen. 2017. Transfer learning for sequence tagging with hierarchical recurrent networks. In ICLR.
- [Zhang and Weiss, 2016] Yuan Zhang and David Weiss. 2016. Stack-propagation: Improved representation learning for syntax. In ACL.
Appendix A Comparison between different MTL approaches
In Table 7, we summarize the results of different MTL approaches. We observe no significant differences between those methods.
Appendix B Additional results on All-but-one settings
|All - upos||94.06||93.44||86.47||60.48||71.08||73.79||68.1||74.69||58.32||50.83||0||2|
|All - xpos||94.38||93.6||86.68||60.09||70.98||73.78||67.9||74.26||58.31||50.6||0||3|
|All - chunk||94.6||94.29||86.08||60.6||70.39||73.36||68.07||74.47||58.73||51.1||0||3|
|All - ner||94.69||94.31||93.69||60.48||70.64||73.59||67.51||74.49||58.19||50.44||0||4|
|All - mwe||94.93||94.46||93.72||86.21||71.11||74.04||67.38||74.49||57.6||50.5||0||2|
|All - sem||94.86||94.41||93.6||85.97||59.94||72.26||67.35||74.34||59.08||50.48||0||3|
|All - semtr||94.8||94.28||93.56||86.23||61.23||69.62||68.16||74.36||58.85||51.5||0||2|
|All - supsense||94.82||94.4||93.67||86.49||59.11||71.02||73.76||74.69||58.28||51.96||0||2|
|All - com||95.19||94.76||93.79||86.25||62.02||72.32||74.92||67.62||60.72||50.0||4||2|
|All - frame||95.03||94.6||93.64||86.68||60.52||71.11||73.9||67.69||74.49||51.23||0||2|
|All - hyp||94.94||94.45||93.69||86.86||61.07||71.22||74.04||68.32||74.4||58.55||0||1|
|All - upos||94.0||93.36||85.98||59.58||70.68||73.66||68.19||74.07||60.51||50.23||0||1|
|All - xpos||94.24||93.29||85.8||59.81||70.57||73.64||68.47||73.94||60.13||50.39||0||4|
|All - chunk||94.66||94.3||85.73||61.58||70.78||73.65||67.87||73.67||61.73||50.18||0||1|
|All - ner||94.71||94.25||93.5||59.05||70.58||73.4||67.95||74.16||59.96||49.95||0||2|
|All - mwe||94.94||94.5||93.63||86.1||71.12||73.75||69.0||74.28||61.51||49.81||0||0|
|All - sem||94.76||94.32||93.45||85.58||59.47||72.21||67.77||74.2||61.76||50.15||1||1|
|All - semtr||94.68||94.25||93.54||86.02||60.59||69.86||67.96||73.81||61.31||51.72||1||2|
|All - supsense||94.8||94.27||93.56||86.04||59.25||70.53||73.27||74.3||59.98||50.01||0||2|
|All - com||95.25||94.72||93.82||86.23||60.63||72.38||75.06||67.94||63.55||48.77||4||0|
|All - frame||94.84||94.39||93.51||85.99||61.21||70.78||73.69||68.13||74.3||50.35||0||1|
|All - hyp||94.86||94.45||93.59||86.1||61.09||71.03||74.09||68.17||73.78||61.91||0||2|
Appendix C Detailed results separated by the tasks being tested on
|Trained with||Tested on upos|
|upos only||95.4 0.08|
|+xpos||95.38 0.03||95.4 0.04||95.42 0.07|
|+chunk||95.43 0.11||95.57 0.02||95.4 0.0|
|+ner||95.38 0.1||95.32 0.03||95.29 0.04|
|+mwe||95.15 0.05||95.11 0.07||95.05 0.05|
|+sem||95.23 0.14||95.2 0.05||95.27 0.08|
|+semtr||95.17 0.15||95.21 0.03||95.23 0.13|
|+supsense||95.08 0.08||95.05 0.04||95.27 0.08|
|+com||93.04 0.77||94.03 0.42||93.6 0.15|
|+frame||94.98 0.13||94.79 0.09||95.0 0.07|
|+hyp||94.84 0.07||94.35 0.21||94.43 0.15|
|All - xpos||94.57 0.12||94.38 0.05||94.24 0.24|
|All - chunk||94.84 0.01||94.6 0.1||94.66 0.15|
|All - ner||94.81 0.07||94.69 0.05||94.71 0.07|
|All - mwe||94.93 0.01||94.93 0.08||94.94 0.04|
|All - sem||94.82 0.17||94.86 0.08||94.76 0.15|
|All - semtr||94.83 0.12||94.8 0.03||94.68 0.17|
|All - supsense||94.97 0.07||94.82 0.03||94.8 0.07|
|All - com||95.19 0.05||95.19 0.04||95.25 0.02|
|All - frame||95.15 0.07||95.03 0.17||94.84 0.1|
|All - hyp||94.93 0.18||94.94 0.11||94.86 0.04|
|All||95.04 0.03||94.95 0.08||94.94 0.1|
|Oracle||95.4 0.08||95.57 0.02||95.4 0.08|
|Trained with||Tested on xpos|
|xpos only||95.04 0.06|
|+upos||95.01 0.04||94.99 0.03||94.94 0.05|
|+chunk||95.1 0.02||95.21 0.02||95.1 0.04|
|+ner||94.98 0.12||95.09 0.07||95.05 0.13|
|+mwe||94.7 0.16||94.8 0.08||94.66 0.07|
|+sem||94.77 0.08||94.82 0.15||94.93 0.08|
|+semtr||94.86 0.02||94.8 0.09||94.97 0.09|
|+supsense||94.75 0.15||94.81 0.06||95.0 0.12|
|+com||93.19 0.75||93.94 0.21||93.12 0.44|
|+frame||94.64 0.06||94.66 0.05||94.55 0.06|
|+hyp||94.46 0.3||94.56 0.09||94.26 0.18|
|All - upos||94.03 0.13||94.06 0.09||94.0 0.26|
|All - chunk||94.46 0.09||94.29 0.07||94.3 0.12|
|All - ner||94.3 0.03||94.31 0.02||94.25 0.07|
|All - mwe||94.45 0.05||94.46 0.12||94.5 0.09|
|All - sem||94.34 0.09||94.41 0.09||94.32 0.17|
|All - semtr||94.35 0.08||94.28 0.07||94.25 0.12|
|All - supsense||94.54 0.02||94.4 0.08||94.27 0.03|
|All - com||94.69 0.1||94.76 0.08||94.72 0.06|
|All - frame||94.57 0.12||94.6 0.19||94.39 0.08|
|All - hyp||94.53 0.07||94.45 0.1||94.45 0.07|
|All||94.31 0.15||94.42 0.07||94.3 0.2|
|Oracle||95.04 0.06||95.21 0.02||95.04 0.06|
|Trained with||Tested on chunk|
|chunk only||93.49 0.01|
|+upos||94.18 0.02||94.02 0.08||94.0 0.15|
|+xpos||93.97 0.16||94.18 0.01||93.98 0.13|
|+ner||93.47 0.1||93.64 0.03||93.54 0.1|
|+mwe||93.54 0.13||93.59 0.2||93.33 0.2|
|+sem||93.63 0.02||93.45 0.07||93.52 0.13|
|+semtr||93.61 0.07||93.47 0.03||93.45 0.07|
|+supsense||93.2 0.21||93.25 0.15||93.13 0.13|
|+com||91.94 0.4||92.29 0.27||91.86 0.09|
|+frame||93.22 0.16||93.23 0.04||93.29 0.13|
|+hyp||92.96 0.08||92.86 0.08||93.13 0.04|
|All - upos||93.59 0.13||93.44 0.17||93.36 0.17|
|All - xpos||93.57 0.19||93.6 0.05||93.29 0.21|
|All - ner||93.59 0.09||93.69 0.14||93.5 0.23|
|All - mwe||93.71 0.11||93.72 0.13||93.63 0.04|
|All - sem||93.63 0.08||93.6 0.11||93.45 0.13|
|All - semtr||93.58 0.08||93.56 0.14||93.54 0.06|
|All - supsense||93.67 0.08||93.67 0.12||93.56 0.12|
|All - com||93.67 0.12||93.79 0.14||93.82 0.05|
|All - frame||93.7 0.09||93.64 0.11||93.51 0.06|
|All - hyp||93.78 0.12||93.69 0.05||93.59 0.07|
|All||93.44 0.09||93.64 0.21||93.7 0.06|
|Oracle||94.01 0.13||94.07 0.25||93.93 0.16|
|Trained with||Tested on ner|
|ner only||88.24 0.09|
|+upos||87.68 0.41||87.99 0.21||87.43 0.11|
|+xpos||87.61 0.27||87.65 0.14||87.71 0.08|
|+chunk||87.96 0.19||88.11 0.21||88.07 0.16|
|+mwe||88.15 0.23||87.99 0.15||88.02 0.36|
|+sem||87.35 0.16||87.27 0.36||87.49 0.25|
|+semtr||87.34 0.27||87.75 0.38||87.29 0.17|
|+supsense||87.9 0.24||87.94 0.33||87.92 0.16|
|+com||86.62 0.72||86.59 0.31||86.75 0.45|
|+frame||88.15 0.35||88.02 0.17||87.99 0.32|
|+hyp||87.98 0.21||87.91 0.4||87.82 0.31|
|All - upos||86.03 0.53||86.47 0.14||85.98 0.29|
|All - xpos||86.04 0.15||86.68 0.27||85.8 0.27|
|All - chunk||86.05 0.1||86.08 0.49||85.73 0.2|
|All - mwe||86.21 0.27||86.21 0.19||86.1 0.37|
|All - sem||85.81 0.32||85.97 0.14||85.58 0.04|
|All - semtr||86.11 0.28||86.23 0.23||86.02 0.39|
|All - supsense||86.43 0.12||86.49 0.17||86.04 0.14|
|All - com||86.6 0.79||86.25 0.06||86.23 0.33|
|All - frame||85.9 0.29||86.68 0.15||85.99 0.3|
|All - hyp||86.31 0.18||86.86 0.25||86.1 0.56|
|All||86.38 0.12||86.8 0.08||86.01 0.4|
|Oracle||88.24 0.09||88.24 0.09||88.24 0.09|
|Trained with||Tested on mwe|
|mwe only||53.07 0.12|
|+upos||59.99 0.36||60.28 0.24||57.61 0.2|
|+xpos||58.87 0.78||60.32 0.3||58.26 0.25|
|+chunk||59.18 0.03||57.61 1.53||58.06 0.88|
|+ner||55.4 0.52||55.17 0.44||53.4 0.98|
|+sem||60.16 1.23||58.21 0.09||58.62 0.61|
|+semtr||58.84 1.45||58.55 0.28||58.31 2.24|
|+supsense||58.81 1.01||58.75 0.33||58.05 0.72|
|+com||53.89 1.41||51.72 1.01||51.71 1.05|
|+frame||53.88 0.76||53.05 1.32||53.3 1.15|
|+hyp||53.08 1.72||52.98 1.66||52.59 1.98|
|All - upos||61.28 0.78||60.48 0.93||59.58 1.14|
|All - xpos||61.91 1.56||60.09 0.9||59.81 0.83|
|All - chunk||61.01 1.61||60.6 1.52||61.58 1.05|
|All - ner||62.69 0.26||60.48 0.15||59.05 0.4|
|All - sem||61.17 0.86||59.94 0.85||59.47 0.04|
|All - semtr||63.04 0.85||61.23 2.05||60.59 0.59|
|All - supsense||60.51 0.25||59.11 2.02||59.25 0.74|
|All - com||61.95 0.97||62.02 1.73||60.63 0.73|
|All - frame||62.62 0.85||60.52 0.47||61.21 0.99|
|All - hyp||62.04 0.6||61.07 0.51||61.09 1.06|
|All||61.43 1.94||61.97 0.5||59.57 0.64|
|Oracle||62.76 0.63||61.74 1.49||61.92 0.66|
|Trained with||Tested on sem|
|sem only||72.77 0.04|
|+upos||73.23 0.06||73.17 0.08||73.11 0.01|
|+xpos||73.34 0.12||73.21 0.04||73.04 0.21|
|+chunk||73.16 0.05||73.02 0.05||73.13 0.07|
|+ner||72.88 0.08||72.77 0.19||72.91 0.08|
|+mwe||72.75 0.09||72.66 0.18||72.83 0.07|
|+semtr||72.5 0.07||72.5 0.05||72.17 0.06|
|+supsense||72.81 0.04||72.71 0.03||73.09 0.08|
|+com||70.39 0.46||70.37 0.28||70.18 0.54|
|+frame||72.76 0.16||72.26 0.21||72.49 0.23|
|+hyp||72.47 0.02||72.15 0.1||71.95 1.22|
|All - upos||70.87 0.19||71.08 0.19||70.68 0.76|
|All - xpos||71.12 0.1||70.98 0.24||70.57 0.13|
|All - chunk||71.07 0.27||70.39 0.39||70.78 0.35|
|All - ner||70.82 0.41||70.64 0.15||70.58 0.03|
|All - mwe||71.01 0.14||71.11 0.17||71.12 0.29|
|All - semtr||69.72 0.27||69.62 0.37||69.86 0.36|
|All - supsense||71.22 0.29||71.02 0.16||70.53 0.19|
|All - com||72.38 0.08||72.32 0.23||72.38 0.17|
|All - frame||71.48 0.51||71.11 0.16||70.78 0.44|
|All - hyp||71.22 0.25||71.22 0.33||71.03 0.07|
|All||71.53 0.28||71.72 0.21||71.58 0.24|
|Oracle||73.32 0.04||73.1 0.03||73.14 0.06|
|Trained with||Tested on semtr|
|semtr only||74.02 0.04|
|+upos||74.93 0.09||74.87 0.1||74.85 0.05|
|+xpos||74.91 0.06||74.84 0.21||74.66 0.2|
|+chunk||74.79 0.13||74.73 0.12||74.77 0.13|
|+ner||74.34 0.08||74.01 0.05||74.04 0.07|
|+mwe||74.51 0.18||74.63 0.28||74.66 0.21|
|+sem||74.73 0.1||74.72 0.14||74.41 0.01|
|+supsense||74.61 0.24||74.52 0.05||74.94 0.22|
|+com||72.6 0.95||71.76 0.88||71.35 0.95|
|+frame||74.18 0.19||74.21 0.37||74.63 0.11|
|+hyp||74.23 0.27||74.19 0.45||74.14 0.23|
|All - upos||73.54 0.54||73.79 0.46||73.66 0.97|
|All - xpos||74.03 0.11||73.78 0.28||73.64 0.07|
|All - chunk||73.97 0.22||73.36 0.05||73.65 0.39|
|All - ner||73.51 0.35||73.59 0.19||73.4 0.19|
|All - mwe||73.61 0.2||74.04 0.18||73.75 0.24|
|All - sem||71.97 0.3||72.26 0.28||72.21 0.48|
|All - supsense||73.86 0.09||73.76 0.19||73.27 0.2|
|All - com||74.75 0.22||74.92 0.1||75.06 0.12|
|All - frame||74.24 0.37||73.9 0.29||73.69 0.32|
|All - hyp||74.02 0.12||74.04 0.17||74.09 0.21|
|All||74.26 0.1||74.36 0.03||74.35 0.29|
|Oracle||75.23 0.06||75.24 0.13||75.09 0.02|
|Trained with||Tested on supsense|
|supsense only||66.81 0.22|
|+upos||68.25 0.42||67.8 0.29||67.76 0.14|
|+xpos||67.78 0.4||68.3 0.71||67.77 0.15|
|+chunk||67.39 0.15||67.29 0.33||67.36 0.29|
|+ner||68.06 0.16||67.25 0.21||67.57 0.27|
|+mwe||66.88 0.14||66.88 0.24||66.26 0.9|
|+sem||68.29 0.21||68.46 0.38||68.1 0.59|
|+semtr||68.6 0.81||68.18 0.39||67.64 0.92|
|+com||65.57 0.17||64.98 0.34||65.55 0.18|
|+frame||66.59 0.07||66.2 0.16||66.75 0.22|
|+hyp||66.47 0.24||66.52 0.59||66.16 0.43|
|All - upos||68.27 0.33||68.1 0.28||68.19 0.55|
|All - xpos||67.99 0.5||67.9 0.54||68.47 0.18|
|All - chunk||68.26 0.48||68.07 0.28||67.87 0.32|
|All - ner||68.16 0.26||67.51 0.4||67.95 0.24|
|All - mwe||68.18 0.62||67.38 0.22||69.0 0.45|
|All - sem||67.36 0.42||67.35 0.18||67.77 0.28|
|All - semtr||68.17 0.15||68.16 0.47||67.96 0.73|
|All - com||68.67 0.37||67.62 0.6||67.94 0.22|
|All - frame||68.47 0.72||67.69 0.95||68.13 0.39|
|All - hyp||68.46 0.37||68.32 0.18||68.17 0.36|
|All||68.1 0.54||67.98 0.29||68.02 0.21|
|Oracle||68.53 0.09||68.22 0.61||69.04 0.44|
|Trained with||Tested on com|
|com only||72.71 0.75|
|+upos||72.46 0.34||72.86 0.12||72.09 0.36|
|+xpos||72.83 0.16||72.87 0.56||72.41 0.51|
|+chunk||72.44 0.11||73.3 0.15||72.88 0.26|
|+ner||70.93 0.73||71.08 0.31||70.78 0.27|
|+mwe||71.31 0.31||70.93 0.43||71.36 0.42|
|+sem||72.72 0.22||73.14 0.08||72.25 0.07|
|+semtr||71.96 0.16||71.74 0.46||72.15 0.5|
|+supsense||72.24 0.27||69.13 0.19||72.12 0.66|
|+frame||72.47 0.08||72.89 0.22||72.1 0.93|
|+hyp||71.82 0.97||70.47 0.81||72.79 0.97|
|All - upos||74.42 0.24||74.69 0.26||74.07 0.19|
|All - xpos||74.36 0.14||74.26 0.64||73.94 0.3|
|All - chunk||74.2 0.13||74.47 0.26||73.67 0.23|
|All - ner||74.08 0.07||74.49 0.38||74.16 0.48|
|All - mwe||74.7 0.14||74.49 0.13||74.28 0.16|
|All - sem||74.31 0.1||74.34 0.42||74.2 0.28|
|All - semtr||74.2 0.24||74.36 0.36||73.81 0.16|
|All - supsense||74.24 0.44||74.69 0.52||74.3 0.13|
|All - frame||75.03 0.24||74.49 0.2||74.3 0.19|
|All - hyp||74.62 0.14||74.4 0.06||73.78 0.05|
|All||74.54 0.53||74.61 0.24||74.61 0.32|
|Oracle||72.71 0.75||72.71 0.75||72.71 0.75|
|Trained with||Tested on frame|
|frame only||62.04 0.74|
|+upos||62.14 0.35||61.54 0.53||62.27 0.33|
|+xpos||60.77 0.39||61.44 0.06||61.62 1.01|
|+chunk||62.67 0.47||61.39 0.78||62.98 0.5|
|+ner||62.39 0.37||59.25 0.52||63.02 0.39|
|+mwe||61.75 0.21||56.77 2.79||60.61 0.91|
|+sem||61.74 0.27||60.09 0.48||62.17 0.36|
|+semtr||62.03 0.41||59.77 0.81||62.79 0.19|
|+supsense||61.94 0.43||55.68 0.61||61.96 0.18|
|+com||56.52 0.27||55.25 2.29||57.65 2.42|
|+hyp||61.02 0.62||55.35 0.5||61.14 1.77|
|All - upos||58.47 1.0||58.32 0.35||60.51 0.1|
|All - xpos||60.16 0.42||58.31 0.8||60.13 1.38|
|All - chunk||60.01 0.65||58.73 0.68||61.73 0.48|
|All - ner||59.17 0.27||58.19 0.89||59.96 0.52|
|All - mwe||59.23 0.33||57.6 0.82||61.51 0.43|
|All - sem||58.73 0.67||59.08 0.84||61.76 0.52|
|All - semtr||59.49 0.79||58.85 0.51||61.31 1.16|
|All - supsense||59.23 0.64||58.28 0.19||59.98 1.23|
|All - com||62.37 0.37||60.72 0.73||63.55 0.31|
|All - hyp||59.69 0.41||58.55 0.29||61.91 0.59|
|All||59.71 0.85||58.14 0.23||61.83 0.98|
|Oracle||62.04 0.74||62.04 0.74||62.04 0.74|
|Trained with||Tested on hyp|
|hyp only||46.73 0.55|
|+upos||48.02 0.31||49.36 0.36||48.27 0.68|
|+xpos||48.81 0.36||49.23 0.55||48.06 0.02|
|+chunk||47.85 0.2||48.43 0.3||47.13 0.35|
|+ner||47.9 0.67||48.24 0.65||48.64 1.17|
|+mwe||47.32 0.29||45.83 0.46||46.71 0.64|
|+sem||48.15 0.21||47.95 0.75||47.12 0.43|
|+semtr||47.74 0.57||46.96 0.85||46.1 0.11|
|+supsense||49.23 0.13||47.29 0.41||47.24 0.43|
|+com||47.41 1.18||45.24 0.46||47.81 0.8|
|+frame||47.5 0.46||46.0 0.53||46.66 0.54|
|All - upos||51.13 0.94||50.83 0.65||50.23 0.73|
|All - xpos||51.65 0.63||50.6 0.44||50.39 1.17|
|All - chunk||50.27 0.76||51.1 0.28||50.18 0.81|
|All - ner||50.86 0.87||50.44 0.39||49.95 0.38|
|All - mwe||50.83 0.61||50.5 0.9||49.81 0.44|
|All - sem||50.93 0.27||50.48 0.53||50.15 0.11|
|All - semtr||51.27 0.5||51.5 0.46||51.72 0.15|
|All - supsense||50.86 1.85||51.96 0.29||50.01 1.13|
|All - com||50.28 1.02||50.0 0.11||48.77 0.54|
|All - frame||50.89 0.64||51.23 1.01||50.35 0.68|
|All||51.41 0.25||51.31 0.55||49.5 0.05|
|Oracle||50.0 0.42||50.15 0.25||48.06 0.02|