Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings

Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings

Abstract

The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy, based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. We show through rigorous experiments that our rankings are well correlated (with strong statistical significance) with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our method makes it applicable to any language. Code and data are publicly available1.

\aclfinalcopy
Figure 1: Illustration of the proposed approach with and .

1 Introduction

Polysemy, the number of senses that a word has, is a very subjective notion, subject to individual biases. Word sense annotation has always been one of the tasks with the lowest values of inter-annotator agreement Artstein and Poesio (2008). Yet, creating high-quality, consistent word sense inventories is a critical pre-requisite to successful word sense disambiguation.

Towards creating word sense inventories, it can be helpful to have some reliable information about word polysemy. That is, knowing which words have many senses, and which words have only a few senses. Such information can help in creating new inventories, but also in validating and interpreting existing ones. It can also help in selecting which words to include in a study (e.g., only highly polysemous words).

We propose a novel, fully unsupervised and data-driven approach to quantify work polysemy, based on basic geometry in the contextual embedding space.

Contextual word embeddings have emerged in the last few years, as part of the NLP transfer learning revolution. Now, entire deep models are pre-trained on huge amounts of unannotated data and fine-tuned on much smaller annotated datasets. Some of the most famous examples include ULMFiT Howard and Ruder (2018) and ELMo Peters et al. (2018), both based on recurrent neural networks; and GPT Radford et al. (2018) and BERT Devlin et al. (2018), based on transformers Vaswani et al. (2017). These models all are very deep language models. During pre-training on large-scale corpora, they learn to generate powerful internal representations, including fine-grained contextual word embeddings. For instance, in a well pre-trained model, the word python will have two very different embeddings depending on whether it occurs in a programming context (as in, e.g., “python is my favorite language”) or in a ecological context (“while hiking in the rainforest, I saw a python”).

Our approach capitalizes on the contextual embeddings previously described. It does not involve any tool and does not rely on any human input or judgment. Also, thanks to its unsupervised nature, it can be applied to any language, provided that contextual embeddings are available.

The remainder of this paper is organized as follows. We detail our approach in section 2. Then, we present our experimental setup (sec. 3), evaluation metrics (sec. 4), and report and interpret our results (sec. 5). In section 6, we present an interesting by-product of our method, that allows the user to sample sentences containing each a different sense of a given word. Finally, related work is presented in section 7.

2 Proposed approach

2.1 Basic asumption

First, by passing diverse sentences containing a given word to a pre-trained language model, we construct a representative set of vectors for that word (one vector for each occurrence of the word). The basic and intuitive assumption we make, is that the volume covered by the cloud of points in the contextual embedding space is representative of the polysemy of the associated word.

2.2 Main idea: multiresolution grids

As a proxy for the volume covered, we adopt a simple geometrical approach. As shown in Fig. 1, we construct a hierarchical discretization of the space, where, at each level, the same number of bins are drawn along each dimension. Each level corresponds to a different resolution. Our polysemy score is based on the proportion of bins covered by the vectors of a given word, at each level.

This simple binning strategy makes more sense than clustering-based approaches. Indeed, clusters do not partition the space equally and regularly. This is especially problematic, since word representations are not uniformly distributed in the embedding space Ethayarajh (2019). Indeed, in that case, the vectors lying in the same dense area of the space will always belong to one single large cluster, while outliers lying in the same, but sparser, area of the space, will be assigned to many different small clusters. Therefore, counting the number of clusters a given word belongs to is not a reliable indicator of how much of the space this word covers.

2.3 Scoring scheme

We quantify the polysemy degree of a word as:

(1)

where designates the proportion of bins covered by word at level , between 0 and 1. At each level, bins are drawn along each dimension (see the vertical and horizontal lines in Fig. 1). The hierarchy starts at since there is only one bin covering all the space at (so all words have equal coverage at this level). The total number of bins in the entire space, at a given level , is equal to .

Consider again the example of Fig. 1. In this example, each word is associated with a set of 10 contextualized embeddings in a space of dimension , and the hierarchy has levels. First, we can clearly see that word 1 (blue circles) covers a large area of the space while all the vectors of word 2 (orange squares) are grouped in the same region. Intuitively, this can be interpreted as “word 1 occurs in more different contexts than word 2”, which per our assumption, is equivalent to saying that “word 1 is more polysemous than word 2”.

Let us now see how this is reflected by our scoring scheme. First, the penalization terms (denominators) for levels 1 to 3 are . Note that the higher the level, the exponentially more bins, and so the less penalized (or the more rewarded) coverage is, because getting good coverage becomes more and more difficult. Now, per Eq. 1, the score of word 1 is computed as the dot product of its coverage vector (coverage at each level) with the penalization vector, which gives a score of . Likewise, the score of word 2 is computed as . We can thus see that our scores reflect what can be observed in Fig. 1: word 1 covers a larger area of the space than word 2.

Note that the score of a given word is only meaningful in comparison with the scores of other words, i.e., in rankings, as will be seen in the next section.

Implementation. To compute our scores, we built on the code of the pyramid match kernel from the GraKeL Python library Siglidis et al. (2018).

3 Experiments

In this section, we describe the protocol we followed to test the extent to which our rankings match human rankings.

3.1 Word selection

The first step was to select words to include in our analysis. To this purpose, we downloaded and extracted all the text from the latest available English Wikipedia dump2. We then performed tokenization, stopword, punctuation and number removal, and counted the occurrence of each token of at least 3 characters in size. Out of these tokens, we kept the 2000 most frequent.

3.2 Generating vector sets

For each word in the shortlist, we randomly selected 3000 sentences such that the corresponding word appeared exactly once within each sentence. The words that did not appear in at least 3000 sentences were removed from the analysis, reducing the size of the shortlist from 2000 to 1822. Then, for each word, the associated sentences were passed through a pre-trained ELMo model3 Peters et al. (2018) in test mode, and the top layer representations corresponding to the word were harvested. The advantage of using ELMo’s top layer embeddings is that they are the most contextual, as shown by Ethayarajh (2019). We ended up with a set of exactly 3000 1024-dimensional contextual embeddings for each word.

3.3 Dimensionality reduction

Remember that the total number of bins in the entire space is equal to at a given level , which would have given us an infinite number of bins even at the first level, since the ELMo representations have dimensionality . To reduce the dimensionality of the contextual embedding space, we applied PCA, trying 19 different output dimensionalities, from to with steps of . Due to the quantity and high initial dimensionality of the vectors, we used the distributed4 version of PCA provided by the PySpark’s ML Library Meng et al. (2016).

3.4 Score computation

For each PCA output dimensionality, we computed our scores, trying with 18 different hierarchies whose numbers of levels ranged from 2 to 19. So in total, we obtained rankings.

3.5 Ground truth rankings and baselines

We evaluated the rankings generated by our approach against several ground truth rankings that we derived from human-constructed resources.

Since the number of senses of a word is a subjective, debatable notion, and thus may vary from source to source, we included 6 ground truth rankings in our analysis, in order to minimize source-specific bias as much as possible. For sanity checking purposes, we also added two basic baseline rankings (frequency and random).

We provide more details about all rankings in what follows.

WordNet

We used WordNet Miller (1998) version 3.0 and counted the number of synonym sets or “synsets” of each word.

WordNet-Reduced

There are very subtle differences among the WordNet senses (“synsets”), making distinguishing between them difficult, and even irrelevant in some applications Palmer et al. (2004, 2007); Brown et al. (2010); Rumshisky (2011); Jurgens (2013). For instance, call has 41 senses in the original WordNet (28 as verb and 13 as noun). Even for other words with less senses, like eating (7 senses in total), the difference between senses can be very tiny. For instance, “take in solid food” and “eat a meal; take a meal” are really close in meaning. This very fine granularity of WordNet may somewhat artificially increase the polysemy of some words.

To reduce the granularity of the WordNet synsets, we used their sense keys5. They follow the format lemma%ss_type:lex_filenum: lex_id:head_word:head_id, where ss_type represents the synset type (part-of-speech tag such as noun, verb, adjective) and lex_filenum represents the name of the lexicographer file containing the synset for the sense (noun.animal, noun.event, verb.emotion, etc.). We truncated the sense keys after lex_filenum.

For instance, “take in solid food” and “eat a meal; take a meal” initially correspond to two different senses with keys eat%2:34:00:: and eat%2:34:01::, but after truncation, they both are mapped to the same sense: eat%2:34. However, coarse differences in senses are still captured. For instance, bank “sloping land” (bank%1:17:01::) and bank “financial institution” (bank%1:14:00::) are still mapped to two different senses after truncation, respectively bank%1:17 and bank%1:14.

WordNet-Domains

WordNet Domains Bentivogli et al. (2004); Magnini and Cavaglia (2000) is a lexical resource created in a semi-automatic way to augment WordNet with domain labels. Instead of synsets, each word is associated with a number of semantic domains. The domains are areas of human knowledge (politics, economy, sports, etc.) exhibiting specific terminology and lexical coherence. As for the two previous WordNet ground truth rankings, we simply counted the number of domains associated with each word.

OntoNotes

OntoNotes Hovy et al. (2006); Weischedel et al. (2011) is a large annotated corpus comprising various genres of text (news, conversational telephone speech, weblogs, newsgroups, broadcast, talk shows) with structural information and shallow semantics.

We counted the senses in the sense inventory of each word. The senses in OntoNotes are groupings of the WordNet synsets, constructed by human annotators. As a result, the sense granularity of OntoNotes is coarser than that of WordNet Brown et al. (2010).

Oxford

We counted the number of senses returned by the Oxford dictionary6, which was, at the time of this study, the resource underlying the Google dictionary functionality.

Wikipedia

We capitalized on the Wikipedia disambiguation pages7. Such pages contain a list of the different categories under which one or more articles about the query word can be found. For example, the disambiguation page of the word bank includes categories such as geography, finance, computing (data bank) and science (blood bank). We counted the number of categories on the disambiguation page of each word to generate the ranking.

Frequency and random baselines

In the frequency baseline, we ranked words in decreasing order of their frequency in the entire Wikipedia dump (see subsection 3.1). The naive assumption made here is that words occurring the most have the most senses.

With the random baseline, on the other hand, we produced rankings by shuffling words. Further, we assigned them random scores by sampling from the Log Normal distribution8, to imitate the long-tail behavior of the other score distributions, as can be seen in Fig. 2. All distributions can be seen in Fig. 6. Note that to account for randomness, all results for the random baseline are averages over 30 runs.

Figure 2: Average score distribution of the 5 ground truth rankings and frequency baseline (histogram) vs. average score distribution of the random baseline (blue curve).

Not every of the 1822 words included in our analysis had an entry in each of the resources described above. The lengths of each ground truth ranking are shown in Table 1.

Ranking # words
WN 1535
WN-reduced 1535
WN-Domains 1420
Oxford 1536
Wikipedia 1042
OntoNotes 723
\hdashlineFrequency & random 1822
Table 1: Length of the ground truth rankings.

4 Evaluation metrics

As will be detailed next, we used 6 standard metrics from the fields of statistics and information retrieval to compare among methods. To ensure fair comparison, the scores in the rankings of all methods were normalized to be in the range before proceeding.

Also, each method played in turn the role of candidate and ground truth. This allowed us to not only compute the similarity between our rankings and the ground truth rankings, but also the similarity among the ground truth rankings themselves, which was interesting for exploration purposes.

For each pair of evaluated and ground truth method, only the parts of the rankings corresponding to the words in common (intersection) were compared. Thus, the rankings in each (candidate,ground truth) pair had equal length.

4.1 Similarity and correlation metrics

Cosine similarity

Cosine similarity measures the angle between the two vectors whose coordinates are given by the scores in the evaluated and ground truth rankings. What is evaluated here is the alignment between rankings, i.e., the extent to which the candidate method assigns high/low scores to the same words that receive high/low scores in the ground truth. Since all rankings have positive scores, cosine similarity is in , where 0 indicates that the two vectors are orthogonal and 1 means that they are perfectly aligned. Since we are computing the value of an angle, only the ratios/proportions of scores matter here. E.g., the two rankings and would be considered perfectly aligned.

Spearman’s rho

Spearman’s rho Spearman (1904) is a measure of rank correlation. More precisely, it equals the famous Pearson product-moment correlation coefficient () computed from the ranks of the scores in the two rankings, rather than on the scores themselves.

Kendall’s tau

Kendall’s tau Kendall (1938) is another measure of rank correlation, based on signs of ranks. One can compute it by counting concordant and discordant pairs among the ranks of the scores in the two rankings. More precisely, given two rankings and , a pair for is said to be concordant if and . Based on this notion, the metric is expressed:

(2)

Kendall’s tau can also be written:

(3)

where sign designates the sign function and is the length of the two rankings.

Both Spearman’s rho and Kendall’s tau take values in (for reversed and same rankings), and approach zero when the correlation between the two rankings is low (independence).

4.2 Information retrieval metrics

p@k

Here, we simply compute the percentage of words in the top 10% of the candidate ranking that are present in the top 10% of the ground truth ranking. The idea here is to measure ranking quality for the most polysemous words.

Ndcg

The Normalized Discounted Cumulative Gain or NDCG Järvelin and Kekäläinen (2002) is a standard metric in information retrieval. It is based on the Discounted Cumulative Gain (DCG):

(4)

where designates the ground truth score of the word at the position in the ranking under consideration, and denotes the length of the ranking. NDCG is then expressed as:

(5)

the denominator is called the ideal DCG, or IDCG. It is the DCG computed with the order provided by the ground truth ranking, that is, for the best possible word positioning.

Since the scores are penalized proportionally to their position in the ranking (with some concavity), the more words with high ground truth scores are placed on top of a candidate ranking, the better the NDCG of that ranking. NDCG is maximal and equal to 1 if the candidate and ground truth rankings are identical.

Rbo

The Rank Biased Overlap or RBO Webber et al. (2010) takes values in , where 0 means that the two rankings are independent and 1 that they match exactly. It is computed as:

(6)

where is the proportion of words belonging to both rankings up to position , is the length of the rankings, and is a parameter controlling how steep the decline in weights is: the smaller , the more top-weighted the metric is. When , only the top-ranked word is considered, and the RBO is either zero or one. When is close to 1, the weights become flat, and more and more words are considered. We used in our experiments, which means that the top 50 positions received 86% of all weight.

4.3 Implementations

We used the base R R Core Team (2018) cor() function9 to compute the and statistics. For RBO, we relied on a publicly available Python implementation10. For all other metrics, we wrote our own implementations.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 3: Pairwise similarity matrices between methods. For readability, all scores are shown as percentages. For Kendal and Spearman, * and *** mean statistical significance at and , respectively. For a given metric, our configuration that best matches (on average) all other methods (except random and frequency) is always shown first. DL means that the compressed contextual embedding space has D dimensions and that the hierarchy has L levels. Rand, freq, wiki, oxf, ON, WN, WNred, and WNdom are short for random, frequency, Wikipedia, Oxford, OntoNotes, WordNet, WordNet reduced, and WordNet domains. All metrics except NDCG are symmetric, hence we only show one triangle for them. For NDCG, candidate methods are shown as columns and ground truths as rows.
sentences bin coordinates
it stars christopher lee as count dracula along with dennis waterman
the count of the new group is the sum of the separate counts of the two original groups
the first fight did not count towards the official record
five year old horatia came to live at merton in may 1805
it features various amounts of live and backstage footage while touring
first tax bills were used to pay taxes and to register bank deposits and bank credits
the ball nest is built on a bank tree stump or cavity
Table 2: Sentences containing different senses of the same word can be sampled by selecting from different bins.

5 Results

Our rankings correlate well with human rankings. Results are shown in Fig. 3, as pairwise similarity matrices, for all six metrics. For readability, all scores are shown as percentages. For a given metric, our configuration that best matches, on average, all other methods (except random and frequency) is always shown as the first column. Since all metrics except NDCG are symmetric, we only show the lower triangles of the other matrices. For NDCG, candidate methods are shown as columns and ground truths as rows.

For each of the six evaluation metrics, it can be seen that the ranking generated by our unsupervised, data-driven method is well correlated with all human-derived ground truth rankings. This means that our method is robust to how one defines and measures correlation or similarity.

In some cases, we even very closely reproduce the human rankings. For instance, our best configurations for cosine and NDCG get almost perfect scores of 86.5 and 99.72 when compared against Wikipedia. In terms of Kendall’s tau, Spearman’s rho, p@k, and RBO, we are also very close to OntoNotes (scores of 49.43, 35.23, 39.53, and 33.47, resp.).

Finally, the correlation between our rankings and the human rankings can also be observed to be, everywhere, much stronger than that between the baseline rankings (random and frequency) and the human rankings.

Statistical significance. We computed statistical significance for the Spearman’s rho and Kendall’s tau metrics. As can be seen in Fig. 3, the null hypothesis that there is no correlation between our rankings and the human-derived ground truth rankings, was systematically rejected everywhere, with very high significance ().

However, against the random baseline, the same null hypothesis (no correlation) was accepted everywhere. Against frequency, the null was rejected, but very weakly (only at the level), and with very low correlation coefficients (6.53 for Spearman and 4.44 for Kendall).

Finally, the correlation between the random and frequency rankings and the ground truth rankings is never statistically significant, with the exception of the pair frequency/OntoNotes, but again, at a weak level ().

Hyperparameters have a significant impact on performance, but optimal values are consistent across metrics. First, as can be observed from Fig. 4 and Fig. 5, there is a large variability in performance when (number of PCA dimensions) and (number of levels in the hierarchy) vary.

However, for all six evaluation metrics, the best configurations are very similar: , , , , , and 11. Given the rather large grid we explored ( for and , resp.), with 342 combinations in total, we can say that all these optimal values belong to the same small neighborhood. This interpretation is confirmed by inspecting Fig. 4, where it can clearly be seen that the optimal area of the hyperparameter space is robust to metric selection, and consistently corresponds to small values of (around 3), and values of at least above 3 or 4, ideally around 8. For larger values of , performance plateaus (keeping fixed). In other words, it is necessary to have some levels in the hierarchy, but having very deep hierarchies is not required for our method to work well. A benefit of having such small optimal values of and is their affordability, from a computational standpoint.

All rankings derived from WordNet-based resources are highly correlated. It is interesting to note that the rankings generated from OntoNotes, WordNet, WordNet reduced, and WordNet domains, all are highly similar. And this, despite the very different sense granularities they have. This means that despite the apparent differences in these resources, they all tend to assign the same number of senses to the same words. The Oxford rankings tend to be part of this high-similarity cluster as well, to a lesser extent.

Frequent words are not the most polysemous. Finally, one last interesting observation we can make is that while the frequency ranking is much better than the random ones, it still is far away from the human rankings. In other words, the frequency of appearance of a word (excluding stopwords, of course), is not as good an indicator of its polysemy as one could expect.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 4: Performance (color scale) vs. number of PCA dimensions ( axis) vs. number of levels in the hierarchy ( axis).
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5: Performance distributions over the 342 values in the discrete hyperparameter space (grids of Fig. 4).
Figure 6: Normalized ranking score distributions for the random and frequency rankings and the human-derived ground truth rankings.

6 Sampling diverse examples

An interesting application of our discretization strategy is that it can be used to select sentences containing different senses of the same word, as illustrated in Table 2. Provided a mapping, for a given word, between the sentences that were passed to the pre-trained language model and the vectors, we can sample vectors from different bins and retrieve the associated sentences. If the bins are distant enough, the sentences will contain different senses of the word. For instance, in Table 2, we can see that we are able to sample sentences containing three senses of the word count: (1) noble title, (2) determining the total number, and (3) taking into account. While a by-product of our approach, this sampling methodology has many useful applications in practice, e.g., in online dictionaries, dataset creation, etc.

7 Related work

Task. To the best of our knowledge, this study is the first to focus purely on polysemy quantification, that is, on estimating the number of senses of words, without trying to label these senses. Also, this study is, still to the best of our knowledge, the first to approach word sense disambiguation (or a subtask thereof, to be precise), from a purely empirical and unsupervised standpoint. Indeed, except for performance evaluation, no human annotators (even non-expert ones), and no human-constructed word sense inventories or dictionaries, are involved in our process.

For the reasons above, we did not find any previous work directly comparable with ours in the literature. However, several previous efforts have interested themselves in creating sense inventories without human experts.

For instance, in Rumshisky (2011); Rumshisky et al. (2012) 12, Amazon Mechanical Turk (AMT) workers are given a set of sentences containing the target word, and one sentence that is randomly selected from this set as a target sentence. Workers are then asked to judge, for each sentence, whether the target word is used in the same way as in the target sentence. This creates an undirected graph of sentences. Clustering can then be applied to that graph to find senses. To label clusters with senses, one has to manually inspect the sentences in each cluster.

More recently, Jurgens (2013)13 compared three annotation methodologies for gathering word sense labels on AMT. The methods compared are Likert scales, two-stage select and rate, and difference between counts of when senses were rated best/worst. Regardless of the strategy, inter-annotator agreement remains low (around 0.3).

Methodology. In the original ELMo paper, Peters et al. (2018) have shown that using contextual word representations (through nearest neighbor matching) improves word sense disambiguation. Hadiwinoto et al. (2019) showed that this technique, along with some other ones, works well for BERT too.

From a methodological point of view, our approach is related in spirit to pyramid matching Nikolentzos et al. (2017); Grauman and Darrell (2007); Lazebnik et al. (2006). This kernel-based method has originated in computer vision, and computes the similarity between objects by placing a sequence of increasingly coarser grids over the feature space and taking a weighted sum of the number of matches that occur at each resolution level. Matches found at finer resolutions are weighted more highly than matches found at coarser resolutions.

8 Conclusion

We proposed a novel unsupervised, fully data-driven geometrical approach to estimate word polysemy. Our approach builds multiresolution grids in the contextual embedding space. We showed through rigorous experiments that our rankings are well correlated (with strong statistical significance) to 6 different human rankings, for 6 different metrics. Such fully data-driven rankings of words according to polysemy can help in creating new sense inventories, but also in validating and interpreting existing ones. Increasing the quality and consistency of sense inventories is a key first step of the word sense disambiguation pipeline. We also showed that our discretization can be used, at no extra cost, to sample contexts containing different senses of a given word, which has useful applications in practice. Finally, the fully unsupervised nature of our method makes it applicable to any language.

While our scores are a good proxy for polysemy, they are not equal to word sense counts. Moreover, we do not label each sense. Future work should address these challenges, by, e.g., automatically selecting bins of interest, and generating labels for them. Another direction of work is investigating how different contextual embeddings (e.g., BERT) impact our rankings. Finally, it would be interesting to test the effect on performance of basic transformations of the contextual embedding space, such as that proposed in Mu et al. (2017).

9 Acknowledgments

We thank Giannis Nikolentzos for helpful discussions about pyramid matching. The GPU that was used in this study was donated by the NVidia corporation as part of their GPU grant program. This work was supported by the LinTo project.

Footnotes

  1. https://github.com/ksipos/polysemy-assessment
  2. https://dumps.wikimedia.org/
  3. We used the implementation and pre-trained weights publicly released by the authors https://allennlp.org/elmo.
  4. 15 executors with 10 GB of RAM each.
  5. See ‘Sense Key Encoding’ here: https://wordnet.princeton.edu/documentation/senseidx5wn
  6. www.lexico.com
  7. https://en.wikipedia.org/wiki/word_(disambiguation)
  8. with mean and standard deviation 0 and 0.6 (resp.)
  9. https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html
  10. https://github.com/changyaochen/rbo
  11. for RBO, and had the same score.
  12. We asked the authors to share annotations with us to use as ground truth, but they were unable to do so.
  13. same as footnote 12.

References

  1. Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555–596.
  2. Luisa Bentivogli, Pamela Forner, Bernardo Magnini, and Emanuele Pianta. 2004. Revising the wordnet domains hierarchy: semantics, coverage and balancing. In Proceedings of the workshop on multilingual linguistic resources, pages 94–101.
  3. Susan Windisch Brown, Travis Rood, and Martha Palmer. 2010. Number or nuance: Which factors restrict reliable word sense annotation? In LREC.
  4. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  5. Kawin Ethayarajh. 2019. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512.
  6. Kristen Grauman and Trevor Darrell. 2007. The pyramid match kernel: Efficient learning with sets of features. Journal of Machine Learning Research, 8(Apr):725–760.
  7. Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung Gan. 2019. Improved word sense disambiguation using pre-trained contextualized word representations. arXiv preprint arXiv:1910.00194.
  8. Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: the 90% solution. In Proceedings of the human language technology conference of the NAACL, Companion Volume: Short Papers, pages 57–60.
  9. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
  10. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446.
  11. David Jurgens. 2013. Embracing ambiguity: A comparison of annotation methodologies for crowdsourcing word sense labels. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 556–562.
  12. Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81–93.
  13. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2169–2178. IEEE.
  14. Bernardo Magnini and Gabriela Cavaglia. 2000. Integrating subject field codes into wordnet. In LREC, pages 1413–1418.
  15. Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241.
  16. George A Miller. 1998. WordNet: An electronic lexical database. MIT press.
  17. Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2017. All-but-the-top: Simple and effective postprocessing for word representations. arXiv preprint arXiv:1702.01417.
  18. Giannis Nikolentzos, Polykarpos Meladianos, and Michalis Vazirgiannis. 2017. Matching node embeddings for graph similarity. In Thirty-First AAAI Conference on Artificial Intelligence.
  19. Martha Palmer, Olga Babko-Malaya, and Hoa Trang Dang. 2004. Different sense granularities for different applications. In Proceedings of the 2nd International Workshop on Scalable Natural Language Understanding (ScaNaLU 2004) at HLT-NAACL 2004, pages 49–56.
  20. Martha Palmer, Hoa Trang Dang, and Christiane Fellbaum. 2007. Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Natural Language Engineering, 13(2):137–163.
  21. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  22. R Core Team. 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  23. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf.
  24. Anna Rumshisky. 2011. Crowdsourcing word sense definition. In Proceedings of the 5th Linguistic Annotation Workshop, pages 74–81. Association for Computational Linguistics.
  25. Anna Rumshisky, Nick Botchan, Sophie Kushkuley, and James Pustejovsky. 2012. Word sense inventories by non-experts. In LREC, pages 4055–4059.
  26. Giannis Siglidis, Giannis Nikolentzos, Stratis Limnios, Christos Giatsidis, Konstantinos Skianis, and Michalis Vazirgiannis. 2018. Grakel: A graph kernel library in python. arXiv preprint arXiv:1806.02193.
  27. Charles Spearman. 1904. The proof and measurement of association between two things.
  28. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  29. William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS), 28(4):1–38.
  30. Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Martha Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, Craig Greenberg, Eduard Hovy, Robert Belvin, et al. 2011. Ontonotes release 4.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
412555
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description