A Novel Neural Sequence Model with Multiple Attentions for Word Sense Disambiguation
Word sense disambiguation (WSD) is a well researched problem in computational linguistics. Different research works have approached this problem in different ways. Some state of the art results that have been achieved for this problem are by supervised models in terms of accuracy, but they often fall behind flexible knowledge-based solutions which use engineered features as well as human annotators to disambiguate every target word. This work focuses on bridging this gap using neural sequence models incorporating the well-known attention mechanism. The main gist of our work is to combine multiple attentions on different linguistic features through weights and to provide a unified framework for doing this. This weighted attention allows the model to easily disambiguate the sense of an ambiguous word by attending over a suitable portion of a sentence. Our extensive experiments show that multiple attention enables a more versatile encoder-decoder model leading to state of the art results.
Word sense disambiguation (WSD) is the task of assigning appropriate meaning to a target word in the case where the target sense is clearly distinguishable from other word senses subjected to the attributes of the target wordâs context. As one of the challenging problems in the field of computational linguistics, WSD has received considerable attention over the past decade [1, 2] due to its various application potentials such as information retrieval, text mining, machine translation, speech synthesis as well as question answering etc. Some of the classical algorithms for solving WSD are LESK algorithm which sees the dictionary definition overlapping of the words in a sentence, Naive Bayes; which looks at the conditional probability of each word sense along with the contexts having an assumption that the ordering of words as well as their presence among a bag of words is independent and Neural Networks; where the words in a sentence are represented by nodes and they gradually turned on / off as the training goes on; at the same time activating neighbour nodes on the next cycle and finally stabilizes in a state where one sense for each input word is more activated than the others.
A large amount of research works are continuously going on to solve this classical WSD problem, developing new algorithms [3, 4, 5] and evaluating on some of the famous benchmarks [6, 7, 8]. All of these recent works mainly focus on the two famous WSD drawbacks; order of the words in the context and series of handcrafted features. Most of the traditional supervised WSD methods are based on extracting features form the surrounding words and then train a classifier for each of the ambiguous words .
Recently, Deep neural network (DNN) based approaches have gained state of the art results in many widely examined classical problems in computational linguistic field. In the past few years, neural network has been very successfully applied for getting the word embedding; representing each word as a vector of fixed dimensions. Mikolov et al.  trained a shallow neural network to get vector representation of a target word depending on its surrounding contexts and claims that their representation captures the syntactic and semantic level similarity very well. Later on, Levy et al. proved that Mikolov’s word2vec skip-gram model with negative sampling is actually an implicit factorization of a word context matrix in which every cell is the point-wise mutual information (PMI) of the respective word and context pairs shifted by a global constant.
Neural machine translation (NMT) is a new way to deal with machine translation, recently proposed by [12, 13, 14]. Unlike the conventional phrase-based translation framework  which comprises of numerous small sub-segments that are tuned independently, NMT endeavors to construct and prepare encoder-decoder based vast neural network that goes through a sentence and yields a right translation. The attention mechanism has been a breakthrough in NMT which calculates How much attention the network has to give on each source word to generate a specific translated word. Bahdanau et al.  first proposed this concept of doing translation as well as alignment jointly, where they calculate the context using encoder outputs and last hidden state of decoder at each time step. Finally, they combined this context with the decoded word from previous time step to generate a new word in current time step. Luong et al.  came up with two new attention mechanisms where one looks for the global context and the other one looks for local context i.e. a subset of words. They used translated word in each time step along with encoder outputs to calculate the context. Finally, they concatenated this context with the recurrent neural network (RNN) output of the decoder and mapped it to a translated word via a multi layer perceptron (MLP). Sennrich et al.  show that, although these Seq2Seq models captures word level features very well, it would not be redundant to add some linguistic features such as part-of speech tags, morphological features as well as syntactic dependency and they found improved results on WMT16 task.
A number of recent works adopted this sequence to sequence (Seq2Seq) concept for doing WSD. Among them, Raganato et al.  have experimented with some neural sequence models which include bidirectional long short term memory (BLSTM) based architecture in many to many form with and without attention for sequence tagging. They also experimented on sequence to sequence architecture with attention to do multitask learning where they tag a sentence with their sense as well as their parts of speech. Their best performing model is attentive BLSTM tagger rather than the Seq2Seq architecture. Melamud et al.  trained a BLSTM architecture to get the context representation of each sense annotation on an unlabeled corpus. They used an objective function where the sentence with the target word is empty or replaced by a dummy symbol and then try to put that modified context as well as the target word in the same low-dimensional space. Kaageback et al.  relied on a BLSTM based approach where they divided the sentence in two sections based on the position of the target words. They call it left context and right context. Finally, they applied two long short term memories (LSTMs) from two opposite directions on these contexts, concatenated their last hidden states and used an MLP to classify to corresponding sense. Yuan et al.  propose a powerful neural language model to obtain a latent representation for the entire sentence containing a target word w; later on, they compare this representation with those sentences which have other candidate meanings of word w. Ahmad et al.  came up with an architecture, where they calculate the cosine similarities between the sense embedding of the center word and the word embedding of every other words in a sentence. Then they applied two LSTMs on this vector of similarities from both left and right directions, concatenated them and finally applied a fully connected layer to classify the word sense as a one class classification problem.
In this paper, we propose a few encoder-decoder architectures for doing WSD by taking linguistic features of the surrounding context words into account, combining them and at the same time capturing the order of the context words as well. The Seq2Seq architecture for NMT is being used in this work as a sequence tagger and the attention mechanism is used to tell the network, how much attention it needs to pay on each linguistic feature to identify the specific meaning of an ambiguous word. We choose neural sequence models because it generally perform very well on sequence data by keeping them in the memory  and adding linguistic features with these models actually improve their performance . Although Raganato et al.  show that Seq2Seq is sub-optimal for doing WSD, this paper revisit their finding and explore the effectiveness of various attentive encoder-decoder architecture for this task. We are using supervised attention-based method to come up with a linear combination of different features and generate a final linguistic feature based attention matrix. As candidate features, we are looking at three possible ones: surrounding word vectors, surrounding context bigrams and the parts of speech (POS) sequence of the whole sentence. Adding multiple attentions help to decide the significant contribution of different attentional features and a vector representing linear combination of those attentional features captures every pieces of it. Even though we haven’t been able to take the entire corpus into account because of the resource limitation, still itâs been found that our Seq2Seq architecture with multiple attentions have obtained state of the art performance on some of the benchmarks.
Ii The model
In this section, we describe our work in detail. We first explain our basic Seq2Seq model having attention on bi-grams. Following this, we explain a way of doing multiple attentions on different features and finally, we describe a way of combining these multiple attentions using some weighted value in the latter part of this section.
Ii-a Attention on Bigrams
Our first architecture is completely based on the attention based Seq2Seq model with encoders and decoders as shown in Figure 1. The encoder inputs the source sentence as a sequence , having as the ambiguous word. The decoder tries to generate a sequence where for all i, except the target word at index t is replaced by the corresponding sense tagged word. Adding attention mechanism with this architecture allows the model to take one word at a time from the decoder and looks for which of the words in the input sentence are useful for generating this target word. However, rather than using word by word attention, we use bigram attentions which actually makes sense, because to generate a particular sense of a word, the context words contribution matters most. Next, we pad a start token at the beginning of this sentence and then pass this modified sentence to an embedding layer. The embedding weights are initialized randomly as well as using pre-trained word vector and are trained along with other parameters of the network. Next, a convolution layer with kernel size is initialized which goes over the bigram embedding with a stride length of 1 as shown in Eqn. 1.
where, is the embedding matrix, is the convolution kernel and is the maximum sequence length for the current batch. This will generate a convolved embedding of bigrams and it is then fed to the gated recurrent unit (GRU) layer.
The last hidden state is the encoded representation of the sentence and we term this as . Next, in the decoder section, a GRU is initialized with as the hidden state and as input. This generates a new hidden state .
We then pass this and encoder output at each time step to an ‘Attention’ model which returns an attention matrix of size .
Next, we apply batch-wise matrix multiplication on and and compute the context .
We then concatenate and and pass it to an MLP followed by a Softmax layer which maps the result back to the vocabulary size
In the next time step, the decoder GRU again unfolds. But this times it takes hidden state and the word generated from previous time step into account. So, Eqn. 3 becomes,
The rest of the training is done as an end to end fashion.
Ii-B Attention on Words and Parts of Speech (POS)
In this model, we introduce multiple attention mechanism where we apply individual attention on different features of data and later combine them through point-wise multiplication. We start with traditional Seq2Seq architecture having encoder and decoder at both ends. As shown in Figure 2, the encoder has GRU Layer which takes word embeddings as well as POS embeddings as input. Rather than having different GRU layer for words and POS, we use a single GRU layer whose weights are shared for both of these inputs. This also allows the gradients to be shared as well during back-propagation. Next, using Eqn. 2, we calculate the encoded word and POS features as follows,
where and represent the hidden state for word and POS respectively. From each of these hidden state vectors, we consider the one at the last index as the encoded version. Following this, we take these encoded hidden state vectors and apply Eqn. 3 and 4 which gives us two attention matrices and for word and POS respectively. We get the final attention matrix by doing a point-wise multiplication of and ,
The point-wise multiplication changes the amplitude of each dimension of these vectors and also allows encoding the positions of the target word neighbors according to their vector amplitude. If both word and POS attention matrix puts more focus on th word then the magnitude of the final attention vector at th dimension becomes very large. And if two attention vectors put focus on two different words, then the final attention gets distributed over those two possible words. This makes it similar to a soft-hard attention model, where the model decides which one gets activated and when. Next, to make the final attention matrix as a probability distribution, we apply a Softmax layer on it
Finally, we use Eqn. 5 to calculate the context followed by Eqn. 6 to generate the next predicted word. Similarly to the previous subsection II-A, Eqn. 3 gets replaced by Eqn. 7 from the second time step to generate the hidden state vector for decoder. The decoder continues to decode until it generates all words in the target sentence or an EOS token is encountered. The rest of the training is done in end to end manner.
Ii-C Attention on Words, Parts of Speech and Bigrams with weighting
In this model, we combine the concept of both bigrams and POS attention architecture from II-A and II-B and introduces a term called weighted attention. As shown in Figure 3, input GRU in the encoder takes word embeddings, POS embeddings as well as the bigram embeddings. Similarly to our previous model, we share the GRU weights among these three inputs. We use Eqn. 2 on these three inputs independently and calculate three encoded vectors , and for word, POS and bigram respectively. Following this, we apply Eqn. 3 and Eqn. 4 on these encoded vectors independently and calculate three attention vectors , and . Next, we generate three weights , and and perform a weighted linear addition of three attention vectors using them,
Following this, we use Eqn. 10 to turn these attention weights into probability. Finally, calculating the context and generating the next probable word is similar to the one in II-A and II-B. The decoder continues to generate until an EOS token is found or the entire sequence gets generated.
Generating weight values Currently, most of the research works use attention to decide which portion of a feature the model needs to attend to more. To the best of our knowledge, no work has investigated putting attention on attentions. In this study, we propose a few ways to do this. Firstly, if a model has multiple features with separate attention for each, the simplest method is to combine all the attentions and perform a Softmax on them. Another way is to apply a local gating mechanism, where we first pass each of the individual attention vectors to a Sigmoid layer. This generates a value in for each attention vector. Then we perform a point-wise multiplication on these individual attention vectors and their sigmoid values. Finally, we add up these values, apply a Softmax on them and use the result as the attention vector.
Apart from local gating, it is also possible to apply a global gating mechanism where we first concatenate all the attention vectors and separately apply Softmax on individual columns of this concatenated matrix. We then pick the best features for each position via argmax and make a vector from it. Finally we apply another Softmax on this vector to get the attention vector.
Another way to generate the weights is to scale each of the attention vectors with a scaling factor and then add them. The factor can be a vector or a scalar. To do this, first we initialize the factor with some random values and then add it to the model parameters where its gradient is calculated based on loss. If the factor is a vector then we can choose an MLP without a bias for doing the transformation. In this study, we choose scalar factors as weights, initialize them to and then perform a linear weighted addition of the attention vectors. After the addition, we apply a Softmax on the resultant vector in order to make it a vector of probabilities. In the next section we show that even though we start with weight values of , by taking gradients during training, the model adjusts the values accordingly and we end up with a completely different set of values.
Iii Experimental Setup
In this section we describe the detailed experimental setup for the evaluation of our study. We first explain our training corpus as well as all the benchmarks used in other standard WSD methods. Following this, we explain the technical details of our proposed architectures along with their hyper-parameter settings.
Training and Evaluation Benchmarks: We use SemCor 3.0, MASC and Senseval task 3 corpora for training as well as evaluating our models. Many existing works have used these corpora as their benchmark [3, 20, 22, 23, 24, 25]. As our models are configured to work with WordNet senses, we use the mapping algorithm proposed by  to map the SemCor and MASC corpora from NOAD senses to WordNet senses. Moreover, we also evaluate our architectures both on the same domain as well as on the different domains. During this cross domain evaluation, we train on one corpus and test on another. Standard splits of these corpora are used during evaluating on the same domain (Train - SemCor, Test - SemCor and vice versa). And during the cross domain evaluation, the whole corpus from one domain is used for training and the entire corpus from other domain is used for testing.
|Learning rate||0.01 / 0.02 / 0.001|
|Context size||50 (25 on each side)|
|Batch size||10 / 50 / 100|
|No. of GRU layers||Encoder- 2|
|Decoder - 2|
|Type of GRU layer||Encoder- bidirectional|
|Dropout||0.1 / 0.2 / 0.3|
|Word embedding size||100 / 200 / 300|
|Initialization of scalar weights on Attentions||Random uniform|
|Decoder learning ratio||5.0|
|Gradient clipping||50 / 25 / 10|
Model selection is done using the validation set of each corpus and the best model is finally trained by combining the training and the validation set to evaluate on the left alone test set. As we have not found any standard splits for SemCor and MASC corpora, we perform a manual split: train, test and validation with a ratio of 80%, 10%, and 10% respectively. The hyper-parameters are tuned solely according to the validation set.
During testing, our models calculate the probability distribution over output words given a target word . The output at each time step is fed to a Softmax layer which gives the probability for each class. It is then used to rank the candidate senses of and the top ranked candidates are selected as the output of the model.
Architecture details and network parameters: For all three architectures, we use GRU as the basic building block. Only for the first architecture, we use single attention and for the other two architectures, we use multiple attentions. We use the ‘dot’ , ‘concat’ and ‘linear’ attention model from  to calculate the attention energies. As all the models were giving best results with ‘dot’ attention model, we report our final experimental results in the next section only with ‘dot’ attention model.
Table I shows the detailed hyper-parameter settings used during the evaluation for all three of our architectures. We trained our models on GeForce GTX 1080 GPU with both ‘Adam’ and ‘SGD’ optimizer. All the results in the next section are reported using ‘SGD’ as it was giving comparatively good results. We used PyTorch 0.3.1 for implementing our models under Linux environment.
|Seq2Seq + conv (bigrams)||59.2||-||-||-||-|
|Seq2Seq + POS (point-wise multiply)||57.1||-||-||-||-|
|Seq2Seq + POS (weighting)||67.5||49.1||58.3||61.2||68.1|
|Seq2Seq + conv + POS (weighting)||73.9||58.2||70.4||71.3||85.4|
| BLSTM (att.)||70.2||71.0||58.4||75.2||83.5|
| Seq2Seq (att.)||69.6||69.5||57.2||74.5||81.8|
| Seq2Seq (att., LEX, POS)||68.5||70.1||55.2||75.1||84.4|
| UKBgloss w2w||55.4||64.9||41.4||69.5||69.7|
| IMS + adapted CW||73.4||-||-||-||-|
| IRST - kernels||72.6||-||-||-||-|
|Seq2Seq + POS (weighting)||MASC||MASC||39.9||59.7||57.5||68.8||68.1|
|Seq2Seq + conv + POS (weighting)||MASC||MASC||46.5||65.2||67.0||74.1||72.7|
Iv Experimental Results
In this section, we describe in detail our experimental results in terms of F1(%) score. This section also contains the results of the top performing models for WSD along with the benchmarks. We show how multiple attention can penalize the confusion of the model by allowing it to put more focus on the exact context. Finally, we conclude this section by showing which linguistic feature has more impact on determining a sense of a word. For extensive evaluation, we implemented Seq2Seq with word attention as a baseline model. Also to see the impact of weighted combination of different linguistic features, we implemented Seq2Seq + POS (weighting) model which is similar to II-C except the bigram attention module is being removed.
|Attention weights||Initial value||1000th epoch||2000th epoch||4000th epoch||5000th epoch||3200th epoch|
Table II compares the F1 score achieved by our proposed models against some of the existing state of the art ones on Senseval task. Our Seq2Seq + conv + POS (weighting) is the best performing among the five models that we experimented with. It outperformed the top performing neural sequence model from  on Senseval-3 task and achieves state of the art F1 score of 73.9%. An interesting aspect is that when POS feature is added through point-wise multiplication, the performance drops (from 66.3% to 57.1%) because it causes inconsistent scaling of different dimension of attention vectors. However, adding the same feature through weighted addition causes a performance boost (from 66.3% to 67.5%) as each dimension of the final attention vectors now gets stretched uniformly based on the two individual attention vectors. Table II also reports the F1 scores of individual POS classes for the Senseval 3 task. Our best model achieves state of the art results on and classes with F1 scores of 70.4% and 85.4%, respectively. The best results on (71.9%) and (75.9%) are achieved with an IMS framework along with word embeddings to generate features and a support vector machine (SVM) for doing the classification . Apart from that, tt can easily be seen that the Seq2Seq architectures perform very well against the statistical and knowledge-based methods like IMS + adapted CW , Htsa3 , UKBgloss w2w , Babelfy  as well as RST - kernels  achieving results that are superior or equivalent to the best models as mentioned above. One interesting evaluation is that the Seq2Seq baseline from  is 69.6% and our Seq2Seq baseline performance is 66.3%. When  added POS, where it is meant to be learned as one of their tasks, their performance degrades from 69.6% to 68.5%. When we added POS as a feature, our performance jumped from 66.3% to 67.5% which clearly shows that adding POS information as a feature has an influence on identifying polysemy of a word. It is to be noted that, the variation in baseline model performance is may be due to different hyper-parameter settings or different hardware configuration.
Table III shows extensive evaluation of our models on the SemCor and MASC corpora with various training and testing environments. It also shows our model performance on different parts of speech classes on these two corpora. It is clear that whenever testing on the same domain (Train-MASC, Test-MASC and vice versa), all the models perform quite well with maximum F1-score up to 72.7% and 76.4% for MASC and SemCor, respectively. However, while testing on different domains (Train-MASC, Test-SemCor and vice versa), performance of all the models decreases. It does make sense because even though we are in different domains, we are not tuning any of the hyper-parameters; instead it’s been set according to their prior training environment. Table III also depicts how well our best models perform on some frequent parts of speech classes. For almost all of the POS classes, our top performing model is Seq2Seq + conv + POS (weighting), beating the other models by quite a good margin. Variance in the performance on different parts of speech is mainly because there are some statistically significant differences between the models. However, one thing for sure is that the results for training and testing in different domain is very much correlated with what we have seen in Table II. Also, we have not included the results for Seq2Seq + conv (bigrams) and Seq2Seq + POS (point-wise multiply) as they are comparatively weak models.
Table IV depicts the pattern of change in scalar weights on different attentions with epochs which shows how the model decides the amount of attention it needs to pay on different linguistic features. We start with random scalar weights on three possible features as shown in the first column of Table IV. It can be easily seen that, with more data the model sees, it gives more more importance to POS with weight 3.58 compared to the other two (0.32 for words and 3.45 for bigrams). The weight on word attention is not stable but weights on the other two are increasing monotonically until the loss gets very small. Finally, we select the set of weights with which the model achieves highest F1 score at the th epoch.
Figure 4 shows how the model gives different attentions on different linguistic features. It is clearly visible that with just word attentions, the model gets confused and gives attention to more than one word at a given time. This makes sense for a translation model as one particular word in one language at a specific index position can depend on more than one word of another language. But as we are dealing with same languages on both ends, the decoded word attention has to be on the same word from the encoder; in other words, it should be a one to one mapping. The deviation is mainly because of the lack of context and by making a new attention matrix with the linear weighted combination of multiple attentions on different linguistic features, we penalizes this context lacking. However, this linearly combined attention matrix is finally passed through a softmax layer to make each attention weight a probability.
In this paper, we adopted a new approach for doing WSD using neural sequence models by applying multiple attentions on different linguistic features of a sentence. The single attention approach with sequence models is very effective with machine translation however this study focuses on using multiple attentions and taking their linear weighted combinations. By making these weights a network parameter, the model can easily fit itself to a suitable combination of them. Our best model achieves state of the art result on Senseval 3 corpus. Also our in depth analysis on POS classes of all the corpora gives us an insight about how polysemy relates with POS. The multiple attention approach has a huge impact on penalizing the mistake made by the model as it gives it more flexibility to choose the right combination from a number of suitable features. This approach can easily be applied to other applications of neural sequence models such as question answering where one can pick the most related fact though an attention over all the fact sentences and finally the answer gets predicted through another attention on question. As future research, we are currently working on this idea.
-  E. Agirre and P. Edmonds, Word sense disambiguation: Algorithms and applications. Springer Science & Business Media, 2007, vol. 33.
-  R. Navigli, “Word sense disambiguation: A survey,” ACM Computing Surveys (CSUR), vol. 41, no. 2, p. 10, 2009.
-  D. Yuan, J. Richardson, R. Doherty, C. Evans, and E. Altendorf, “Semi-supervised word sense disambiguation with neural models,” arXiv preprint arXiv:1603.07012, 2016.
-  A. M. Butnaru, R. T. Ionescu, and F. Hristea, “Shotgunwsd: An unsupervised algorithm for global word sense disambiguation inspired by dna sequencing,” arXiv preprint arXiv:1707.08084, 2017.
-  R. Tripodi and M. Pelillo, “A game-theoretic approach to word sense disambiguation,” Computational Linguistics, vol. 43, no. 1, pp. 31–70, 2017.
-  B. Snyder and M. Palmer, “The english all-words task,” in Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004.
-  R. Navigli, K. C. Litkowski, and O. Hargraves, “Semeval-2007 task 07: Coarse-grained english all-words task,” in Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007, pp. 30–35.
-  A. Moro and R. Navigli, “Semeval-2015 task 13: Multilingual all-words sense disambiguation and entity linking,” in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015, pp. 288–297.
-  Z. Zhong and H. T. Ng, “It makes sense: A wide-coverage word sense disambiguation system for free text,” in Proceedings of the ACL 2010 system demonstrations. Association for Computational Linguistics, 2010, pp. 78–83.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
-  O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix factorization,” in Advances in neural information processing systems, 2014, pp. 2177–2185.
-  N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1700–1709.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  P. Koehn, F. J. Och, and D. Marcu, “Statistical phrase-based translation,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics, 2003, pp. 48–54.
-  M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
-  R. Sennrich and B. Haddow, “Linguistic input features improve neural machine translation,” arXiv preprint arXiv:1606.02892, 2016.
-  A. Raganato, C. D. Bovi, and R. Navigli, “Neural sequence learning models for word sense disambiguation,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1156–1167.
-  O. Melamud, J. Goldberger, and I. Dagan, “context2vec: Learning generic context embedding with bidirectional lstm,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2016, pp. 51–61.
-  M. Kågebäck and H. Salomonsson, “Word sense disambiguation using a bidirectional lstm,” arXiv preprint arXiv:1606.03568, 2016.
-  A. Pesaranghader, A. Pesaranghader, S. Matwin, and M. Sokolova, “One single deep bidirectional LSTM network for word sense disambiguation of text data,” in Advances in Artificial Intelligence - 31st Canadian Conference on Artificial Intelligence, Canadian AI 2018, Toronto, ON, Canada, May 8-11, 2018, Proceedings, 2018, pp. 96–107. [Online]. Available: https://doi.org/10.1007/978-3-319-89656-4_8
-  K. Taghipour and H. T. Ng, “Semi-supervised word sense disambiguation using word embeddings in general and specific domains,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 314–323.
-  C. Grozea, “Finding optimal parameter settings for high performance word sense disambiguation,” in Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004.
-  C. Strapparava, A. Gliozzo, and C. Giuliano, “Pattern abstraction and term similarity for word sense disambiguation: Irst at senseval-3,” in Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004.
-  Y. K. Lee, H. T. Ng, and T. K. Chia, “Supervised word sense disambiguation with support vector machines and multiple knowledge sources,” in Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004.
-  I. Iacobacci, M. T. Pilehvar, and R. Navigli, “Embeddings for word sense disambiguation: An evaluation study,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2016, pp. 897–907.
-  E. Agirre, O. López de Lacalle, and A. Soroa, “Random walks for knowledge-based word sense disambiguation,” Computational Linguistics, vol. 40, no. 1, pp. 57–84, 2014.
-  A. Moro, A. Raganato, and R. Navigli, “Entity linking meets word sense disambiguation: a unified approach,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 231–244, 2014.