A GameTheoretic Approach to Word Sense Disambiguation
Abstract
This paper presents a new model for word sense disambiguation formulated in terms of evolutionary game theory, where each word to be disambiguated is represented as a node on a graph whose edges represent word relations and senses are represented as classes. The words simultaneously update their class membership preferences according to the senses that neighboring words are likely to choose. We use distributional information to weigh the influence that each word has on the decisions of the others and semantic similarity information to measure the strength of compatibility among the choices. With this information we can formulate the word sense disambiguation problem as a constraint satisfaction problem and solve it using tools derived from game theory, maintaining the textual coherence. The model is based on two ideas: similar words should be assigned to similar classes and the meaning of a word does not depend on all the words in a text but just on some of them. The paper provides an indepth motivation of the idea of modeling the word sense disambiguation problem in terms of game theory, which is illustrated by an example. The conclusion presents an extensive analysis on the combination of similarity measures to use in the framework and a comparison with stateoftheart systems. The results show that our model outperforms stateoftheart algorithms and can be applied to different tasks and in different scenarios.
— \historydatesSubmission received: 8th June, 2015; Revised submission received: 25th January, 2016;Accepted for publication: 24th March, 2016; will appear in either the 424 issue (December 2016) or the 431 issue (March 2017)
1 Introduction
Word Sense Disambiguation (WSD) is the task of identifying the intended meaning of a word based on the context in which it appears [\citenameNavigli2009]. It has been studied since the beginnings of Natural Language Processing (NLP) [\citenameWeaver1955] and today it is still a central topic of this discipline. This because it is important for many NLP tasks such as text understanding [\citenameKilgarriff1997], text entailment [\citenameDagan and Glickman2004], machine translation [\citenameVickrey et al.2005], opinion mining [\citenameSmrž2006], sentiment analysis [\citenameRentoumi et al.2009] and information extraction [\citenameZhong and Ng2012]. All these applications can benefit from the disambiguation of ambiguous words, as a preliminary process; otherwise they remain on the surface of the word, compromising the coherence of the data to be analyzed [\citenamePantel and Lin2002].
To solve this problem, over the past few years, the research community has proposed several algorithms during the years, based on supervised [\citenameZhong and Ng2010, \citenameTratz et al.2007], semisupervised [\citenamePham, Ng, and Lee2005, \citenameNavigli and Velardi2005] and unsupervised [\citenameMihalcea2005, \citenameMcCarthy et al.2007] learning models. Nowadays, although supervised methods perform better in general domains, unsupervised and semisupervised models are receiving increasing attention from the research community with performances close to the state of the art of supervised systems [\citenamePonzetto and Navigli2010]. In particular knowledgebased and graphbased algorithms are emerging as promising approaches to solve the problem [\citenameAgirre et al.2009, \citenameSinha and Mihalcea2007]. The peculiarities of these algorithms are that they do not require any corpus evidence and use only the structural properties of a lexical database to perform the disambiguation task. In fact, unsupervised methods are able to overcome a common problem in supervised learning: the knowledge acquisition problem, which consists in the production of largescale resources, manually annotated with word senses.
Knowledgebased approaches exploit the information from knowledge resources such as dictionaries, thesauri or ontologies and compute sense similarity scores to disambiguate words in context [\citenameMihalcea2006]. Graphbased approaches model the relations among words and senses in a text with graphs, representing words and senses as nodes and the relations among them as edges. From this representation the structural properties of the graph can be extracted and the most relevant concepts in the network can be computeds [\citenameNavigli and Lapata2007, \citenameAgirre et al.2006].
Our approach falls into these two lines of research; it uses a graph structure to model the geometry of the data points (the words in a text) and a knowledge base to extract the senses of each word and to compute the similarity among them. The most important difference between our approach and stateoftheart graphbased approaches [\citenameMoro, Raganato, and Navigli2014, \citenameAgirre, de Lacalle, and Soroa2014, \citenameNavigli and Lapata2010, \citenameSinha and Mihalcea2007, \citenameVéronis2004] is that in our method the graph contains only words and not senses. This graph is used to model the pairwise interaction among words and not to rank the senses in the graph according to their relative importance.
The starting point of our research is based on two fundamental assumptions:

the meaning of a sentence emerges from the interaction of the components which are involved in it;

these interactions are different and must be weighted in order to supply the right amount of information.
We interpret language as a complex adaptive system, composed of linguistic units and their interactions [\citenameCong and Liu2014, \citenameLarsenFreeman and Cameron2008]. The interactions among units give rise to the emergence of properties, which in our case, by problem definition, can be interpreted as meanings. In our model the relations between the words are weighted by a similarity measure with a distributional approach, increasing the weights among words which share a proximity relation. Weighting the interaction of the nodes in the graph is helpful in situations in which the indiscriminate use of contextual information can deceive. Furthermore, it models the idea that the meaning of a word does not depend on all the words in a text but just on some of them [\citenameChaplot, Bhattacharyya, and Paranjape2015].
This problem is illustrated in the sentences below:

There is a financial institution near the river bank.

They were troubled by insects while playing cricket.
In these two sentences^{1}^{1}1A complete example of the disambiguation of the first sentence is given in Section 5.3 the meaning of the words bank and cricket can be misinterpreted by a centrality algorithm that tries to find the most important node in the graph composed of all the possible senses of the words in the sentence. This because the meanings of the words financial and institution tend to shift the meaning of the word bank toward its financial meaning and not toward its naturalistic meaning. The same behavior can be observed for the word cricket, which is shifted by the word insect toward its insect meaning and not toward its game meaning. In our work the disambiguation task is performed imposing a stronger importance on the relations between the words bank and river for the first sentence and between cricket and play for the second; exploiting proximity relations.
Our approach is based on the principle that the senses of the words that share a strong relation must be similar. The idea of assigning a similar class to similar objects has been implemented in a different way by Kleinberg and Tardos \shortcitekleinberg2002approximation, within a Markow random field framework. They have shown that it is beneficial in combinatorial optimization problems. In our case, this idea can preserve the textual coherence; a characteristic that is missing in many stateoftheart systems. In particular, it is missing in systems in which the words are disambiguated independently. On the contrary, our approach disambiguates all the words in a text concurrently, using an underlying structure of interconnected links, which models the interdependence between the words. In so doing, we model the idea that the meaning for any word depends at least implicitly on the combined meaning of all the interacting words.
In our study, we model these interactions by developing a system in which it is possible to map lexical items onto concepts exploiting contextual information in a way in which collocated words influence each other simultaneously, imposing constraints in order to preserve the textual coherence. For this reason, we have decided to use a powerful tool, derived from game theory: the noncooperative games (see Section 4). In our system, the nodes of the graph are interpreted as players, in the game theoretic sense (see Section 4), which play a game with the other words in the graph, in order to maximize their utility; constraints are defined as similarity measures among the senses of two words that are playing a game. The concept of utility has been used in different ways in the game theory literature, in general, it refers to the satisfaction that a player derives from the outcome of a game [\citenameSzabó and Fath2007]. From our point of view, increasing the utility of a word means increasing the textual coherence, in a distributional semantics perspective [\citenameFirth1957]. In fact, it has been shown that collocated words tend to have a determined meaning [\citenameGale, Church, and Yarowsky1992, \citenameYarowsky1993].
Game theoretic frameworks have been used in different ways to study the language use [\citenamePietarinen2007a, \citenameSkyrms2010] and evolution [\citenameNowak, Komarova, and Niyogi2001], but to the best of our knowledge, our method is the first attempt to use it in a specific NLP task. This choice is motivated by the fact that game theoretic models are able to perform a consistent labeling of the data [\citenameHummel and Zucker1983, \citenamePelillo1997], taking into account contextual information. These features are of great importance for an unsupervised or semisupervised algorithm, which tries to perform a WSD task, because, by assumption, the sense of a word is given by the context in which it appears. Within a game theoretic framework we are able to cast the WSD problem as a continuous optimization problem, exploiting contextual information in a dynamic way. Furthermore, no supervision is required and the system can adapt easily to different contextual domains, which is exactly what is required for a WSD algorithm.
The additional reason for the use of a consistent labeling system relies on the fact that it is able to deal with semantic drifts [\citenameCurran, Murphy, and Scholz2007]. In fact, as shown in the above two sentences, concentrating the disambiguation task of a word on highly collocated words, taking into account proximity (or even syntactic) information allows the meaning interpretation to be guided only towards senses which are strongly related to the word which has to be disambiguated.
In this article, we provide a detailed discussion about the motivation behind our approach and a full evaluation of our algorithm comparing it with stateoftheart systems, in WSD tasks. In a previous work we used a similar algorithm in a semisupervised scenario [\citenameTripodi, Pelillo, and Delmonte2015], casting the WSD task as a graph transduction problem. Now we have extended that work making the algorithm fully unsupervised. Furthermore, in this article we provide a complete evaluation of the algorithm extending our previous works [\citenameTripodi and Pelillo2015], exploiting proximity relations among words.
An important feature of our approach is that it is versatile. In fact, the method can adapt to different scenarios and to different tasks and it is possible to use it as unsupervised or semisupervised. The semisupervised approach, presented in [\citenameTripodi, Pelillo, and Delmonte2015], is a bootstrapping graph based method, which propagates, over the graph, the information from labeled nodes to unlabeled. In this article, we also provide a new semisupervised version of the approach, which can exploit the evidence from sense tagged corpora or the most frequent sense heuristic and does not require labeled nodes to propagate the labeling information.
We tested our approach on different datasets, from WSD and entity linking tasks, in order to find the similarity measures, which perform better and evaluated it against unsupervised, semisupervised and supervised stateoftheart systems. The results of this evaluation shows that our method performs well and can be considered as a valid alternative to current models.
2 Related Works
There are two major paradigms in WSD: supervised and knowledgebased. Supervised algorithms learn, from senselabeled corpora, a computational model of the words of interest. Then, the obtained model is used to classify new instances of the same words. Knowledgebased algorithms perform the disambiguation task by using an existing lexical knowledge base, which usually is structured as a semantic network. Then, these approaches use graph algorithms to disambiguate the words of interests, based on the relations that these words’ senses have in the network [\citenamePilehvar and Navigli2014].
A popular supervised WSD system, which has shown good performances in different WSD tasks, is It Makes Sense (IMS) [\citenameZhong and Ng2010]. It takes as input a text and for each content word (noun, verb, adjective, or adverb) outputs a list of possible senses ranked according to the likelihood of appearing in a determined context and extracted from a knowledge base. The training data used by this system are derived from SemCor [\citenameMiller et al.1993], DSO [\citenameNg and Lee1996] and collected automatically exploiting parallel corpora [\citenameChan and Ng2005]. Its default classifier is LIBLINEAR^{2}^{2}2http://liblinear.bwaldvogel.de with a linear kernel and its default parameters.
Unsupervised and knowledgebased algorithms for WSD are attracting great attention from the research community. This because, supervised systems require training data, which are difficult to obtain. In fact, producing sense tagged data is a timeconsuming process, which has to be carried out separately for each language of interest. Furthermore, as investigated by Yarowsky and Florian \shortciteyarowsky2002evaluating, the performances of a supervised algorithm degrade substantially with the increasing of sense entropy. Sense entropy refers to the distribution over the possible senses of a word, as seen in training data. Additionally, a supervised system has problems to adapt to different contexts, because it depends on prior knowledge, which makes the algorithm rigid, therefore can not efficiently adapt to domain specific cases, when other optimal solution may be available [\citenameYarowsky and Florian2002].
One of the most common heuristics that allows to exploit sense tagged data such as SemCor [\citenameMiller et al.1993] is the most frequent sense. It exploits the overall sense distribution for each word to be disambiguated, choosing the sense with the highest probability regardless of any other information. This simple procedure is very powerful in general domains but can not handle senses with a low distribution, which could be found in specific domains.
With these observations in mind Koeling et al. \shortcitekoeling2005domain created three domain specific corpora to evaluate WSD systems. They tested whether WSD algorithms are able to adapt to different contexts, comparing their results with the most frequent sense heuristic, computed on general domains corpora. They used an unsupervised approach to obtain the most frequent sense for a specific domain [\citenameMcCarthy et al.2007] and demonstrated that their approach outperforms the most frequent sense heuristic derived from general domain and labeled data.
This heuristics, for the unsupervised acquisition of the predominant sense of a word, consists in collecting all the possible senses of a word and then in ranking these senses. The ranking is computed according to the information derived from a distributional thesaurus automatically produced from a large corpus and a semantic similarity measure derived from the sense inventory. Although the authors have demonstrated that this approach is able to outperform the most frequent sense heuristic computed on sense tagged data on general domains, it is not easy to use it on real world applications, especially when the domain of the text to be disambiguated is not known in advance.
Other unsupervised and semisupervised approaches, instead of computing the prevalent sense of a word, try to identify the actual sense of a word in a determined phrase, exploiting the information derived from its context. This is the case of traditional algorithms, which exploit the pairwise semantic similarity among a target word and the words in its context [\citenameLesk1986, \citenameResnik1995, \citenamePatwardhan, Banerjee, and Pedersen2003]. Our work could be considered as a continuation of this tradition, which tries to identify the intended meaning of a word given its context, using a new approach for the computation of the sense combinations.
Graphbased algorithms for WSD are gaining much attention in the NLP community. This is because graph theory is a powerful tool that can be employed both for the organization of the contextual information and for the computation of the relations among word senses. It allows to extract the structural properties of a text. Examples of this kind of approaches construct a graph from all the senses of the words in a text and then use connectivity measures in order to identify the most relevant word senses in the graph [\citenameSinha and Mihalcea2007, \citenameNavigli and Lapata2007]. Navigli and Lapata \shortcitenavigli2007graph conducted an extensive analysis of graph connectivity measures for unsupervised WSD. Their approach uses a knowledge base, such as WordNet, to collect and organize all the possible senses of the words to be disambiguated in a graph structure, then uses the same resource to search for a path (of predefined length) between each pair of senses in the graph and if it exists, it adds all the nodes and edges on this path to the graph. These measures analyze local and global properties of the graph. Local measures, such as degree centrality and eigenvector centrality, determine the degree of relevance of a single vertex. Global properties, such as compactness, graph entropy and edge density, analyze the structure of the graph as a whole. The results of the study show that local measures outperform global measure and in particular, degree centrality and PageRank [\citenamePage et al.1999] (which is a variant of the eigenvector centrality measure) achieve the best results.
PageRank [\citenamePage et al.1999] is one of the most popular algorithms for WSD, in fact, it was implemented in different ways by the research community [\citenameMihalcea, Tarau, and Figa2004, \citenameHaveliwala2002, \citenameAgirre, de Lacalle, and Soroa2014, \citenameDe Cao et al.2010]. It represents the senses of the words in a text as nodes of a graph. It uses a knowledge base to collect the senses of the words in a text and represents them as nodes of a graph. The structure of this resource is used to connect each node with its related senses in a directed graph. The main idea of this algorithm is that whenever a link from a node to another exists, a vote is produced, increasing the rank of the voted node. It works by counting the number and quality of links to a node in order to determine an estimation of how important the node is in the network. The underlying assumption is that more important nodes are likely to receive more links from other nodes [\citenamePage et al.1999]. Exploiting this idea the ranking of the nodes in the graph can be computed iteratively with the following equation:
(1) 
where is the transition matrix of the graph, is a vector representing a probability distribution and is the socalled damping factor that represents the chance that the process stops, restarting from a random node. At the end of the process each word is associated with the most important concept related to it. One problem of this framework is that the labeling process is not assumed to be consistent.
An algorithm, which tries to improve centrality algorithms, is SUDOKU, introduced by Minion and Sainudiin \shortcitemanion2014iterative. It is an iterative approach, which simultaneously constructs the graph and disambiguates the words using a centrality function. It starts inserting the nodes corresponding to the senses of the words with low polysemy and and iteratively inserting the more ambiguous words. The advantages of this method are that the use of small graphs, at the beginning of the process, reduces the complexity of the problem and that it can be used with different centrality measures.
Recently a new model for WSD has been introduced, based on an undirected graphical model [\citenameChaplot, Bhattacharyya, and Paranjape2015]. It approaches the WSD problem as a maximum a posteriori query on a Markov random field [\citenameJordan and Weiss2002]. The graph is constructed using the content words of a sentence as nodes and connecting them with edges if they share a relation, determined using a dependency parser. The values that each node in the graphical model can take include the senses of the corresponding word. The senses are collected using a knowledge base and weighted using a probability distribution based on the frequency of the senses in the knowledge base. Furthermore, the senses between two related words are weighted using a similarity measure. The goal of this approach is to maximize the joint probability of the senses of all the words in the sentence, given dependency structure of the sentence, the frequency of the senses and the similarity among them.
A new graph based, semisupervised approach, introduced to deal with multilingual WSD [\citenameNavigli and Ponzetto2012b] and entity inking problems, is Babelfy [\citenameMoro, Raganato, and Navigli2014]. Multilingual WSD is an important task because traditional WSD algorithms and resources are focused on English language. It exploits the information from large multilingual knowledge, such as BabelNet [\citenameNavigli and Ponzetto2012a] to perform this task. Entity linking consists in disambiguating the named entities in a text and in finding the appropriate resources in an ontology, which correspond to the specific entities mentioned in a text. Babelfy creates the semantic signature of each word to be disambiguated, that consists in collecting, from a semantic network, all the nodes related to a particular concepts, exploiting the global structure of the network. This process leads to the construction of a graphbased representation of the whole text. Then, it applies Random Walk with Restart [\citenameTong, Faloutsos, and Pan2006] to find the most important nodes in the network, solving the WSD problem.
Approaches which are more similar to ours in the formulation of the problem have been described by Araujo \shortcitearaujo2007evolutionary. The author reviewed the literature devoted to the application of different evolutionary algorithm to several aspects of NLP: syntactical analysis, grammar induction, machine translation, text summarization, semantic analysis, document clustering and classification. Basically these approaches are search and optimization methods inspired by biological evolution principles. A specific evolutionary approach for WSD has been introduced by Menai \shortcitemenai2014word. It uses genetic algorithms [\citenameHolland1975] and memetic algorithms [\citenameMoscato1989] in order to improve the performances of a glossbased method. It assumes that there is a population of individuals, represented by all the senses of the words to be disambiguated, and that there is a selection process, which selects the best candidates in the population. The selection process is defined as a sense similarity function, which gives a higher score to candidates with specific features, increasing their fitness to the detriment of the other population members. This process is repeated until the fitness level of the population regularizes and at the end the candidates with higher fitness are selected as solutions of the problem. Another approach, which address the disambiguation problem in terms of space search is GETALP [\citenameSchwab et al.2013], it uses an Ant Colony algorithm to find the best path in the weighted graph constructed measuring the similarity of all the senses in a text and assigning to each word to be disambiguated the sense corresponding to the node in this path.
These methods are similar to our study in the formulation of the problem; the main difference is that our approach is defined in terms of evolutionary game theory. As it is shown in the next section, this approach ensures that the final labeling of the data is consistent and that the solution of the problem is always found. In fact, our system always converges to the nearest Nash equilibrium from where the dynamics have been started.
3 Word Sense Disambiguation as a Consistent Labeling Problem
WSD can be interpreted as a senselabeling task [\citenameNavigli2009], which consists in assigning a sense label to a target word. As a labeling problem we need an algorithm, which performs this task in a consistent way, taking into account the context in which the target word occurs. Following this observation we can formulate the WSD task as a constraint satisfaction problem [\citenameTsang1995] in which the labeling process has to satisfy some constraints in order to be consistent. This approach gives us the possibility not only to exploit the contextual information of a word but also to find the most appropriate sense association for the target word and the words in its context. This is the most important contribution of our work, which distinguishes it from existing WSD algorithms. In fact, in some cases using only contextual information without the imposition of constraints can lead to inconsistencies in the assignment of senses to related words.
As an illustrative example we can consider a binary CSP, which is defined by a set of variables representing the elements of the problem and a set of binary constraints representing the relationships among variables. The problem is considered solved if there is a solution, which satisfies all the constraints. This setting can be described in a formal manner as a triple , where is the set of variables, is the set of domains for each variable, each denoting a finite set of possible values for variable ; and is a set of binary constraints where describe a set of compatible pairs of values for the variables and . can be defined as a binary matrix of size where and are the cardinalities of domains and variables respectively. Each element of the binary matrix indicates if the assignment is compatible with the assignment . is used to impose constraints on the labeling so that each label assignment is consistent.
The binary case described above assumes that the constraints are completely violated or completely respected, which is restrictive; it is more appropriate, in many realword cases, to have a weight, which expresses the level of confidence about a particular assignment [\citenameHummel and Zucker1983]. This notion of consistency has been shown to be related to the Nash equilibrium concept in game theory [\citenameMiller and Zucker1991]. We have adopted this method to approach the WSD task in order to perform a consistent labeling of the data. In our case, we can consider variables as words, labels as word senses and compatibility coefficients as similarity values among two word senses. To explain how the Nash equilibria are computed we need to introduce basic notions of game theory in the following sections.
4 Game Theory
In this section, we briefly introduce the basic concepts of classical game theory and evolutionary game theory that we used in our framework; for a more detailed analysis of these topics the reader is referred to [\citenameWeibull1997, \citenameLeytonBrown and Shoham2008, \citenameSandholm2010].
4.1 Classical Game Theory
Game theory provides predictive power in interactive decision situations. It has been introduced by Von Neumann and Morgenstern \shortcitevon1944theory in order to develop a mathematical framework able to model the essentials of decision making in interactive situations. In its normal form representation (which is the one we use in this article) it consists of a finite set of players , a set of pure strategies for each player , and a utility function , which associates strategies to payoffs. Each player can adopt a strategy in order to play a game and the utility function depends on the combination of strategies played at the same time by the players involved in the game, not just on the strategy chosen by a single player. An important assumption in game theory is that the players are rational and try to maximize the value of ; Furthermore, in noncooperative games the players choose their strategies independently, considering what the other players can play and try to find the best strategy profile to employ in a game.
A strategy is said to be dominant if and only if:
(2) 
where represents all strategy sets other than player ’s.
As an example, we can consider the famous Prisoner’s Dilemma, whose payoff matrix is shown in Table 1. Each cell of the matrix represents a strategy profile, where the first number represents the payoff of Player 1 () and the second is the payoff of Player 2 (), when both players employ the strategy associated with a specific cell. is called the row player because it selects its strategy according to the rows of the payoff matrix, is called the column player because it selects its strategy according to the columns of the payoff matrix. In this game the strategy confess is a dominant strategy for both players and this strategy combination is the Nash equilibrium of the game.
Nash equilibria represent the key concept of game theory and can be defined as those strategy profiles in which each strategy is a best response to the strategy of the coplayer and no player has the incentive to unilaterally deviate from his decision, because there is no way to do better.
In many games, the players can also play mixed strategies, which are probability distributions over their pure strategies. Within this setting, the players choose a strategy with a certain preassigned probability. A mixed strategy set can be defined as a vector , where is the number of pure strategies and each component denotes the probability that player chooses its th pure strategy. For each player its strategy set is defined as a standard simplex:
(3) 
Each mixed strategy corresponds to a point on the simplex and its corners correspond to pure strategies.
In a twoplayers game we can define a strategy profile as a pair where and . The expected payoff for this strategy profile is computed as follows: and , where and are the payoff matrices of player and player , respectively. The Nash equilibrium is computed in mixed strategies in the same way of pure strategies. It is represented by a pair of strategies such that each is a best response to the other. The only difference is that, in this setting, the strategies are probabilities and must be computed considering the payoff matrix of each player.
A game theoretic framework can be considered as a solid tool in decision making situations since a fundamental theorem by Nash \shortcitenash1951non states that any normalform game has at least one mixed Nash equilibrium, which can be employed as the solution of the decision problem.
confess  don’t confess  

confess  5,5  0,6 
don’t confess  6,0  1,1 
4.2 Evolutionary Game Theory
Evolutionary game theory has been introduced by Smith and Price \shortcitesmith1973conflict overcoming some limitations of traditional game theory, such as the hyperrationality imposed on the players. In fact, in real life situations the players choose a strategy according to heuristics or social norms [\citenameSzabó and Fath2007]. It has been introduced in biology to explain the evolution of species. In this context, strategies correspond to phenotypes (traits or behaviors), payoffs correspond to offsprings, allowing players with a high actual payoff (obtained thanks to their phenotype) to be more prevalent in the population. This formulation explains natural selection choices among alternative phenotypes based on their utility function. This aspect can be linked to rational choice theory, in which players make a choice that maximizes their utility, balancing cost against benefits [\citenameOkasha and Binmore2012].
This intuition introduces an inductive learning process, in which we have a population of agents which play games repeatedly with their neighbors. The players, at each iteration, update their beliefs on the state of the game and choose their strategy according to what has been effective and what has not in previous games. The strategy space of each player is defined as a mixed strategy profile , as defined in the previous section, which lives in the mixed strategy space of the game, given by the Cartesian product:
(4) 
The expected payoff of a pure strategy in a single game is calculated as in mixed strategies. The difference in evolutionary game theory is that a player can play the game with all other players, obtaining a final payoff, which is the sum of all the partial payoff obtained during the single games. We have that the payoff relatives to a single strategy is: and the average payoff , where is the number of players with whom the games are played and is the payoff matrix between player and . Another important characteristic of evolutionary game theory is that the games are played repeatedly. In fact, at each iteration a player can update its strategy space according to the payoffs gained during the games. He can allocate more probability to the strategies with high payoff until an equilibrium is reached. In order to find those states that correspond to the Nash equilibria of the games, the replicator dynamic equation is used [\citenameTaylor and Jonker1978]:
(5) 
which allows better than average strategies (best replies) to grow at each iteration.
The following theorem states that with equation 5 it is always possible to find the Nash equilibria of the games (see [\citenameWeibull1997] for the proof).
A point is the limit of a trajectory of equation 5 starting from the interior of if and only if is a Nash equilibrium. Further, if point is a strict Nash equilibrium, then it is asymptotically stable, additionally implying that the trajectories starting from all nearby states converge to .
As in [\citenameErdem and Pelillo2012] we used the discrete time version of the replicator dynamic equation for the experiments of this paper:
(6) 
where, at each time step , the players update their strategies according to the strategic environment, until the system converges and the Nash equilibria are met. In classical evolutionary game theory these dynamics describe a stochastic evolutionary process in which the agents adapt their behaviors to the environment.
For example, if we analyze the prisoner’s dilemma within the evolutionary game theory framework we can see that the cooperative strategy (do not confess) tends to emerge as an equilibrium of the game and this is the best situation for both players, because this strategy gives an higher payoff than the defect strategy (confess), which is the equilibrium in the classical game theory framework. In fact, if the players play the game shown in Table 1 repeatedly and randomize their decisions in each game, assigning at the beginning a normal distribution to their strategies, their payoffs can be computed as follows:
where is the transpose operator, required for , which chooses its strategies according to the columns of the matrix in Table 1. This operation makes the matrices and identical and for this reason in this case the distinction among the two players is not required since they get the same payoffs. Now we can compute the strategy space of a player at time according to equation (5):

:

:
The game is played with the new strategy spaces until the system converges, that is when the difference among the payoffs at time and is under a small threshold. In Figure 1 we can see how the cooperate strategy increases over time, reaching a stationary point, which corresponds to the equilibrium of the game.
Pietarinen \shortcitepietarinen2007invitation distinguishes two levels in
The use of game theory as a tool to explain the origin of language …
The use of game theory as a tool to study communication systems relies on the
In our work we interpret
5 WSD Games
In this section we describe how the WSD games are formulated. We assume that each player , which participates in the games is a particular word in a text and that each strategy is a particular word sense. The players can choose a determined strategy among the set of strategies , each expressing a certain hypothesis about its membership in a class and being the total number of classes available. We consider as the mixed strategy for player as described in Section 4. The games are played between two similar words, and , imposing only pairwise interaction between them. The payoff matrix of a single game is defined as a sense similarity matrix between the senses of word and word . The payoff function for each word is additively separable and is computed as described in Section 4.2.
Formulating the problem in this way we can apply equation (6) to compute the equilibrium state of the system, which corresponds to a consistent labeling of the data. In fact, once stability is reached, all players play the strategy with the highest payoff. Each player arrives to this state not only considering its own strategies but also the strategies that its coplayers are playing. For each player is chosen the strategy with the highest probability when the system converges (see equation below).
(7) 
In our framework a word is not disambiguated only if it is not able to update its strategy space. This can happen when the player’s strategy space is initialized with a uniform distribution and either its payoff matrices have only zero entries, that is when its senses are not similar to the senses of the coplayers, or it is not connected with other nodes in the graph. The former assumption depends on the semantic measures used to calculate the payoffs (see section 5.2.2), experimentally we noticed that it does not happen frequently. The latter assumption can happen when a word is not present in a determined corpus. It can be avoided using query expansion techniques or connecting the disconnected node with nodes in its neighborhood, exploiting proximity relations (see section 5.1.1). With equation 7 it is guaranteed that at the end of the process each word is mapped to exactly one sense. Experimentally, we noticed that when a word is able to update its strategy space, it is not the case that two strategies in it have the same probability.
5.1 Implementation of the WSD Games
In order to run our algorithm we need the network that models the interactions among the players, the strategy space of the game and the payoff matrices. We adopted the following steps in order to model the data required by our framework and specifically, for each text to be disambiguated, we:

extract from the text the list of words , which have an entry in a lexical database,

compute, from , the word similarity matrix in which are stored the pairwise similarities among each word with the others and represents the players’ interactions,

increase the weights between two words, which share a proximity relation,

extract, from , the list of all the possible senses, which represents the strategy space of the system,

assign, for each word in , a probability distribution over the senses in creating for each player a probability distribution over the possible strategies,

compute the sense similarity matrix among each pair of senses in , which is then used to compute the partial payoff matrices of each games,

apply the replicator dynamics equation in order to compute the Nash equilibria of the games, and

assign to each word a strategy .
These steps are described in the following section. In Section 5.1.1 we describe the graph construction procedure, which we employed in order to model the geometry of the data. In Section 5.1.2 we explain how we implement the strategy space of the game, which allows each player to choose over a predetermined number of strategies. In Section 5.1.3 we describe how we compute the sense similarity matrix and how it is used to create the partial payoff matrices of the games. Finally in Section 5.1.4 we describe the system dynamics.
5.1.1 Graph Construction
In our study, we modeled the geometry of the data as a graph. The nodes of the graph correspond to the words of a text, which have an entry in a lexical database. We denote the words by , where is the th word and is the total number of words retrieved. From we construct a similarity matrix where each element is the similarity value assigned by a similarity function to the words and . can be exploited as an useful tool for graphbased algorithms since it is treatable as weighted adjacency matrix of a weighted graph.
A crucial factor for the graph construction is the choice of the similarity measure, to weight the edges of the graph. In our experiments, we used similarity measures, which compute the strength of cooccurrence between any two words and .
(8) 
This choice is motivated by the fact that collocated words tend to have determined meanings [\citenameGale, Church, and Yarowsky1992, \citenameYarowsky1993], and also because the computation of these similarities can be obtained easily. In fact, it only required a corpus in order to compute a vast range of similarity measures. Furthermore, large corpora such as the BNC corpus [\citenameLeech1992] and the Google Web 1T corpus [\citenameBrants and Franz2006] are freely available and extensively used by the research community.
In some cases, it is possible that some target words are not present in the reference corpus, due to different text segmentation techniques or spelling differences. In this case we use query expansion techniques in order to find an appropriate substitute [\citenameCarpineto and Romano2012]. Specifically, we use WordNet to find alternative lexicalizations of a lemma, choosing the one that cooccurs more frequently with the words in its context.
The information obtained from an association measure can be enriched taking into account the proximity of the words in the text (or the syntactic structure of the sentence). The first task can be achieved augmenting the similarities among a target word and the words that appear on its right and on its left, where is a parameter that with small values can capture fixed expressions and with large values can detect semantic concepts [\citenameFkih and Omri2012]. The second task can be achieved using a dependency parser to obtain the syntactical relations among the words in the target sentence, but this approach is not used in this paper. In this way, the system is able to exploit local and global cues, mixing together the one sense per discourse [\citenameKelly and Stone1975] and the one sense per collocation [\citenameYarowsky1993] hypotheses.
We are not interested in all the relations in the sentence but we focus only on relations among target words. The use of a dependency/proximity structure makes the graph reflect the structure of the sentence while the use of a distributional approach allows us to exploit the relations of semantically correlated words. This is particularly useful when the proximity information is poor; for example when it connects words to auxiliary or modal verbs. Furthermore, these operations ensure that there are no disconnected nodes in the graph.
5.1.2 Strategy Space Implementation
The strategy space of the game is created using a knowledge base to collect the sense inventories of each word in a text, where is the number of senses associated to word . Then is created the list of all the unique concepts in the sense inventories, which correspond to the space of the game.
With this information we can define the strategy space of the game in matrix form as:
where each row correspond to the mixed strategy space of a player and each column correspond to a specific sense. Each component denotes the probability that the player chooses to play its th pure strategy among all the strategies in its strategy profile, as described in Section 4. The initialization of each mixed strategy space can either be uniform or take into account information from senselabeled corpora.
5.1.3 The Payoff Matrices
We encoded the payoff matrix of a WSD game as a sense similarity matrix among all the senses in the strategy spaces of the game. In this, way the higher the similarity among the senses of two words, the higher the incentive for a word to chose that sense, and play the strategy associated with it.
The sense similarity matrix is defined in equation (9).
(9) 
This similarity matrix can be obtained using the information derived by the same knowledge base used to construct the strategy space of the game. It is used to extract the partial payoff matrix for all the single games played between two players and . This operation is done extracting from the entries relative to the indices of the senses in the sense inventories and . It produces an payoff matrix, where and are the numbers of senses in and , respectively.
5.1.4 System Dynamics
Now that we have the topology of the data , the strategy space of the game and the payoff matrix we can compute the Nash equilibria of the game according to equation (6). In each iteration of the system each player plays a game with its neighbors according to the cooccurrence graph . The payoffs of the hth strategy is calculated as:
(10) 
and the player’s payoff as:
(11) 
In this way we can weight the influence that each word has on the choices that a particular word has to make on its meaning. We assume that the payoff of word depends on the similarity that it has with word , , the similarities among its senses and those of word , , and the sense preference of word , (). During each phase of the dynamics a process of selection allows strategies with higher payoff to emerge and at the end of the process each player chooses its sense according to these constraints.
The complexity of each step of the replicator dynamics is quadratic but there are different dynamics that can be used with our framework to solve the problem more efficiently, such as the recently introduced infection and immunization dynamics [\citenameRota Buló, Pelillo, and Bomze2011] that has a lineartime/space complexity per step and it is known to be much faster then, and as accurate as, the replicator dynamics.
5.2 Implementation Details
In this section we describe the association measures used to weight the graph (Section 5.2.1), the semantic and relatedness measures used to compare the synsets (Section 5.2.2), the computation of the payoff matrices of the games (Section 5.2.3) and the different implementations of the system strategy space (5.2.4), in case of unsupervised, semisupervised and coarsegrained WSD.
5.2.1 Association Measures
We evaluated our algorithm with different similarity measures in order to find the measure that performs better. The results of this evaluation are presented in Section 6.2.1. Specifically for our experiments we used eight different measures: the Dice coefficient (dice) [\citenameDice1945], the modified Dice coefficient (mDice) [\citenameKitamura and Matsumoto1996], the pointwise mutual information (pmi) [\citenameChurch and Hanks1990], the tscore measure (tscore) [\citenameChurch and Hanks1990], the zscore measure (zscore) [\citenameBurrows2002], the odds ration (oddsr) [\citenameBlaheta and Johnson2001], the chisquared test (chis) [\citenameRao2002] and the chisquared correct (chisc) [\citenameDeGroot et al.1986].
The measures that we used are presented in Figure 3 where the notation refers to the standard contingency tables [\citenameEvert2008] used to display the observed and expected frequency distribution of the variables, respectively on the left and on the right of Figure 2. All the measures for the experiments in this article have been calculated using the BNC corpus [\citenameLeech1992], since it is a well balanced general domain corpus.
5.2.2 Semantic and Relatedness Measures
We used WordNet [\citenameMiller1995] and BabelNet [\citenameNavigli and Ponzetto2012a] as knowledge bases to collect the sense inventories of each word to be disambiguated.
agraphSemantic and Relatedness Measures Calculated with WordNet WordNet [\citenameMiller1995] is a lexical database where the lexicon is organized according to a psycholinguistic theory of the human lexical memory, in which the vocabulary is organized conceptually rather than alphabetically, giving a prominence to word meanings rather than to lexical forms. The database is divided in five parts: nouns, verbs, adjectives, adverbs and functional words. In each part the lexical forms are mapped to the senses related to them, in this way it is possible to cluster words, which share a particular meaning (synonyms) and to create the basic component of the resource: the synset. Each synset is connected in a network to other synsets, which have a semantic relation with it.
The relations in WordNet are: hyponymy, hypernymy, antonymy, meronymy and holonymy. Hyponymy gives the relations from more general concepts to more specific; hypernymy gives the relations from particular concepts to more general; antonymy relates two concepts, which have an opposite meaning; meronymy connects the concept that is part of a given concept with it; and holonymy relates a concept with its constituents. Furthermore, each synset is associated to a definition and gives the morphological relations of the word forms related to it. Given the popularity of the resource many parallel projects have been developed. One of them is eXtended WordNet [\citenameMihalcea and Moldovan2001], which gives a parsed version of the glosses together with their logical form and the disambiguation of the term in it.
We have used this resource to compute similarity and relatedness measures in order to construct the payoff matrices of the games. The computation of the sense similarity measures is generally conducted using relations of likeness such as the isa relation in a taxonomy; on the other hand the relatedness measures are more general and take in account a wider range of relations such as the isapartof or istheoppositeof.
The semantic similarity measure which we used are the wup similarity [\citenameWu and Palmer1994] and the jcn measure [\citenameJiang and Conrath1997]. These measure are based on the structural organization of WordNet and compute the similarity among two senses , according to the depth of the two sense in the lexical database and that of the most specific ancestor node, msa, of the two senses. The wup similarity, described in equation (12), takes into account only the path length among two concepts. The jcn measure combines corpus statistics and structural properties of a knowledge base. It is computed as presented in equation (13), where is the information content of a concept derived from a corpus^{3}^{3}3We used the IC files computed on SemCor [\citenameMiller et al.1993] for the experiments in this article. They are available at http://wnsimilarity.sourceforge.net and are mapped to the corresponding version of WordNet of each dataset. and computed as .
(12) 
(13) 
The semantic relatedness measures, which we used, are based on the computation of the similarity among the definitions of two concepts in a lexical database. These definitions are derived from the glosses of the synsets in WordNet. They are used to construct a cooccurrence vector for each concept , with a bagofwords approach where represents the number of times word occur in the gloss and is the total number of different words (types) in the corpus^{4}^{4}4In our case the corpus is composed of all the WordNet glosses.. This representation allows to project each vector into a vector space where it is possible to conduct different kind of computations. For our experiments, we decided to calculate the similarity among two glosses using the cosine distance among two vectors as shown in equation (14), where the nominator is the intersection of the words in the two glosses and is the norm of the vectors, which is calculated as: .
(14) 
This measure gives the cosine of the angle between the two vectors and in our case returns values ranging from to because the values in the cooccurrence vectors are all positive. Given the fact that small cosine distances indicate a high similarity we transform this distance measure into a similarity measure with .
The procedure to compute the semantic relatedness of two synsets has been introduced by Patwardhan and Pedersen \shortcitepatwardhan2006using as Gloss Vector measure and we used it with four different variations for our experiments. The four variations are named: , , and . The difference among them relies on the way the gloss vectors are constructed. Since the synset gloss is usually short we used the concept of supergloss as in [\citenamePatwardhan and Pedersen2006] to construct the vector of each synset. A supergloss is the concatenation of the gloss of the synset plus the glosses of the synsets, which are connected to it via some WordNet relations [\citenamePedersen2012]. We employed, the WordNet version that has been used to to label each dataset. Specifically the different implementations of the vector construction vary on: the way in which the cooccurrence is calculated, the corpus used and the source of the relations. tfidf constructs the cooccurrence vectors exploiting the term frequency  inverse document frequency weighting schema (tfidf). uses the same information of tfidf plus the relations derived from eXtended WordNet [\citenameMihalcea and Moldovan2001]. vec uses a standard BoW approach to compute the cooccurrences. uses the same information of vec plus the relations from eXtended WordNet.
Instead of considering only the raw frequency of terms in documents, the tfidf method, scales the importance of less informative terms taking into account the number of documents in which a term occur. Formally, it is the product of two statistics: the term frequency and the inverse document frequency. The former is computed as the number of times a term occur in a document (gloss in our case), the latter is computed as , where is the number of documents in the corpus and is the number of documents in which the term occurs.
agraphRelatedness Measure Calculated with BabelNet and NASARIBabelNet [\citenameNavigli and Ponzetto2012a] is a widecoverage multilingual semantic network. It integrates lexicographic and encyclopedic knowledge from WordNet and Wikipedia, automatically mapping the concepts shared by the two knowledge bases. This mapping generates a semantic network where millions of concepts are lexicalized in different languages. Furthermore, it allows to link named entities, such as Johann Sebastian Bach and concepts, such as composer and organist.
BabelNet can be represented as a labeled direct graph where is the set of nodes (concepts or named entities) and is the set of edges connecting pairs of concepts or named entities. The edges are labeled with a semantic relation from , such as: isa, given name or occupation. Each node contains a set of lexicalizations of the concept for different languages, which forms a BabelNet synset.
The semantic measure, which we developed using BabelNet, is based on NASARI^{5}^{5}5The resource is available at http://lcl.uniroma1.it/nasari/ [\citenameCamachoCollados, Pilehvar, and Navigli2015], a semantic representation of the concepts and named entities in BabelNet. This approach first exploits the BabelNet network to find the set of related concepts in WordNet and Wikipedia and then constructs two vectors to obtain a semantic representation of a concept . These representations are projected in two different semantic spaces, one based on words and the other on synsets. They use lexical specificity^{6}^{6}6A statistical measure based on the hypergeometric distribution over word frequencies. [\citenameLafon1980] to extract the most representative words to use in the first vector and the most representative synsets to use in the second vector.
In this article, we computed the similarity between two senses using the vectors (of the wordbased semantic space) provided by NASARI. These semantic representations provide for each sense the set of words, which best represent e particular concept and the score of representativeness of each word. From this representation we computed the pairwise cosine similarity between each concept as described in the previous section for the semantic relatedness measures.
The use of NASARI is particularly useful in case of named entity disambiguation, since it includes many entities, which are not included in WordNet. On the other hand, it is difficult to use it in allwords sense disambiguation tasks, since it includes only WordNet synsets that are mapped to Wikipedia pages in BabelNet. For this reason it is not possible to find the semantic representation for many verbs, adjectives and adverbs, which is common to find in allwords sense disambiguation tasks.
We used the SPARQL endpoint^{7}^{7}7http://babelnet.org/sparql/ provided by BabelNet to collect the sense inventories of each word in the texts of each dataset. For this task we filtered the first 100 resources whose label contains the lexicalization of to word to be disambiguated. This operation is required because in many cases it is possible to have indirect references to entities.
5.2.3 From similarities to payoffs
The similarity and relatedness measures are computed for all the senses of the words to be disambiguated. From this computation it is possible to obtain a similarity matrix , which incorporates the pairwise similarity among all the possible senses. This computation could have heavy computational cost, if there are many words to be disambiguated. To overcome this issue, the pairwise similarities can be computed just one time on the entire knowledge base and used in actual situations, reducing the computational cost of the algorithm. From this matrix we can obtain the partial semantic similarity matrix for each pair of player, , where and are the senses of and in .
In a previous work [\citenameTripodi and Pelillo2015] we did not use this information, instead we used labeled data points to propagate the class membership information over the graph. In this new version the use of the semantic information made the algorithm completely unsupervised.
5.2.4 Strategy space implementation
Once the pairwise similarities between the words and their senses, stored in the two matrices and , are calculated, we can pass to the description of the strategy space of each player. It can be initialized with equation (15), which follows the constraints described in Section 4.2 and assigns to each sense an equal probability.
(15) 
This initialization is used in the case of unsupervised WSD since it does not use any prior knowledge about the senses distribution. In case we want to exploit information from prior knowledge, obtained from senselabeled data, we can assign to each sense a probability according with its rank, concentrating a higher probability on senses that have a high frequency. To model this kind of scenario we used a geometric distribution that gives us a decreasing probability distribution. This new initialization is defined as follows,
(16) 
where is the parameter of the geometric distribution and determines the scale or statistical dispersion of the probability distribution, and is the rank of sense , which ranges from , the rank of the most common sense, to , the rank of the least frequent sense. Finally, the vector obtained from equation (16) is divided by in order to make the probabilities add up to . In our experiments, we used the ranked system provided by the Natural Language Toolkit (version 3.0) [\citenameBird2006] to rank the senses associated to each word to be disambiguated. Natural Language Toolkit is a suite of modules and data sets, covering symbolic and statistical NLP. It includes a WordNet reader that can be queried with a lemma and a part of speech to obtain the list of possible sysnets associated to the specified lemma and a part of speech. The returned synsets are listed in decreasing order of frequency and can be used as ranking system by our algorithm.
We used the method proposed by Navigli \shortcitenavigli2006meaningful for the experiments on coarsegrained WSD. With this approach it is possible to cluster the senses of a given word, according to the similarity that the senses share. In this way it is possible to obtain a set of disjoint clusters , which is ranked according to the information obtained with the ranking system described above, for each sense inventory . The initialization of the strategy space, in this case, is defined as follows,
(17) 
With this initialization it is possible to assign an equal probability to the senses belonging to a determined cluster and to rank the clusters according to the ranking of the senses in each of them.
5.3 An example
As an example we can consider the following sentence, which we encountered before:

There is a financial institution near the river bank.
We first tokenize, lemmatize and tag the sentence, then we extract the content words that have an entry in WordNet 3.0 [\citenameMiller1995], constructing the list of words to be disambiguated: {is, financial, institution, river, bank}. Once we identified the target words we computed the pairwise similarity for each target word. For this task we used the Google Web 1T 5Gram Database [\citenameBrants and Franz2006] to compute the modified Dice coefficient^{8}^{8}8Specifically we used the service provided by the Corpus Linguistics group at FAU ErlangenNürnberg, with a collocation span of 4 words on the left and on the right and collocates with minimum frequency: 100. [\citenameKitamura and Matsumoto1996]. With the information derived by this process we can construct a cooccurrence graph (Figure 4(a)), which indicates the strength of association between the words in the text. This information can be augmented taking into account other sources of information such that the dependency structure of the syntactic relations between the words^{9}^{9}9This aspect is not treated in this article. or the proximity information derived by a simple ngram model (Figure 4(b), ).
The operation to increment the weights of structurally related words is important because it prevents the system to rely only on distributional information, which could lead to a sense shift for the ambiguous word bank. In fact, its association with the words financial and institution would have the effect to interpret it as a financial institution and not as sloping land as defined in WordNet. Furthermore, using only distributional information could exclude associations between words that do not appear in the corpus in use.
In Figure 4(c) it is represented the final form of the graph for our target sentence, in which we have combined the information from the cooccurrence graph and from the ngram graph. The weights in the cooccurrence graph are increased by the mean weight of the graph if a corresponding edge exists in the ngram graph and not include stopword^{10}^{10}10A more accurate representation of the data can be obtained using the dependency structure of the sentence, instead of the ngram graph; but in this case the results would not have changed, since in both cases there is an edge between river and bank. In fact, in many cases a simple ngram model can implicitly detect syntactical relations. We used the stopword list available in the Python Natural Language Toolkit, described above..
After the pairwise similarities between the words are computed we access a lexical database in order to get the sense inventories of each word so that each word can be associated to a predefined number of senses. For this task, we use WordNet 3.0 [\citenameMiller1995]. Then for each unique sense in all the sense inventories we compute the pairwise semantic similarity, in order to identify the affinity among all the pairwise sense combination. This task can be done using a semantic similarity or relatedness measure ^{11}^{11}11Semantic similarity and relatedness measures are discussed in Section 5.2.1 and 5.2.2.. For this example, we used a variant of the gloss vector measure [\citenamePatwardhan and Pedersen2006], the tfidf, described in Section 5.2.2.
Having obtained the similarity information we can initialize the strategy space of each player with a uniform distribution, given the fact that we are not considering any prior information about the senses distributions. Now the system dynamics can be started. In each iteration of the dynamics each player play a game with its neighbors obtaining a payoff for each of its strategies according to equation (10) and once the players have played the games with their neighbors in , the strategy space of each player is updated at time according to equation (6).
We present the dynamics of the system created for the example sentence in Figure 5. The dynamics are shown only for the ambiguous words at time steps , , and (when the system converges). As we can see at time step 1 the senses of each word are equiprobable, but as soon as the games are played some sense starts to emerge. In fact at time step 2 many senses are discarded, and this in virtue of two principles,{enumerate*}[label=)]
the words in the text push the senses of the other words toward a specific sense; and
the sense similarity values for certain senses are very low. Regarding the first principle, we can consider the word institution, which is playing the games with the words financial and bank, is immediately driven toward a specific sense, as an organization founded and united for a specific purpose as defined in WordNet 3.0; thus discarding the other senses. Regarding the second principle, we can consider many senses of the word bank that are not compatible with the senses of the other words in the text and therefore their values decrease rapidly.
The most interesting phenomenon that can be appreciated from the example is the behavior of the strategy space of the word bank. It has ten senses, according to WordNet 3.0 [\citenameMiller1995], and can be used in different context and domains, to indicate, among the other things, a financial institution ( in Figure 5) or a sloping land ( in Figure 5). When it plays a game with the words financial and institution it is directed toward its financial sense; when it plays a game with the word river, it is directed toward its naturalistic meaning. As we can see in Figure 5 at time step 2 the two meanings ( and ) have almost the same value and at time step 3 the word starts to define a precise meaning to the detriment of but not of . The balancing of these forces toward a specific meaning is given by the similarity value , which allows bank in this case to chose its naturalistic meaning. Furthermore, we can see that the inclination to a particular sense is given by the payoff matrix and by the strategy distribution , which indicates what sense word is going to choose, ensuring that word ’s is coherent with this choice.
6 Experimental Evaluation
In this Section we describe how the parameters of the presented method have been found and how it has been tested and compared with stateoftheart systems^{12}^{12}12The code of the algorithm and the datasets used are available at http://www.dsi.unive.it/ tripodi/wsd, in Section 6.1 and Section 6.2, respectively. We describe the datasets used for the tuning and for the evaluation of our model and the different settings used to test it. The results of our experiments using WordNet as knowledge base are described in Section 6.2.1, where two different implementations of the system are proposed, the unsupervised and the supervised. In Section 6.2.1 we compare our results with stateoftheart systems. Finally, the results of the experiments using BabelNet as knowledge base, related to WSD and entity disambiguation, are described in Section 6.2.2. The results are provided as F1, computed according to the following equation,
(18) 
F1 is a measure that determines the weighted harmonic mean of precision and recall. Precision is defined as the number of correct answers divided by the number of provided answers and recall is defined as the number of correct answers divided by the total number of answers to be provided.
6.1 Parameter Tuning
We used two datasets to tune the parameters of our approach, SemEval2010 task 17 (S10) [\citenameAgirre et al.2009] and SemEval2015 Task 13 (S15) [\citenameMoro and Navigli2015]. The first dataset is composed of three English texts from the ecology domain, for a total of 1398 words to be disambiguated (1032 nouns/named entities and 366 verbs). The second dataset is composed of four English documents, from different domains: medical, drug, math and social issues, for a total of 1261 instances, including nouns/named entities, verbs, adjectives and adverbs. Both datasets have been manually labeled using WordNet 3.0. The only difference between these datasets is that the target words of the first dataset belong to a specific domain, whereas all the content words of the second dataset have to be disambiguated. We used these two typologies of dataset to evaluate our algorithm in different scenarios, furthermore we created, from each dataset, 50 different datasets, selecting from each text a random number of sentences and evaluating our approach on each of these datasets to identify the parameters that on average perform better than others. In this way it is possible to simulate a situation in which the system has to work on texts of different sizes and on different domains. This because as demonstrated by Søgaard et. al \shortcitesogaard2014s the results of a determined algorithm are very sensitive to sample size. The number of target words for each text in the random datasets ranges from 12 to 571. The parameters which will be tuned are: the association and semantic measure to use to weight the similarity among words and senses (Section 6.1.1), the of the ngram graph used to increase the weights of near words (Section 6.1.2) and the of the geometric distribution useed by our semisupervised system (Section 6.1.3)
6.1.1 Association and Semantic Measures
The first experiment that we present is aimed at finding the semantic and distributional measures with the highest performances. We recall that we used WordNet 3.0 as knowledge base and the BNC corpus [\citenameLeech1992] to compute the association measures. In Table 2 and 3 we report the average results on the S10 and S15 datasets, respectively. From these tables it is possible to see that the performances of the system are highly influenced by the combination of measures used.
dice  mdice  pmi  tscore  zscore  oddsr  chis  chisc  

55.5  56.3  50.6  45.4  50.1  49.8  39.1  54.4  
56.5  55.9  50.1  45.0  49.9  49.5  39.1  54.2  
54.7  54.3  49.3  44.1  49.4  53.6  39.3  50.5  
55.0  54.3  48.8  43.8  48.6  53.6  39.1  49.9  
51.3  50.6  40.1  50.1  47.6  52.6*  50.1  50.6  
37.2  36.9  35.6  32.2  37.9  36.8  38.4  35.4 
dice  mdice  pmi  tscore  zscore  oddsr  chis  chisc  

64.1  64.2  63.1  59.0  61.8  65.3  63.3*  62.4  
62.9  63.1  62.4  58.7  60.9  63.0  62.0  61.1  
62.8  62.3  62.8  59.8  62.3  62.9  61.1  60.3  
60.5  59.9  61.2  57.8  59.7  60.6  60.1  59.4  
57.2  57.6  56.7  57.9  57.0  56.9  57.5  57.6  
46.2  45.4  43.8  45.4  45.9  47.4  46.1  45.5 
As an example of the different representations generated by the measures
described in Section 5.2 we can observe Figure
6 and 7, which depict the matrices and the adjacency matrix of the graph ,
respectively and are computed on the following three sentences from the second
text of S10,
The rivers Trent and Ouse, which provide the main fresh water flow into the Humber, drain large industrial and urban areas to the south and west (River Trent), and less densely populated agricultural areas to the north and west (River Ouse). The Trent/Ouse confluence is known as Trent Falls. On the north bank of the Humber estuary the principal river is the river Hull, which flows through the city of KingstonuponHull, and has a tidal length of 32 km, up to the Hempholme Weir.
resulting in 35 content words (names and verbs) listed below and 131 senses.

river n

Trent n

Ouse n

provide v

main n

water n

flow n

Humber n

drain v

area n

south n

west n

River n

Trent n

area n

River n

Ouse n

Trent n

Ouse n

confluence n

be v

Trent n

Falls n

bank n

Humber n

estuary n

river n

be v

river n

flow v

city n

have v

length* n

km n

Weir n
The first observation that can be done on the results is related to the semantic measures; in fact, the relatedness measures perform significantly better than the semantic similarity measures. This is due to the fact that wup and jcn can be computed only on synsets, which have the same part of speech. This limitation affects the results of the algorithm because the games played between two words with different parts of speech have no effect on the dynamics of the system, since the values of the resulting payoff matrices are all zeros. This affects the performances of the system in terms of recall, because, in this situation, these words tend to remain on the central point of the simplex and also in terms of precision, because the choice of the meaning of a word is computed only taking into account the influence of words with the same part of speech. In fact, from Figure 6 we can see that the representations provided by wup and jcn, for the text described above, have many uniform areas, this means that these approaches are not able to provide a clear representation of the data. To the contrary, the representations provided by the relatedness measures show a block structure on the main diagonal of the matrix, which is exactly what is required for a similarity measure. The use of the tfidf weighting schema seems to be able to reduce the noise in the data representation, in fact the weights on the left part of the matrix are reduced by tfidf and tfidfext whereas they have high values in vec and vecext. The representations obtained with eXtended WordNet are very similar to those obtained with WordNet and also their performances are very close, although on average WordNet outperform eXtended WordNet.
If we observe the performances of the association measures we can notice that on average the best measures are dice, mdice, chisc and also oddsr on S15, to the contrary the other measures perform almost always under the statistical significance. Observing the representations in Figure 7 we can see that dice and mdice have a similar structure, the difference between these two measures are that mdice has values on a different range and tends to differentiate better the weights, whereas in dice the values are almost uniform. Pmi tends to take high values when one word in the collocation has low frequency but this does not imply high dependency and therefore compromise the results of the games. From its representation we can observe that its structure is different from the previous two, in fact it concentrate its values on collocations such as river Trent and river Ouse and this has the effect to unbalance the data representation. In fact the dice and mdice concentrate their values on collocations such as river flow and bank estuary. Tscore and zscore have a similar structure, the only difference is in the range of the values. For these measures we can see that the distribution of the values is quite homogeneous meaning that these measures are not able to balance well the weights. On oddsr we can recognize a structure similar to that of pmi, the main difference is that it works on a different range. The values obtained with chis are on a wide range, which compromises the data representation; in fact its results are always under the statistical significance. Chisc works on a narrower range than chis and its structure resample that of dice, in fact its results are often high.
6.1.2 Ngram Graph
The association measures are able to give a good representation of the text but in many cases it is possible that a word in a specific text is not present in the corpus on which these measures are calculated, furthermore, it is possible that these words are used with different lexicalizations. A way to overcome these problems is to increase the values of the nodes near a determined word, in this way it is possible to ensure that the nodes in are always connected. Furthermore, it allows to exploit local information, increasing the importance of the words, which share a proximity relation with a determined word, in this way it is possible to give more importance to (possibly syntactically) related words, as described in Section 5.1.1.
To test the influence that the parameter of the ngram graph has on the performances of the algorithm we selected the association and relatedness measures with the highest results and conduct a series of experiments on the same datasets presented above, with increasing values of . The results of these experiments on S10 and S15 are presented in Figure 8(a) and 8(b), respectively. From the plots we can see that this approach is always beneficial for S15 and that the results increased substantially with values of greater than 2. To the contrary on S10 this approach is not always beneficial but in many cases it is possible to notice an improvement. In particular we can notice that the pair of measures with highest results on both datasets is tfidfmdice with . This confirms also our earlier experiments in which we have seen that these two measures are particularly suited for our algorithm.
6.1.3 Geometric Distribution
Once we have identified the measures to use in our unsupervised system we can test what is the best parameter to use in case we want to exploit information from sense labeled corpora. To tune the parameter of the geometric distribution (described in Section 5.2.4) we used the pair of measures and the value of detected with the previous experiments and ran the algorithm on S10 and S15 with increasing values of , in the interval .
The results of this experiment are presented in Figure 9(a), where we can see that the performances of the semisupervised system on S15 are always better those obtained with the unsupervised system (). To the contrary, the performances on S10 are always lower than those obtained with the unsupervised system. This behavior is not surprising because the target words of S10 belong to a specific semantic domain. We used SemCor to obtain the information about the sense distributions and this resource is a general domain corpus, which is not tailored for this specific task. In fact, as pointed out by McCarthy et. al \shortcitemccarthy2007unsupervised the distribution of word senses on specific domains is highly skewed and for this reason the most frequent sense heuristic calculated on general domains corpora, such as SemCor, is not beneficial for this kind of texts.
From the plot we can see that on S15 the highest results are obtained with values of ranging from to and for the evaluation of our model we decided to use as parameter for the geometric distribution, since with this value we obtained the highest result.
6.1.4 Error Analysis
The main problems that we noticed analyzing the results of previous experiments are related to the semantic measures. As we pointed out in Section 6.1.1, these measures can be computed only on synsets with the same part of speech and this influences the results in terms of recall. The adverbs and adjectives are not disambiguated with these measures, because of the lack of payoffs. This does not happen only in case of function words with low semantic content but also for verbs with a rich semantic content, such as generate, prevent and obtain. The use of the relatedness measures reduces substantially the number of words that are not disambiguated. With these measures a word is not disambiguated only in cases in which the concepts denoted by it are not covered enough by the reference corpus, for example in our experiments we have that words such as drawnout, dribble and catchment are not disambiguated.
To overcome this problem we have used the ngram graph to increase the weights among neighboring words. Experimentally we noticed that when this approach is used with the relatedness measures it leads to the disambiguation of all the target words and with we have . The use of this approach influences the results also in terms of precision, in fact if we consider the performances of the system on the word actor, we pass from () to (). This is because the number of relations of the two senses (synsets) of the word actor are not balanced in WordNet 3.0, in fact actor as theatrical performer has 21 relations whereas actor as person who acts and gets things done has only 8 relations and this can compromise the computation of the semantic relatedness measures. It is possible to overcome this limitation using the local information given by the ngram graph, which allows to balance the influence of words in the text.
Another aspect to consider is if the polysemy of the words influences the results of the system. Analyzing the results we noticed that the majority of the errors are made on words such as makev, givev, playv, bettera, workv, followv, seev, comev, which have more that 20 different senses and are very frequent words difficult to disambiguate in finegrained tasks. As we can see from Figure10 this problem can be partially solved using the semisupervised system. In fact, the use of information from senselabeled corpora is particularly useful when the polysemy of the words is particularly high.
6.2 Evaluation Setup
We evaluated our algorithm with three finegrained datasets: Senseval2 english allwords (S2) [\citenamePalmer et al.2001], Senseval3 english allwords (S3) [\citenameSnyder and Palmer2004], SemEval2007 allwords (S7) [\citenamePradhan et al.2007], and one coarsegrained dataset, SemEval2007 english allwords (S7CG) [\citenameNavigli, Litkowski, and Hargraves2007]^{13}^{13}13We downloaded S2 from www.hipposmond.com/senseval2, S3 from http://www.senseval.org/senseval3, S7 from http://nlp.cs.swarthmore.edu/semeval/tasks/index.php and S7CG from http://lcl.uniroma1.it/coarsegrainedaw, using as knowledge base WordNet. Furthermore we evaluated our approach on two datasets, SemEval2013 task 12 (S13) [\citenameNavigli, Jurgens, and Vannella2013] and KORE50 [\citenameHoffart et al.2012]^{14}^{14}14We downloaded S13 from https://www.cs.york.ac.uk/semeval2013/task12/index.html and KORE50 from http://www.mpiinf.mpg.de/departments/databasesandinformationsystems/research/yagonaga/aida/downloads/, using as knowledge base BabeNet.
We describe the evaluation using as knowledge base WordNet in the next sections and in Section 6.2.2 we present the evaluation conducted using as knowledge base BabelNet. We recall that for all the next experiments we used mdice to weight the graph , tfidf to compute the payoffs, for the ngram graph and in case of semisupervised learning. The results are provided as F1 for all the datasets except KORE50, for this dataset the results are provided as accuracy, as it is common in the literature.
6.2.1 Experiments Using WordNet as Knowledge Base
SemEval 2007 coarsegrained  S7CG  

Method  All  N  V  A  R 
WSD  80.4  85.5  71.2  81.5  76.0 
WSD  82.8  85.4  77.2  82.9  84.6 
MFS  76.3  76.0  70.1  82.0  86.0 
SemEval 2007 finegrained  S7  
Method  All  N  V  A  R 
WSD  43.3  49.7  39.9  
WSD  56.5  62.9  53.0  
MFS  54.7  60.4  51.7  
Senseval 3 finegrained  S3  
Method  All  N  V  A  R 
WSD  59.1  63.3  50.7  64.5  71.4 
WSD  64.7  70.3  54.1  70.7  85.7 
MFS  62.8  69.3  51.4  68.2  100.0 
Senseval 2 finegrained  S2  
Method  All  N  V  A  R 
WSD  61.2  69.8  41.7  61.9  65.1 
WSD  66.0  72.4  43.5  71.8  75.7 
MFS  65.6  72.1  42.4  71.6  76.1 
Table 4 shows the results as F1 for the four datasets that we used for the experiments with WordNet. The table includes the results for the two implementations of our system: the unsupervised and the semisupervised and the results obtained using the most frequent sense heuristic. For the computation of the most frequent sense we assigned to each word to be disambiguated the first sense returned by the WordNet reader provided by the Natural Language Toolkit (version 3.0) [\citenameBird2006]. As we can see the best performances of our system are obtained on nouns, on all the datasets. This is in line with state of the art systems because in general the nouns have lower polysemy and higher interannotator agreement [\citenamePalmer et al.2001]. Furthermore, our method is particularly suited for nouns. In fact, the disambiguation of nouns benefits from a wide context and local collocations [\citenameAgirre and Edmonds2007].
We obtained low results on verbs, on all datasets. This, as pointed out by Dang \shortcitedang2004investigations, is a common problem not only for supervised and unsupervised WSD systems but also for humans which in many cases disagree about what constitutes a different sense for a polysemous verb, compromising the sense tagging procedure.
As we have anticipated in Section 6.1.3, the use of prior knowledge is beneficial for this kind of dataset. As we can see in Table 4 using a semisupervised setting improves the results of 5% on S2 and S3 and of 12% on S7. The big improvement obtained on S7 can be explained by the fact that the results of the unsupervised system are well below the most frequent sense heuristic, so exploiting the evidence from senselabeled dataset is beneficial. For the same reason, the results obtained on S7CG with a semisupervised setting are less impressive than those obtained with the unsupervised systems; in fact, the structure of the datasets is different and the results obtained with the unsupervised setting are well above the most frequent sense. These series of experiments confirm that the use of prior knowledge is beneficial in general domain datasets and that when it is used the system performs better than the most common sense heuristic computed on the same corpus.
agraphComparison to StateoftheArt Algorithms
S7CG  S7CG (N)  S7  S3  S2  
unsup. 
Nav10  43.1  52.9  
PP  80.1  83.6  41.7  57.9  59.7  
WS  80.4*  85.5  43.3  59.1  61.2  
semi sup. 
IRSTDDD00  58.3  
MFS  76.3  77.4  54.7  62.8  65.6*  
MRFLP  50.6*  58.6  60.5  
Nav05  83.2  84.1  60.4  
PP  81.4  82.1  48.6  63.0  62.6  
WS  82.8  85.4  56.5  64.7*  66.0  
sup. 
Best  82.5  82.3*  59.1  65.2  68.6 
Zhong10  82.6  58.3  67.6  68.2 
Table 5 shows the results of our system and the results obtained by stateoftheart systems on the same datasets. We compared our method with supervised, unsupervised and semisupervised system on four datasets. The supervised systems are It makes sense [\citenameZhong and Ng2010] (Zhong10), an open source WSD system based on support vector machines [\citenameSteinwart and Christmann2008]; and the best system that participated to each competition (Best). The semisupervised systems are: IRSTDDD00 [\citenameStrapparava, Gliozzo, and Giuliano2004], based on WordNet domains and on manually annotated domain corpora; MFS, which corresponds to the most frequent sense heuristic implemented using the WordNet corpus reader of the natural language toolkit; MRFLP based on Markov random field [\citenameChaplot, Bhattacharyya, and Paranjape2015]; Nav05 [\citenameNavigli and Velardi2005] a knowledge based method that exploits manually disambiguated word senses to enrich the knowledge base relations; [\citenameAgirre, de Lacalle, and Soroa2014] a random walk method that uses contextual information and prior knowledge about senses distribution to compute the most important sense in a network given a specific word and its context. The unsupervised systems are: Nav10, a graph based WSD algorithm that exploits connectivity measures to determine the most important node in the graph composed by all the senses of the words in a sentence; and a version of the algorithm that does not use sense tagged resources.
The results show that our unsupervised system performs better than any other unsupervised algorithm in all datasets. In S7CG and S7 the difference is minimal compared with and Nav10, respectively; in S3 and S2 the difference is more substantial compared to both unsupervised systems. Furthermore, the performances of our system is more stable on the four datasets, showing a constant improvement on the stateoftheart.
The comparison with semi supervised systems shows that our system performs always better than the most frequent sense heuristic when we use information from senselabeled corpora. We can note a strange behavior on S7CG, when we use prior knowledge the performances of our semisupervised system are lower than our unsupervised system and stateoftheart. This is because on this dataset the performances of our unsupervised system are better than the results than can be achieved by using labeled data to initialize the strategy space of the semi supervised system. On the other three datasets we can note a substantial improvement in the performances of our system, with stable results higher than stateoftheart systems.
Finally we can note that the results of our semi supervised system, on the finegrained datasets, are close to the performances of stateoftheart supervised systems, with values that are statistically relevant only on . We can also note that the performances of our system on the nouns of the S7CG dataset are higher than the results of the supervised systems.
6.2.2 Experiments with BabelNet
BabelNet is particularly useful when the number of named entities to disambiguate is high. In fact it is not possible to perform this task using only WordNet, because its coverage on named entities is limited. For the experiments on this section we used BabelNet to collect the sense inventories of each word to be disambiguated, the mdice measure to weight the graph and NASARI to obtain the semantic representation of each sense. The similarity among the representation obtained with this resource are computed using the cosine similarity measure, described in Section 5.2.2. The only differences with the experiments presented in Section 6.2.1 are that we used BabelNet as knowledge base and NASARI as resource to collect the sense representations instead of WordNet.
S13 consists of 13 documents in different domains, available in 5 languages (we used only English). All the nouns in these texts are annotated using BabelNet, with a total number of 1931 words to be disambiguated (English dataset). KORE50 consists of 50 short English sentences with a total number of 146 mentions manually annotated using YAGO2 [\citenameHoffart et al.2013]. We used the mapping between YAGO2 and Wikipedia to obtain for each mention the corresponding BabelNet concept, since there exists a mapping between Wikipedia and BabelNet. This dataset contains highly ambiguous mentions, which are difficult to capture without the use of a large and well organized knowledge base. In fact, the mentions are not explicit and require the use of common knowledge to identify their intended meaning.
We used the SPARQL endpoint^{15}^{15}15http://babelnet.org/sparql/ provided by BabelNet to collect the sense inventories of the words in the texts of each dataset. For this task we filtered the first 100 resources whose label contains the lexicalization of to word to be disambiguated. This operation can increase the dimensionality of the strategy space, but it is required because especially in KORE50 there are many indirect references, such as Tiger to refer to Tiger Woods (the famous golf player) or Jones to refer to John Paul Jones (the Led Zeppelin bassist).
agraphComparison to StateoftheArt Algorithms
S13  KORE50  
WS  70.8  75.7 
Babelfy  69.2  71.5 
SUDOKU  66.3  
MFS  66.5*  
PP  60.8  
KORE  63.9*  
GETALP  58.3 
The results of these experiments are shown in Table 6, where it is possible to see that the performances of our system are close to the results obtained with Babelfy on S13 and substantially higher on KORE50. This is because with our approach it is necessary to respect the textual coherence, which is required when a sentence contains a high level of ambiguity, such as those proposed by KORE50. To the contrary, PP performs poorly on this dataset. This because, as attested in [\citenameMoro, Raganato, and Navigli2014], it disambiguates the words independently, without imposing any consistency requirements.
The good performances of our approach are also due to the good semantic representations provided by NASARI, in fact, it is able to exploit a richer source of information, Wikipedia, which provides a larger coverage and a wider source of information than WordNet alone.
The results on KORE50 are presented as accuracy, following the custom of previous work on this dataset. As we have anticipated it contains decontextualized sentences, which require common knowledge to be disambiguated. This common knowledge is obtained exploiting the relations in BabelNet that connect related entities but in many cases this is not enough because the references to entities are too general and in this case the system is not able to provide an answer. It is also difficult to exploit distributional information on this dataset, since the sentences are short and in many cases cryptic. For these reasons the recall on this dataset is well below the precision: . The system does not provide answers for the entities in sentences such as: Jobs and Baez dated in the late 1970s, and she performed at his Stanford memorial, but it is able to disambiguate correctly the same entities in sentences where there is more contextual information.
7 Conclusions
In this article we have introduced a new method for WSD based on game theory. We have provided an extensive introduction on the WSD task and explained the motivations behind the choice to model the WSD problem as a constraint satisfaction problem. We conducted an extensive series of experiments to find out the similarity measures that perform better in our framework. We have also evaluated our system with two different implementations and compared our results with stateoftheart systems, on different WSD tasks.
Our method can be considered as a continuation of knowledge based, graph based and similarity based approaches. We used the methodologies of these three approaches combined in a game theoretic framework. This model is used to perform a consistent labeling of senses. In our model we try to maximize the textual coherence imposing that the meaning of each word in a text must be related to the meaning of the other words in the text. To do this we exploited distributional and proximity information to weight the influence that each word has on the others. We exploited also semantic similarity information to weight the strengths of compatibility among two senses. This is of great importance because it imposes constraints on the labeling process, developing a contextual coherence on the assignment of senses. The application of a game theoretic framework guarantees that these assumptions are met. Furthermore, the use of the replicator dynamics equation allows to always find the best labeling assignment.
Our system in addition to have a solid mathematical and linguistic foundation, has demonstrated to perform well compared with stateoftheart systems and to be extremely flexible. In fact, it is possible to implement new similarity measures, graph constructions and strategy space initializations to test it in different scenarios. It is also possible to use it as completely unsupervised or to use information from senselabeled corpora.
The features that make our system competitive, compared with stateoftheart systems, are that instead of finding the most important sense in a network to be associated to the meaning of a single word, our system disambiguates all the words at the same time taking into account the influence that each word has on the others and imposes to respect the sense compatibility among each sense before to assign a meaning. We have demonstrated how our system can deal with sense shifts, where a centrality algorithm, which tries to find the most important sense in a network can be deceived by the context. In our case, the weighting of the context ensures to respect the proximity structure of a sentence and to disambiguate each word according to the context in which it appears. This is because the meaning of a word in a sentence does not depend on all the words contained in the sentence but only on those that share a proximity (or syntactical) relation and those with which enjoy a high distributional similarity.
Acknowledgements.
This work was supported by Samsung Global Research Outreach Program. We are deeply grateful to Rodolfo Delmonte for his insights on the preliminary phase of this work and to Bernadette Sharp for her help during the final part of it. We would also like to thanks Phil Edmonds for providing us the correct version of the Senseval 2 dataset.References
 [\citenameAgirre et al.2009] Agirre, E., O. L. De Lacalle, C. Fellbaum, A. Marchetti, A. Toral, and P. Vossen. 2009. SemEval2010 task 17: Allwords word sense disambiguation on a specific domain. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pages 123–128. Association for Computational Linguistics.
 [\citenameAgirre, de Lacalle, and Soroa2014] Agirre, E., O. Lopez de Lacalle, and A. Soroa. 2014. Random walks for knowledgebased word sense disambiguation. Computational Linguistics, 40(1):57–84.
 [\citenameAgirre et al.2009] Agirre, E., O. Lopez De Lacalle, A. Soroa, and I. Fakultatea. 2009. Knowledgebased wsd and specific domains: Performing better than generic supervised wsd. In IJCAI, pages 1501–1506.
 [\citenameAgirre and Edmonds2007] Agirre, E. and P. G. Edmonds. 2007. Word sense disambiguation: Algorithms and applications, volume 33. Springer Science & Business Media.
 [\citenameAgirre et al.2006] Agirre, E., D. Martínez, O. L. de Lacalle, and A. Soroa. 2006. Two graphbased algorithms for stateoftheart WSD. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 585–593. Association for Computational Linguistics.
 [\citenameAraujo2007] Araujo, L. 2007. How evolutionary algorithms are applied to statistical natural language processing. Artificial Intelligence Review, 28(4):275–303.
 [\citenameBird2006] Bird, S. 2006. NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, pages 69–72.
 [\citenameBlaheta and Johnson2001] Blaheta, D. and M. Johnson. 2001. Unsupervised learning of multiword verbs. In Proc. of the ACL/EACL 2001 Workshop on the Computational Extraction, Analysis and Exploitation of Collocations, pages 54–60.
 [\citenameBrants and Franz2006] Brants, T. and A. Franz. 2006. Web 1T 5gram Version 1. Linguistic Data Consortium.
 [\citenameBurrows2002] Burrows, J. 2002. Delta: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3):267–287.
 [\citenameCamachoCollados, Pilehvar, and Navigli2015] CamachoCollados, J., M. T. Pilehvar, and R. Navigli. 2015. NASARI: A novel approach to a semanticallyaware representation of items. In Proceedings of NAACL, pages 567–577.
 [\citenameCarpineto and Romano2012] Carpineto, C. and G. Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR), 44(1):1–50.
 [\citenameChan and Ng2005] Chan, Y. S. and H. T. Ng. 2005. Scaling up word sense disambiguation via parallel texts. In AAAI, volume 5, pages 1037–1042.
 [\citenameChaplot, Bhattacharyya, and Paranjape2015] Chaplot, D. S., P. Bhattacharyya, and A. Paranjape. 2015. Unsupervised word sense disambiguation using markov random field and dependency parser. In AAAI, pages 2217–2223.
 [\citenameChurch and Hanks1990] Church, K. W. and P. Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29.
 [\citenameCong and Liu2014] Cong, J. and H. Liu. 2014. Approaching human language with complex networks. Physics of life reviews, 11(4):598–618.
 [\citenameCurran, Murphy, and Scholz2007] Curran, J. R, T. Murphy, and B. Scholz. 2007. Minimising semantic drift with mutual exclusion bootstrapping. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, volume 6, pages 172–180.
 [\citenameDagan and Glickman2004] Dagan, I. and O. Glickman. 2004. Probabilistic textual entailment: Generic applied modeling of language variability. Learning Methods for Text Understanding and Mining, pages 26–29.
 [\citenameDang1975] Dang, H. T. 1975. Investigations into the role of lexical semantics in word sense disambiguation. PhD Thesis, University of Pensylvania.
 [\citenameDe Cao et al.2010] De Cao, D., R. Basili, M. Luciani, F. Mesiano, and R. Rossi. 2010. Robust and efficient page rank for word sense disambiguation. In Proceedings of the 2010 Workshop on Graphbased Methods for Natural Language Processing, pages 24–32.
 [\citenameDeGroot et al.1986] DeGroot, M. H., M. J. Schervish, X. Fang, Ligang L., and D. Li. 1986. Probability and statistics, volume 2. AddisonWesley Reading, MA.
 [\citenameDice1945] Dice, L. R. 1945. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302.
 [\citenameErdem and Pelillo2012] Erdem, A. and M. Pelillo. 2012. Graph transduction as a noncooperative game. Neural Computation, 24(3):700–723.
 [\citenameEvert2008] Evert, S. 2008. Corpora and collocations. Corpus Linguistics. An International Handbook, 2:223–233.
 [\citenameFirth1957] Firth, J. R. 1957. A synopsis of linguistic theory 19301955. Studies in linguistic analysis. Oxford: Blackwell, pages 1–32.
 [\citenameFkih and Omri2012] Fkih, F. and M. N. Omri. 2012. Learning the size of the sliding window for the collocations extraction: a ROCbased approach. In Proc. The 2012 International Conference on Artificial Intelligence (ICAI’12), pages 1071–1077.
 [\citenameGale, Church, and Yarowsky1992] Gale, W. A., K. W Church, and D. Yarowsky. 1992. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26(56):415–439.
 [\citenameHaveliwala2002] Haveliwala, T. H. 2002. Topicsensitive pagerank. In Proceedings of the 11th international conference on World Wide Web, pages 517–526. ACM.
 [\citenameHoffart et al.2012] Hoffart, J., S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. 2012. KORE: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 545–554. ACM.
 [\citenameHoffart et al.2013] Hoffart, J., F. M Suchanek, K. Berberich, and G. Weikum. 2013. YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. In Proceedings of the TwentyThird international joint conference on Artificial Intelligence, pages 3161–3165. AAAI Press.
 [\citenameHolland1975] Holland, J. H. 1975. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press.
 [\citenameHummel and Zucker1983] Hummel, R. A. and S. W. Zucker. 1983. On the foundations of relaxation labeling processes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, pages 267–287.
 [\citenameJiang and Conrath1997] Jiang, J. J. and D. W Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In International Conference Research on Computational Linguistics, pages 970–985.
 [\citenameJordan and Weiss2002] Jordan, M. I. and Y. Weiss. 2002. Graphical models: Probabilistic inference. The handbook of brain theory and neural networks, pages 490–496.
 [\citenameKelly and Stone1975] Kelly, E. F. and P. J. Stone. 1975. Computer recognition of English word senses, volume 13. NorthHolland.
 [\citenameKilgarriff1997] Kilgarriff, A. 1997. I don’t believe in word senses. Computers and the Humanities, 31(2):91–113.
 [\citenameKitamura and Matsumoto1996] Kitamura, W. and Y. Matsumoto. 1996. Automatic extraction of word sequence correspondences in parallel corpora. In Proceedings of the 4th Workshop on Very Large Corpora, pages 79–87.
 [\citenameKleinberg and Tardos2002] Kleinberg, J. and E. Tardos. 2002. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields. Journal of the ACM (JACM), 49(5):616–639.
 [\citenameKoeling, McCarthy, and Carroll2005] Koeling, R., D. McCarthy, and J. Carroll. 2005. Domainspecific sense distributions and predominant sense acquisition. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 419–426.
 [\citenameLafon1980] Lafon, P. 1980. Sur la variabilité de la fréquence des formes dans un corpus. Mots, 1(1):127–165.
 [\citenameLarsenFreeman and Cameron2008] LarsenFreeman, D. and L. Cameron. 2008. Complex systems and applied linguistics. Oxford University Press.
 [\citenameLeech1992] Leech, G. 1992. 100 million words of English: The British National Corpus (BNC). Language Research, 28(1):1–13.
 [\citenameLesk1986] Lesk, Michael. 1986. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, pages 24–26.
 [\citenameLeytonBrown and Shoham2008] LeytonBrown, K. and Y. Shoham. 2008. Essentials of game theory: A concise multidisciplinary introduction. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2(1):1–88.
 [\citenameManion and Sainudiin2014] Manion, S. L. and R. Sainudiin. 2014. An iterative sudoku style approach to subgraphbased word sense disambiguation. In Proceedings of the Third Joint Conference on Lexical and Computational Semantics (* SEM 2014), pages 40–50.
 [\citenameMcCarthy et al.2007] McCarthy, D., R. Koeling, J. Weeds, and J. Carroll. 2007. Unsupervised acquisition of predominant word senses. Computational Linguistics, 33(4):553–590.
 [\citenameMenai2014] Menai, M. 2014. Word sense disambiguation using evolutionary algorithms–application to arabic language. Computers in Human Behavior, 41:92–103.
 [\citenameMihalcea2005] Mihalcea, R. 2005. Unsupervised largevocabulary word sense disambiguation with graphbased algorithms for sequence data labeling. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 411–418.
 [\citenameMihalcea2006] Mihalcea, R. 2006. Knowledgebased methods for WSD. Word Sense Disambiguation: Algorithms and Applications, pages 107–131.
 [\citenameMihalcea and Moldovan2001] Mihalcea, R and D. I. Moldovan. 2001. eXtended WordNet: progress report. In Proceedings of NAACL Workshop on WordNet and Other Lexical Resources, pages 95–100.
 [\citenameMihalcea, Tarau, and Figa2004] Mihalcea, R., P. Tarau, and E. Figa. 2004. Pagerank on semantic networks, with application to word sense disambiguation. In Proceedings of the 20th international conference on Computational Linguistics, page 1126. Association for Computational Linguistics.
 [\citenameMiller and Zucker1991] Miller, D. A. and S. W. Zucker. 1991. Copositiveplus Lemke algorithm solves polymatrix games. Operations Research Letters, 10(5):285–290.
 [\citenameMiller1995] Miller, G. A. 1995. WordNet: a lexical database for english. Communications of the ACM, 38(11):39–41.
 [\citenameMiller et al.1993] Miller, G. A, C. Leacock, Randee Tengi, and Ross T Bunker. 1993. A semantic concordance. In Proceedings of the workshop on Human Language Technology, pages 303–308. Association for Computational Linguistics.
 [\citenameMoro and Navigli2015] Moro, A. and R. Navigli. 2015. SemEval2015 Task 13: Multilingual AllWords Sense Disambiguation and Entity Linking. In Proceedings of SemEval2015.
 [\citenameMoro, Raganato, and Navigli2014] Moro, A., A. Raganato, and R. Navigli. 2014. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics, 2:231–244.
 [\citenameMoscato1989] Moscato, P. 1989. On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech concurrent computation program, C3P Report, 826:1989.
 [\citenameNash1951] Nash, J. 1951. Noncooperative games. Annals of mathematics, pages 286–295.
 [\citenameNavigli2006] Navigli, R. 2006. Meaningful clustering of senses helps boost word sense disambiguation performance. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 105–112.
 [\citenameNavigli2009] Navigli, R. 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2):10.
 [\citenameNavigli, Jurgens, and Vannella2013] Navigli, R., D. Jurgens, and D. Vannella. 2013. Semeval2013 task 12: Multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM), volume 2, pages 222–231.
 [\citenameNavigli and Lapata2007] Navigli, R. and M. Lapata. 2007. Graph connectivity measures for unsupervised word sense disambiguation. In IJCAI, pages 1683–1688.
 [\citenameNavigli and Lapata2010] Navigli, R. and M. Lapata. 2010. An experimental study of graph connectivity for unsupervised word sense disambiguation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(4):678–692.
 [\citenameNavigli, Litkowski, and Hargraves2007] Navigli, R., K. C Litkowski, and O. Hargraves. 2007. SemEval2007 task 07: Coarsegrained english allwords task. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 30–35. Association for Computational Linguistics.
 [\citenameNavigli and Ponzetto2012a] Navigli, R. and S. P. Ponzetto. 2012a. BabelNet: The automatic construction, evaluation and application of a widecoverage multilingual semantic network. Artificial Intelligence, 193:217–250.
 [\citenameNavigli and Ponzetto2012b] Navigli, R. and S. P. Ponzetto. 2012b. Joining forces pays off: Multilingual joint word sense disambiguation. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 1399–1410. Association for Computational Linguistics.
 [\citenameNavigli and Velardi2005] Navigli, R. and P. Velardi. 2005. Structural semantic interconnections: a knowledgebased approach to word sense disambiguation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(7):1075–1086.
 [\citenameNg and Lee1996] Ng, H. T. and H. B. Lee. 1996. Integrating multiple knowledge sources to disambiguate word sense: An exemplarbased approach. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 40–47.
 [\citenameNowak, Komarova, and Niyogi2001] Nowak, M. A, N. L Komarova, and P. Niyogi. 2001. Evolution of universal grammar. Science, 291(5501):114–118.
 [\citenameOkasha and Binmore2012] Okasha, Samir and Ken Binmore. 2012. Evolution and rationality: decisions, cooperation and strategic behaviour. Cambridge University Press.
 [\citenamePage et al.1999] Page, L., S. Brin, R. Motwani, and T. Winograd. 1999. The PageRank citation ranking: bringing order to the web, technical report. Stanford InfoLab.
 [\citenamePalmer et al.2001] Palmer, M., C. Fellbaum, S. Cotton, L. Delfs, and H. T. Dang. 2001. English tasks: Allwords and verb lexical sample. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pages 21–24.
 [\citenamePantel and Lin2002] Pantel, P. and D. Lin. 2002. Discovering word senses from text. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 613–619.
 [\citenamePatwardhan and Pedersen2006] Patwardhan, S. and T. Pedersen. 2006. Using WordNetbased context vectors to estimate the semantic relatedness of concepts. In Proceedings of the EACL 2006 Workshop Making Sense of SenseBringing Computational Linguistics and Psycholinguistics Together, volume 1501, pages 1–8.
 [\citenamePatwardhan, Banerjee, and Pedersen2003] Patwardhan, Siddharth, Satanjeev Banerjee, and T. Pedersen. 2003. Using measures of semantic relatedness for word sense disambiguation. In Computational linguistics and intelligent text processing. Springer, pages 241–257.
 [\citenamePedersen2012] Pedersen, T. 2012. Duluth: Measuring degrees of relational similarity with the gloss vector measure of semantic relatedness. In First Joint Conference on Lexical and Computational Semantics (*SEM), pages 497–501.
 [\citenamePelillo1997] Pelillo, M. 1997. The dynamics of nonlinear relaxation labeling processes. Journal of Mathematical Imaging and Vision, 7(4):309–323.
 [\citenamePham, Ng, and Lee2005] Pham, T. P., H. T. Ng, and W. S. Lee. 2005. Word sense disambiguation with semisupervised learning. In Proceedings of the National Conference on Artificial Intelligence, volume 20, pages 1093–1098.
 [\citenamePietarinen2007a] Pietarinen, A. 2007a. Game theory and linguistic meaning. BRILL.
 [\citenamePietarinen2007b] Pietarinen, AhtiVeikko. 2007b. An invitation to language and games. Elsevier Ltd, Oxford.
 [\citenamePilehvar and Navigli2014] Pilehvar, M. T. and R. Navigli. 2014. A largescale pseudowordbased evaluation framework for stateoftheart word sense disambiguation. Computational Linguistics, 40(4):837–881.
 [\citenamePonzetto and Navigli2010] Ponzetto, S. P. and R. Navigli. 2010. Knowledgerich word sense disambiguation rivaling supervised systems. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 1522–1531.
 [\citenamePradhan et al.2007] Pradhan, S. S., E. Loper, D. Dligach, and M. Palmer. 2007. SemEval2007 task 17: English lexical sample, SRL and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 87–92.
 [\citenameRao2002] Rao, C.R. 2002. Karl Pearson chisquare test the dawn of statistical inference. In Goodnessoffit tests and model validity. Springer, pages 9–24.
 [\citenameRentoumi et al.2009] Rentoumi, V., G. Giannakopoulos, V. Karkaletsis, and G. A. Vouros. 2009. Sentiment analysis of figurative language using a word sense disambiguation approach. In RANLP, pages 370–375.
 [\citenameResnik1995] Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence  Volume 1, pages 448–453.
 [\citenameRota Buló, Pelillo, and Bomze2011] Rota Buló, S., M. Pelillo, and I. M. Bomze. 2011. Graphbased quadratic optimization: A fast evolutionary approach. Computer Vision and Image Understanding, 115(7):984–995.
 [\citenameSandholm2010] Sandholm, W. H. 2010. Population games and evolutionary dynamics. MIT press.
 [\citenameSchwab et al.2013] Schwab, Didier, Andon Tchechmedjiev, Jérôme Goulian, Mohammad Nasiruddin, Gilles Sérasset, and Hervé Blanchon. 2013. GETALP System : Propagation of a Lesk Measure through an Ant Colony Algorithm. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013).
 [\citenameSinha and Mihalcea2007] Sinha, R. Som and R. Mihalcea. 2007. Unsupervised graphbasedword sense disambiguation using measures of word semantic similarity. In ICSC, volume 7, pages 363–369.
 [\citenameSkyrms2010] Skyrms, B. 2010. Signals: Evolution, learning, and information. Oxford University Press.
 [\citenameSmith and Price1973] Smith, J. M. and G.R. Price. 1973. The logic of animal conflict. Nature, 246:15.
 [\citenameSmrž2006] Smrž, P. 2006. Using WordNet for opinion mining. In Proceedings of the Third International WordNet Conference, pages 333–335. Masaryk University.
 [\citenameSnyder and Palmer2004] Snyder, B. and M. Palmer. 2004. The english allwords task. In Senseval3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 41–43.
 [\citenameSøgaard et al.2014] Søgaard, A., A. Johannsen, B. Plank, D. Hovy, and H. M. Alonso. 2014. What’s in a pvalue in nlp? In Proceedings of the eighteenth conference on computational natural language learning (CONLL14), pages 1–10.
 [\citenameSteinwart and Christmann2008] Steinwart, I. and A. Christmann. 2008. Support vector machines. Springer Science & Business Media.
 [\citenameStrapparava, Gliozzo, and Giuliano2004] Strapparava, C., A. Gliozzo, and C. Giuliano. 2004. Pattern abstraction and term similarity for word sense disambiguation: Irst at Senseval3. In Proc. of SENSEVAL3 Third International Workshop on Evaluation of Systems for the Semantic Analysis of Text, pages 229–234.
 [\citenameSzabó and Fath2007] Szabó, G. and G. Fath. 2007. Evolutionary games on graphs. Physics Reports, 446(4):97–216.
 [\citenameTaylor and Jonker1978] Taylor, P. D. and L. B. Jonker. 1978. Evolutionary stable strategies and game dynamics. Mathematical biosciences, 40(1):145–156.
 [\citenameTong, Faloutsos, and Pan2006] Tong, H., C. Faloutsos, and J. Pan. 2006. Fast random walk with restart and its applications. In Proceedings of the Sixth International Conference on Data Mining, pages 613–622. IEEE.
 [\citenameTratz et al.2007] Tratz, S., A. Sanfilippo, M. Gregory, A. Chappell, C. Posse, and P. Whitney. 2007. PNNL: A supervised maximum entropy approach to word sense disambiguation. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 264–267.
 [\citenameTripodi and Pelillo2015] Tripodi, R. and M. Pelillo. 2015. WSDgames: a gametheoretic algorithm for unsupervised word sense disambiguation. In Proceedings of SemEval2015, pages 329–334.
 [\citenameTripodi, Pelillo, and Delmonte2015] Tripodi, R., M. Pelillo, and Rodolfo Delmonte. 2015. An evolutionary game theoretic approach to word sense disambiguation. In Proceedings of Natural Language Processing and Cognitive Science 2014, pages 39–48.
 [\citenameTsang1995] Tsang, E. 1995. Foundations of Constraint Satisfaction. Academic Press.
 [\citenameVéronis2004] Véronis, J. 2004. Hyperlex: lexical cartography for information retrieval. Computer Speech & Language, 18(3):223–252.
 [\citenameVickrey et al.2005] Vickrey, D., L. Biewald, M. Teyssier, and D. Koller. 2005. Wordsense disambiguation for machine translation. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 771–778.
 [\citenameVon Neumann and Morgenstern1944] Von Neumann, J. and O. Morgenstern. 1944. Theory of Games and Economic Behavior. Princeton University Press.
 [\citenameWeaver1955] Weaver, W. 1955. Translation. Machine translation of languages, 14:15–23.
 [\citenameWeibull1997] Weibull, J. W. 1997. Evolutionary game theory. MIT press.
 [\citenameWu and Palmer1994] Wu, Z. and M. Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138.
 [\citenameYarowsky1993] Yarowsky, D. 1993. One sense per collocation. In Proceedings of the workshop on Human Language Technology, pages 266–271. Association for Computational Linguistics.
 [\citenameYarowsky and Florian2002] Yarowsky, D. and R. Florian. 2002. Evaluating sense disambiguation across diverse parameter spaces. Natural Language Engineering, 8(04):293–310.
 [\citenameZhong and Ng2010] Zhong, Z. and H. T. Ng. 2010. It makes sense: A widecoverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 System Demonstrations, pages 78–83. Association for Computational Linguistics.
 [\citenameZhong and Ng2012] Zhong, Z. and H. T. Ng. 2012. Word sense disambiguation improves information retrieval. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long PapersVolume 1, pages 273–282.