Unveiling the Relationship Between Complex Networks Metrics and Word Senses
The automatic disambiguation of word senses (i.e., the identification of which of the meanings is used in a given context for a word that has multiple meanings) is essential for such applications as machine translation and information retrieval, and represents a key step for developing the so-called Semantic Web. Humans disambiguate words in a straightforward fashion, but this does not apply to computers. In this paper we address the problem of Word Sense Disambiguation (WSD) by treating texts as complex networks, and show that word senses can be distinguished upon characterizing the local structure around ambiguous words. Our goal was not to obtain the best possible disambiguation system, but we nevertheless found that in half of the cases our approach outperforms traditional shallow methods. We show that the hierarchical connectivity and clustering of words are usually the most relevant features for WSD. The results reported here shine light on the relationship between semantic and structural parameters of complex networks. They also indicate that when combined with traditional techniques the complex network approach may be useful to enhance the discrimination of senses in large texts.
Many statistical methods are now used to investigate language  in attempts to understand empirical findings such as the Zipf’s Law , and model syntactic and semantic relationships between words or passages [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. The numerous studies encompass complex networks  (CN) representing texts in several applications, including summarization , assessment of quality of machine translators , authorship recognition , keyword extraction , topic identification  and segmentation . In this paper we assess the use of complex network concepts for Word Sense Disambiguation (WSD), which is a crucial task for the Semantic Web  and for machine translation . We analyze ambiguous words with topological measurements and show a strong relationship between senses and local features of complex networks. Indeed, for some of the ambiguous words the distinguishability with the CN approach is better than that obtained with the traditional analysis of neighbors. From an analysis of feature relevance, we found that the strength of connection of neighbors in higher hierarchies and the hierarchical clustering coefficient are the most efficient metrics to discriminate word senses.
Typical Approaches to Word Sense Disambiguation
The WSD problem has been widely studied  by computer scientists and researchers interested in Natural Language Processing  tasks. Even though humans can readily discriminate specific senses of a word, this is not the case of a computer. In fact, WSD is considered as one of the most complex problems in Artificial Intelligence . The two conventional approaches to WSD are: i) the deep paradigm based on a large amount of linguist knowledge (e.g. dictionaries, thesaurus or semantic networks); and ii) the shallow paradigm which makes use of statistical techniques. The deep paradigm is in theory the best strategy as it mimics human thinking, but in practice methods requiring knowledge bases do not achieve the best performance because there is still no database that can cover the human knowledge. Moreover, this paradigm is often impracticable because the manual creation of knowledge bases is an expensive, time consuming endeavor. In contrast, simpler methods such as those based on the analysis of contexts surrounding ambiguous words  have led to better performance. One of the most popular algorithms, referred to as Lesk , assumes that words in a given neighborhood tend to share a common topic, an assumption that is used by other algorithms . Actually, the analysis of contexts based on the recurrence of nearby components is so efficient that it has even been employed to decipher encoded manuscripts .
Networks have also been applied to the WSD task, and some of the network-based algorithms are now close to the state-of-the-art in disambiguation. One of the earliest works date back to and uses the network structure to store knowledge in the form of a semantic memory . Other examples include the application of random walks  in semantic networks whose nodes are linked according to semantic relations provided by WordNet. With a different approach, the HyperLex algorithm  connect words that co-occur in a given paragraph and use the weight of edges (given by the relative frequency of occurrence of the corresponding connected nodes) to disambiguate words. Although these algorithms use the network representation in processing steps, they differ substantially from our strategy because they all consider the label of nodes while we focus on the characterization of local structural properties.
In the experiments, we used a set of books to retrieve ambiguous words (save, note, march, present, jam, ring, just, bear, rock and close), which were manually disambiguated. The only criterion in choosing these words was the quite distinct meanings of each word, which minimizes possible inaccuracies in the manual disambiguation. The list of word senses and books are given respectively in Tables S1 and S2 of the Supplementary Information (SI). The text in the books was represented as networks, as explained below.
Modeling Texts as Complex Networks
The model used to represent text is known in the literature as co-occurrence or adjacency networks [6, 7]. Basically, words are represented as nodes, which are directionally linked according to the natural reading order. In other words, if a word appears immediately before word in the text, then there will be the edge in the network. When a given association is repeated in the text, the weight of the corresponding edge is incremented. Before creating nodes and edges, stopwords (prepositions, articles and other high-frequency words with little semantic meaning) are removed (the full list of disregarded stopwords is shown in the SI). In addition, the remaining words are converted to their canonical form in order to group words with different inflections referring to a same concept.
Mathematically, the text network is defined by the matrix , whose element counts the number of times the word appeared before the word . When defining some of the complex networks measurements we also employed the non-oriented, non-weighted version, represented by the matrix so that if appeared at least once as a neighbor of (regardless of the position) and otherwise. When a word repeatedly appears in the same text, it is considered as the same node in the corresponding network. But this procedure is not adopted for the ambiguous words under analysis, i.e. each occurrence is taken as a distinct node in the network, so that it is possible to characterize each occurrence of an ambiguous word to correlate its structural features with its meanings.
Characterization of Senses Through Complex Networks Features
To characterize the local structure of an ambiguous word, we used a set of complex network local measurements . The simplest measurement is the degree , i.e., the number of connections (without considering the weight of the edges). In terms of the adjacency matrix , the degree is computed as
The weighted version of , which considers the strength  of the links, is given by
Extensions of these two measurements were considered through the analysis of further hierarchies  for the hierarchical expansion usually yields better network characterization [6, 28]. The expansion of a given node is made by merging the node under analysis with its neighbors in a single node, keeping the external connections of the neighbors [6, 28]. This procedure is then repeated to generate deeper expansions. This hierarchical characterization was adopted for both and , where the -th expansion is represented as and . We have not made explicit use of and when because these measurements take constant values as a consequence of considering each occurrence of an ambiguous word as a single node.
In addition to the local measurements, we quantified the connectivity of nodes to their neighbors. Analogously to the adoption of further hierarchies, the study of topological properties of neighbors also yields better network characterization . Indeed, neighbors have played a key role in many algorithms, such as the PageRank  algorithm and its variations. In this paper, the following neighborhood-based measurements were employed: the average degree and strength of the neighbors ( and , respectively) and their standard deviations ( and ). Another structural measurement used was the clustering coefficient (), which is proportional to the fraction of triangles over the total number of connected triads. More specifically, the clustering is computed as:
It is known that a correlation exists between the number of semantic contexts where a word appears and its clustering coefficient . Since word senses might be related to the number of contexts (because distinct senses could appear in different contexts) this measurement may be useful in the disambiguation task. Similarly to the degree and strength measurements, we expanded the hierarchies up to to compute this measurement.
The local structure was also examined with shortest paths (or geodesic paths) between two nodes, which are paths whose sum of the edge weights is minimum. If is the shortest path between nodes and in the adjacency matrix , then the average shortest path length for is:
In networks of text, the shortest path quantifies the centrality of a word according to its distance to the most frequent words . We chose to use this measurement to verify if the distance from an ambiguous word to the core-content concepts of the books can be used do distinguish senses. Shortest paths were also employed to compute the betweenness of words (). Let be the number of shortest paths between nodes and that pass through node . If is the number of shortest paths between nodes and , then is defined as:
Even though we are aware of the correlation between and () [7, 31] in large networks, the possible distinct values of taken for ambiguous words will not reflect differences in word frequency, because for each occurrence of an ambiguous word. Actually, will reflect the ability of words to connect different network regions [31, 32] or different contexts .
The measurements employed to characterize the local structure of ambiguous words are summarized in Table 1, which contains the measurements classified according to their type (connectivity, clustering, neighborhood and paths), their notation and complexity.
|Connectivity||, and||O( + )|
|, and||O( + )|
|Clustering||, , and||[O(),O()]|
To verify if the description provided by the measurements above is useful for the WSD task, we used machine learning algorithms that induce classifiers from the training set provided for each word. The quality of the results was then evaluated using the 10-fold cross validation technique , which was chosen because it is robust in the sense that the training set is always different from the evaluation set. Thus, it prevents that overfitted inductors take high values of accuracy rate. inductor algorithms were used: the C4.5 algorithm , which generates trees based on the gain provided by each feature; the Naive Bayes algorithm , which uses the Bayes theorem; and the k nearest neighbor algorithm  (kNN), which classifies an external unknown instance according to the most similar instance of the training database in a normalized space including all features. Details regarding algorithms and the cross validation technique are given in the SI.
Results and Discussion
The ambiguous words were characterized with complex networks measurements to verify if senses can be inferred from a topological analysis. Table 2 shows in the second and third columns, respectively, the accuracy rate and the corresponding p-value relative to a classification performed by assigning the most common (i.e. the most frequent) sense to the ambiguous word. A significant accuracy () could be observed in out of the words. An example of scatter plot depicting the discrimination obtained for the word “ring” is shown in Figure 1, where each axis represents a linear combination of the measurements provided by the Canonical Variable Analysis technique . These results confirm the relationship between local characteristics of adjacency networks and word senses, reinforcing the suitability of complex network methods to relate structure and semantics. We believe that the ability to distinguish senses is at least partially due to the fact that co-occurrence networks probably imply syntactic factors  that are reflected on the semantic relations . This relationship, however, is still difficult to establish because there is no consolidated interpretation for the measurements of word adjacency networks (see e.g. Ref. ).
|Word||Acc. Rate||Best Ind.||Minimum Set of Measurements|
|note||84.53 %||kNN||, and|
|march||86.95 %||kNN||, ,|
|present||71.14 %||kNN||, , and|
|jam||100.0 %||kNN||, and|
|ring||84.61 %||kNN||, , , and|
|bear||61.95 %||kNN||, and|
|rock||79.30 %||C4.5||, , , and|
|close||72.20 %||kNN||, , and|
Although our primary goal was not a search for the best possible disambiguation system, we compare our results with the traditional approach based on the analysis of frequency of nearby words. Classifiers were induced using attributes that represent the frequency of the , or words surrounding the ambiguous word. The lowest p-values and the top classifiers are shown in Table 3. For the first words the CN approach outperformed the traditional method. This means that the local structure can be even more relevant than the frequency analysis of neighbors. Obviously, we are not suggesting the CN approach to replace the approaches based on semantic information provided by neighbors, since complex network measurements are statistically reliable only when computed in large texts. Still, it could be valuable to combine both strategies in disambiguation systems. As for the methods and algorithms, the differences regarding the best classifier are worth noting. The CN approach performs better with the kNN algorithm, while the Naïve Bayes algorithm is better in the traditional approach. These differences occur probably because of the distinct number of attributes in each approach. Finally, in 6 out of the 10 words considered, the traditional analysis with neighbors outperformed the classifiers with larger numbers of neighbors.
The contribution from each network metric in discriminating word senses was estimated by first finding the smallest subset of measurements generating the best classifiers (see last column of Table 2). Although we used measurements to characterize nodes, the best accuracy rates were obtained with a maximum of measurements. Strikingly, in some cases only two measurements were already sufficient to provide a reasonable distinction, as indicated in the scatter plot for the word “save” in Figure 2). Quantitatively, the relevance of each metric (i.e. feature) for disambiguating the words was calculated in two ways: using the Kullback-Leibler (KL) divergence and the method based on the Mann Whitney U (MWU) test . While in the latter features are evaluated individually, in the former the interaction between features is considered. Thus, it is possible to identify cases where features are useful only when combined with others. Details regarding the KL divergence and the MWU test are given in the SI and in Ref. .
The rankings shown in Table 4 indicate that the relevance of a metric varies from word to word. In addition, a metric may be relevant when analyzed individually but not so if combined with other attributes because some features included in the first method may not be included in the second one. According to the KL divergence, the most frequent relevant metrics (i.e. the ones which appear among the top ) are and . For the MWU test, the clustering computed at high hierarchical level () is also a relevant feature along with and . Therefore, these results suggest that meanings are often correlated with the strength (or frequency) of higher-order neighbors and with the degree of interconnection of neighbors. Interestingly, was not so relevant when combined with other features as it appeared only a few times in the MWU ranking.
|Method||Kullback-Leibler||Mann-Whitney U test|
In this paper we have verified the suitability of the complex network model for the word sense disambiguation task in large texts. Upon characterizing the local structure of nodes representing ambiguous words, we obtained significant discrimination, which means that different senses affect the structural organization of complex networks. Strikingly, the discrimination was so effective for some words that the topological characterization outperformed traditional shallow methods. In general, the hierarchical characterization of the clustering and connectivity measurements were the most relevant features for WSD, even though the ranking of metrics varied from word to word. The analysis here may shed light on the relationship between structure of complex networks and semantics. From a practical standpoint, the methodology described might be useful in hybrid approaches to improve state-of-the-art disambiguating systems. Given an extensive set of texts, it is possible to obtain networks with the local characterization of nodes representing words whose meaning is known beforehand. Then, an ambiguous word of a book could be disambiguated by assigning meanings according to the semantic (traditional approach) and topological features (CN approach) provided by the training set. In future works, we plan to use wider window sizes to connect words and additional complex networks measurements , such as weighted versions of the shortest path, clustering coefficient, and betweenness along with CN-based classification algorithms  to improve the performance of disambiguation systems in long texts. Also, we shall study the influence on the results when the proposed methodology is applied to other languages.
The authors acknowledge the financial support from CNPq (Brazil) and FAPESP (2010/00927-9).
-  Page S. E., Diversity and Complexity (Princeton University Press) 2010.
-  Manning C. D. Schütze H., Foundations of Statistical Natural Language Processing (MIT Press) 1999, p. 24.
-  Costa L. F., Sporns O., Antiqueira L., Nunes M. G. V. Oliveira Jr. O. N., Applied Physics Letters, 91 (2007)
-  Ferrer i Cancho R. Sole R. V., Procs. Natl. Acad. Sci. USA, 100 (2003), 788.
-  Antiqueira L., Oliveira Jr. O. N., Costa L. F. Nunes M. G. V., Information Sciences, 179 (2009), 584.
-  Amancio D. R., Nunes M. G. V., Oliveira Jr. O. N., Pardo T. A. S., Antiqueira L. Costa L. F., Physica A, 390 (2011), 131.
-  Amancio D. R., Altmann E. G., Oliveira Jr. O. N. Costa L. F., New Journal of Physics 390 (2011), 13.
-  Mihalcea R., Tarau P. Figa E., Proceedings of the Conference on Empirical Methods in Natural Language Processing, (2004), 404.
-  Kenett Y. N., Kenett D. Y., Ben-Jacob E. Faust M., PLoS ONE 6 8 (2011), e23912.
-  Sigman M. Cecchi G. A., Proc. Natl. Acad. Sci. 99 (2002), 1742-1747.
-  Steyvers, M., Tenenbaum, J. B., Cogn. Sci. 29 (2005), 41-78.
-  Alvarez-Lacalle E., Dorow B., Eckmann J.-P., Moses E., Proc. Natl. Acad. Sci. 103 (2006), 7956-7961.
-  Amancio D. R., Oliveira Jr. O. N. Costa L. F., Journal of Statistical Mechanics P01004 (2012).
-  Costa L. F., Rodrigues F. A., Travieso G. Villas Boas P. R., Advances in Physics, 9 (2007).
-  Coursey K., Mihalcea R. Moen W., Proceedings of the Conference on Natural Language Learning, (2009), 210.
-  Malioutov I. Barzilay R., Proceedings of the 21st International Conference on Computational Linguistics, (2006), 25.
-  Berners-Lee T., Hendler J. Lassila O., Scientific American Magazine, 26 (2008).
-  Weaver W., Machine Translation of Languages: Fourteen Essays (Technology Press of MIT, Cambridge, MA, and John Wiley & Sons, New York, NY. ) 1955.
-  Navigli R., ACM Computing Surveys, 41 (2009) 2, 1.
-  Mallery J. C., Thinking about foreign policy: Finding an appropriate role for artificial intelligence computers. (Ph.D. dissertation. MIT Political Science Department, Cambridge, MA.) 1988.
-  Yarwosky D., Proceedings of the ARPA Workshop on Human Language Technology, (1993).
-  Lesk M., Proceedings of the 5th annual international conference on Systems documentation, (1986), 24.
-  Knight K., Megyesi B. Schaefer C., Proceedings of the 4th Workshop on Building and Using Comparable Corpora, 2011, 2.
-  Mihalcea R. Radev D., Graph-Based Natural Language Processing and Information Retrieval. (Cambridge University Press) 2011.
-  Asztalos A. Toroczkai Z., Europhysics Letters, 92, (2010).
-  Mihalcea R., Tarau P. Figa E., Proceedings of the 20th International Conference on Computational Linguistics, (2004), 1126.
-  Veronis J., Comput. Speech Lang. 18 (2004) 3, 223.
-  Costa L.F. Silva F. N., Journal of Statistical Physics, 125 (2006), 4.
-  Costa L. F., Rodrigres F. A., Hilgetag C. C. Kaiser M., Europhysics Letters 87, (2009).
-  Page L., Brin S., Motwani R. Winograd T., The PageRank Citation Ranking: Bringing Order to the Web Stanford InfoLab. (Technical Report) 1999.
-  Barrat A., Barthelemy M. Vespignani A., Dynamical Processes on Complex Networks. (Cambridge University Press) 2008.
-  Goh K. I., Kahng B. Kim D., Phys. Rev. Lett. 87, (2001) 278701.
-  Newman M. E. J., Networks: An Introduction (Oxford University Press) 2010
-  Kohavi R., Proceedings of the International Joint Conference on Artificial Intelligence, 2 (1995), 12.
-  Bishop C. M., Pattern Recognition and Machine Learning (Springer) 2006
-  Duda R. O., Pattern Classification 2000
-  Gamallo P., Gonzalez M., Agustini A. Lopes G. Lima V. S., Workshop on Machine Learning and Natural Language Processing for Ontology Engineering (2002).
-  Romacker M., Markert K. Hahn U., International Joint Conferences on Artificial Intelligence (1999), 868.
-  Mann H. B. Whitney D. R., Annals of Mathematical Statistics 18 (1947) 1.
-  Silva T. C. Zhao L., IEEE Transactions on Neural Networks 23 (2012), 1-16.