GASC: Genre-Aware Semantic Change for Ancient Greek

GASC: Genre-Aware Semantic Change for Ancient Greek

Valerio Perrone
Amazon, Berlin
&Marco Palma
University of Warwick

&Simon Hengchen
University of Helsinki
\ANDAlessandro Vatri
University of Oxford
The Alan Turing Institute
&Jim Q. Smith
University of Warwick
&Barbara McGillivray
University of Cambridge
The Alan Turing Institute
  Work done prior to joining Amazon.

Word meaning changes over time, depending on linguistic and extra-linguistic factors. Associating a word’s correct meaning in its historical context is a critical challenge in diachronic research, and is relevant to a range of NLP tasks, including information retrieval and semantic search in historical texts. Bayesian models for semantic change have emerged as a powerful tool to address this challenge, providing explicit and interpretable representations of semantic change phenomena. However, while corpora typically come with rich metadata, existing models are limited by their inability to exploit contextual information (such as text genre) beyond the document time-stamp. This is particularly critical in the case of ancient languages, where lack of data and long diachronic span make it harder to draw a clear distinction between polysemy and semantic change, and current systems perform poorly on these languages. We develop GASC, a dynamic semantic change model that leverages categorical metadata about the texts’ genre information to boost inference and uncover the evolution of meanings in Ancient Greek corpora. In a new evaluation framework, we show that our model achieves improved predictive performance compared to the state of the art.

GASC: Genre-Aware Semantic Change for Ancient Greek

Valerio Perrone Amazon, Berlinthanks:   Work done prior to joining Amazon.                        Marco Palma University of Warwick                        Simon Hengchen University of Helsinki

Alessandro Vatri University of Oxford The Alan Turing Institute                        Jim Q. Smith University of Warwick                        Barbara McGillivray University of Cambridge The Alan Turing Institute

1 Introduction

Change and its precondition, variation, are inherent in languages. Over time, new words enter the lexicon, others become obsolete, and existing words acquire new senses. These changes are grounded in cognitive, social, and contextual factors, and can be realized in different ways. For example, in Old English thing meant ‘a public assembly’111In the remainder of this paper, we use emphasis to refer to a word and ‘single quotes’ for any of its senses. and currently it more generally means ‘entity’. Semantic change research has a number of practical applications, beyond historical linguistics research, including new sense detection in computational lexicography and information retrieval for historical texts that allows to restrict a search to certain word senses (e.g. the old sense of the English adjective nice as ‘silly’). To take an example from recent semantic change in English, the verb tweet used to be uniquely associated with birds’ sounds and has recently acquired a new sense related to the social media platform Twitter. However, in this as in many other cases, the original sense co-exists with the new one, and specific contexts or genres will select one over the other. This is known as synchronic variation, and can be successfully modelled probabilistically, as advocated by several authors (see e.g. jenset). The close relationship between innovation and variation is well-known in historical linguistics, and critical to ancient languages, for which balanced corpora are not available due to the limited amount of data at our disposal; therefore models need to explicitly account for confounding variables like genre.

To address these challenges, we introduce GASC (Genre-Aware Semantic Change), a novel dynamic Bayesian topic model for semantic change. In this model, the evolution of word senses over time is based not only on distributional information of lexical nature, but also on additional features, specifically genre. This allows GASC to decouple sense probabilities and genre prevalence, which is critical with genre-unbalanced data such as ancient languages corpora. The value of incorporating genre information in the model goes beyond literary corpora and historical language data and can be applied to recent data spanning over a period of time where text type information is critical, for example in specialized domains. Explicitly modelling genres also makes it possible to address a number of additional questions, revealing the genre most likely associated to a given sense, the most unusual sense for a genre, and which genres have the most similar senses. Naturally, this framework can be applied to other kinds of categorical metadata about the text, such as author, geography, or style.

Figure 1: Distribution of mus ‘mouse’/‘muscle’/‘mussel’ by genre over time vs. distribution of its senses over time. The coloured lines track proportions of mus in each genre and century; the bars show the mus occurrence proportions with each sense and century. The grey lines inside the bars show the confidence intervals.

Ancient Greek is an insightful test case for several reasons. First, Ancient Greek words tend to have a particularly high number of different senses (bakker_register_2010), and the extant corpus of Ancient Greek texts displays a large number of literary genres. Second, we can use data spanning over several centuries. Third, Ancient Greek scholarship provides high-quality data to validate automatic systems. Top-quality transcribed Ancient Greek texts are available, thus eliminating the need for OCR correction.

Finally, polysemous words are particularly sensitive to register variation and the distribution of senses can vary greatly across registers (Leiwoetal2012). As most extant texts are literary and relatively conservative from a linguistic perspective, we expect genre and register to play a significant role in the variation of sense distributions in polysemous words. The word mus, for instance, can mean ‘mouse’, ‘muscle’, or ‘mussel’. As Figure 1 shows, the distribution of ’muscle’ over time (light blue bars) closely follows the distribution of this word in technical genres over time (red line), suggesting that the effect of genre should be incorporated in semantic change models.

2 Related work

Semantic change in historical languages, especially on a large scale and over a long time period, is an under-explored, but impactful research area. Previous work has mainly been qualitative in nature, due to the complexity of the phenomenon (cf. e.g. Leiwoetal2012). In recent years, NLP research has made great advances in the area of semantic change detection and modelling (for an overview of the NLP literature, see tang2018state and tahmasebi2018survey), with methods ranging from topic-based models (boydgraber2007; cook2014novel; lau2014learning; wijaya2011understanding; frermann2016bayesian), to graph-based models (mitra2014s; mitra2015automatic; tahmasebi2017finding), and word embeddings (kim2014temporal; Basile2018; kulkarni2015statistically; hamilton2016diachronic; dubossarsky2017outta; tahmasebi2018study; rudolph2018dynamic). However, such models are purely based on words’ lexical distribution information (such as bag-of-words) and do not account for language variation features such as text type because genre-balanced corpora are typically used.

With the exception of Bamman2011 and Rodda2016, no previous work has focussed on ancient languages. Recent work on languages other than English is rare but exists: falk2014quenelle use topic models to detect changes in French, whereas cavallin2012automatic and tahmasebi2018study focus on Swedish, with the comparison of verb-object pairs and word embeddings, respectively. zampieri2016modeling use SVMs to assign a time period to text snippets in Portuguese, and tang2016semantic work on Chinese newspapers using S-shaped models. Most work in this area focusses on simply detecting the occurrence of semantic change, while frermann2016bayesian’s system, SCAN, takes into account synchronic polysemy and models how the different word senses evolve across time.

The work we present bears important connections with the topic model literature. The idea of enriching topic models with document-specific author meta-data has been explored in Rosen-Zvi2004 for the static case. Several time-dependent extensions of Bayesian topic models have been developed by the machine learning community, with a number of parametric and nonparametric approaches (bleilafferti; gamma; xing; topictime; perrone2017). In this paper we transfer such ideas to the semantic change domain, where each datapoint comes in the form of a bag of words associated to a single sense (rather than a mixture of topics). Excluding cases of intentional ambiguity, which we expect to be rare, we can safely assume that there are generally no ambiguities in a context, and each word instance maps to a single sense.

3 The model

We start with a lemmatized corpus which has been pre-processed into a set of text snippets, each containing an instance of the target word. Every snippet corresponds to a fixed-size window , e.g., a set of words to the left and to the right of an instance of the target word. The inferential task is to detect the sense associated to the target word in the given context, and describe the evolution of all sense proportions over time.

The generative model for GASC is presented in Algorithm 1 and illustrated by the plate diagram in Figure 2. First, suppose that throughout the corpus the target word is used with different senses, where we define a sense at time as a distribution over words from the dictionary. Based on the intuition that each genre is more or less likely to feature a given sense, we assume that each of possible text genres determines a different distribution over senses. Each observed document snippet is then associated with a genre-specific distribution over senses at time , where is the observed genre for document . Crucially, conditioning on the observed genre we have a specific distribution over senses, which accounts for genre-specific word usage patterns. On the other hand, to make sure senses can be uniquely identified across genres, we associate each sense to the same probability distribution over words for all genres. We let word and sense distributions evolve over time with changes drawn from a Gaussian, ensuring smooth transitions. The degree of coupling between sense probabilities over time is controlled by , the sense probability precision parameter, so that the larger , the stronger the coupling between the sense probabilities over time. We place a Gamma prior over with hyperparameters and , and infer from the data. We fix , the word probability precision parameter.

The model can be applied to different inferential goals: we can focus on the evolution of the sense probabilities or on the changes within each sense. For each of these aims, we can use several hyperparameter combinations for , which is drawn from the prior distribution as determined by and , and . Specifically, we consider the following three settings. Setting 1: , , , as in frermann2016bayesian. Setting 2: , , . This setting aims at enforcing less variation within senses over time. Setting 3: , , . This still keeps the bag of words stable for each sense, but also induces less smoothing of the sense probabilities over time. Setting 3 allows the probabilities to vary widely from one century to one another. We also expect the high value of to reduce the likelihood of dramatic changes within the same sense across contiguous time periods and also to favour the emergence of new senses. If not otherwise specified, we use setting 3. Finally, note that an extra parameter of the model is the window size , namely the number of words surrounding an instance of the target. While larger values increase the range of dependencies that can be captured by the model, this tends to introduce noise as wider windows can include irrelevant contextual words.

Draw ;
for time  do
       for genre  do
             Draw sense distribution
       end for
      for sense  do
             Draw word distribution
       end for
      for document  do
             Let be the observed genre;
             Draw sense ;
             for context position  do
                   Draw word ;
             end for
       end for
end for
Algorithm 1 GASC generative model

Figure 2: GASC plate diagram with 3 time periods.

3.1 Inference

For posterior inference we extend the blocked Gibbs sampler proposed in frermann2016bayesian. Specifically, the full conditional is available for the snippet-sense assignment, while to sample the sense and word distributions we adopt the auxiliary variable approach from Mimno2008. The sense precision parameters are drawn from their conjugate Gamma priors. For the distribution over genres we proceed as follows. First, sample the distribution over senses for each genre following Mimno2008. Then, sample the sense assignment conditioned on the observed genre from its full conditional: This setting easily extends to sample genre assignments for tasks where, for example, some genre metadata are missing.

4 Ancient Greek corpus

We used the Diorisis Annotated Ancient Greek Corpus (diorisis2018), which consists of 10,206,421 words and is lemmatized and part-of-speech-tagged ( The corpus contains 820 texts spanning between the beginnings of the Ancient Greek literary tradition (8th century BC) and the 5th century AD. The corpus covers a number of Ancient Greek literary and technical genres: poetry (narrative, choral, epigrams, didactic), drama (tragedy, comedy), oratory, philosophy, essays, narrative (historiography, biography, mythography, novels), geography, religious texts (hymns, Jewish and Christian Scriptures, theology, homilies), technical literature (medicine, mathematics, natural science, tactics, astronomy, horsemanship, hunting, politics, art history, rhetoric, literary criticism, grammar), and letters (see Table 1). In technical texts, we expect polysemous words to have a technical sense. On the other hand, in works more closely representing general language (comedy, oratory, historiography) we expect the words to appear in their more concrete and less metaphorical senses; in a number of genres such as philosophy and tragedy, we cannot assume that this distribution holds. Whilst genre-annotated corpora are not especially common in NLP, where most tasks rely on specific genres (e.g. Twitter) or on genre-balanced corpora such as COHA (davies2002corpus), they are more prevailing within the humanities, and especially the classics. Additionally, research on automated genre identification has been flourishing for decades (e.g. kessler1997automatic), making the need for genre information in a potential corpus not as much of a hindrance as can be thought.

Genres Word counts per century Total
Comedy 79K 16K 95K
Essays 4K 475K 263K 361K 24K 1,127K
Letters 10K 1K 10K 164K 185K
Narrative 335K 209K 311K 661K 962K 483K 411K 98K 3,470K
Oratory 58K 529K 185K 296K 3K 56K 1,127K
Philosophy 895K 113K 213K 1,221K
Poetry 199K 16K 21K 81K 4K 6K 23K 18K 60K 127K 555K
Religion 16K 132K 463K 134K 45K 18K 808K
Technical 104K 327K 15K 386K 24K 394K 158K 4K 1,412K
Tragedy 207K 207K
Total 199K 32K 804K 1,990K 540K 467K 1,053K 1,780K 1,617K 1,174K 424K 127K 10,206K
Table 1: Genre and time distribution of texts in the Ancient Greek corpus (counts rounded to the nearest thousand).

5 Evaluation framework

Evaluating the performance of models tackling lexical semantic change is notoriously challenging. Frameworks are either lacking or focus on very specific types of sense change (schlechtweg2018durel; tahmasebi2018survey). Exceptions are kulkarni2015statistically, Basile2018 and hamilton2016diachronic, who focus on the change points of word senses. However, in the case of Ancient Greek (and other historical languages), semantic change is so closely related to polysemy and the corpora typically contain gaps and uneven distribution of text genres, that it is very hard to find a specific point in time when a new sense emerged in the language. Therefore, it is more appropriate to take a probabilistic approach to modelling sense distribution, and devise an evaluation approach that fits this. Although historical dictionaries and traditional philology do describe the evolution of words’ senses across time, they do not necessarily reflect the evidence from corpora on which models can be evaluated, and often only provide insights into the appearance of a new sense, rather than the relative predominance of a word’s senses across time. These reasons led us to craft a novel evaluation dataset and framework, which has the advantage of reflecting the data on which the model is evaluated, and allows for a finer-grained evaluation of the predominance of a word’s senses across time.

5.1 Log-likelihood evaluation

First, we compared GASC against the current state-of-the-art (SCAN) in terms of log-likelihood of held-out data. We chose 50 target words in the corpus that could be identified as polysemous (e.g. the verb legō, whose senses are ‘gather’ and ‘tell’) based on expert judgment. 17 words were selected from the technical vocabulary of Greek aesthetics, whose polysemy has been described in the secondary literature (pollitt_ancient_1974) and 33 words have been manually selected from the highest-frequency lemmas in the Diorisis corpus. The necessity to manually identify suitable words has led us to limit their overall number to 50. For each one of these target words, we randomly divided the corpus into a training (80%) and test set (20%). Results on the 50-word dataset are reported in Section 6.1.

5.2 Expert annotation

To evaluate our method against ground truth, we proceeded as follows. First, two Ancient Greek experts determined, for each of three target words (mus ‘mouse’/‘muscle’/‘mussel’, harmonia ‘fastening’/‘agreement’/‘stringing (musical scale, melody)’, and kosmos ‘order’/‘world’/‘decoration’), the range of its possible senses. This was achieved using the standard scholarly Ancient Greek-English dictionary (Liddell:1996) and existing philological evidence (pollitt_ancient_1974). The target words were selected (a) from the vocabulary of Ancient Greek aesthetics, which includes numerous terms that are used with an abstract metaphorical sense and have a concrete counterpart in the general vocabulary, and as such it has been the subject of much philological research literature; (b) from high-frequency polysemous terms. In addition, these words are attested in most of the time periods covered by the corpus and across different literary genres. Once the set of possible senses was created, experts manually annotated the whole corpus by tagging the senses of the target words in context, and we plan to publish this dataset in the future. Table 2 shows an example from the annotated dataset for the word kosmos.

date genre author work target word sense id
-335 Technical Aristotle De Mundo kosmos kosmos:world
Table 2: Example from annotated dataset displaying a sentence containing the target word kosmos and its expert-assigned sense ‘world’. The date of the text is given as a negative number because it refers to the year 335 B. C. This instance refers to the sentence Tou de sumpantos ouranou te kai kosmou sphairoeidous ontos kai kinoumenou kathaper eipon “The whole of the heaven, the whole cosmos, is spherical, and moves continuously, as I have said”.

The annotators also marked the cases in which the semantic annotation was purely based on the corpus context of the target words, which is the evidence base on which the model can rely (category “collocates”). Only the annotations that were based on collocates were retained in the evaluation. Using this information, the relative frequency of each sense usage for each target word in any time slice becomes computable, and was used to create ground-truth data on the diachronic predominance of a word’s senses as reflected in the corpus.

5.3 Automatic sense labelling

For every time period , inferred sense , and genre , GASC outputs a distribution of words with associated probabilities. For instance, the output for kosmos (‘order’, ‘world’, or ‘decoration’) in oratory at time 0 contains the following word distributions (only the first, second, and fifth senses are displayed for reasons of space):

aêr (0.069); mousikos (0.059); gê (0.056); harmonia (0.034); ouranos (0.033); logos (0.030); gignomai (0.021); sphaira (0.021); pselion (0.020); apaiteô (0.019);
polis (0.035); asebeia (0.014); politeia (0.012); proteros (0.012); naus (0.012); pentêkonta(0.011); aei (0.011); hama (0.011) ; peripeteia (0.011); oikia (0.011);
phulassô (0.138); eucharisteô (0.097); argureos (0.069); epimeleomai (0.068); pothen (0.038); dakruon (0.027); apophainô (0.026); chruseos (0.025); Nikê (0.025); olochrusos (0.022);

These distributions can be interpreted by experts based on the meanings of the words they group and thus associated to the meanings of the target word. In this example, the word list for includes aêr (‘air’), (‘earth’), ouranos (‘sky’), and sphaira (‘sphere, globe’), which point to the meaning of kosmos as ‘world’. The list for includes polis (‘city’), asebeia (‘impiety’), politeia (‘constitution’), and oikia (‘household’), which point to the meaning of kosmos as ‘order’. Lastly, the list for includes argureos (‘of silver’), apophainô (‘show, display’), chruseos (‘golden’), and olochrusos (‘of solid gold’), pointing to the meaning of kosmos as ‘decoration’.

On the other hand, the expert annotation provides lists of occurrences of the target words in their corpus context, each associated to a sense label. In the example displayed in Table 2, the sense label is ‘kosmos-world’ and we can associate lemmas such as ouranos ‘sky’ and sphairoeides ‘spherical’ to this sense because these lemmas occur in the corpus context of this target word.

In order to evaluate the model’s output against the expert annotation, we need a way to automatically match the lists of words associated to each sense by the model to the sense labels assigned by the annotators. To achieve this, we devised the following sense-labelling strategy that matches the word senses assigned by the annotators (denoted by ) with the senses outputted by the model (denoted by ).

First, we aimed to measure how closely each model’s sense matches each expert sense . We assigned a confidence score to every possible pair by relying on the words associated to in the model’s output and the words co-occurring with the target word in a given expert-assigned sense. In the example for kosmos, for we compare words from the model output, such as ouranos ‘sky’ and sphairoeides ‘spherical’ with words from the context words of the annotated sentences, such as aêr ‘air’, ‘earth’, and ouranos ‘sky’. We therefore considered two elements. For the words from the model output, we consider the normalized probability with which these words are associated to the model sense , i.e. . In the example for kosmos, aêr ‘air’ is associated to probability 0.069, ‘earth’ and ouranos ‘sky’ to 0.033. For the context words from the annotated dataset, we consider the degree by which these words are associated to an expert sense. In the example of kosmos from Table 2, this is calculated based on how many different senses a context word like ouranos ‘sky’ or sphairoeides ‘spherical’ is associated to. To measure this degree of association we define the expert score of a word as 1 divided by the number of senses assigned by the experts to this word. If the word is associated to only one sense in the annotated dataset, its expert score will be highest (1). If a word not assigned to the sense by the experts, its expert score is 0.

The formula for the confidence score of a pair of model sense and expert-assigned sense is as follows:


The confidence score is highest when both elements, and are highest for all words. In the extreme cases, will be 1 if the model estimated that a word is very strongly associated to sense and is 1 if the is only found in contexts labelled as by the experts. This points to and being associated to the same words, and therefore being the same sense. On the other hand, the confidence score is lowest when and do not share any words.

In contrast with clustering overlap techniques like purity or rand index (which compare sets of words), we use this weighting to ensure words with a higher model-estimated probability and uniquely associated to a sense (thus whose expert score is 1) weigh more compared to other words.

The confidence scores were used to find the best matching pair (, ): for every expert sense we selected the sense(s) for which was higher than the random baseline (1 divided by the number of expert senses) and higher than the sum of the second and third best confidence scores, when possible. We consider NA as an additional expert sense whenever the expert assigned a sense based on other factors than lexical context.

After matching the model-estimated senses to the expert-assigned senses, we calculated precision and recall metrics, to measure if the words associated to a given sense by the model were correct. For every target word and for every matched pair , we considered a word to be correctly assigned to a sense by the model if this word also appeared within a 5-word window of the target word in the expert annotation for . In the example above for kosmos, and =‘kosmos-world’, one such word is ouranos ‘sky’ because it appears in the model output for and in the context window of a sentence labelled as ‘kosmos-world’ by the annotators. Moreover, we decided to weight every word by the probability that the model assigned to it, so to take into account the fact that some words are more strongly associated to a sense than others.

Therefore, we defined precision as the ratio between the number of words correctly assigned to , weighted by their respective normalised model-estimated probabilities, and the number of words assigned to by the model. Note that our precision metric is based on the distributional hypothesis whereby words occurring in similar contexts tend to exhibit similar meanings. We computed this metric after stop word removal, which limited the amount of noise by excluding uninformative contextual words. We fixed the window size to the same value of SCAN for all methods to ensure a fair comparison. We have defined precision in terms of the words assigned to a sense by the model and that also appear within a 5-word window of the target word in the expert annotation for that sense. The reason for this is that our model, as SCAN, only considers those context words for determining the target word’s senses, and that for the evaluation against the ground truth we only retained the cases in which the annotators were able to disambiguate the words based purely on their context.

We defined recall as the ratio between the number of all words correctly assigned to (weighted by their respective probabilities) and the number of words assigned to sense by the experts (weighted by their expert scores). For each model, the precision and recall scores for each pair were averaged and used as the final scores. Since recall directly depends on the number of expert words, the metric can only be used to compare the performance of models for a specific target word. While the proposed assessment method focusses on evaluating dynamic topic models, it can be generalised to any probabilistic model by considering the posterior probability of the gold word sense.

6 Experiments

6.1 Predictions on held-out words

Considering the 50-word dataset described in Section 5, we evaluated the predictive performance in terms of log-likelihood of held-out data for 3 models: SCAN (not using any genre information), GASC-all (GASC with all the available genres) and GASC-narr (GASC with 2 genres, Narrative vs. non Narrative). Narrative and Technical are the genres with the highest frequency in the corpus for which all the 50 words occurred at least once in the training and test sets, and analogous results are obtained when GASC with Technical vs. non Technical is used. For each, we compared the 3 hyperparameter settings previously reported, with higher scores indicating that a model is better at explaining unseen data.

Figure 3 shows the predictive log-likelihood scores for a range of values of , averaging the results over 50 leave-one-out folds. Each time, the log-likelihood scores were averaged under the final 10 samples of the latent variables, out of 1000 MCMC iterations (150 of which used as burn-in). On average, GASC-narr consistently outperforms SCAN across every and for each hyperparameter setting. On the other hand, SCAN exhibits a higher held-out log-likelihood than GASC-all. Exploiting some information on the genre yields better predictions, while using all genres attested in the corpus is not effective as some genres are not sufficiently represented by the data.

Figure 3 also shows that the best predictions over unseen data are obtained for between 10 and 15. Higher values tend to introduce noisy senses with no improvement for the model output. In addition, Setting 3 proved to work better or as well as the other settings. In the next section, we fix the model hyperparameters and use a validation set of words that were not part of the 50 targets of this experiment.

Figure 3: Held-out log-likelihood scores varying . Larger scores indicate better predictive performance. The shaded area with dotted contour indicates 1 standard error.

6.2 Ground truth recovery

harmonia ‘agreement, harmony’ Technical ( = 0.888, p < 0.0001)
Narrative ( = 0.719, p = 0.006)
Essays ( = 0.561, p = 0.046)
‘fastening’ Narrative ( = 0.663, p = 0.013)
‘stringing, music scale’ Technical ( = 0.817, p = 0.001)
Philosophy ( = 0.632, p = 0.02)
Essays ( = 0.598, p = 0.031)
kosmos ‘decoration’ Narrative ( = 0.887, p = 0.001)
Technical ( = 0.705, p = 0.023)
Oratory ( = 0.664, p = 0.036)
‘order’ Technical ( = 0.875, p = 0.001)
Narrative ( = 0.862, p = 0.001
‘world’ Technical ( = 0.792, p = 0.006)
Oratory ( = 0.723, p = 0.018)
mus ‘mouse’ Narrative ( = 0.813, p = 0.001)
Essays ( = 0.743, p = 0.004)
‘muscle’ Technical ( = 0.766, p = 0.002)
‘mussel’ Narrative ( = 0.736, p = 0.004)
Essays ( = 0.736, p = 0.004)
Poetry ( = 0.613, p = 0.026)
Table 3: Correlations between senses and genres for manually annotated target words.





Figure 4: Comparison of expert annotation (top) vs the output of SCAN and GASC (bottom). Each stacked bar corresponds to all occurrences of kosmos in a given time period. Colours denoting senses are matched between plots (both shades of orange map to the sense ‘order’).
Word/Model SCAN GASC-independent GASC
P R F1 P R F1 P R F1
mus 0.430 0.477 0.452 0.420 0.442 0.431 0.224 0.298 0.253
harmonia 0.527 0.708 0.603 0.582 0.729 0.646 0.497 0.481 0.484
kosmos 0.405 0.586 0.478 0.362 0.447 0.399 0.525 0.611 0.595
Table 4: SCAN vs GASC on mus (‘mouse’, ‘muscle’, ‘mussel’), harmonia (‘abstract’, ‘concrete’, ‘musical’), and kosmos (‘order’, ‘decoration’, ‘world’) in terms of precision (‘P’), recall (‘R’), and F1-score (‘F1’).

We explored the ability to recover ground truth when available. For the word mus, experts annotated 205 instances, of which 198 were assigned to one of the 3 senses ‘mouse’, ‘mussel’, and ‘muscle’; out of these 198 assignments, 114 were performed based on lexical contextual information only (category ‘collocates’) and were retained for the evaluation. For harmonia, the number of annotated occurrences was 599, of which 411 were of the type ‘collocates’. For kosmos, 1,411 occurrences were annotated, of which 1,406 were assigned to a sense, and in 1,102 cases the annotation was of the type ‘collocates’. We identified the genres that manual annotation shows have the largest effect on the distribution of senses for each target words by calculating the Spearman’s Rank Correlation Coefficient for each word-sense s between the frequency f(s) of s across centuries and the frequency f(s,g) of s in each genre g across centuries. Significant correlation between f(s) and any f(s,g) would suggest that variation in the frequency of a word sense across centuries is not due to diachronic change, but to how frequently s is attested in g in each century (and, ultimately, to the amount of texts representing g in each century). Significance (p < 0.05) is reached for following senses and genres as shown in Table 3. Given the amount of available data and the size of the correlations, we considered the genres Technical and non-Technical for mus and harmonia, and both Technical and non-Technical and Narrative and non-Narrative for kosmos (see Table 1). These target words were selected as examples of polysemous words (a) exhibiting a range of clearly distinct senses (such as ‘mus’, whose three senses are strikingly diverse), (b) attested in most, if not all, the time periods covered by the corpus, and (c) attested across a number of literary genres. As expert annotations of semantic change in Ancient Greek corpora are virtually unavailable, this particular choice of targets also allowed us to leverage ground truth for validation.

We compared the performance of SCAN with GASC and GASC-independent, namely a simpler version of GASC that fits an independent model to each collection of documents sharing the same genre, so that parameters and senses are inferred independently across genres (while in GASC senses are shared but their probability distributions are independent across genres). The comparison was carried out with two approaches: 1) a comparison of the word senses across time against the expert-annotated data, 2) precision, recall, and F1 scores (the harmonic mean of precision and recall) to determine how closely the words assigned to a sense by the model match the words assigned to a sense by the experts.

Figure 4 compares the time distribution of the senses of kosmos in the expert annotation (left) and as outputted by GASC run on Narrative vs. non-Narrative (right). For every matched pair, we computed precision, recall, and F1 scores. For GASC, the values average precision, recall, and F1-score for {Technical, non-Technical} for mus and harmonia and {Narrative, Non-Narrative} for kosmos. The results reported in Table 4 indicate that, for the targets that are sufficiently represented in the corpus, incorporating genre information leads to a greater ability to recover the ground truth.

7 Discussion

We introduced GASC, a Bayesian model to study the evolution of word senses in ancient texts. Crucially, we performed this analysis conditional on the text genre, demonstrating that the ability to harness genre metadata addresses a fundamental challenge in disambiguating word senses in ancient Greek. In experiments we showed that GASC is able to provide interpretable representations of the evolution of word senses, and achieves improved predictive performance compared to the state of the art. Further, we established a new framework to assess model accuracy against expert judgment. To our knowledge, no previous work has systematically compared the estimates from a statistical model to manual semantic annotations of ancient texts. This work can be seen as a step towards the development of richer evaluation schemes and models that can embed expert judgments. Future work in this direction could encode more structured cross-genre dependencies, or allow for change points that occur in the light of exogenous forces by historical events.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description