Mark my Word:
A Sequence-to-Sequence Approach to Definition Modeling
Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations. Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it. We implement this approach in a Transformer-based sequence-to-sequence model. Our proposal allows to train contextualization and definition generation in an end-to-end fashion, which is a conceptual improvement over earlier works. We achieve state-of-the-art results both in contextual and non-contextual definition modeling.
The task of definition modeling, introduced by Noraset2017DefinitionML, consists in generating the dictionary definition of a specific word: for instance, given the word “monotreme” as input, the system would need to produce a definition such as “any of an order (Monotremata) of egg-laying mammals comprising the platypuses and echidnas”.111Definition from Merriam-Webster. Following the tradition set by lexicographers, we call the word being defined a definiendum (pl. definienda), whereas a word occurring in its definition is called a definiens (pl. definientia).
Definition modeling can prove useful in a variety of applications. Systems trained for the task may generate dictionaries for low resource languages, or extend the coverage of existing lexicographic resources where needed, e.g. of domain-specific vocabulary. Such systems may also be able to provide reading help by giving definitions for words in the text.
A major intended application of definition modeling is the explication and evaluation of distributed lexical representations, also known as word embeddings (Noraset2017DefinitionML). This evaluation procedure is based on the postulate that the meaning of a word, as is captured by its embedding, should be convertible into a human-readable dictionary definition. How well the meaning is captured must impact the ability of the model to reproduce the definition, and therefore embedding architectures can be compared according to their downstream performance on definition modeling. This intended usage motivates the requirement that definition modeling architectures take as input the embedding of the definiendum and not retrain it.
From a theoretical point of view, usage of word embeddings as representations of meaning (cf. lenci2018distributional; Boleda2019DSandLT, for an overview) is motivated by the distributional hypothesis (Harris54). This framework holds that meaning can be inferred from the linguistic context of the word, usually seen as co-occurrence data. The context of usage is even more crucial for characterizing meanings of ambiguous or polysemous words: a definition that does not take disambiguating context into account will be of limited use (Gadetsky18WordDefGen).
We argue that definition modeling should preserve the link between the definiendum and its context of occurrence. The most natural approach to this task is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it (cf. sections 3 & 4). We implement this approach in a Transformer-based sequence-to-sequence model that achieves state-of-the-art performances (sections 5 & 6).
2 Related Work
In their seminal work on definition modeling, Noraset2017DefinitionML likened systems generating definitions to language models, which can naturally be used to generate arbitrary text. They built a sequential lstm seeded with the embedding of the definiendum; its output at each time-step was mixed through a gating mechanism with a feature vector derived from the definiendum.
Gadetsky18WordDefGen stressed that a definiendum outside of its specific usage context is ambiguous between all of its possible definitions. They proposed to first compute the AdaGram vector (Bartunov15Adagram) for the definiendum, to then disambiguate it using a gating mechanism learned over contextual information, and finally to run a language model over the sequence of definientia embeddings prepended with the disambiguated definiendum embedding.
In an attempt to produce a more interpretable model, Chang18xSense map the definiendum to a sparse vector representation. Their architecture comprises four modules. The first encodes the context in a sentence embedding, the second converts the definiendum into a sparse vector, the third combines the context embedding and the sparse representation, passing them on to the last module which generates the definition.
Related to these works, Yang2019Sememes4CDM specifically tackle definition modeling in the context of Chinese—whereas all previous works on definition modeling studied English. In a Transformer-based architecture, they incorporate “sememes” as part of the representation of the definiendum to generate definitions.
On a more abstract level, definition modeling is related to research on the analysis and evaluation of word embeddings (Levy2014a; Levy2014; Arora16LinearWordsenses; batchkarov2016critiquewordsim; swinger2018biases, e.g.). It also relates to other works associating definitions and embeddings, like the “reverse dictionary task” (Hill16DictRep)—retrieving the definiendum knowing its definition, which can be argued to be the opposite of definition modeling—or works that derive embeddings from definitions (Wang15LexEmbLexKno; Tissier2017Dict2vecL; Bosc18AutoencodeDefs).
3 Definition modeling as a sequence-to-sequence task
Gadetsky18WordDefGen remarked that words are often ambiguous or polysemous, and thus generating a correct definition requires that we either use sense-level representations, or that we disambiguate the word embedding of the definiendum. The disambiguation that Gadetsky18WordDefGen proposed was based on a contextual cue—ie. a short text fragment. As Chang18xSense notes, the cues in Gadetsky18WordDefGen’s (Gadetsky18WordDefGen) dataset do not necessarily contain the definiendum or even an inflected variant thereof. For instance, one training example disambiguated the word “fool” using the cue “enough horsing around—let’s get back to work!”.
Though the remark that definienda must be disambiguated is pertinent, the more natural formulation of such a setup would be to disambiguate the definiendum using its actual context of occurrence. In that respect, the definiendum and the contextual cue would form a linguistically coherent sequence, and thus it would make sense to encode the context together with the definiendum, rather than to merely rectify the definiendum embedding using a contextual cue. Therefore, definition modeling is by its nature a sequence-to-sequence task: mapping contexts of occurrence of definienda to definitions.
This remark can be linked to the distributional hypothesis (Harris54). The distributional hypothesis suggests that a word’s meaning can be inferred from its context of usage; or, more succinctly, that “you shall know a word by the company it keeps” (firth1957). When applied to definition modeling, the hypothesis can be rephrased as follows: the correct definition of a word can only be given when knowing in what linguistic context(s) it occurs. Though different kinds of linguistic contexts have been suggested throughout the literature, we remark here that sentential context may sometimes suffice to guess the meaning of a word that we don’t know (Lazaridou2017MultimodalWM). Quoting from the example above, the context “enough around—let’s get back to work!” sufficiently characterizes the meaning of the omitted verb to allow for an approximate definition for it even if the blank is not filled (Taylor53Cloze; Devlin18Bert).
This reformulation can appear contrary to the original proposal by Noraset2017DefinitionML, which conceived definition modeling as a “word-to-sequence task”. They argued for an approach related to, though distinct from sequence-to-sequence architectures. Concretely, a specific encoding procedure was applied to the definiendum, so that it could be used as a feature vector during generation. In the simplest case, vector encoding of the definiendum consists in looking up its vector in a vocabulary embedding matrix.
We argue that the whole context of a word’s usage should be accessible to the generation algorithm rather than a single vector. To take a more specific case of verb definitions, we observe that context explicitly represents argument structure, which is obviously useful when defining the verb. There is no guarantee that a single embedding, even if it be contextualized, would preserve this wealth of information—that is to say, that you can cram all the information pertaining to the syntactic context into a single vector.
Despite some key differences, all of the previously proposed architectures we are aware of (Noraset2017DefinitionML; Gadetsky18WordDefGen; Chang18xSense; Yang2019Sememes4CDM) followed a pattern similar to sequence-to-sequence models. They all implicitly or explicitly used distinct submodules to encode the definiendum and to generate the definientia. In the case of Noraset2017DefinitionML, the encoding was the concatenation of the embedding of the definiendum, a vector representation of its sequence of characters derived from a character-level cnn, and its “hypernym embedding”. Gadetsky18WordDefGen used a sigmoid-based gating module to tweak the definiendum embedding. The architecture proposed by Chang18xSense is comprised of four modules, only one of which is used as a decoder: the remaining three are meant to convert the definiendum as a sparse embedding, select some of the sparse components of its meaning based on a provided context, and encode it into a representation adequate for the decoder.
Aside from theoretical implications, there is another clear gain in considering definition modeling as a sequence-to-sequence task. Recent advances in embedding designs have introduced contextual embeddings (McCann17CoVe; Peters18ELMo; Devlin18Bert); and these share the particularity that they are a “function of the entire sentence” (Peters18ELMo): in other words, vector representations are assigned to tokens rather than to word types, and moreover semantic information about a token can be distributed over other token representations. To extend definition modeling to contextual embeddings therefore requires that we devise architectures able to encode a word in its context; in that respect sequence-to-sequence architectures are a natural choice.
A related point is that not all definienda are comprised of a single word: multi-word expressions include multiple tokens, yet receive a single definition. Word embedding architectures generally require a pre-processing step to detect these expressions and merge them into a single token. However, as they come with varying degrees of semantic opacity (cordeiro2016MWECompEmbeddingsHardTime), a definition modeling system would benefit from directly accessing the tokens they are made up from. Therefore, if we are to address the entirety of the language and the entirety of existing embedding architectures in future studies, reformulating definition modeling as a sequence-to-sequence task becomes a necessity.
A sequence-to-sequence formulation of definition modeling can formally be seen as a mapping between contexts of occurrence of definienda and their corresponding definitions. It moreover requires that the definiendum be formally distinguished from the remaining context: otherwise the definition could not be linked to any particular word of the contextual sequence, and thus would need to be equally valid for any word of the contextual sequence.
We formalize definition modeling as mapping to sequences of definientia from sequences of pairs , where is the th word in the input and indicates whether the th token is to be defined. As only one element of the sequence should be highlighted, we expect the set of all indicators to contain only two elements: the one, , to mark the definiendum, the other, , to mark the context; this entails that we encode this marking using one bit only.222Multiple instances of the same definiendum within a single context should all share a single definition, and therefore could theoretically all be marked using the definiendum indicator . Likewise the words that make up a multi-word expression should all be marked with this indicator. In this work, however, we only mark a single item; in cases when multiple occurrences of the same definiendum were attested, we simply marked the first occurrence.
To treat definition modeling as a sequence-to-sequence task, the information from each pair has to be integrated into a single representation :
This marking function can theoretically take any form. Considering that definition modeling uses the embedding of the definiendum , in this work we study a multiplicative and an additive mechanism, as they are conceptually the simplest form this marking can take in a vector space. They are given schematically in Figure 1, and formally defined as:
The last point to take into account is where to set the marking. Two natural choices are to set it either before or after encoded representations were obtained. We can formalize this using either of the following equation, with the model’s encoder:
4.1 Multiplicative marking: Select
The first option we consider is to use scalar multiplication to distinguish the word to define. In such a scenario, the marked token encoding is
As we use bit information as indicators, this form of marking entails that only the representation of the definiendum be preserved and that all other contextual representations are set to : thus multiplicative marking amounts to selecting just the definiendum embedding and discarding other token embeddings. The contextualized definiendum encoding bears the trace of its context, but detailed information is irreparably lost. Hence, we refer to such an integration mechanism as a Select marking of the definiendum.
When to apply marking, as introduced by eq. 4, is crucial when using the multiplicative marking scheme Select. Should we mark the definiendum before encoding, then only the definiendum embedding is passed into the encoder: the resulting system provides out-of-context definitions, like in Noraset2017DefinitionML where the definition is not linked to the context of a word but to its definiendum only. For context to be taken into account under the multiplicative strategy, tokens must be encoded and contextualized before integration with the indicator .
Figure (a)a presents the contextual Select mechanism visually. It consists in coercing the decoder to attend only to the contextualized representation for the definiendum. To do so, we encode the full context and then select only the encoded representation of the definiendum, dropping the rest of the context, before running the decoder. In the case of the Transformer architecture, this is equivalent to using a multiplicative marking on the encoded representations: vectors that have been zeroed out are ignored during attention and thus cannot influence the behavior of the decoder.
This Select approach may seem intuitive and naturally interpretable, as it directly controls what information is passed to the decoder—we carefully select only the contextualized definiendum, thus the only remaining zone of uncertainty would be how exactly contextualization is performed. It also seems to provide a strong and reasonable bias for training the definition generation system. Such an approach, however, is not guaranteed to excel: forcibly omitted context could contain important information that might not be easily incorporated in the definiendum embedding.
Being simple and natural, the Select approach resembles architectures like that of Gadetsky18WordDefGen and Chang18xSense: the full encoder is dedicated to altering the embedding of the definiendum on the basis of its context; in that, the encoder may be seen as a dedicated contextualization sub-module.
4.2 Additive marking: Add
We also study an additive mechanism shown in Figure (b)b (henceforth Add). It concretely consists in embedding the word and its indicator bit in the same vector space and adding the corresponding vectors:
In other words, under Add we distinguish the definiendum by adding a vector to the definiendum embedding, and another vector to the remaining context token embeddings; both markers and are learned during training. In our implementation, markers are added to the input of the encoder, so that the encoder has access to this information; we leave the question of whether to integrate indicators and words at other points of the encoding process, as suggested in eq. 4, to future work.
Additive marking of substantive features has its precedents. For example, Bert embeddings (Devlin18Bert) are trained using two sentences at once as input; sentences are distinguished with added markers called “segment encodings”. Tokens from the first sentence are all marked with an added vector , whereas tokens from second sentences are all marked with an added vector . The main difference here is that we only mark one item with the marker , while all others are marked with .
This Add marking is more expressive than the Select architecture. Sequence-to-sequence decoders typically employ an attention to the input source (Bahdanau14Attention), which corresponds to a re-weighting of the encoded input sequence based on a similarity between the current state of the decoder (the ‘query’) and each member of the input sequence (the ‘keys’). This re-weighting is normalized with a softmax function, producing a probability distribution over keys. However, both non-contextual definition modeling and the Select approach produce singleton encoded sequences: in such scenarios the attention mechanism assigns a single weight of 1 and thus devolves into a simple linear transformation of the value and makes the attention mechanism useless. Using an additive marker, rather than a selective mechanism, will prevent this behavior.
We implement several sequence to sequence models with the Transformer architecture (Vaswani17), building on the Opennmt library (opennmt2017) with adaptations and modifications when necessary.333Code & data are available at the following url: https://github.com/TimotheeMickus/onmt-selectrans. Throughout this work, we use GloVe vectors (Pennington2014) and freeze weights of all embeddings for a fairer comparison with previous models; words not in GloVe but observed in train or validation data and missing definienda in our test sets were randomly initialized with components drawn from a normal distribution .
We train a distinct model for each dataset. We batch examples by 8,192, using gradient accumulation to circumvent gpu limitations. We optimize the network using Adam with , , a learning rate of 2, label smoothing of 0.1, Noam exponential decay with 2000 warmup steps, and dropout rate of 0.4. The parameters are initialized using Xavier. Models were trained for up to 120,000 steps with checkpoints at each 1000 steps; we stopped training if perplexity on the validation dataset stopped improving. We report results from checkpoints performing best on validation.
5.1 Implementation of the Non-contextual Definition Modeling System
In non-contextual definition modeling, definienda are mapped directly to definitions. As the source corresponds only to the definiendum, we conjecture that few parameters are required for the encoder. We use 1 layer for the encoder, 6 for the decoder, 300 dimensions per hidden representations and 6 heads for multi-head attention. We do not share vocabularies between the encoder and the decoder: therefore output tokens can only correspond to words attested as definientia.444In our case, not sharing vocabularies prevents the model from considering rare words only used as definienda, such as “penumbra” as potential outputs, and was found to improve performances.
The dropout rate and warmup steps number were set using a hyperparameter search on the dataset from Noraset2017DefinitionML, during which encoder and decoder vocabulary were merged for computational simplicity and models stopped after 12,000 steps. We first fixed dropout to 0.1 and tested warmup step values between 1000 and 10,000 by increments of 1000, then focused on the most promising span (1000–4000 steps) and exhaustively tested dropout rates from 0.2 to 0.8 by increments of 0.1.
5.2 Implementation of Contextualized Definition Modeling Systems
To compare the effects of the two integration strategies that we discussed in section 4, we implement both the additive marking approach (Add, cf. section 4.2) and the alternative ‘encode and select’ approach (Select, cf. section 4.1). To match with the complex input source, we define encoders with 6 layers; we reemploy the set of hyperparameters previously found for the non-contextual system. Other implementation details, initialization strategies and optimization algorithms are kept the same as described above for the non-contextual version of the model.
We stress that the two approaches we compare for contextualizing the definiendum are applicable to almost any sequence-to-sequence neural architecture with an attention mechanism to the input source.555For best results, the Select mechanism should require a bi-directional encoding mechanism. Here we chose to rely on a Transformer-based architecture (Vaswani17), which has set the state of the art in a wide range of tasks, from language modeling (dai19tfxl) to machine translation (ott18scalingNMT). It is therefore expected that the Transformer architecture will also improve performances for definition modeling, if our arguments for treating it as a sequence to sequence task are on the right track.
We train our models on three distinct datasets, which are all borrowed or adapted from previous works on definition modeling. As a consequence, our experiments focus on the English language. The dataset of Noraset2017DefinitionML (henceforth ) maps definienda to their respective definientia, as well as additional information not used here. In the dataset of Gadetsky18WordDefGen (henceforth ), each example consists of a definiendum, the definientia for one of its meanings and a contextual cue sentence. contains on average shorter definitions than . Definitions in have a mean length of and a standard deviation of , whereas those in have a mean length of and a standard deviation of .
Chang18xSense stress that the dataset includes many examples where the definiendum is absent from the associated cue. About half of these cues doe not contain an exact match for the corresponding definiendum, but up to 80% contains either an exact match or an inflected form of the definiendum according to lemmatization by the nltk toolkit (nltk). To cope with this problematic characteristic, we converted the dataset into the word-in-context format assumed by our model by concatenating the definiendum with the cue. To illustrate this, consider the actual input from comprised of the definiendum “fool” and its associated cue “enough horsing around—let’s get back to work!”: to convert this into a single sequence, we simply prepend the definiendum to the cue, which results in the sequence “fool enough horsing around—let’s get back to work!”. Hence the input sequences of do not constitute linguistically coherent sequences, but it does guarantee that our sequence-to-sequence variants have access to the same input as previous models; therefore the inclusion of this dataset in our experiments is intended mainly for comparison with previous architectures. We also note that this conversion procedure entails that our examples have a very regular structure: the word marked as a definiendum is always the first word in the input sequence.
Our second strategy was to restrict the dataset by selecting only cues where the definiendum (or its inflected form) is present. The curated dataset (henceforth ) contains 78,717 training examples, 9,413 for validation and 9,812 for testing. In each example, the first occurrence of the definiendum is annotated as such. thus differs from in two ways: some definitions have been removed, and the exact citation forms of the definienda are not given. Models trained on implicitly need to lemmatize the definiendum, since inflected variants of a given word are to be aligned to a common representation; thus they are not directly comparable with models trained with the citation form of the definiendum that solely use context as a cue—viz. Gadetsky18WordDefGen & Chang18xSense. All this makes harder, but at the same time closer to a realistic application than the other two datasets, since each word appears inflected and in a specific sentential context. For applications of definition modeling, it would only be beneficial to take up these challenges; for example, the output “monotremes: plural of monotreme”666Definition from Wiktionary. would not have been self-contained, necessitating a second query for “monotreme”.
We use perplexity, a standard metric in definition modeling, to evaluate and compare our models. Informally, perplexity assesses the model’s confidence in producing the ground-truth output when presented the source input. It is formally defined as the exponentiation of cross-entropy. We do not report bleu or rouge scores due to the fact that an important number of ground-truth definitions are comprised of a single word, in particular in ( 25%). Single word outputs can either be assessed as entirely correct or entirely wrong using bleu or rouge. However consider for instance the word “elation”: that it be defined either as “mirth” or “joy” should only influence our metric slightly, and not be discounted as a completely wrong prediction.
Table 1 describes our main results in terms of perplexity. We do not compare with Chang18xSense, as they did not report the perplexity of their system and focused on a different dataset; likewise, Yang2019Sememes4CDM consider only the Chinese variant of the task. Perplexity measures for Noraset2017DefinitionML and Gadetsky18WordDefGen are taken from the authors’ respective publications.
All our models perform better than previous proposals, by a margin of 4 to 10 points, for a relative improvement of 11–23%. Part of this improvement may be due to our use of Transformer-based architectures (Vaswani17), which is known to perform well on semantic tasks (Radford2018; Cer18USE; Devlin18Bert; Radford2019, eg.). Like Gadetsky18WordDefGen, we conclude that disambiguating the definiendum, when done correctly, improves performances: our best performing contextual model outranks the non-contextual variant by 5 to 6 points. The marking of the definiendum out of its context (Add vs. Select) also impacts results. Note also that we do not rely on task-specific external resources (unlike Noraset2017DefinitionML; Yang2019Sememes4CDM) or on pre-training (unlike Gadetsky18WordDefGen).
Our contextual systems trained on the dataset used the concatenation of the definiendum and the contextual cue as inputs. The definiendum was always at the start of the training example. This regular structure has shown to be useful for the models’ performance: all models perform significantly worse on the more realistic data of than on . The dataset is intrinsically harder for other reasons as well: it requires some form of lemmatization in every three out of eight training examples, and contains less data than other datasets, only half as many examples as , and 20% less than .
The surprisingly poor results of Select on the dataset may be partially blamed on the absence of a regular structure in . Unlike , where the model must only learn to contextualize the first element of the sequence, in the model has to single out the definiendum which may appear anywhere in the sentence. Any information stored only in representations of contextual tokens will be lost to the decoders. The Select model therefore suffers of a bottleneck, which is highly regular in and that it may therefore learn to cope with; however predicting where in the input sequence the bottleneck will appear is far from trivial in the dataset. We also attempted to retrain this model with various settings of hyperparameters, modifying dropout rate, number of warmup steps, and number of layers in the encoder—but to no avail. An alternative explanation may be that in the case of the dataset, the regular structure of the input entails that the first positional encoding is used as an additive marking device: only definienda are marked with the positional encoding , and thus the architecture does not purely embrace a selective approach but a mixed one.
In any event, even on the dataset where the margin is very small, the perplexity of the additive marking approach Add is better than that of the Select model. This lends empirical support to our claim that definition modeling is a non-trivial sequence-to-sequence task, which can be better treated with sequence methods. The stability of the performance improvement over the non-contextual variant in both contextual datasets also highlights that our proposed additive marking is fairly robust, and functions equally well when confronted to somewhat artificial inputs, as in , or to linguistically coherent sequences, as in .
6 Qualitative Analysis
A manual analysis of definitions produced by our system reveals issues similar to those discussed by Noraset2017DefinitionML, namely self-reference,777Self-referring definitions are those where a definiendum is used as a definiens for itself. Dictionaries are expected to be exempt of such definitions: as readers are assumed not to know the meaning of the definiendum when looking it up. pos-mismatches, over- and under-specificity, antonymy, and incoherence. Annotating distinct productions from the validation set, for the non-contextual model trained on , we counted 9.9% of self-references, 11.6% pos-mismatches, and 1.3% of words defined as their antonyms. We counted pos-mismatches whenever the definition seemed to fit another part-of-speech than that of the definiendum, regardless of both of their meanings; cf. Table 2 for examples.
|Error type||Context (definiendum in bold)||Production|
|Pos-mismatch||her major is linguistics||most important or important|
|Self-reference||he wrote a letter of apology to the hostess||a formal expression of apology|
For comparison, we annotated the first 1000 productions of the validation set from our Add model trained on . We counted 18.4% pos mismatches and 4.4% of self-referring definitions; examples are shown in Table 3. The higher rate of pos-mismatch may be due to the model’s hardship in finding which word is to be defined since the model is not presented with the definiendum alone: access to the full context may confuse it. On the other hand, the lower number of self-referring definitions may also be linked to this richer, more varied input: this would allow the model not to fall back on simply reusing the definiendum as its own definiens. Self-referring definitions highlight that our models equate the meaning of the definiendum to the composed meaning of its definientia. Simply masking the corresponding output embedding might suffice to prevent this specific problem; preliminary experiments in that direction suggest that this may also help decrease perplexity further.
As for pos-mismatches, we do note that the work of Noraset2017DefinitionML had a much lower rate of 4.29%: we suggest that this may be due to the fact that they employ a learned character-level convolutional network, which arguably would be able to capture orthography and rudiments of morphology. Adding such a sub-module to our proposed architecture might diminish the number of mistagged definienda. Another possibility would be to pre-train the model, as was done by Gadetsky18WordDefGen: in our case in particular, the encoder could be trained for pos-tagging or lemmatization.
Lastly, one important kind of mistakes we observed is hallucinations. Consider for instance this production by the Add model trained on , for the word “beta”: “the twentieth letter of the Greek alphabet (), transliterated as ‘o’.”. Nearly everything it contains is factually wrong, though the general semantics are close enough to deceive an unaware reader.888On a related note, other examples were found to contain unwanted social biases; consider the production by the same model for the word “blackface”: “relating to or characteristic of the theatre”. Part of the social bias here may be blamed on the under-specific description that omits the offensive nature of the word; however contrast the definition of Merriam Webster for blackface, which includes a note on the offensiveness of the term, with that of Wiktionary, which does not. Cf. Bolukbasi2016; swinger2018biases for a discussion on biases within embedding themselves. We conjecture that filtering out hallucinatory productions will be a main challenge for future definition modeling architectures, for two main reasons: firstly, the tools and metrics necessary to assess and handle such hallucinations have yet to be developed; secondly, the input given to the system being word embeddings, research will be faced with the problem of grounding these distributional representations—how can we ensure that “beta” is correctly defined as “the second letter of the Greek alphabet, transliterated as ‘b’”, if we only have access to a representation derived from its contexts of usage? Integration of word embeddings with structured knowledge bases might be needed for accurate treatment of such cases.
We introduced an approach to generating word definitions that allows the model to access rich contextual information about the word token to be defined. Building on the distributional hypothesis, we naturally treat definition generation as a sequence-to-sequence task of mapping the word’s context of usage (input sequence) into the context-appropriate definition (output sequence).
We showed that our approach is competitive against a more naive ‘contextualize and select’ pipeline. This was demonstrated by comparison both to the previous contextualized model by Gadetsky18WordDefGen and to the Transformer-based Select variation of our model, which differs from the proposed architecture only in the context encoding pipeline. While our results are encouraging, given the existing benchmarks we were limited to perplexity measurements in our quantitative evaluation. A more nuanced semantically driven methodology might be useful in the future to better assess the merits of our system in comparison to alternatives.
Our model opens several avenues of future explorations. One could straightforwardly extend it to generate definitions of multiword expressions or phrases, or to analyze vector compositionality models by generating paraphrases for vector representations produced by these algorithms. Another strength of our approach is that it can provide the basis for a standardized benchmark for contextualized and non-contextual embeddings alike: downstream evaluation tasks for embeddings systems in general either apply to non-contextual embeddings (Gladkova2016, eg.) or to contextual embeddings (wang2019glue, eg.) exclusively, redefining definition modeling as a sequence-to-sequence task will allow in future works to compare models using contextual and non-contextual embeddings in a unified fashion. Lastly, we also intend to experiment on languages other than English, especially considering that the required resources for our model only amount to a set of pre-trained embeddings and a dataset of definitions, either of which are generally simple to obtain.
While there is a potential for local improvements, our approach has demonstrated its ability to account for contextualized word meaning in a principled way, while training contextualized token encoding and definition generation end-to-end. Our implementation is efficient and fast, building on free open source libraries for deep learning, and shows good empirical results. Our code, trained models, and data will be made available to the community.
We thank Quentin Gliosca for his many remarks throughout all stages of this project. We also thanks Kees van Deemter, as well as anonymous reviewers, for their thoughtful criticism of this work. The work was supported by a public grant overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program: Idex Lorraine Université d’Excellence (reference: ANR-15-IDEX-0004).