Multi-channel Reverse Dictionary Model

Multi-channel Reverse Dictionary Model

Abstract

A reverse dictionary takes the description of a target word as input and outputs the target word together with other words that match the description. Existing reverse dictionary methods cannot deal with highly variable input queries and low-frequency target words successfully. Inspired by the description-to-word inference process of humans, we propose the multi-channel reverse dictionary model, which can mitigate the two problems simultaneously. Our model comprises a sentence encoder and multiple predictors. The predictors are expected to identify different characteristics of the target word from the input query. We evaluate our model on English and Chinese datasets including both dictionary definitions and human-written descriptions. Experimental results show that our model achieves the state-of-the-art performance, and even outperforms the most popular commercial reverse dictionary system on the human-written description dataset. We also conduct quantitative analyses and a case study to demonstrate the effectiveness and robustness of our model. All the code and data of this work can be obtained on https://github.com/thunlp/MultiRD.

Introduction

A regular (forward) dictionary maps words to definitions while a reverse dictionary [29] does the opposite and maps descriptions to corresponding words. In Figure 1, for example, a regular dictionary tells you that “expressway” is “a wide road that allows traffic to travel fast”, and when you input “a road where cars go very quickly without stopping” to a reverse dictionary, it might return “expressway” together with other semantically similar words like “freeway”.

Reverse dictionaries have great practical value. First and foremost, they can effectively address the tip-of-the-tongue problem [7], which severely afflicts many people, especially those who write a lot such as researchers, writers and students. Additionally, reverse dictionaries can render assistance to new language learners who know a limited number of words. Moreover, reverse dictionaries are believed to be helpful to word selection (or word dictionary) anomia patients, people who can recognize and describe an object but fail to name the object due to neurological disorder [3]. In terms of natural language processing (NLP), reverse dictionaries can be used to evaluate the quality of sentence representations [10]. They are also beneficial to the tasks involving text-to-entity mapping including question answering and information retrieval [12]

There have been some successful commercial reverse dictionary systems such as OneLook1, the most popular one, but their architecture is usually undisclosed proprietary knowledge. Some scientific researches into building reverse dictionaries have also been conducted. Early work adopts sentence matching based methods, which utilize hand-engineered features to find the words whose stored definitions are most similar to the input query [4, 37, 16, 28]. But these methods cannot successfully cope with the main difficulty of reverse dictionaries that human-written input queries might differ widely from target words’ definitions.

Figure 1: An example illustrating what a forward and a reverse dictionary are.
\citeauthor

Hill2016LearningTU \shortciteHill2016LearningTU propose a new method based on neural language model (NLM). They employ a NLM as the sentence encoder to learn the representation of the input query, and return those words whose embeddings are closest to the input query’s representation. The NLM based reverse dictionary model alleviates the above-mentioned problem of variable input queries, but its performance is heavily dependent on the quality of word embeddings. According to Zipf’s law [36], however, quite a few words are low-frequency and usually have poor embeddings, which will undermine the overall performance of ordinary NLM based models.

To tackle the issue, we propose the multi-channel reverse dictionary model, which is inspired by the description-to-word inference process of humans. Taking “expressway” as an example, when we forget what word means “a road where cars go very quickly”, it may occur to us that the part-of-speech tag of the target word should be “noun” and it belongs to the category of “entity”. We might also guess that the target word probably contains the morpheme “way”. When having knowledge of these characteristics, it is much easier for us to search the target word out. Correspondingly, in our multi-channel reverse dictionary model, we employ multiple predictors to identify different characteristics of target words from input queries. By doing this, the target words with poor embeddings can still be picked out by their characteristics and, moreover, the words which have close embeddings to the correct target word but contradictory characteristics to the given description will be filtered out.

We view each characteristic predictor as an information channel of searching the target word. Two types of channels involving internal and external channels are taken into consideration. The internal channels correspond to the characteristics of words themselves including the part-of-speech (POS) tag and morpheme. The external channels reflect characteristics of target words related to external knowledge bases. We take account of two external characteristics including the word category and sememe. The word category information can be obtained from word taxonomy systems and it usually corresponds to the genus words of definitions. A sememe is defined as the minimum semantic unit of human languages [5], which is similar to the concept of semantic primitive [33]. Sememes of a word depict the meaning of the word atomically, which can be also predicted from the description of the word.

More specifically, we adopt the well-established bi-directional LSTM (BiLSTM) [11] with attention [2] as the basic framework and add four feature-specific characteristic predictors to it. In experiments, we evaluate our model on English and Chinese datasets including both dictionary definitions and human-written descriptions, finding that our model achieves the state-of-the-art performance. It is especially worth mentioning that for the first time OneLook is outperformed when input queries are human-written descriptions. In addition, to test our model under other real application scenarios like crossword game, we provide our model with prior knowledge about the target word such as the initial letter, and find it yields substantial performance enhancement. We also conduct detailed quantitative analyses and a case study to demonstrate the effectiveness of our model as well as its robustness in handling polysemous and low-frequency words.

Related Work

Reverse Dictionary Models

Most of existing reverse dictionary models are based on sentence-sentence matching methods, i.e., comparing the input query with stored word definitions and return the word whose definition is most similar to the input query [37, 4]. They usually use some hand-engineered features, e.g., tf-idf, to measure sentence similarity, and leverage well-established information retrieval techniques to search the target word [28]. Some of them utilize external knowledge bases like WordNet [17] to enhance sentence similarity measurement by finding synonyms or other pairs of related words between the input query and stored definitions [16, 13, 28].

Recent years have witnessed a growing number of reverse dictionary models which conduct sentence-word matching. \citeauthorThorat2016ImplementingAR \shortciteThorat2016ImplementingAR present a node-graph architecture which can directly measure the similarity between the input query and any word in a word graph. However, it works on a small lexicon ( words) only. \citeauthorHill2016LearningTU \shortciteHill2016LearningTU propose a NLM based reverse dictionary model, which uses a bag-of-words (BOW) model or an LSTM to embed the input query into the semantic space of word embeddings, and returns the words whose embeddings are closest to the representation of the input query.

Following the NLM model, \citeauthorMorinagaY18 \shortciteMorinagaY18 incorporate category inference to eliminate irrelevant results and achieve better performance; \citeauthorkartsaklis2018mapping \shortcitekartsaklis2018mapping employ a graph of WordNet synsets and words in definitions to learn target word representations together with a multi-sense LSTM to encode input queries, and they claim to deliver state-of-the-art results; \citeauthorhedderich2019using \shortcitehedderich2019using use multi-sense embeddings when encoding the queries, aiming to improve sentence representations of input queries; \citeauthorpilehvar2019importance \shortcitepilehvar2019importance adopt sense embeddings to disambiguate senses of polysemous target words.

Our multi-channel model also uses a NLM to embed input queries. Compared with previous work, our model employs multiple predictors to identity characteristics of target words, which is consistent with the inference process of humans, and achieves significantly better performance.

Applications of Dictionary Definitions

Dictionary definitions are handy resources for NLP research. Many studies utilize dictionary definitions to improve word embeddings [20, 31, 1, 6, 26]. In addition, dictionary definitions are utilized in various applications including word sense disambiguation [15], knowledge representation learning [34], reading comprehension [14] and knowledge graph generation [30, 22].

Methodology

In this section, we first introduce some notations. Then we describe our basic framework, i.e., BiLSTM with attention. Next we detail our multi-channel model and its two internal and two external predictors. The architecture of our model is illustrated in Figure 2.

Notations

We define as the vocabulary set, as the whole morpheme set and as the whole POS tag set. For a given word , its morpheme set is , where each of its morpheme and denotes the cardinality of a set. A word may have multiple senses and each sense corresponds to a POS tag. Supposing has senses, all the POS tags of its senses form its POS tag set , where each POS tag . In subsequent sections, we use lowercase boldface symbols to stand for vectors and uppercase boldface symbols for matrices. For instance, is the word vector of and is a weight matrix.

Basic Framework

The basic framework of our model is essentially similar to a sentence classification model, composed of a sentence encoder and a classifier. We select Bidirectional LSTM (BiLSTM) [27] as the sentence encoder, which encodes an input query into a vector. Different words in a sentence have different importance to the representation of the sentence, e.g., the genus words are more important than the modifiers in a definition. Therefore, we integrate attention mechanism [2] into BiLSTM to learn better sentence representations.

Formally, for an input query , we first pass the pre-trained word embeddings of its words to the BiLSTM, where is the dimension of word embeddings, and obtain two sequences of directional hidden states:

(1)

where and is the dimension of directional hidden states. Then we concatenate bi-directional hidden states to obtain non-directional hidden states:

(2)

The final sentence representation is the weighted sum of non-directional hidden states:

(3)

where is the attention item serving as the weight:

(4)

Next we map , the sentence vector of the input query, into the space of word embeddings, and calculate the confidence score of each word using dot product:

(5)

where indicates the confidence score of , is a weight matrix, is a bias vector.

Figure 2: Multi-channel reverse dictionary model.

Internal Channel: POS Tag Predictor

A dictionary definition or human-written description of a word is usually able to reflect the POS tag of the corresponding sense of the word. We believe that predicting the POS tag of the target word can alleviate the problem of returning words with POS tags contradictory to the input query in existing reverse dictionary models.

We simply pass the sentence vector of the input query to a single-layer perceptron:

(6)

where records the prediction score of each POS tag, is a weight matrix, and is a bias vector.

The confidence score of from the POS tag channel is the sum of the prediction scores of ’s POS tags:

(7)

where denotes the -th element of , and returns the POS tag index of .

Internal Channel: Morpheme Predictor

Most words are complex words consisting of more than one morphemes. We find there exists a kind of local semantic correspondence between the morphemes of a word and its definition or description. For instance, the word “expressway” has two morphemes “express” and “way” and its dictionary definition is “a wide road in a city on which cars can travel very quickly”. We can observe that the two words “road” and “quickly” semantically correspond to the two morphemes “way” and “express” respectively. By predicting morphemes of the target word from the input query, a reverse dictionary can capture compositional information of the target word, which is complementary to contextual information of word embeddings.

We design a special morpheme predictor. Different from the POS tag predictor, we allow each hidden state to be involved in morpheme prediction directly, and do max-pooling to obtain final morpheme prediction scores. Specifically, we feed each non-directional hidden state to a single-layer perceptron and obtain local morpheme prediction scores:

(8)

where measures the semantic correspondence between -th word in the input query and each morpheme, is a weight matrix, and is a bias vector. Then we do max-pooling over all the local morpheme prediction scores to obtain global morpheme prediction scores:

(9)

And the confidence score of from the morpheme channel is:

(10)

where returns the morpheme index of .

External Channel: Word Category Predictor

Semantically related words often belong to different categories, although they have close word embeddings, e.g., “car” and “road”. Word category information is helpful in eliminating semantically related but not similar words from the results of reverse dictionaries [18]. There are many available word taxonomy systems which can provide hierarchical word category information, e.g., WordNet [17]. Some of them provides POS tag information as well, in which case POS tag predictor can be removed.

We design a hierarchical predictor to calculate prediction scores of word categories. Specifically, each word belongs to a certain category in each layer of word hierarchy. We first compute the word category prediction score of each layer:

(11)

where is the word category prediction score distribution of -th layer, is a weight matrix, is a bias vector, and is the category number of -th layer. Then the final confidence score of from the word category channel is the weighted sum of its category prediction scores of all the layers:

(12)

where is the total layer number of the word hierarchy, is a hyper-parameter controlling the relative weights, and returns the category index of in the -th layer.

External Channel: Sememe Predictor

In linguistics, a sememe is the minimum semantic unit of natural languages [5]. Sememes of a word can accurately depict the meaning of the word. HowNet [8] is the most famous sememe knowledge base. It defines about sememes and uses them to annotate more than Chinese and English words by hand. HowNet and its sememe knowledge has been widely applied to various NLP tasks including sentiment analysis [9], word representation learning [19], semantic composition [23], sequence modeling [25] and textual adversarial attack [35].

Sememe annotation of a word in HowNet includes hierarchical sememe structures as well as relations between sememes. For simplicity, we extract a set of unstructured sememes for each word, in which case sememes of a word can be regarded as multiple semantic labels of the word. We find there also exists local semantic correspondence between the sememes of a word and its description. Still taking “expressway” as an example, its annotated sememes in HowNet are route and fast, which semantically correspond to the words in its definition “road” and “quickly” respectively.

Therefore, we design a sememe predictor similar to the morpheme predictor. Formally, we use to represent the set of all sememes. The sememe set of a word is . We pass each hidden state to a single-layer perceptron to calculate local sememe prediction scores:

(13)

where indicates how corresponding between -th word in the input query and each sememe, is a weight matrix, and is a bias vector. Final sememe prediction scores are computed by doing max-pooling:

(14)

The confidence score of from the sememe channel is:

(15)

where returns the sememe index of .

Multi-channel Reverse Dictionary Model

By combining the confidence scores of direct word prediction and indirect characteristic prediction, we obtain the final confidence score of a given word in our multi-channel reverse dictionary model:

(16)

where is the channel set, and and are the hyper-parameters controlling relative weights of corresponding terms.

As for training loss, we simply adopt the one-versus-all cross-entropy loss inspired by the sentence classification models.

Model Seen Definition Unseen Definition Description
OneLook 0 .66/.94/.95 200 - - - 5.5 .33/.54/.76 332
BOW 172 .03/.16/.43 414 248 .03/.13/.39 424 22 .13/.41/.69 308
RNN 134 .03/.16/.44 375 171 .03/.15/.42 404 17 .14/.40/.73 274
RDWECI 121 .06/.20/.44 420 170 .05/.19/.43 420 16 .14/.41/.74 306
SuperSense 378 .03/.15/.36 462 465 .02/.11/.31 454 115 .03/.15/.47 396
MS-LSTM 0 .92/.98/.99 65 276 .03/.14/.37 426 1000 .01/.04/.18 404
BiLSTM 25 .18/.39/.63 363 101 .07/.24/.49 401 5 .25/.60/.83 214
+Mor 24 .19/.41/.63 345 80 .08/.26/.52 399 4 .26/.62/.85 198
+Cat 19 .19/.42/.68 309 68 .08/.28/.54 362 4 .30/.62/.85 206
+Sem 19 .19/.43/.66 349 80 .08/.26/.53 393 4 .30/.64/.87 218
Multi-channel 16 .20/.44/.71 310 54 .09/.29/.58 358 2 .32/.64/.88 203
median rank accuracy@1/10/100 rank variance
Table 1: Overall reverse dictionary performance of all the models.

Experiments

In this section, we evaluate the performance of our multi-channel reverse dictionary model. We also conduct detailed quantitative analyses as well as a case study to explore the influencing factors in the reverse dictionary task and demonstrate the strength and weakness of our model. We carry out experiments on both English and Chinese datasets. But due to limited space, we present our experiments on the Chinese dataset in the appendix.

Dataset

We use the English dictionary definition dataset created by \citeauthorHill2016LearningTU \shortciteHill2016LearningTU2 as the training set. It contains about words and word-definition pairs. We have three test sets including: (1) seen definition set, which contains pairs of words and WordNet definitions existing in the training set and is used to assess the ability to recall previously encoded information; (2) unseen definition set, which also contains pairs of words and WordNet definitions but the words together with all their definitions have been excluded from the training set; and (3) description set, which consists of pairs of words and human-written descriptions and is a benchmark dataset created by \citeauthorHill2016LearningTU \shortciteHill2016LearningTU too.

To obtain the morpheme information our model needs, we use Morfessor [32] to segment all the words into morphemes. As for the word category information, we use the lexical names from WordNet [17]. There are lexical names and the total layer number of the word category hierarchy is . Since the lexical names have included POS tags, e.g., noun.animal, we remove the POS tag predictor from our model. We use HowNet as the source of sememes. It contains English words manually annotated with different sememes in total. We employ OpenHowNet [24], the open data accessing API of HowNet, to obtain sememes of words.

Experimental Settings

Baseline Methods

We choose the following models as the baseline methods: (1) OneLook, the most popular commercial reverse dictionary system, whose 2.0 version is used; (2) BOW and RNN with rank loss [10], both of which are NLM based and the former uses a bag-of-words model while the latter uses an LSTM; (3) RDWECI [18], which incorporates category inference and is an improved version of BOW; (4) SuperSense [21], an improved version of BOW which uses pretrained sense embeddings to substitute target word embeddings; (5) MS-LSTM [12], an improved version of RNN which uses graph-based WordNet synset embeddings together with a multi-sense LSTM to predict synsets from descriptions and claims to produce state-of-the-art performance; and (6) BiLSTM, the basic framework of our multi-channel model.

Hyper-parameters and Training

For our model, the dimension of non-directional hidden states is , the weights of different channels are equally set to , and the dropout rate is 0.5. For all the models except MS-LSTM, we use the 300-dimensional word embeddings pretrained on GoogleNews with word2vec3, and the word embeddings are fixed during training. For all the other baseline methods, we use their recommended hyper-parameters. For training, we adopt Adam as the optimizer with initial learning rate , and the batch size is .

Evaluation Metrics

Following previous work, we use three evaluation metrics: the median rank of target words (lower better), the accuracy that target words appear in top 1/10/100 (acc@1/10/100, higher better) and the standard deviation of target words’ ranks (rank variance, lower better). Notice that MS-LSTM can only predict WordNet synsets. Thus, we map the target words to corresponding WordNet synsets (target synsets) and calculate the accuracy and rank variance of the target synsets.

Prior Knowlege Seen Definition Unseen Definition Description
None 16 .20/.44/.71 310 54 .09/.29/.58 358 2.5 .32/.64/.88 203
POS Tag 13 .21/.45/.72 290 45 .10/.31/.60 348 3 .35/.65/.91 174
Initial Letter 1 .39/.73/.90 270 4 .26/.63/.85 348 0 .62/.90/.97 160
Word Length 1 .40/.71/.90 269 6 .25/.56/.84 346 0 .55/.85/.95 163
median rank accuracy@1/10/100 rank variance
Table 2: Reverse dictionary performance with prior knowledge.

Overall Experimental Results

Table 1 exhibits reverse dictionary performance of all the models on the three test sets, where “Mor”, “Cat” and “Sem” represent the morpheme, word category and sememe predictors respectively. Notice that the performance of OneLook on the unseen test set is meaningless because we cannot exclude any definitions from its definition bank, hence we do not list corresponding results. From the table, we can see:

(1) Compared with all the baseline methods other than OneLook, our multi-channel model achieves substantially better performance on the unseen definition set and the description set, which verifies the absolute superiority of our model in generalizing to the novel and unseen input queries.

(2) OneLook significantly outperforms our model when the input queries are dictionary definitions. This result is expected because the input dictionary definitions are already stored in the database of OneLook and even simple text matching can easily handle this situation. However, the input queries of a reverse dictionary cannot be exact dictionary definitions in reality. On the description test set, our multi-channel model achieves better overall performance than OneLook. Although OneLook yields slightly higher acc@1, it has limited value in terms of practical application, because people always need to pick the proper word from several candidates, not to mention the fact that the acc@1 of OneLook is only .

(3) MS-LSTM performs very well on the seen definition set but badly on the description set, which manifests its limited generalization ability and practical value. Notice that when testing MS-LSTM, the searching space is the whole synset list rather than the synset list of the test set, which causes the difference in performance on the unseen definition set measured by us and recorded in the original work [12].

(4) All the BiLSTM variants enhanced with different information channels (+Mor, +Cat and +Sem) perform better than vanilla BiLSTM. These results prove the effectiveness of predicting characteristics of target words in the reverse dictionary task. Moreover, our multi-channel model achieves further performance enhancement as compared with the single-channel models, which demonstrates the potency of characteristic fusion and also verifies the efficacy of our multi-channel model.

(5) BOW performs better than RNN, which is consistent with the findings from \citeauthorHill2016LearningTU \shortciteHill2016LearningTU. However, BiLSTM far surpasses BOW as well as RNN. This verifies the necessity for bi-directional encoding in RNN models, and also shows the potential of RNNs.

Performance with Prior Knowledge

In practical application of reverse dictionaries, extra information about target words in addition to descriptions may be known. For example, we may remember the initial letter of the word we forget, or the length of the target word is known in crossword game. In this subsection, we evaluate the performance of our model with the prior knowledge of target words, including POS tag, initial letter and word length. More specifically, we extract the words satisfying given prior knowledge from the top results of our model, and then reevaluate the performance. The results are shown in Table 2.

We can find that any prior knowledge improves the performance of our model to a greater or lesser extent, which is an expected result. However, the performance boost brought by the initial letter and word length information is much bigger than that brought by the POS tag information. The possible reasons are as follows. For the POS tag, it has been already predicted in our multi-channel model, hence the improvement it brings is limited, which also demonstrates that our model can do well in POS tag prediction. For the initial letter and word length, they are hard to predict according to a definition or description and not considered in our model. Therefore, they can filter many candidates out and markedly increase performance.

Analyses of Influencing Factors

In this subsection, we conduct quantitative analyses of the influencing factors in reverse dictionary performance. To make results more accurate, we use a larger test set consisting of words and seen pairs of words and WordNet definitions. Since we are interested in the features of target words, we exclude MS-LSTM that predicts WordNet synsets.

Figure 3: Acc@10 on words with different sense numbers. The numbers of words are , , , , and respectively.

Sense Number

Figure 3 exhibits the acc@10 of all the models on the words with different numbers of senses. It is obvious that performance of all the models declines with the increase in the sense number, which indicates that polysemy is a difficulty in the task of reserve dictionary. But our model displays outstanding robustness and its performance hardly deteriorates even on the words with the most senses.

Figure 4: Acc@10 on ranges of different word frequency rankings. The number of words in each range is , , , , and respectively.

Word Frequency

Figure 4 displays all the models’ performance on the words within different ranges of word frequency ranking. We can find that the most frequent and infrequent words are harder to predict for all the reverse dictionary models. The most infrequent words usually have poor embeddings, which may damage the performance of NLM based models. For the most frequent words, on the other hand, although their embeddings are better, they usually have more senses. We count the average sense numbers of all the ranges, which are , , , , and respectively. The first range has a much larger average sense number, which explains its bad performance. Moreover, our model also demonstrates remarkable robustness.

Query Length

The effect of query length on reverse dictionary performance is illustrated in Figure 5. When the input query has only one word, the system performance is strikingly poor, especially our multi-channel model. This is easy to explain because the information extracted from the input query is too limited. In this case, outputting the synonyms of the query word is likely to be a better choice.

Case study

In this subsection, we give two cases in Table 3 to display the strength and weakness of our reverse dictionary model. For the first word “postnuptial”, our model correctly predicts its morpheme “post” and sememe “GetMarried” from the words “after” and “marriage” in the input query. Therefore, our model easily finds the correct answer. For the second case, the input query describes a rare sense of the word “takeaway”. HowNet has no sememe annotation for this sense, and morphemes of the word are not semantically related to any words in the query either. Our model cannot solve this kind of cases, which is in fact hard to handle for all the NLM based models. In this situation, the text matching methods, which return the words whose stored definitions are most similar to the input query, may help.

Figure 5: Acc@10 on ranges of different query length. The number of queries in each range is , , , , , , and respectively.
Word Input Query
postnuptial relating to events after a marriage.
takeaway concession made by a labor union to a company.
Table 3: Two reverse dictionary cases.

Conclusion and Future Work

In this paper, we propose a multi-channel reverse dictionary model, which incorporates multiple predictors to predict characteristics of target words from given input queries. Experimental results and analyses show that our model achieves the state-of-the-art performance and also possesses outstanding robustness.

In the future, we will try to combine our model with text matching methods to better tackle extreme cases, e.g., single-word input query. In addition, we are considering extending our model to the cross-lingual reverse dictionary task. Moreover, we will explore the feasibility of transferring our model to related tasks such as question answering.

Acknowledgements

This work is funded by the Natural Science Foundation of China (NSFC) and the German Research Foundation (DFG) in Project Crossmodal Learning, NSFC 61621136008 / DFG TRR-169. Furthermore, we thank the anonymous reviewers for their valuable comments and suggestions.

Appendix A Appendix: Experiments on Chinese Dataset

In this section, we evaluate our multi-channel reverse dictionary model on the Chinese dataset.

Dataset

For Chinese, we build a dictionary definition dataset as the training set. It contains words and word-definition pairs, and the definitions are extracted from Modern Chinese Dictionary (6th Edition)4, an authoritative Chinese dictionary. There are four test sets including (1) seen and unseen definition sets of size , which are built in the similar way to English; (2) description set, which is created by us and contains word-description pairs given by Chinese native speakers; and (3) question set, which collects real-world Chinese exam question-answers of writing the right word given a description from the Internet.

For the morpheme information, we simply cut each word into Chinese characters as morphemes. As for the word category information, we use HIT-IR Tongyici Cilin5. It has five levels of word category hierarchy, and we only use the first four levels. The numbers of categories in each level are , , and , respectively. We also use HowNet as the source of sememes in the same way as English. The POS tags can be extracted from Modern Chinese Dictionary (6th Edition). We use all its POS tags.

Experimental Settings

Baseline Methods

We choose the same baseline methods as English except OneLook, SuperSense and MS-LSTM. We exclude OneLook because it only supports English reverse dictionary search and there are no Chinese reverse dictionary systems. In addition, SuperSense and MS-LSTM rely on WordNet but the Chinese version of WordNet contains too few words. So we do not make comparison with them, either. More specifically, the baselines are (1) BOW and RNN with rank loss [10], both of which are NLM based and the former uses a bag-of-words model while the latter uses an LSTM; (2) RDWECI [18], which incorporates category inference and is an improved version of BOW; and (3) BiLSTM, the basic framework of our multi-channel model.

Hyper-parameters and Training

For our model on the Chinese dataset, the dimension of non-directional hidden states is , which is different from the model of English. The weights of different channels are equally set to . For the baseline methods, we use their recommended hyper-parameters. For all the models, we use the 200-dimensional word embeddings pretrained on the SogouT corpus6 with word2vec7, and the word embeddings are fixed during training. For training, we adopt Adam as the optimizer with initial learning rate , and the batch size is , which are all the same as that of English experiments.

Evaluation Protocols

Same as the English experiments, we utilize three metrics including (1) the median rank of the target words; (2) the accuracy that the target words appears in top 1/10/100; and (3) the standard deviation of the target words’ ranks.

Overall Experimental Results

Table 4 exhibits reverse dictionary performance of all the models on the four test sets, where “+POS”,“+Mor”, “+Cat” and “+Sem” represent the POS tag, morpheme, word category and sememe predictors respectively. From the table, we can see:

Model Seen Definition Useen Definition Description Question
BOW 59 .08/.28/.56 403 65 .08/.28/.53 411 40 .07/.30/.60 357 42 .10/.28/.63 362
RNN 69 .05/.23/.55 379 103 .05/.21/.49 405 79 .04/.26/.53 361 56 .07/.27/.60 346
RDWECI 56 .09/.31/.56 423 83 .08/.28/.52 436 32 .09/.32/.59 376 45 .12/.32/.61 384
BiLSTM 4 .28/.58/.78 302 14 .15/.45/.71 343 13 .14/.44/.78 233 4 .30/.61/.82 243
+POS 4 .28/.58/.78 309 14 .16/.45/.71 346 13 .14/.44/.79 255 5 .25/.59/.79 271
+Mor 1 .43/.73/.87 260 11 .19/.47/.73 332 8 .22/.52/.83 251 1 .42/.73/.86 227
+Cat 4 .29/.58/.78 319 16 .14/.43/.70 356 13 .16/.45/.77 289 3 .33/.62/.82 246
+Sem 4 .29/.60/.80 298 14 .16/.45/.72 340 12 .15/.45/.75 244 4 .34/.61/.83 231
Multi-channel 1 .49/.78/.90 220 10 .18/.49/.76 310 5 .24/.56/.82 260 0 .50/.73/.90 223
median rank accuracy@1/10/100 rank variance
Table 4: Overall reverse dictionary performance of all the models.
Prior Knowledge Seen Definition Useen Definition Description Question
None 1 .49/.78/.90 220 10 .18/.49/.76 310 5 .24/.56/.82 260 0 .50/.73/.90 223
POS Tag 1 .50/.79/.90 222 9 .18/.51/.77 307 4 .24/.61/.85 252 0 .50/.74/.90 223
Initial Char 0 .74/.89/.92 220 0 .55/.82/.86 304 0 .61/.88/.93 239 0 .84/.95/.95 213
Word Length 0 .54/.82/.91 217 6 .23/.57/.81 297 3 .32/.68/88 242 0 .62/.85/.94 212
median rank accuracy@1/10/100 rank variance
Table 5: Reverse dictionary performance with prior knowledge.

(1) Our multi-channel model achieves substantially better performance than all the baseline methods on all the four test sets, which demonstrates the superiorty of our model. In addition, similar to the results of the English experiments, our model can also generalize well to the novel, unseen input queries.

(2) All the BiLSTM variants enhanced with different information channels (+POS, +Mor, +Cat and +Sem) perform better than vanilla BiLSTM except the evaluation for BiLSTM+POS on the Question test set. That is because words in the Question test set are all idioms and most of them have no POS tags. Basically, the results prove the effectiveness of all the four information channels.

(3) BiLSTM’s better performance than BOW and RNN demonstrate the necessity of bi-directional encoding in RNN models, although BOW performs also better than RNN here.

(4) The results on the Question test set show that our model is also good at question-answer exercise problems in real-world exams.

Performance with Prior Knowledge

Similar to the English experiments, we use the prior knowledge of the target word to evaluate the performance of our model on the Chinese dataset in the same way.

The results are shown in Table 5. We can also find that any prior knowledge can improve our model’s performance, especially the initial character information. That is presumably because the average character number of Chinese words is much less than that of English words and the search space is reduced to be smaller. Similar to English, the performance improvement of our model given POS tag information is also insignificant, which also demonstrates that our model can do well in POS tag prediction.

Footnotes

  1. https://onelook.com/thesaurus/
  2. The definitions are extracted from five electronic resources: WordNet, The American Heritage Dictionary, The Collaborative International Dictionary of English, Wiktionary and Webster’s.
  3. https://code.google.com/archive/p/word2vec/
  4. http://www.cp.com.cn/book/978-7-100-08467-344.html
  5. https://github.com/yaleimeng/Final˙word˙Similarity/tree/master/cilin
  6. https://www.sogou.com/labs/resource/t.php
  7. https://code.google.com/archive/p/word2vec/

References

  1. D. Bahdanau, T. Bosc, S. Jastrzebski, E. Grefenstette, P. Vincent and Y. Bengio (2017) Learning to compute word embeddings on the fly. arXiv preprint arXiv:1706.00286. Cited by: Applications of Dictionary Definitions.
  2. D. Bahdanau, K. Cho and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, Cited by: Introduction, Basic Framework.
  3. D. F. Benson (1979) Neurologic correlates of anomia. In Studies in neurolinguistics, pp. 293–328. Cited by: Introduction.
  4. S. Bilac, W. Watanabe, T. Hashimoto, T. Tokunaga and H. Tanaka (2004) Dictionary search based on the target word description. In Proceedings of NLP, Cited by: Introduction, Reverse Dictionary Models.
  5. L. Bloomfield (1926) A set of postulates for the science of language. Language 2 (3), pp. 153–164. Cited by: Introduction, External Channel: Sememe Predictor.
  6. T. Bosc and P. Vincent (2018) Auto-encoding dictionary definitions into consistent word embeddings. In Proceedings of EMNLP, Cited by: Applications of Dictionary Definitions.
  7. R. Brown and D. McNeill (1966) The “tip of the tongue” phenomenon. Journal of verbal learning and verbal behavior 5 (4), pp. 325–337. Cited by: Introduction.
  8. Z. Dong and Q. Dong (2003) HowNet-a hybrid language and knowledge resource. In Proceedings of NLP-KE, Cited by: External Channel: Sememe Predictor.
  9. X. Fu, G. Liu, Y. Guo and Z. Wang (2013) Multi-aspect sentiment analysis for chinese online social reviews based on topic modeling and hownet lexicon. Knowledge-Based Systems 37, pp. 186–195. Cited by: External Channel: Sememe Predictor.
  10. F. Hill, K. Cho, A. Korhonen and Y. Bengio (2016) Learning to understand phrases by embedding the dictionary. TKDE 4, pp. 17–30. Cited by: Appendix A, Introduction, Baseline Methods.
  11. S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Introduction.
  12. D. Kartsaklis, M. T. Pilehvar and N. Collier (2018) Mapping text to knowledge graph entities using multi-sense lstms. In Proceedings of EMNLP, Cited by: Introduction, Baseline Methods, Overall Experimental Results.
  13. K. N. Lam and J. K. Kalita (2013) Creating reverse bilingual dictionaries. In Proceedings of HLT-NAACL, Cited by: Reverse Dictionary Models.
  14. T. Long, E. Bengio, R. Lowe, J. C. K. Cheung and D. Precup (2017) World knowledge for reading comprehension: rare entity prediction with hierarchical lstms using external descriptions. In Proceedings of EMNLP, Cited by: Applications of Dictionary Definitions.
  15. F. Luo, T. Liu, Q. Xia, B. Chang and Z. Sui (2018) Incorporating glosses into neural word sense disambiguation. In Proceedings of ACL, Cited by: Applications of Dictionary Definitions.
  16. O. Méndez, H. Calvo and M. A. Moreno-Armendáriz (2013) A reverse dictionary based on semantic analysis using wordnet. In Proceedings of MICAI 2013, Cited by: Introduction, Reverse Dictionary Models.
  17. G. A. Miller (1995) WordNet: a lexical database for english. Communications of the Acm 38 (11), pp. 39–41. Cited by: Reverse Dictionary Models, External Channel: Word Category Predictor, Dataset.
  18. Y. Morinaga and K. Yamaguchi (2018) Improvement of reverse dictionary by tuning word vectors and category inference. In Proceedings of ICIST, Cited by: Appendix A, External Channel: Word Category Predictor, Baseline Methods.
  19. Y. Niu, R. Xie, Z. Liu and M. Sun (2017) Improved word representation learning with sememes. In Proceedings of ACL, Cited by: External Channel: Sememe Predictor.
  20. T. Noraset, C. Liang, L. Birnbaum and D. Downey (2017) Definition modeling: learning to define word embeddings in natural language. In Proceedings of AAAI, Cited by: Applications of Dictionary Definitions.
  21. M. T. Pilehvar (2019) On the importance of distinguishing word meaning representations: a case study on reverse dictionary mapping. In Proceedings of NAACL-HLT, Cited by: Baseline Methods.
  22. V. Prokhorov, M. T. Pilehvar and N. Collier (2019) Generating knowledge graph paths from textual definitions using sequence-to-sequence models. In Proceedings NAACL-HLT, Cited by: Applications of Dictionary Definitions.
  23. F. Qi, J. Huang, C. Yang, Z. Liu, X. Chen, Q. Liu and M. Sun (2019) Modeling semantic compositionality with sememe knowledge. In Proceedings of ACL, Cited by: External Channel: Sememe Predictor.
  24. F. Qi, C. Yang, Z. Liu, Q. Dong, M. Sun and Z. Dong (2019) OpenHowNet: an open sememe-based lexical knowledge base. arXiv preprint arXiv:1901.09957. Cited by: Dataset.
  25. Y. Qin, F. Qi, S. Ouyang, Z. Liu, C. Yang, Y. Wang, Q. Liu and M. Sun (2019) Enhancing recurrent neural networks with sememes. arXiv preprint arXiv:1910.08910. Cited by: External Channel: Sememe Predictor.
  26. T. Scheepers, E. Kanoulas and E. Gavves (2018) Improving word embedding compositionality using lexicographic definitions. In Proceedings of WWW, Cited by: Applications of Dictionary Definitions.
  27. M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: Basic Framework.
  28. R. Shaw, A. Datta, D. E. VanderMeer and K. Dutta (2013) Building a scalable database-driven reverse dictionary. TKDE 25, pp. 528–540. Cited by: Introduction, Reverse Dictionary Models.
  29. G. Sierra (2000) The onomasiological dictionary: a gap in lexicography. In Proceedings of the ninth Euralex international congress, Cited by: Introduction.
  30. V. Silva, A. Freitas and S. Handschuh (2018) Building a knowledge graph from natural language definitions for interpretable text entailment recognition. In Proceedings LREC, Cited by: Applications of Dictionary Definitions.
  31. J. Tissier, C. Gravier and A. Habrard (2017) Dict2vec : learning word embeddings using lexical dictionaries. In Proceedings of EMNLP, Cited by: Applications of Dictionary Definitions.
  32. S. Virpioja, P. Smit, S. A. Grönroos and M. Kurimo (2013) Morfessor 2.0: python implementation and extensions for morfessor baseline. Aalto University Publication. Cited by: Dataset.
  33. A. Wierzbicka (1996) Semantics: primes and universals: primes and universals. Oxford University Press, UK. Cited by: Introduction.
  34. R. Xie, Z. Liu, J. Jia, H. Luan and M. Sun (2016) Representation learning of knowledge graphs with entity descriptions. In Proceedings of AAAI, Cited by: Applications of Dictionary Definitions.
  35. Y. Zang, C. Yang, F. Qi, Z. Liu, M. Zhang, Q. Liu and M. Sun (2019) Textual adversarial attack as combinatorial optimization. arXiv preprint arXiv:1910.12196. Cited by: External Channel: Sememe Predictor.
  36. G. K. Zipf (1949) Human behavior and the principle of least effort. SERBIULA (sistema Librum 2.0). Cited by: Introduction.
  37. M. Zock and S. Bilac (2004) Word lookup on the basis of associations: from an idea to a roadmap. In Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries, Cited by: Introduction, Reverse Dictionary Models.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402486
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description