WordRep is a benchmark collection for the research on learning distributed word representations (or word embeddings), released by Microsoft Research. In this paper, we describe the details of the WordRep collection and show how to use it in different types of machine learning research related to word embedding. Specifically, we describe how the evaluation tasks in WordRep are selected, how the data are sampled, and how the evaluation tool is built. We then compare several state-of-the-art word representations on WordRep, report their evaluation performance, and make discussions on the results. After that, we discuss new potential research topics that can be supported by WordRep, in addition to algorithm comparison. We hope that this paper can help people gain deeper understanding of WordRep, and enable more interesting research on learning distributed word representations and related topics.
WordRep: A Benchmark for Research on Learning Word Representations
Bin Gao email@example.com
Jiang Bian firstname.lastname@example.org
Tie-Yan Liu email@example.com
The success of machine learning methods depends much on data representation, since different representations may encode different explanatory factors of variation behind the data. Conventional natural language processing (NLP) tasks often take the 1-of- word representation, where is the size of the entire vocabulary, and each word in the vocabulary is represented as a long vector with only one non-zero element. However, such simple form of word representation meets several challenges. The most critical one is that 1-of- word representation cannot indicate any relationship between different words even though they yield high semantic or syntactic correlation. For example, while elegant and elegantly have quite similar semantics, their corresponding 1-of- representation vectors trigger different indexes to be the hot value, and it is not explicit that elegant is much closer to elegantly than other words like rough via 1-of- representations. To deal with this problem, Latent Semantic Analysis (LSA) (Dumais, ) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003) were proposed to learn continuous word representations. Unfortunately, it is quite difficult to train LSA or LDA model efficiently on large-scale text data.
Recently, with the rapid development of deep learning techniques, researchers have started to train complex and deep models on large amounts of text corpus, to learn distributed representations of words (also known as word embeddings) in the form of continuous vectors (Collobert & Weston, 2008; Bengio et al., 2003; Glorot et al., 2011; Mikolov, 2012; Socher et al., 2011; Tur et al., 2012). While conventional NLP techniques usually represent words as indices in a vocabulary causing no notion of relationship between words, word embeddings learned by deep learning approaches aim at explicitly encoding many semantic relationships as well as linguistic regularities and patterns into the new word embedding space. For example, a previous study (Bengio et al., 2003) proposed a widely used model architecture for estimating neural network language model. Collobert et al. (Collobert & Weston, 2008; Collobert et al., 2011) introduced a unified neural network architecture that learns word representations based on large amounts of unlabeled training data, to deal with several different natural language processing tasks. Mikolov et al. (Mikolov et al., 2013a; b) proposed the continuous bag-of-words model (CBOW) and the continuous skip-gram model (Skip-gram) for learning distributed representations of words also from large amount of unlabeled text data; these two models can map the semantically or syntactically similar words to close positions in the word embedding space, based on the intuition that similar words are likely to yield similar context.
Although the study in learning distributed word representations has become very hot recently, there are very few large and public datasets for the evaluation of word representations. In this paper, we introduce a new large benchmark collection named WordRep, which is built from several different data source. We describe which evaluation tasks are selected in WordRep, how the data are sampled, and how the evaluation tool is built. We also compare the performance of several state-of-the-art word representations on WordRep. Moreover, we take further discussions on new potential research topics that can be supported by WordRep.
The rest of the paper is organized as follows. Section id1 gives a detailed description about how WordRep is created. In Section id1, we report the performance of several state-of-the-art word representations on WordRep. We then show how to leverage WordRep to study other research topics in Section id1. Finally, the paper is concluded in Section id1.
In this section, we introduce the process of creating WordRep collection, consisting of three main steps: selecting evaluation tasks, generating evaluation samples, and finalizing datasets.
The main task for WordRep is the analogical reasoning task introduced by Mikolov et al. (Mikolov et al., 2013a). The task consists of 19,544 questions, each of which is a tuple composed by two word pairs and . From the tuple, the question is of the form “ is to is as is to ”, denoted as : : ?. Suppose is the learned representation vector of word and is normalized to unit norm. Following (Mikolov et al., 2013a), we answer this question by finding the word whose representation vector is the closest to vector according to cosine similarity excluding and , i.e.,
The question is judged as correctly-answered only if is exactly the answer word in the evaluation set. There are two categories of analogical tasks, including 8,869 semantic analogies in five subtasks (e.g., England : London China : Beijing) and 10,675 syntactic analogies in nine subtasks (e.g., amazing : amazingly unfortunate : unfortunately).
Table 1 gives a summary of Mikolov et al.’s evaluation set. The first five subtasks are semantic questions and the rest nine subtasks are syntactic questions. The table gives one example tuple for every subtask. It also shows the number of unique word pairs and the number of tuples combined from these pairs in each subtask. Regarding to the numbers of unique word pairs, we can see that there are actually very small number of meaningful pairs in this dataset. After further checking the number of tuples, we find that not all possible combinations of word pairs are used as the tuple questions. For example, in the subtask of City-in-state, there should be as many as 4,556 tuples combined from 68 word pairs, but only 2,467 tuples are used in the published dataset. It is not clear about the reason or how the 2,467 tuples were sampled.
|Subtask||Word pair 1 (, )||Word pair 2 (, )||# word pairs||# tuples|
|Common capital city||Athens||Greece||Oslo||Norway||23||506|
|All capital cities||Astana||Kazakhstan||Harare||Zimbabwe||116||4,524|
|Adjective to adverb||apparent||apparently||rapid||rapidly||32||992|
In the WordRep collection, we merged the above 14 subtasks into 12 subtasks, and expanded the set of word pairs in each subtask by extracting new word pairs from the Wikipedia and an English dictionary; we also added 13 new subtasks by deriving pairwise word relationship from WordNet (WordNet, 2010).
Before we describe how we generate the evaluation samples, we first introduce the scope of the vocabulary in WordRep. Since the words extracted from the Wikipedia, the dictionary, and the Web knowledge bases (WordNet) may contain many rare words that are not commonly used, we filter out those words that are not covered by the vocabulary of wiki2010 (Shaoul & Westbury, 2010). This corpus is a snapshot of the Wikipedia corpus in April 2010, which contains about two million articles and 990 million tokens. The vocabulary size of wiki2010 is 731,155, we regard which is large enough for common NLP tasks. Furthermore, note that the evaluation set proposed by Mikolov et al. (Mikolov et al., 2013a) contains a question set of phrase pairs besides the word pairs. To deal with phrases like New York, they simply connect the tokens by an underline and write it as New_York. In this release of WordRep, we will only generate question sets with word pairs and leave phrase pairs for the future versions.
We leverage Wikipedia knowledge to enlarge the semantic analogical tasks. For the first two subtasks in Table 1, we merged Common capital city into All capital cities for simplicity. Then, we extracted the Wikipedia pages for all the counties and areas to get a full list of countries, capitals, currencies, and nationality adjectives, so that we can enlarge the question sets in the subtasks of All capital cities, Currency, and Nationality adjective. For City-in-state, we found the top cities in population in the 50 states in the U.S. from Wikipedia, and built the city-state pairs accordingly. Note that we only kept the word based name entities, and removed the phrase based names.
We take advantage of dictionary knowledge to enlarge both semantic and syntactic analogical tasks. For the subtask Opposite, we merged it into a new subtask Antonym extracted from WordNet, which will be described in the following. For the rest subtasks (Man-Woman, Adjective to adverb, Comparative, Superlative, Present Participle, Past tense, Plural nouns, Plural verbs) in Table 1, we enlarged their corresponding word pairs by extracting new candidates from the Longman Dictionaries111http://www.longmandictionariesonline.com/. For Man-Woman, we also added some word pairs of male and female animals like cock and hen. Note that the newly extracted words are filtered by the vocabulary of wiki2010. The statistics about the enlarged subtasks by Wikipedia knowledge and Dictionary knowledge is shown in Table 2.
WordNet (WordNet, 2010) is a large lexical database of English. It contains both the words and the relations among the words, which is a gold mine to enlarge both semantic and syntactic analogical tasks. We extracted 13 new subtasks based on the WordNet relations including Antonym, MemberOf, MadeOf, IsA, SimilarTo, PartOf, InstanceOf, DerivedFrom, HasContext, RelatedTo, Attribute, Causes, and Entails. Note that we merged the original subtask Opposite in Table 1 into the new subtask Antonym, and we also filtered the words in the 13 new subtasks by the vocabulary of wiki2010. The statistics and examples about the 13 WordNet subtasks are shown in Table 3.
From Table 2 and Table 3, we can see that WordRep has much larger number of word pairs as well as word tuples. The word pairs and word tuples can be downloaded (http://research.microsoft.com/en-us/um/beijing/events/kpdltm2014/WordRep-1.0.zip). The size of the compressed package is 1.61 GB.
In this section, we report the performance of several state-of-the-art distributed word representations on WordRep, including CW08 (Collobert & Weston, 2008), RNNLM (Mikolov, 2012), and CBOW (Mikolov et al., 2013a). In particular, we download the public word representations of CW08222http://ml.nec-labs.com/senna/ (Collobert & Weston, 2008) whose dimension is 50; the public word representations of RNNLM (Mikolov, 2012) are obtained from its public site333http://www.fit.vutbr.cz/~imikolov/rnnlm/, which includes three word representation models with the dimension of 80, 640, and 1600, respectively; we obtain the CBOW (Mikolov et al., 2013a) models by using its online tools444https://code.google.com/p/word2vec/ to train word representations directly on the wiki2010 dataset, where we set the dimension as 100, 200, and 300, respectively.
Table 4 demonstrates the accuracy of each of word representations on the enlarged evaluation set of analogical reasoning tasks. From this table, we can find that different word representations yield quite various accuracy on the analogical reasoning tasks. In particular, CW08 with 50 dimension can only achieve relatively low accuracy compared with RNNLM and CBOW, which may be due to the difference in terms of the dimension of word representations, the training data, or the training algorithms. We can also find that, with respect to the same training method, larger dimension of word representations is more likely to result in better performance.
Table 5 demonstrates the accuracy of each of word representations on the WordNet evaluation set of analogical reasoning tasks. From this table, we can see the analogous observations with the Table 4. From these two tables, we can also find that different subtasks could yield quite diverse accuracy by evaluating on the same word representation model.
Note that the above evaluation experiments can be done using the evaluation tool provided by Word2Vec555https://code.google.com/p/word2vec/. The only difference is that we treat all tuple questions as seen questions. Thus, if a tuple question is unseen by the word embedding being evaluated (i.e., at least one word in the tuple question is not in the vocabulary of the word embedding), we will regard it to be answered incorrectly.
Besides measuring the quality of word representations, we can also use WordRep for, but not limited to, the following tasks.
WordRep can be used to evaluate the embedding for relations. Recently some researchers have attempted to do word embedding and relation embedding simultaneously. WordRep contains 25 subtasks in total, i.e., there are 25 types of relations that can be used for evaluating relation embeddings.
WordRep can be used to evaluate relation prediction. The 25 types of relations can be regarded as labels of word pairs. Researchers can test their elation prediction methods using these labels as ground-truth.
WordRep provides several good word lists for general NLP tasks. For example, there are lists for different syntactic forms of nouns, verbs, adjectives, and adverbs, and there are lists for commonly used relations.
In this paper, we have introduced a new data collection called WordRep, which can be used for the evaluation of distributed word representations. We described how we built the data collection and reported the evaluation performance of several state-of-the-art word representations on it. We also discussed the possible research topics that WordRep may support.
For the future work, we plan to further expand the evaluation set to phrase pairs, and we also plan to enrich the collection by considering other Web knowledge bases like Freebase (Bollacker et al., 2008).
We would like to thank Siyu Qiu, Chang Xu, Yalong Bai, Hongfei Xue, and Rui Zhang for their contributions in the preparation of the collection and the evaluation of the state-of-the-art word embeddings.
|Subtask||Word pair 1 (, )||Word pair 2 (, )||# word pairs||# tuples|
|All capital cities||Astana||Kazakhstan||Harare||Zimbabwe||131||17,030|
|Adjective to adverb||apparent||apparently||rapid||rapidly||689||474,032|
|Subtask||Word pair 1 (, )||Word pair 2 (, )||# word pairs||# tuples|
|All capital cities||0.62%||0.76%||1.23%||1.81%||6.62%||9.04%||11.28%|
|Adjective to adverb||1.40%||0.93%||1.17%||2.01%||3.45%||3.44%||3.23%|
- Bengio et al. (2003) Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A neural probabilistic language model. In The Journal of Machine Learning Research, pp. 3:1137–1155, 2003.
- Blei et al. (2003) Blei, David M., Ng, Andrew Y., and Jordan, Michael I. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. ISSN 1532-4435.
- Bollacker et al. (2008) Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor, J. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. ACM, 2008.
- Collobert & Weston (2008) Collobert, R. and Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 160–167, New York, NY, USA, 2008. ACM.
- Collobert et al. (2011) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537, November 2011. ISSN 1532-4435.
- (6) Dumais, Susan T. Latent semantic analysis. Annual Review of Information Science and Technology, 38(1). ISSN 1550-8382.
- Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classification: A deep learning approach. In In Proceedings of the Twenty-eight International Conference on Machine Learning, ICML, 2011.
- Mikolov (2012) Mikolov, T. Statistical Language Models Based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
- Mikolov et al. (2013a) Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013a.
- Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Burges, Christopher J. C., Bottou, LÃ©on, Ghahramani, Zoubin, and Weinberger, Kilian Q. (eds.), NIPS, pp. 3111–3119, 2013b.
- Shaoul & Westbury (2010) Shaoul, C. and Westbury, C. The westbury lab wikipedia corpus, edmonton, ab: University of alberta, 2010.
- Socher et al. (2011) Socher, R., Lin, C. C., Ng, A. Y., and Manning, C. D. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th International Conference on Machine Learning (ICML), 2011.
- Tur et al. (2012) Tur, G., Deng, L., Hakkani-Tur, D., and He, X. Towards deeper understanding: Deep convex networks for semantic utterance classification. In ICASSP, pp. 5045–5048, 2012.
- WordNet (2010) WordNet. “about wordnet”, princeton university, 2010. URL http://wordnet.princeton.edu.