An Ensemble Method to Produce High-Quality Word Embeddings

An Ensemble Method to Produce High-Quality Word Embeddings


A currently successful approach to computational semantics is to represent words as embeddings in a machine-learned vector space. We present an ensemble method that combines embeddings produced by GloVe [\citenamePennington et al.2014] and word2vec [\citenameMikolov et al.2013] with structured knowledge from the semantic networks ConceptNet [\citenameSpeer and Havasi2012] and PPDB [\citenameGanitkevitch et al.2013], merging their information into a common representation with a large, multilingual vocabulary. The embeddings it produces achieve state-of-the-art performance on many word-similarity evaluations. Its score of on an evaluation of rare words [\citenameLuong et al.2013] is 16% higher than the previous best known system.


1 Introduction

Vector space models are an effective way to express the meanings of natural-language terms in a computational system. These models are created using machine-learning techniques that represent words or phrases as vectors in a high-dimensional space, such that the cosine similarity of any two terms corresponds to their semantic similarity.

These vectors, referred to as the embeddings of the terms in the vector space, can also be used as an input to further steps of machine learning. When algorithms expect dense vectors as input, embeddings provide a representation that is both more compact and more informative than the “one-hot” representation in which every term in the vocabulary gets its own dimension.

This kind of vector space has been used in applications such as search, topic detection, and text classification, dating back to the introduction of latent semantic analysis [\citenameDeerwester et al.1990]. In recent years, there has been a surge of interest in natural-language embeddings, as machine-learning techniques such as \newcitemikolov2013word2vec’s word2vec and \newcitepennington2014glove’s GloVe have begun to show dramatic improvements. Word embeddings are often suggested as an initialization for more complex methods, such as the sentence encodings of \newcitekiros2015skip.


faruqui2015retrofitting introduced a technique known as “retrofitting”, which combines embeddings learned from the distributional semantics of unstructured text with a source of structured connections between words. The combined embedding achieves performance on word-similarity evaluations superior to either source individually.

Here, we build on the retrofitting process to produce a high-quality space of word embeddings. We extend existing techniques in the following ways:

  • We modify the retrofitting algorithm, making it not depend on the row order of its input matrix, and allowing it to propagate over the union of the vocabularies. This allows retrofitting to benefit from structured links outside the original vocabulary, such as translations into other languages. We call this procedure “expanded retrofitting”.

  • We include ConceptNet, a Linked Open Data semantic network that expresses many kinds of relationships between words in many languages, as a source of structured connections between words.

  • We align English terms from different sources using a lemmatizer and a heuristic for merging together multiple term vectors.

  • We fill gaps when aligning the two distributional-semantics sources (GloVe and word2vec) using a locally linear interpolation.

  • We re-scale the distributional-semantics features using normalization.

When we use this process to combine word2vec, GloVe, PPDB, and ConceptNet, this process produces a space of multilingual term embeddings we call the “ConceptNet vector ensemble” that achieves state-of-the-art performance on word-similarity evaluations1 over both common and rare words.

1.1 Related Work


agirre2009similarity observes that distributional similarity and structured knowledge can be combined for a benefit exceeding what each would achieve alone, particularly by extending the vocabulary. Their system uses a similarity measure over WordNet, and uses distributional similarities to recognize words outside of WordNet’s vocabulary.


levy2015embeddings surveys modern methods of distributional similarity and experiments with training them on specific data while varying their parameters. They compare word2vec and GloVe, tune their hyperparameters in a way that particularly improves word2vec, then proposes a method based on the SVD of the Pointwise Mutual Information matrix that outperforms both. We use Levy’s results as a point of comparison here.

AutoExtend [\citenameRothe and Schütze2015] is a system with similar methods to ours: it extends word2vec embeddings to cover all the word senses and synsets of WordNet by propagating information over edges, thus combining distributional and structured data after the fact. The primary goal of AutoExtend is word sense disambiguation, and as such it is optimized for and evaluated on WSD tasks. Our ensemble aims to extend and improve a vocabulary of undisambiguated words, so there is no direct comparison between AutoExtend’s results and ours.

2 Knowledge Sources

2.1 ConceptNet and PPDB

ConceptNet [\citenameSpeer and Havasi2012] is a semantic network of terms connected by labeled relations. Its terms are words or multiple-word phrases in a variety of natural languages. For continuity with previous work, these terms are often referred to as concepts.

ConceptNet originated as a machine-parsed version of the early crowd-sourcing project called Open Mind Common Sense (OMCS) [\citenameSingh et al.2002], and has expanded to include several other data sources, both crowd-sourced and expert-created, by unifying their vocabularies into a single representation. ConceptNet now includes representations of WordNet [\citenameMiller et al.1998], Wiktionary [\citenameWiktionary2014], and JMDict [\citenameBreen2004], as well as data from “games with a purpose” in multiple languages [\citenamevon Ahn et al.2006, \citenameKuo et al.2009, \citenameNakahara and Yamada2011]. We choose not to include ConceptNet’s alignment to DBPedia [\citenameAuer et al.2007] here, as DBPedia focuses on relations between specific named entities, which do not help with general word similarity.2

PPDB [\citenameGanitkevitch et al.2013] is another resource that is useful for learning about word similarity, providing different information from ConceptNet. It lists pairs of words that are translated to the same word in parallel corpora, particularly in documents of the European Parliament. PPDB is used as an external knowledge source by \newcitefaruqui2015retrofitting, so we have evaluated the effect of adding it to our ensemble as well. As it seems to have a small beneficial effect, we include it as part of the full ensemble.

2.2 word2vec and GloVe

word2vec and GloVe are two current systems that learn vector representations of words according to their distributional semantics. Given a large text corpus, they produce vectors representing similarities in how the words co-occur with other words.


mikolov2013word2vec described a system of distributional word embeddings called Skip-Grams with Negative Sampling (SGNS), which is more popularly known by the name of its software implementation, word2vec. (The word2vec software also implements another representation, Continuous Bag-of-Words or CBOW, which is less often used for word similarity.)

In SGNS, a neural network with one hidden layer is trained to recognize words that are likely to appear near each other. Its goal is to output a high value when given examples of co-occurrences that appear in the data, and a low value for negative examples where one word is replaced by a random word. The loss function is weighted by the frequencies of the words involved and the distance between them in the data. The word2vec software3 comes with SGNS embeddings of text from Google News.

GloVe [\citenamePennington et al.2014] is an unsupervised learning algorithm that learns a set of word embeddings such that the dot product of two words’ embeddings is approximately equal to the logarithm of their co-occurrence count. The algorithm operates on a global word-word co-occurrence matrix, and solves an optimization problem to learn a vector for each word, a separate vector for each context (although the contexts are also words), and a bias value for each word and each context. Only the word vectors are used for computing similarity.

The embeddings that GloVe learns from data sources such as the Common Crawl4 are distributed on the GloVe web page5. Here we evaluate two downloadable sets of GloVe 1.2 embeddings, built from 42 billion and 840 billion tokens of the Common Crawl, respectively.

There is some debate about whether GloVe or word2vec is better at representing word meanings in general. GloVe is presented by \newcitepennington2014glove as performing better than word2vec on word-similarity tasks, but \newcitelevy2015embeddings finds that word2vec performs better with an optimized setting of hyperparameters than GloVe does, when retrained with a particular corpus.

In this paper, we focus only on the downloadable sets of term embeddings that the GloVe and word2vec projects provide, not on re-running them with tuned hyperparameters. Using this data makes it possible to reproduce their results and compare directly to them, even when their preferred input data is not available. We find that we can get very good results derived from the downloadable embeddings, and that GloVe’s downloadable embeddings outperform word2vec’s in this case, but a combination of them can perform even better.

3 Methods

Figure 1: The flow of data in building the ConceptNet vector ensemble from its data sources.

faruqui2015retrofitting introduced the “retrofitting” procedure, which adjusts dense matrices of embeddings (such as the GloVe output) to take into account external knowledge from a sparse semantic network. They tried various sources of external knowledge, and the one that was most helpful to GloVe was PPDB. We found using ConceptNet to be more effective, and that further marginal improvements could be achieved on some evaluations by combining ConceptNet and PPDB.

Our goal is to create a 300-dimensional vector space that represents terms based on a combination of GloVe and word2vec’s downloadable embeddings, and structured data from ConceptNet and PPDB. The resulting vector space allows information to be shared among these various representations, including words that were not in the vocabulary of the original representations. This includes low-frequency words and even words that are not in English.

The complete process of building this vector space, whose steps will be explained throughout this paper, appears in Figure 1.

As \newcitelevy2015embeddings notes, “[…] much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves.” While it is presented as a negative result, this simply emphasizes the importance of these system design choices.

Indeed, we have found that choices about how to handle terms and their embeddings have a significant impact on evaluation results. One of these choices involves how to pre-process words and phrases before looking them up, and another involves the scale of the various features in the embeddings.

3.1 Transforming and Aligning Vocabularies

Figure 2: A proportional-area diagram showing the overlap of vocabularies among ConceptNet and the available embeddings for word2vec and GloVe.

Different representations apply different pre-processing steps, placing strings in different equivalence classes. We can only properly combine these resources if these string representations are comparable to each other.

Pre-processing steps that various resources apply include: tokenizing text to separate words from punctuation (which all inputs except GloVe 840B do), joining multi-word phrases with underscores (ConceptNet and word2vec), removing a small list of stopwords from multi-word phrases (ConceptNet only), folding the text to lowercase (ConceptNet and GloVe 42B), replacing multiple digits with the character # (word2vec only), and lemmatizing English words to their root form using a modification of WordNet’s Morphy algorithm (ConceptNet only).

We adapt a text pre-processing function from ConceptNet to apply a combination of all of these processes, yielding a set of standardized, language-tagged labels. As an example, the text “Giving an example” becomes the standardized form /c/en/give_example. Applying this combined pre-processing function to all labels increases the alignment of the various resources while reducing the size of the combined vocabulary.

Because the transformations are many-to-one, this has the effect that a single transformed term can become associated with multiple embeddings in a single vector space. We considered a few options for dealing with these merged terms, such as keeping only the highest-frequency term, averaging the vectors together, or taking a weighted average based on their word frequency.

We found in preliminary evaluations that the weighted average was the best approach. The multiple rows contain valuable data that should not simply be discarded, but lower-frequency rows tend to have lower-quality data.

When using pre-trained vectors, it is often the case that intermediate computations that produced these vectors (such as word frequencies) are not available. What we do instead is to infer approximate word frequencies from the fact that both GloVe and word2vec output their vocabularies in descending order of frequency. We approximate the frequency distribution by assuming that the tokens are distributed according to Zipf’s law [\citenameZipf1949]: the th token in rank order has a frequency proportional to . We use these proportions in the weighted average when combining multiple embeddings.

This process alone is a benefit on word-similarity evaluations, even without combining any resources. For example, the Rare Words (RW) dataset [\citenameLuong et al.2013] tends to encounter terms that are poorly represented or out-of-vocabulary in most word embeddings. Lemmatizing them before looking them up, and combining them with more frequently observed representations, improves the evaluation results on these words, even though the process loses the ability to distinguish some word forms. The raw GloVe 840B data gets a Spearman correlation of on the RW dataset, which increases to when its embeddings are standardized and combined in this way.

Figure 2 shows the size of the vocabularies of ConceptNet, GloVe, and word2vec after this transformation, and the sizes of the overlaps among them, using a proportional-area Venn diagram produced using eulerAPE [\citenameMicallef and Rodgers2014].

3.2 Feature Normalization

As briefly mentioned by \newcitepennington2014glove, normalization of the columns (that is, the 300 features) of the GloVe matrix provides a notable increase in performance. One effect of normalization is to increase the weight of distinguishing features and reduce the impact of noisy features. Features are more distinguishing for the purpose of cosine similarity when they contain a few large values and many small ones.

We find that normalization of GloVe performs even better than normalization. causes occasional large values to have a smaller impact on the norm than normalization. When a learning method such as GloVe has provided highly selective features, normalization allows us to use them more effectively in measuring similarity.

3.3 Retrofitting

Retrofitting [\citenameFaruqui et al.2015] is a process of combining existing word vectors with a semantic lexicon. While the original formulation expresses the problem in terms of updates that propagate over a set of edges, we have found it more convenient to express it and implement it in terms of an update to a matrix.

The inputs to retrofitting are an initial dense matrix of term embeddings, , and a list of known semantic relationships.

Faruqui et al.’s retrofitting procedure aims to minimize a sum of a word’s distance from its neighbors in the semantic network and its distance from its original vector. Its implementation in code takes steps along the gradient toward this minimum by iteratively updating one vector at a time to be a linear combination between its original position and the average of its neighbors in the semantic network.

The advantage of this iterative update is that it only requires two copies of in memory ( and the current state) and converges quickly. A disadvantage is that the results depend on the order in which the nodes of the graph are iterated, which is arbitrary.

We instead choose to update the embeddings all at once by multiplying them by a sparse matrix of semantic connections. Letting be the size of the merged vocabulary, is an matrix containing positive weighted values for terms that are known to be semantically related, and 0 otherwise. The rows of are scaled to have a sum of 1. We then add 1 to its diagonal to help new terms converge on a single vector, as described in more detail below.

Let be an matrix whose rows come from the original embeddings if available, and are all zeroes for terms outside the vocabulary of the original embeddings. is a diagonal matrix of weights in which is 1 if term is in the original vocabulary, and 0 otherwise (allowing us to keep terms near their original embeddings without also keeping out-of-vocabulary terms near the zero vector). We can now update iteratively so that the next iteration of is a combination of its product with and its weighted original state, followed by normalization of its non-zero rows6:

The diagonal of the matrix relates each term to itself. We have found that adding 1 to the diagonal – effectively adding “self-loops” to the semantic network – helps the expanded retrofitting process converge. Without this diagonal, terms that only appear in the semantic network, and not in the original embedding space, would get their value only from their neighbors at every step, because their “original position” is the zero vector. This causes large oscillations that prevent convergence. With the diagonal, each term vector is influenced by the vector it had in the previous step.

As a practical effect of this, Section 4.5 (Varying the System) will show that expanded retrofitting with self-loops added to the diagonal performs better on word-similarity evaluations than it does without, when allowed to run for 10 steps of retrofitting.

3.4 ConceptNet as an Association Matrix

In order to apply the expanded retrofitting method, we need to consider the data in ConceptNet as a sparse, symmetric matrix of associations between terms. What ConceptNet provides is more complex than that, as it connects terms with a variety of not-necessarily-symmetric, labeled relations.


havasi2010color introduced a vector space embedding of ConceptNet, “spectral association”, that disregarded the relation labels for the purpose of measuring the relatedness of terms. Previous embeddings of ConceptNet, such as that of \newcitespeer2008analogyspace, preserved the relations but were suited mostly for direct similarity and inference, not for relatedness. Because most evaluation data for word similarity is also evaluating relatedness, unless there has been a specific effort to separate them [\citenameAgirre et al.2009], we erase the labels as in spectral association.

Each assertion in ConceptNet corresponds to two entries in a sparse association matrix . ConceptNet assigns a confidence score, or weight, to each assertion. These weights are not entirely comparable between the data sources that comprise ConceptNet, so we re-scaled them so that the average weight of each different data source is 1.

An assertion that relates term to term with adjusted weight will contribute to the values of and . If another assertion relates the same terms with a different relation, it will add to that value. This constructs a symmetric matrix , but the matrix we actually use in retrofitting is the asymmetric , whose rows have been -normalized to prevent high-frequency concepts from overwhelming the results.

Due to the structure of ConceptNet, there exists a large fringe of terms that are poorly connected to other nodes. To make the sparse matrix and the size of the overall vocabulary more manageable, we filter ConceptNet when building its association matrix: we exclude all terms that appear fewer than 3 times, English terms that appear fewer than 4 times, and terms with more than 3 words in them.

3.5 Locally Linear Alignment

In order to use both word2vec and GloVe at the same time, we need to align their partially-overlapping vocabularies and merge their features. This is straightforward to do on the terms that are shared between the two vocabularies, but we would rather not lose the other terms, if we can later benefit from learning more about those terms from ConceptNet.

Before merging features, we need to compute GloVe representations for terms represented in word2vec but not GloVe, and vice versa. The way we do this is inspired by \newcitezhao2015learning, who infer translations between languages of unknown phrases using a locally-linear projection of known translations of similar phrases. Instead of known translations, we have the terms that overlap between word2vec and GloVe. Given a non-overlapping term, we calculate its vector as the average of the vectors of the nearest overlapping terms, weighted by their cosine similarity.

To combine the features of word2vec and GloVe, we first concatenate their vectors into 600-dimensional vectors. We then discount redundancy between its features by transforming these 600-dimensional vectors with a singular value decomposition (SVD). We factor the matrix of concatenated vectors as , then compute the new joint features as .

would be an orthogonal rotation of the original features; reduces the effect of its largest eigenvalues, making over-represented features relatively smaller.

As with many decisions we make in preparing this data, we evaluated the benefit of this step on our development data sets. Discounting redundancy by replacing by provides a benefit on two out of three data sets for evaluating word similarity, as shown in Section 4.5.

It is common to use SVD as a form of dimensionality reduction, by discarding the smallest singular values and truncating the matrix accordingly. Section 4.5 shows that we can reduce the interpolated matrix to from 600 dimensions to 450 or 300 dimensions without much loss in performance.

4 Evaluation

4.1 Word-Similarity Datasets

We evaluate our model’s performance at identifying similar words using a variety of word-similarity gold standards:

In striving to maximize an evaluation metric, it is important to hold out some data, to avoid overfitting to the data by modifying the algorithm and its parameters. The metrics we focused on improving were our rank correlation with MEN-3000, which emphasizes having high-quality representations of common words, and RW, which emphasizes having a broad vocabulary.

MEN-3000 comes with a development/test split, where 1000 of the 3000 word pairs are held out for testing. We applied a similar split to RW, setting aside a sample of of its word pairs for testing. In particular, We set aside every third row, starting from row 3, using the Unix command split -un r/3/3 rw.txt. Similarly, we split on r/1/3 and r/2/3 and concatenated the results to get the remaining evaluation data.

We did not apply a development/test split to WordSim-353 or RG-65, as they are already much smaller than MEN and RW.

For the resources where we applied a development/test split, we evaluated decisions we made in the code – such as those described in Section 4.5 – using only the development set, to preserve the integrity of the test set and avoid “overfitting via code”. We then evaluated the final ensemble, with various pieces enabled, all at once on the held-out test data to produce the results in this paper.

4.2 Results

Ensemble components Evaluations
g .448 .816 .759
g .457 .820 .766
G .146 .787 .672
G .148 .789 .676
W .371 .732 .624
W .374 .732 .622
St g .492 .815 .765
St g .513 .834 .794
St G .494 .814 .763
St G .513 .840 .798
St W .453 .778 .731
St W .452 .777 .732
St W G .525 .832 .778
PP St G .561 .852 .806
PP St W .481 .800 .750
PP St W G .543 .847 .782
CN St G .581 .860 .818
CN St W .541 .813 .771
CN St W G .598 .862 .802
CN PP St G .584 .860 .818
CN PP St W .543 .812 .775
CN PP St W G .601 .861 .802
Table 1: Word-similarity results as various components of the ensemble are enabled. The results are the Spearman rank correlation () with the held-out test sets of RW and MEN-3000 and with WordSim-353. The components are: CN = ConceptNet, PP = PPDB, St = standardized and lemmatized terms, W = word2vec SGNS vectors, g = GloVe 42B vectors, G = GloVe 1.2 840B vectors, = -normalized features.
Rare Words MEN-3000
System dev test all dev test all
GloVe 42B .489 .448 .477 .813 .816 .814
Mod. GloVe .536 .513 .528 .841 .840 .841
Full ensemble .593 .601 .596 .858 .861 .859
Omit CN5 .533 .543 .536 .842 .847 .844
Omit PPDB .590 .598 .592 .858 .862 .860
Omit GloVe .545 .543 .545 .807 .812 .808
Omit word2vec .591 .583 .588 .857 .859 .858
Table 2: A comparison of evaluation results between the “dev” datasets that were used in development, and the held-out “test” datasets, for selected systems.

Table 1 shows the performance of the ensemble as various components of it are enabled. G and g indicate that the initial embeddings come from GloVe (840B or 42B respectively), and W indicates that they are word2vec’s SGNS embeddings built from Google News. When both W and G are present, the embeddings are combined as in Section 3.5. indicates that the columns of features were -normalized; otherwise we used the existing scale of the features.7

St indicates that the labels were transformed and rows combined using the method of Section 3.1, which is a prerequisite to combining multiple data sources. CN and PP indicate adding data from ConceptNet, PPDB, or both using expanded retrofitting. Note that the row labeled with g alone is simply an evaluation of GloVe 42B that reproduces the evaluation of \newcitepennington2014glove. Our results here match the published results to within .001.

While Table 1 shows our correlation with only the test data on these evaluations, Table 2 compares our results on development and test data.

4.3 Benefits of Lemmatization

Decisions about how to process the data, even after the fact, make a very large difference in word-similarity evaluations. Comparing the GloVe 42B results to the 840B results (Table 1), we see that GloVe 42B works better “out of the box”, and 840B contains messy data that particularly causes problems on the Rare Words evaluation. However, our strategy to standardize and lemmatize the term labels of GloVe 840B, combining its rows using the Zipf estimate, makes it perform better than GloVe 42B and other published results, as seen in the St, G, row. We call this configuration “Modified GloVe”, and similarly, our best configuration of word2vec is “Modified word2vec”.

The fact that Modified GloVe performs better than GloVe 42B, GloVe 840B, and many other systems, even before retrofitting any additional data onto it, highlights the unexpectedly large benefit of lemmatization: some of the improvements from this paper can be realized without introducing any additional data.

It’s important to note that we are not changing the evaluation data by using a lemmatizer; we are only changing the way we look it up as embeddings in the vector space that we are evaluating. For example, if an evaluation requires similarities for the words “dry” and “dried” to be ranked differently, or the words “polish” and “Polish”, the lemmatized system will rank them the same, and will be penalized in its Spearman correlation. However, the benefits of lemmatization when evaluating semantic similarity appear to far outweigh the drawbacks.

4.4 Comparisons to Other Published Results

Method RW [all] MEN WS
word2vec SGNS (Levy) .470 .774 .733*
Modified word2vec (ours) .476 .778 .731
GloVe (Pennington) .477 .816 .759
Modified GloVe (ours) .528 .840 .798
SVD (Levy) .514 .778 .736*
Retrofitting (Faruqui) .796 .741
ConceptNet vector ensemble .596 .859 .821
Table 3: Comparison between our ensemble word embeddings and previous results, on the complete RW data, the MEN-3000 test data, and WordSim-353. Asterisks indicate estimated overall results for WordSim.
Figure 3: Systems discussed in this paper, plotted according to their Spearman correlation with MEN-3000 and RW. Hollow nodes are previously-reported results; filled nodes are systems created for this paper.
RG-65 language
Method en de fr
Faruqui et al.: SG + Retrofitting .739 .603 .606
ConceptNet vector ensemble .891 .645 .789
Table 4: Evaluation results comparing Faruqui et al.’s multilingual retrofitting of Wikipedia skip-grams to our ensemble, on RG-65 and its translations.

In Table 3, we compare our results on the RW and MEN-3000 datasets to the best published results that we know of. \newcitelevy2015embeddings present results including an SVD-based method that scores on the RW evaluation, as well as an implementation of skip-grams with negative sampling (SGNS), originally introduced by \newcitemikolov2013word2vec, with optimized hyperparameters. We also compare to the original results from GloVe 42B, and the best MEN-3000 result from \newcitefaruqui2015retrofitting. We use the complete RW data, not our test set, so that we can compare directly to previous results.

Levy’s evaluation uses a version of WordSim-353 that is split into separate sets for similarity and relatedness. We estimate the overall score using a weighted average based on the size of the split datasets.

The RW and MEN data is also plotted in Figure 3. Error bars indicate 95% confidence intervals based on the Fisher transformation of [\citenameFisher1915], supposing that each evaluation is randomly sampled from a hypothetical larger data set.

The ConceptNet vector ensemble with all six components performs better than the previously published systems: we can reject the null hypothesis that the ensemble performs the same as one of these published systems with . It is inconclusive whether it is better to include or exclude PPDB in the ensemble, as the results with and without it are very close.

Table 4 shows the performance of these systems on gold standards that have been translated to other languages, in comparison to the multilingual results published by \newcitefaruqui2015retrofitting. Our system performs well in non-English languages even though the vocabularies of word2vec and GloVe are assumed to be English only. The representations of non-English words come from expanded retrofitting, which allows information to propagate over the inter-language links in ConceptNet.

4.5 Varying the System

Modification RW [dev] MEN [dev] WS-353
Unmodified .593 .858 .802
Use first row instead of row-merging .572 .816 .764
Unweighted row-merging .582 .856 .779
No self-loops in retrofitting .563 .857 .764
Interpolate without SVD .579 .856 .813
Reduce from 600 to 450 dimensions .586 .858 .797
Reduce from 600 to 300 dimensions .583 .859 .791
Table 5: The effects of various modifications to the full ensemble. RW and MEN-3000 were evaluated using their development sets here, not the held-out test data.

Some of the procedures we implemented in creating the ensemble require some justification. To show the benefits of certain decisions, such as adding self-loops or the way we choose to merge rows of GloVe, we have evaluated what happens to the system in the absence of each decision. These evaluations appear in Table 5. These evaluations were part of how we decided on the best configuration of the system, so they were run on the development sets of RW and MEN-3000, not the held-out test sets.

In Table 5, we can see that the choice of how to merge rows of GloVe that get the same standardized label makes a large difference. Recall that the method we ultimately used was to assign each row a pseudo-frequency based on Zipf’s law, and then take a weighted average based on those frequencies. The results drop noticeably when we try the other proposed methods, which are taking only the first (most frequent) row that appears, or taking the unweighted average of the rows.

The same table shows that we can save some computation time and space by reducing the dimensionality of the feature vectors while building the matrix that combines word2vec and GloVe. Reducing the dimensionality seems to cause a small degradation in RW score, but the MEN and WordSim scores stay around the same value, even increasing by an inconclusive amount in some cases.

If we skip applying the SVD at all in the interpolation step – that is, when a term has features in word2vec and GloVe, we simply concatenate those features with all their redundancy – it lowers the RW score somewhat, but raises the WordSim-353 score.

In the retrofitting procedure, we made the decision to add “self-loops” that connect each term to itself, because this helps stabilize the representations of terms that are outside the original vocabulary of GloVe. Reversing this decision (and running for the same number of steps) causes a noticeable drop in performance on RW, the evaluation that is most likely to involve words that were poorly represented or unrepresented in GloVe.

In separate experimentation, we found that when we separate ConceptNet into its component datasets and drop each one in turn, the effects on the evaluation results are mostly quite small. There is no single dataset that acts as the “keystone” without which the system falls apart, but one dataset — Wiktionary — unsurprisingly has a larger effect than the others, because it is responsible for most of the assertions. Of the 5,631,250 filtered assertions that come from ConceptNet, 4,244,410 of them are credited to Wiktionary.

Without Wiktionary, the score on RW drops from .587 to .541. However, at the same time, the MEN-3000 score increases from .858 to .865. The system continues to do what it is designed to do without Wiktionary, but there seems to be a tradeoff in performance on rare words and common words involved.

5 Conclusions

The work we have presented here involves building on many previous techniques, while adding some new techniques and new sources of knowledge.

As \newcitelevy2015embeddings found, high-level choices about how to use a system can significantly affect its performance. While Levy found settings of hyperparameters that made word2vec outperform GloVe, we found that we can make GloVe outperform Levy’s tuned word2vec by pre-processing the words in GloVe with case-folding and lemmatization, and re-weighting the features using normalization.

We also showed that it isn’t necessary to choose just one of word2vec or GloVe as a starting point. Instead, we can benefit from both of them using a locally-linear interpolation between them.

We showed that ConceptNet is a useful source of structured knowledge that was not considered in previous work on retrofitting distributional semantics with structured knowledge, especially when retrofitting is generalized into our expanded retrofitting technique, which can benefit from links that are outside the original vocabulary.

5.1 Future Work

One aspect of our method that clearly has room for improvement is the fact that we disregard the labels on relations in ConceptNet. There is valuable knowledge there that we might be able to take into account with a more sophisticated extension of retrofitting, one that goes beyond simply knowing that particular words should be related, to handling them differently based on how they are related, as in the RESCAL representation [\citenameNickel et al.2011]. This seems particularly important for antonyms, which indicate that words are similar overall but different in a key aspect, such as forming two ends of the same scale.

Lemmatization is clearly a useful component of our word-similarity representation, but it loses information. Representing morphological relationships as operations in the vector space, as in \newcitesoricut2015unsupervised, could yield a better representation of similarities between forms of words.

We believe that the variety of data sources represented in ConceptNet helped to improve evaluation scores by expanding the domain of the system’s knowledge. There’s no reason the improvement needs to stop here. It is quite likely that there are more sources of linked open data that could be included, or further standardizations that could be applied to the text to align more data. An appropriate representation of Universal WordNet [\citenameDe Melo and Weikum2009] could improve the multilingual performance, for example, as could embeddings built from the distribution of words in non-English text. Adapting Rothe and Schütze’s AutoExtend representation could provide a representation of word senses and a more specific understanding of words to the system.

6 Reproducing These Results

We aim for these results to be reproducible and reusable by others. The code that we ran to produce the results in this paper is available in the GitHub repository We have tagged this revision as submitted-20160406.

The README of that repository also points to URLs where the computed results can be downloaded.

(As this paper is being updated in 2019, the exact data that went into building the embeddings used in this paper has sadly been lost. However, the repository points to newer data and newer code that implements an updated version of this process.)


We thank Catherine Havasi for overseeing and reviewing this work, and Avril Kenney, Dennis Clark, and Alice Kaanta for providing feedback on drafts of this paper.

We thank the word2vec, GloVe, and PPDB teams for opening their data so that new techniques can be built on top of their results. We also thank our collaborators who have contributed code and data to ConceptNet over the years, and the tens of thousands of pseudonymous contributors to Wiktionary, Open Mind Common Sense, and related projects, for their frequently uncredited work in providing freely-available lexical knowledge.




  1. Some methods and evaluations [\citenameAgirre et al.2009] distinguish word similarity from word relatedness. “Coffee” and “mug”, for example, are quite related, but not actually similar because coffee is not like a mug. In this paper, however, we conflate similarity and relatedness into the same metric, as most evaluations do.
  2. Given different goals – such as achieving a high score on \newcitemikolov2013word2vec’s analogy evaluation that tests for implicit relations such as “A is the CEO of company B” – including an appropriate representation of DBPedia would of course be helpful.
  6. We maintain normalization so that minimizing distance and maximizing cosine similarity are always linked.
  7. When GloVe is not -normalized, it is -normalized instead, following \newcitepennington2014glove.


  1. Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27. Association for Computational Linguistics.
  2. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. Springer.
  3. James Breen. 2004. JMDict: a Japanese-multilingual dictionary. In Proceedings of the Workshop on Multilingual Linguistic Ressources, pages 71–79. Association for Computational Linguistics.
  4. Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. (JAIR), 49:1–47.
  5. Gerard De Melo and Gerhard Weikum. 2009. Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 513–522. ACM.
  6. Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JAsIs, 41(6):391–407.
  7. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of NAACL.
  8. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM.
  9. Ronald A Fisher. 1915. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, pages 507–521.
  10. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In HLT-NAACL, pages 758–764.
  11. Iryna Gurevych. 2005. Using the structure of a conceptual network in computing semantic relatedness. In Natural Language Processing–IJCNLP 2005, pages 767–778. Springer.
  12. Catherine Havasi, Robyn Speer, and Justin Holmgren. 2010. Automated color selection using semantic knowledge. In AAAI Fall Symposium: Commonsense Knowledge.
  13. Colette Joubarne and Diana Inkpen. 2011. Comparison of semantic similarity for different languages using the Google N-gram corpus and second-order co-occurrence measures. In Advances in Artificial Intelligence, pages 216–221. Springer.
  14. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems, pages 3276–3284.
  15. Yen-ling Kuo, Jong-Chuan Lee, Kai-yang Chiang, Rex Wang, Edward Shen, Cheng-wei Chan, and Jane Yung-jen Hsu. 2009. Community-based game design: experiments on social games for commonsense data collection. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 15–22. ACM.
  16. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225.
  17. Minh-Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word representations with recursive neural networks for morphology. CoNLL-2013, 104.
  18. Luana Micallef and Peter Rodgers. 2014. eulerAPE: Drawing area-proportional 3-venn diagrams using ellipses. PLoS ONE, 9(7):e101717, 07.
  19. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
  20. George Miller, Christiane Fellbaum, Randee Tengi, P Wakefield, H Langone, and BR Haskell. 1998. WordNet. MIT Press Cambridge.
  21. Kazuhiro Nakahara and Shigeo Yamada. 2011. 日本でのコモンセンス知識獲得を目的とした Web ゲームの開発と評価 [Development and evaluation of a Web-based game for common-sense knowledge acquisition in Japan]. Unisys 技報, 30(4):295–305.
  22. Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 809–816.
  23. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. GloVe: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12:1532–1543.
  24. Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1793–1803, Beijing, China, July. Association for Computational Linguistics.
  25. Herbert Rubenstein and John B Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633.
  26. Push Singh, Thomas Lin, Erik T Mueller, Grace Lim, Travell Perkins, and Wan Li Zhu. 2002. Open Mind Common Sense: Knowledge acquisition from the general public. In On the move to meaningful internet systems 2002: CoopIS, DOA, and ODBASE, pages 1223–1237. Springer.
  27. Radu Soricut and Franz Och. 2015. Unsupervised morphology induction using word embeddings. In Proc. NAACL.
  28. Robyn Speer and Catherine Havasi. 2012. Representing general relational knowledge in ConceptNet 5. In LREC, pages 3679–3686.
  29. Robyn Speer, Catherine Havasi, and Henry Lieberman. 2008. AnalogySpace: Reducing the dimensionality of common sense knowledge. In AAAI, volume 8, pages 548–553.
  30. Luis von Ahn, Mihir Kedia, and Manuel Blum. 2006. Verbosity: a game for collecting common-sense facts. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 75–78. ACM.
  31. Wiktionary. 2014. Wiktionary, the free dictionary — English data export. (A collaborative project with thousands of authors.) Retrieved from on 2014-08-26.
  32. Kai Zhao, Hany Hassan, and Michael Auli. 2015. Learning translation models from monolingual continuous representations. In Proceedings of NAACL.
  33. G.K. Zipf. 1949. Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description