Learning Multi-Sense Word Distributions using Approximate Kullback-Leibler Divergence
Learning word representations has garnered greater attention in the recent past due to its diverse text applications. Word embeddings encapsulate the syntactic and semantic regularities of sentences. Modelling word embedding as multi-sense gaussian mixture distributions, will additionally capture uncertainty and polysemy of words. We propose to learn the Gaussian mixture representation of words using a Kullback-Leibler (KL) divergence based objective function. The KL divergence based energy function provides a better distance metric which can effectively capture entailment and distribution similarity among the words. Due to the intractability of KL divergence for Gaussian mixture, we go for a KL approximation between Gaussian mixtures. We perform qualitative and quantitative experiments on benchmark word similarity and entailment datasets which demonstrate the effectiveness of the proposed approach.
Language modelling in its inception had one-hot vector encoding of words. However, it captures only alphabetic ordering but not the word semantic similarity. Vector space models helps to learn word representations in a lower dimensional space and also captures semantic similarity. Learning word embedding aids in natural language processing tasks such as question answering and reasoning Choi et al. (2018), stance detection Augenstein et al. (2016), claim verification Hanselowski et al. (2018).
Recent models (Mikolov et al., 2013a; Bengio et al., 2003) work on the basis that words with similar context share semantic similarity. Bengio et al. (2003) proposes a neural probabilistic model which models the target word probability conditioned on the previous words using a recurrent neural network. Word2Vec models (Mikolov et al., 2013a) such as continuous bag-of-words (CBOW) predict the target word given the context, and skip-gram model works in reverse of predicting the context given the target word. While, GloVe embeddings were based on a Global matrix factorization on local contexts (Pennington et al., 2014). However, the aforementioned models do not handle words with multiple meanings (polysemies).
Huang et al. (2012) proposes a neural network approach considering both local and global contexts in learning word embeddings (point estimates). Their multiple prototype model handles polysemous words by providing apriori heuristics about word senses in the dataset. Tian et al. (2014) proposes an alternative to handle polysemous words by a modified skip-gram model and EM algorithm. Neelakantan et al. (2015) presents a non-parametric based alternative to handle polysemies. However, these approaches fail to consider entailment relations among the words.
Vilnis and McCallum (2014) learn a Gaussian distribution per word using the expected likelihood kernel. However, for polysemous words, this may lead to word distributions with larger variances as it may have to cover various senses.
Athiwaratkun and Wilson (2017) proposes multimodal word distribution approach. It captures polysemy. However, the energy based objective function fails to consider asymmetry and hence entailment. Textual entailment recognition is necessary to capture lexical inference relations such as causality (for example, mosquito malaria), hypernymy (for example, dog animal) etc.
In this paper, we propose to obtain multi-sense word embedding distributions by using a variant of max margin objective based on the asymmetric KL divergence energy function to capture textual entailment. Multi-sense distributions are advantageous in capturing polysemous nature of words and in reducing the uncertainty per word by distributing it across senses. However, computing KL divergence between mixtures of Gaussians is intractable, and we use a KL divergence approximation based on stricter upper and lower bounds. While capturing textual entailment (asymmetry), we have also not compromised on capturing symmetrical similarity between words (for example, funny and hilarious) which will be elucidated in Section . We also show the effectiveness of the proposed approach on the benchmark word similarity and entailment datasets in the experimental section.
2.1 Word Representation
Probabilistic representation of words helps one model uncertainty in word representation, and polysemy. Given a corpus , containing a list of words each represented as , the probability density for a word can be represented as a mixture of Gaussians with components (Athiwaratkun and Wilson, 2017).
Here, represents the probability of word belonging to the component , represents dimensional word representation corresponding to the component sense of the word , and represents the uncertainty in representation for word belonging to component .
3 Objective function
The model parameters (means, covariances and mixture weights) can be learnt using a variant of max-margin objective (Joachims, 2002).
Here represents an energy function which assigns a score to the pair of words, is a particular word under consideration, its positive context (same context), and the negative context. The objective aims to push the margin of the difference between the energy function of a word to its positive context higher than its negative context by a threshold of . Thus, word pairs in the same context gets a higher energy than the word pairs in the dissimilar context. Athiwaratkun and Wilson (2017) consider the energy function to be an expected likelihood kernel which is defined as follows.
This is similar to the cosine similarity metric over vectors and the energy between two words is maximum when they have similar distributions. But, the expected likelihood kernel is a symmetric metric which will not be suitable for capturing ordering among words and hence entailment.
3.1 Proposed Energy function
As each word is represented by a mixture of Gaussian distributions, KL divergence is a better choice of energy function to capture distance between distributions. Since, KL divergence is minimum when the distributions are similar and maximum when they are dissimilar, energy function is taken as exponentiated negative KL divergence.
However, computing KL divergence between Gaussian mixtures is intractable and obtaining exact KL value is not possible. One way of approximating the KL is by Monte-Carlo approximation but it requires large number of samples to get a good approximation and is computationally expensive on high dimensional embedding space.
Alternatively, Hershey and Olsen (2007) presents a KL approximation between Gaussian mixtures where they obtain an upper bound through product of Gaussian approximation method and a lower bound through variational approximation method. In (Durrieu et al., 2012), the authors combine the lower and upper bounds from approximation methods of Hershey and Olsen (2007) to provide a stricter bound on KL between Gaussian mixtures. Lets consider Gaussian mixtures for the words and as follows.
The approximate KL divergence between the Gaussian mixture representations over the words and is shown in equation 5. More details on approximation is included in the Supplementary Material.
where and . Note that the expected likelihood kernel appears component wise inside the approximate KL divergence derivation.
One advantage of using KL as energy function is that it enables to capture asymmetry in entailment datasets. For eg., let us consider the words ’chair’ with two senses as ’bench’ and ’sling’, and ’wood’ with two senses as ’trees’ and ’furniture’. The word chair () is entailed within wood (), i.e. chair wood. Now, minimizing the KL divergence necessitates maximizing which in turn minimizes . This will result in the support of the component of to be within the component of , and holds for all component pairs leading to the entailment of within . Consequently, we can see that bench trees, bench furniture, sling trees, and sling furniture. Thus, it introduces lexical relationship between the senses of child word and that of the parent word. Minimizing the KL also necessitates maximizing term for all component pairs among and . This is similar to maximizing expected likelihood kernel, which brings the means of and closer (weighted by their co-variances) as discussed in (Athiwaratkun and Wilson, 2017). Hence, the proposed approach captures the best of both worlds, thereby catering to both word similarity and entailment.
We also note that minimizing the KL divergence necessitates minimizing which in turn maximizes . This prevents the different mixture components of a word converging to single Gaussian and encourages capturing different possible senses of the word. The same is also achieved by minimizing term and act as a regularization term which promotes diversity in learning senses of a word.
4 Experimentation and Results
We train our proposed model GMKL (Gaussian Mixture using KL Divergence) on the Text8 dataset (Mikolov et al., 2014) which is a pre-processed data of words from wikipedia. Of which, unique and frequent words are chosen using the subsampling trick in Mikolov et al. (2013b). We compare GMKL with the previous approaches w2g (Vilnis and McCallum, 2014) ( single Gaussian model) and w2gm (Athiwaratkun and Wilson, 2017) (mixture of Gaussian model with expected likelihood kernel). For all the models used for experimentation, the embedding size () was set to , number of mixtures to , context window length to , batch size to . The word embeddings were initialized using a uniform distribution in the range of , such that the expectation of variance is and mean (Cun et al., 1998). One could also consider initializing the word embeddings using other contextual representations such as BERT (Devlin et al., 2018) and ELMo (Peters et al., 2018) in the proposed approach. In order to purely analyze the performance of GM_KL over the other models, we have chosen initialization using uniform distribution for experiments. For computational benefits, diagonal covariance is used similar to (Athiwaratkun and Wilson, 2017). Each mixture probability is constrained in the range , summing to by optimizing over unconstrained scores in the range and converting scores to probability using softmax function. The mixture scores are initialized to to ensure fairness among all the components. The threshold for negative sampling was set to , as recommended in Mikolov et al. (2013a). Mini-batch gradient descent with Adagrad optimizer (Duchi et al., 2011) was used with initial learning rate set to .
Table 1 shows the qualitative results of GMKL. Given a query word and component id, the set of nearest neighbours along with their respective component ids are listed. For eg., the word ‘plane’ in its component captures the ‘geometry’ sense and so are its neighbours, and its component captures ‘vehicle’ sense and so are its corresponding neighbours. Other words such as ‘rock’ captures both the ‘metal’ and ‘music’ senses, ‘star’ captures ‘celebrity’ and ‘astronomical’ senses, and ‘phone’ captures ‘telephony’ and ‘internet’ senses.
We quantitatively compare the performance of the GMKL, w2g, and w2gm approaches on the SCWS dataset (Huang et al., 2012). The dataset consists of word pairs of polysemous and homonymous words with labels obtained by an average of human scores. The Spearman correlation between the human scores and the model scores are computed. To obtain the model score, the following metrics are used:
MaxCos: Maximum cosine similarity among all component pairs of words and :
AvgCos: Average component-wise cosine similarity between the words and .
KLapprox: Formulated as shown in (5) between the words and .
KLcomp: Maximum component-wise negative KL between words and :
Table 3 shows the Spearman correlation values of GMKL model evaluated on the benchmark word similarity datasets: SL (Hill et al., 2015), WS, WS-R, WS-S (Finkelstein et al., 2002), MEN
(Bruni et al., 2014), MC
(Miller and Charles, 1991), RG (Rubenstein and Goodenough, 1965), YP (Yang and Powers, 2006), MTurk-287 and MTurk-771 (Radinsky et al., 2011; Halawi et al., 2012), and RW (Luong et al., 2013). The metric used for comparison is ’AvgCos’. It can be seen that for most of the datasets, GMKL achieves significantly better correlation score than w2g and w2gm approaches. Other datasets such as MC and RW consist of only a single sense, and hence w2g model performs better and GMKL achieves next better performance. The YP dataset have multiple senses but does not contain entailed data and hence could not make use of entailment benefits of GMKL.
Table 4 shows the evaluation results of GMKL model on the entailment datasets such as entailment pairs dataset (Baroni et al., 2012) created from WordNet with both positive and negative labels, a crowdsourced dataset (Turney and Mohammad, 2015) of semantic relations labelled as entailed or not and annotated distributionally similar nouns dataset (Kotlerman et al., 2010). The ’MaxCos’ similarity metric is used for evaluation and the best precision and best F1-score is shown, by picking the optimal threshold. Overall, GMKL performs better than both w2g and w2gm approaches.
We proposed a KL divergence based energy function for learning multi-sense word embedding distributions modelled as Gaussian mixtures. Due to the intractability of the Gaussian mixtures for the KL divergence measure, we use an approximate KL divergence function. We also demonstrated that the proposed GMKL approaches performed better than other approaches on the benchmark word similarity and entailment datasets.
- Multimodal word distributions. arXiv preprint arXiv:1704.08424. Cited by: §1, §2.1, §3.1, §3, §4.
- Stance detection with bidirectional conditional encoding. arXiv preprint arXiv:1606.05464. Cited by: §1.
- Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 23–32. Cited by: Table 4, §4.
- A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §1.
- Multimodal distributional semantics. Journal of Artificial Intelligence Research 49, pp. 1–47. Cited by: §4.
- Quac: question answering in context. arXiv preprint arXiv:1808.07036. Cited by: §1.
- Efficient backprop, neural networks: tricks of the trade. Lecture notes in computer sciences 1524, pp. 5–50. Cited by: §4.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §4.
- Lower and upper bounds for approximation of the kullback-leibler divergence between gaussian mixture models. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4833–4836. Cited by: Appendix A, §3.1.
- Placing search in context: the concept revisited. ACM Transactions on information systems 20 (1), pp. 116–131. Cited by: §4.
- Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1406–1414. Cited by: §4.
- UKP-athene: multi-sentence textual entailment for claim verification. arXiv preprint arXiv:1809.01479. Cited by: §1.
- Approximating the kullback leibler divergence between gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 4, pp. IV–317. Cited by: Appendix A, Appendix A, §3.1.
- Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: §4.
- Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 873–882. Cited by: §1, §4.
- Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 133–142. Cited by: §3.
- Directional distributional similarity for lexical inference. Natural Language Engineering 16 (4), pp. 359–389. Cited by: Table 4, §4.
- Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. Cited by: §4.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1, §4.
- Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753. Cited by: §4.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §4.
- Contextual correlates of semantic similarity. Language and cognitive processes 6 (1), pp. 1–28. Cited by: §4.
- Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv preprint arXiv:1504.06654. Cited by: §1.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §4.
- A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pp. 337–346. Cited by: §4.
- Contextual correlates of synonymy. Communications of the ACM 8 (10), pp. 627–633. Cited by: §4.
- A probabilistic model for learning multi-prototype word embeddings. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 151–160. Cited by: §1.
- Experiments with three approaches to recognizing lexical entailment. Natural Language Engineering 21 (3), pp. 437–476. Cited by: Table 4, §4.
- Word representations via gaussian embedding. arXiv preprint arXiv:1412.6623. Cited by: §1, §4.
- Verb similarity on the taxonomy of wordnet. Masaryk University. Cited by: §4.
Appendix A Approximation for KL divergence between mixtures of gaussians
KL between gaussian mixtures and can be decomposed as:
(Hershey and Olsen, 2007) presents KL approximation between gaussian mixtures using
product of gaussian approximation method where KL is approximated using product of component gaussians and
variational approximation method where KL is approximated by introducing some variational parameters.
where represents the entropy term and the entropy of component of word with dimension is given as