Efficient Sentence Embedding via Semantic Subspace Analysis
A novel sentence embedding method built upon semantic subspace analysis, called semantic subspace sentence embedding (S3E), is proposed in this work. Given the fact that word embeddings can capture semantic relationship while semantically similar words tend to form semantic groups in a high-dimensional embedding space, we develop a sentence representation scheme by analyzing semantic subspaces of its constituent words. Specifically, we construct a sentence model from two aspects. First, we represent words that lie in the same semantic group using the intra-group descriptor. Second, we characterize the interaction between multiple semantic groups with the inter-group descriptor. The proposed S3E method is evaluated on both textual similarity tasks and supervised tasks. Experimental results show that it offers comparable or better performance than the state-of-the-art. The complexity of our S3E method is also much lower than other parameterized models.
Word embedding technique is widely used in natural language processing (NLP) tasks. For example, it improves downstream tasks such as machine translation , syntactic parsing , and text classification . Yet, many NLP applications operate at the sentence level or a longer piece of texts. Although sentence embedding has received a lot of attention recently, encoding a sentence into a fixed-length vector to capture different linguistic properties remains to be a challenge.
Universal sentence embedding aims to compute sentence representation that can be applied to any tasks. It can be categorized into two types: i) parameterized models and ii) non-parameterized models. Parameterized models are mainly based on deep neural networks and demand training in their parameter updates. Inspired by the famous word2vec model , the skip-thought model  adopts an encoder-decoder model to predict context sentences in an unsupervised manner. InferSent model  is trained by high quality supervised data; namely, the Natural Language Inference data. It shows that supervised training objective can outperform unsupervised ones. USE  combines both supervised and unsupervised objectives and transformer architecture is employed. The STN model  leverages a multi-tasking framework for sentence embedding to provide better generalizability. With the recent success of deep contextualized word models, SBERT  and SBERT-WK  are proposed to leverage the power of self-supervised learning from large unlabeled corpus. Different parameterized models attempt to capture semantic and syntactic meanings from different aspects. Even though their performance is better as compared with non-parameterized models, parameterized ones are more complex and computationally expensive. Since it is challenging to deploy parameterized models into mobile or terminal devices, finding effective and efficient sentence embedding models are necessary.
Non-parameterized sentence embedding methods rely on high quality word embeddings. The simplest idea is to average individual word embeddings, which already offers a tough-to-beat baseline. By following along this line, several weighted averaging methods have been proposed, including tf-idf, SIF , and GEM . Concatenating vector representations of different resources yields another family of methods. Examples include SCDV  and -mean . To better capture the sequential information, DCT  and EigenSent  were proposed from a signal processing perspective.
Here, we propose a novel non-parameterized sentence embedding method based on semantic subspace analysis. It is called semantic subspace sentence embedding (S3E) (see Fig. 1). The S3E method is motivated by the following observation. Semantically similar words tend to form semantic groups in a high-dimensional embedding space. Thus, we can embed a sentence by analyzing semantic subspaces of its constituent words. Specifically, we use the intra- and inter-group descriptors to represent words in the same semantic group and characterize interactions between multiple semantic groups, respectively.
This work has three main contributions.
The proposed S3E method contains three steps: 1) semantic group construction, 2) intra-group descriptor and 3) inter-group descriptor. The algorithms inside each step are flexible and, as a result, previous work can be easily incorporated.
To the best of our knowledge, this is the first work that leverages correlations between semantic groups to provide a sentence descriptor. Previous work using the covariance descriptor  yields super-high embedding dimension (e.g. 45K dimensions). In contrast, the S3E method can choose the embedding dimension flexibly.
The effectiveness of the proposed S3E method in textual similarity and supervised tasks is shown experimentally. Its performance is as competitive as that of very complicated parametrized models.
Ii Related Previous Work
Vector of Locally Aggregated Descriptors (VLAD) is a famous algorithm in the image retrieval field. Same with Bag-of-words method, VLAD trains a codebook based on clustering techniques and concatenate the feature within each clusters as the final representation. Recently work called VLAWE (vector of locally-aggregated word embeddings) , introduce this idea into document representation. However, VLAWE method suffers from high dimensionality problem which is not favored by machine learning models. In this work, a novel clustering method in proposed by taking word frequency into consideration. At the same time, covariance matrix is used to tackle the dimensionality explosion problem of VLAWE method.
Recently, a novel document distance metric called Word Mover’s Distance (WMD)  is proposed and achieved good performance in classification tasks. Based on the fact that semantically similar words will have close vector representations, the distance between two sentences are models as the minimal ’travel’ cost for moving the embedded words from one sentence to another. WMD targets on modeling the distance between sentences in the shared word embedding space. It is natural to consider the possibility of computing the sentence representation directly from the word embedding space by semantical distance measures.
There are a few works trying to obtain sentence/document representation based on Word Mover’s Distance. D2KE (distances to kernels and embeddings) and WME (word mover’s embedding) converts the distance measure into positive definite kernels and has better theoretical guarantees. However, both methods are proposed under the assumption that Word Mover’s Distance is a good standard for sentence representation. In our work, we borrow the ’travel’ concept of embedded words in WMD’s method. And use covariance matrix to model the interaction between semantic concepts in a discrete way.
Iii Proposed S3E Method
As illustrated in Fig. 1, the S3E method contains three steps: 1) constructing semantic groups based on word vectors; 2) using the inter-group descriptor to find the subspace representation; and 3) using correlations between semantic groups to yield the covariance descriptor. Those are detailed below.
Semantic Group Construction. Given word in the vocabulary, , its uni-gram probability and vector are represented by and , respectively. We assign weights to words based on :
where is a small pre-selected parameter, which is added to avoid the explosion of the weight when is too small. Clearly, . Words are clustered into groups using the K-means++ algorithm , and weights are incorporated in the clustering process. This is needed since some words of higher frequencies (e.g. ’a’, ’and’, ’the’) are less discriminative by nature. They should be assigned with lower weights in the semantic group construction process.
Intra-group Descriptor. After constructing semantic groups, we find the centroid of each group by computing the weighted average of word vectors in that group. That is, for the group, , we learn its representation by
where is the number of words in group . For sentence , we allocate words in to their semantic groups. To obtain the intra-group descriptor, we compute the cumulative residual between word vectors and their centroid () in the same group. Then, the representation of sentence in the semantic group can be written as
If there are semantic groups in total, we can represent sentence with the following matrix:
where is the dimension of word embedding.
Inter-group Descriptor. After obtaining the intra-group descriptor, we measure interactions between semantic groups with covariance coefficients. We can interpret in (4), as observations of -dimensional random variables, and use to denote the mean of each row in . Then, the inter-group covariance matrix can be computed as
is the covariance between groups and . Thus, matrix can be written as
Since the covariance matrix is symmetric, we can vectorize its upper triangular part and use it as the representation for sentence . The Frobenius norm of the original matrix is kept the same with the Euclidean norm of vectorized matrices. This process produces an embedding of dimension . Then, the embedding of sentence becomes
Finally, the sentence embedding in Eq. (8) is .
The semantic group construction process can be pre-computed for efficiency. Our runtime complexity is (), where is the length of a sentence, is the number of semantic groups, and is the dimension of word embedding in use. Our algorithm is linear with respect to the sentence length. The S3E method is much faster than all parameterized models and most of non-parameterized methods such as  where the singular value decomposition is needed during inference. The run time comparison is also discussed in Sec. IV-C.
We evaluate our method on two sentence embedding evaluation tasks to verify the generalizability of S3E. Semantic textual similarity tasks are used to test the clustering and retrieval property of our sentence embedding. Discriminative power of sentence embedding is evaluated by supervised tasks.
For performance benchmarking, we compare S3E with a series of other methods including parameterized and non-parameterized ones.
Avg. GloVe embedding;
SIF : Derived from an improved random-walk model. Consist of two parts: weighted averaging of word vectors and first principal component removal;
-means : Concatenating different word embedding models and different power ratios;
DCT : Introduce discrete cosine transform into sentence sequential modeling;
VLAWE : Introduce VLAD (vector of locally aggregated descriptor) into sentence embedding field;
Skip-thought : Extend word2vec unsupervised training objectives from word level into sentence level;
InferSent : Bi-directional LSTM encoder trained on high quality sentence inference data.
Sent2Vec : Learn n-gram word representation and use average as the sentence representation.
FastSent : An improved Skip-thought model for fast training on large corpus. Simplify the recurrent neural network as bag-of-words representation.
ELMo : Deep contextualized word embedding. Sentence embedding is computed by averaging all LSTM outputs.
Avg. BERT embedding : Average the last layer word representation of BERT model.
SBERT-WK : A fusion method to combine representations across layers of deep contextualized word models.
Iv-a Textual Similarity Tasks
|Avg. BERT ||768||46.9||52.8||57.2||63.5||64.5||65.2||80.5||61.51|
We evaluate the performance of the S3E method on the SemEval semantic textual similarity tasks from 2012 to 2016, the STS Benchmark and SICK-Relatedness dataset. The goal is to predict the similarity between sentence pairs. The sentence pairs contains labels between 0 to 5, which indicate their semantic relatedness. The Pearson correlation coefficients between prediction and human-labeled similarities are reported as the performance measure. For STS 2012 to 2016 datasets, the similarity prediction is computed using the cosine similarity. For STS Benchmark dataset and SICK-R dataset, they are under supervised setting and aims to predict the probability distribution of relatedness scores. We adopt the same setting with  for these two datasets and also report the Pearson correlation coefficient.
The S3E method can be applied to any static word embedding method. Here, we
report three of them; namely, GloVe , FastText
 and L.F.P.
Experimental results on textual similarity tasks are shown in Table I, where both non-parameterized and parameterized models are compared. Recent parameterized method SBERT-WK provides the best performance and outperforms other method by a large margin. S3E method using L.F.P word embedding is the second best method in average comparing with both parameterized and non-parameterized methods. As mentioned, our work is compatible with any weight-based methods. With better weighting schemes, the S3E method has a potential to perform even better. As choice of word embedding, L.F.P performs better than FastText and FastText is better than GloVe vector in Table I, which is consistent with the previous findings . Therefore, choosing more powerful word embedding models can be helpful in performance boost.
Iv-B Supervised Tasks
MR: Sentiment classification on movie reviews.
CR: Sentiment classification on product reviews.
SUBJ: Subjectivity/objective classification.
MPQA: Opinion polarity classification.
SST2: Stanford sentiment treebank for sentiment classification.
TREC: Question type classification.
MRPC: Paraphrase identification.
SICK-Entailment: Entailment classification on SICK dataset.
The details for each dataset is also shown in Table III. For all tasks, we trained a simple MLP classifier that contain one hidden layer of 50 neurons. It is same as it was done in  and only tuned the regularization term on validation sets. The hyper-parameter setting of S3E is kept the same as that in textual similarity tasks. The batch size is set to 64 and Adam optimizer is employed. For MR, CR, SUBJ, MPQA and MRPC datasets, we use the nested 10-fold cross validation. For TREC and SICK-E, we use the cross validation. For SST2 the standard validation is utilized. All experiments are trained with 4 epochs.
Experimental results on supervised tasks are shown in Table II. The S3E method outperforms all non-parameterized models, including DCT , VLAWE  and -means . The S3E method adopts a word embedding dimension smaller than -means and VLAWE and also flexible in choosing embedding dimensions. As implemented in other weight-based methods, the S3E method does not consider the order of words in a sentence but splits a sentence into different semantic groups. The S3E method performs the best on the paraphrase identification (MRPC) dataset among all non-parameterized and parameterized methods excluding SBERT-WK. This is attributed to that, when paraphrasing, the order is not important since words are usually swapped. In this context, the correlation between semantic components play an important role in determining the similarity between a pair of sentences and paraphrases.
Comparing with parameterized method, S3E also outperforms a series of them including Skip-thought, FastSent and Sent2Vec. In general, parameterized methods performs better than non-parameterized ones on downstream tasks. The best performance is the recently proposed SBERT-WK method which incorporate a pre-trained deep contextualized word model. However, even though good perform is witnessed, deep models are requiring much more computational resources which makes it hard to integrate into mobile or terminal devices. Therefore, S3E method has its own strength in its efficiency and good performance.
Iv-C Inference Speed
|Model||CPU inference time (ms)||GPU inference time (ms)|
We compare the inference speed of S3E with other models including the non-parameterized and parameterized ones. For fair comparison, the batch size is set to 1 and all sentences from STSB datasets are used for evaluation (17256 sentences). All benchmark results can run on CPU
Comparing the other method, S3E is the very efficient in inference speed and this is very important in sentence embedding. Without the acceleration of powerful GPU, when doing comparing tasks of 10,000 sentence pairs, deep contextualized models takes about 1 hour to accomplish, which S3E only requires 13 seconds.
Iv-D Sensitivity to Cluster Numbers
We test the sensitivity of S3E to the setting of cluster numbers. The cluster number is set from 5 to 60 with internal of 5 clusters. Results for STS-Benchmark, SICK-Entailment and SST2 dataset are reported. As we can see from Figure 2, performance of S3E is quite robust for different choice of cluster numbers. The performance varies less than 1% in accuracy or correlation.
Averaging word embedding provides a simple baseline for sentence embedding. A weighted sum of word embeddings should offer improvement intuitively. Some methods tries to improve averaging are to concatenate word embedding in several forms such as -means  and VLAWE. Concatenating word embeddings usually encounters the dimension explosion problem. The number of concatenated components cannot be too large.
Our S3E method is compatible with exiting models and its performance can be further improved by replacing each module with a stronger one. First, we use word weights in constructing semantically similar groups and can incorporate different weighting scheme in our model, such as SIF , GEM . Second, different clustering schemes such as the Gaussian mixture model and dictionary learning can be utilized to construct semantically similar groups [13, 31]. Finally, the intra-group descriptor can be replaced by methods like VLAWE  and -means . In inter-group descriptor, correlation between semantic groups can also be modeled in a non-linear way by applying different kernel functions. Another future direction is to add sequential information into current S3E method.
A sentence embedding method based on semantic subspace analysis was proposed. The proposed S3E method has three building modules: semantic group construction, intra-group description and inter-group description. The S3E method can be integrated with many other existing models. It was shown by experimental results that the proposed S3E method offers state-of-the-art performance among non-parameterized models. S3E is outstanding for its effectiveness with low computational complexity.
- Our code is available at github.
- concatenated LexVec, FastText and PSL
- Intel i7-5930 of 3.50GHz with 12 cores
- Nvidia GeForce GTX TITAN X
- G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, “Phrase-based & neural unsupervised machine translation,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
- T. Dozat and C. D. Manning, “Simpler but more accurate semantic dependency parsing,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 484–490. [Online]. Available: https://www.aclweb.org/anthology/P18-2077
- D. Shen, G. Wang, W. Wang, M. Renqiang Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin, “Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms,” in ACL, 2018.
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
- R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in neural information processing systems, 2015, pp. 3294–3302.
- A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” arXiv preprint arXiv:1705.02364, 2017.
- D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil, “Universal sentence encoder for English,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 169–174. [Online]. Available: https://www.aclweb.org/anthology/D18-2029
- S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal, “Learning general purpose distributed sentence representations via large scale multi-task learning,” ICLR, 2018.
- N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: http://arxiv.org/abs/1908.10084
- B. Wang and C.-C. J. Kuo, “SBERT-WK: A sentence embedding method by dissecting bert-based word models,” arXiv preprint arXiv:2002.06652, 2020.
- S. Arora, Y. Liang, and T. Ma, “A simple but tough-to-beat baseline for sentence embeddings,” 2017.
- Z. Yang, C. Zhu, and W. Chen, “Parameter-free sentence embedding via orthogonal basis,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 638–648.
- D. Mekala, V. Gupta, B. Paranjape, and H. Karnick, “Scdv: Sparse composite document vectors using soft clustering over distributional representations,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 659–669.
- A. Rücklé, S. Eger, M. Peyrard, and I. Gurevych, “Concatenated power mean word embeddings as universal cross-lingual sentence representations,” arXiv preprint arXiv:1803.01400, 2018.
- N. Almarwani, H. Aldarmaki, and M. Diab, “Efficient sentence embedding using discrete cosine transform,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3663–3669. [Online]. Available: https://www.aclweb.org/anthology/D19-1380
- S. Kayal and G. Tsatsaronis, “Eigensent: Spectral sentence embeddings using higher-order dynamic mode decomposition,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4536–4546.
- M. Torki, “A document descriptor using covariance of word vectors,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 527–532. [Online]. Available: https://www.aclweb.org/anthology/P18-2084
- R. T. Ionescu and A. Butnaru, “Vector of locally-aggregated word embeddings (vlawe): A novel document-level representation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 363–369.
- M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in International conference on machine learning, 2015, pp. 957–966.
- D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
- M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features,” in NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
- F. Hill, K. Cho, and A. Korhonen, “Learning distributed representations of sentences from unlabelled data,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics, Jun. 2016, pp. 1367–1377. [Online]. Available: https://www.aclweb.org/anthology/N16-1162
- M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. of NAACL, 2018.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” arXiv preprint arXiv:1503.00075, 2015.
- J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
- B. Wang, F. Chen, A. Wang, and C.-C. J. Kuo, “Post-processing of word representations via variance normalization and dynamic embedding,” in 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019, pp. 718–723.
- B. Wang, A. Wang, F. Chen, Y. Wang, and C.-C. J. Kuo, “Evaluating word embedding models: methods and experimental results,” APSIPA Transactions on Signal and Information Processing, vol. 8, 2019.
- A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations,” arXiv preprint arXiv:1803.05449, 2018.
- V. Gupta, A. Saw, P. Nokhiz, P. Netrapalli, P. Rai, and P. Talukdar, “P-sif: Document embeddings using partition averaging.”