Efficient Sentence Embedding via Semantic Subspace Analysis

Efficient Sentence Embedding via Semantic Subspace Analysis


A novel sentence embedding method built upon semantic subspace analysis, called semantic subspace sentence embedding (S3E), is proposed in this work. Given the fact that word embeddings can capture semantic relationship while semantically similar words tend to form semantic groups in a high-dimensional embedding space, we develop a sentence representation scheme by analyzing semantic subspaces of its constituent words. Specifically, we construct a sentence model from two aspects. First, we represent words that lie in the same semantic group using the intra-group descriptor. Second, we characterize the interaction between multiple semantic groups with the inter-group descriptor. The proposed S3E method is evaluated on both textual similarity tasks and supervised tasks. Experimental results show that it offers comparable or better performance than the state-of-the-art. The complexity of our S3E method is also much lower than other parameterized models.

I Introduction

Word embedding technique is widely used in natural language processing (NLP) tasks. For example, it improves downstream tasks such as machine translation [1], syntactic parsing [2], and text classification [3]. Yet, many NLP applications operate at the sentence level or a longer piece of texts. Although sentence embedding has received a lot of attention recently, encoding a sentence into a fixed-length vector to capture different linguistic properties remains to be a challenge.

Universal sentence embedding aims to compute sentence representation that can be applied to any tasks. It can be categorized into two types: i) parameterized models and ii) non-parameterized models. Parameterized models are mainly based on deep neural networks and demand training in their parameter updates. Inspired by the famous word2vec model [4], the skip-thought model [5] adopts an encoder-decoder model to predict context sentences in an unsupervised manner. InferSent model [6] is trained by high quality supervised data; namely, the Natural Language Inference data. It shows that supervised training objective can outperform unsupervised ones. USE [7] combines both supervised and unsupervised objectives and transformer architecture is employed. The STN model [8] leverages a multi-tasking framework for sentence embedding to provide better generalizability. With the recent success of deep contextualized word models, SBERT [9] and SBERT-WK [10] are proposed to leverage the power of self-supervised learning from large unlabeled corpus. Different parameterized models attempt to capture semantic and syntactic meanings from different aspects. Even though their performance is better as compared with non-parameterized models, parameterized ones are more complex and computationally expensive. Since it is challenging to deploy parameterized models into mobile or terminal devices, finding effective and efficient sentence embedding models are necessary.

Fig. 1: Overview of the proposed S3E method.

Non-parameterized sentence embedding methods rely on high quality word embeddings. The simplest idea is to average individual word embeddings, which already offers a tough-to-beat baseline. By following along this line, several weighted averaging methods have been proposed, including tf-idf, SIF [11], and GEM [12]. Concatenating vector representations of different resources yields another family of methods. Examples include SCDV [13] and -mean [14]. To better capture the sequential information, DCT [15] and EigenSent [16] were proposed from a signal processing perspective.

Here, we propose a novel non-parameterized sentence embedding method based on semantic subspace analysis. It is called semantic subspace sentence embedding (S3E) (see Fig. 1). The S3E method is motivated by the following observation. Semantically similar words tend to form semantic groups in a high-dimensional embedding space. Thus, we can embed a sentence by analyzing semantic subspaces of its constituent words. Specifically, we use the intra- and inter-group descriptors to represent words in the same semantic group and characterize interactions between multiple semantic groups, respectively.

This work has three main contributions.

  1. The proposed S3E method contains three steps: 1) semantic group construction, 2) intra-group descriptor and 3) inter-group descriptor. The algorithms inside each step are flexible and, as a result, previous work can be easily incorporated.

  2. To the best of our knowledge, this is the first work that leverages correlations between semantic groups to provide a sentence descriptor. Previous work using the covariance descriptor [17] yields super-high embedding dimension (e.g. 45K dimensions). In contrast, the S3E method can choose the embedding dimension flexibly.

  3. The effectiveness of the proposed S3E method in textual similarity and supervised tasks is shown experimentally. Its performance is as competitive as that of very complicated parametrized models.1

Ii Related Previous Work

Vector of Locally Aggregated Descriptors (VLAD) is a famous algorithm in the image retrieval field. Same with Bag-of-words method, VLAD trains a codebook based on clustering techniques and concatenate the feature within each clusters as the final representation. Recently work called VLAWE (vector of locally-aggregated word embeddings) [18], introduce this idea into document representation. However, VLAWE method suffers from high dimensionality problem which is not favored by machine learning models. In this work, a novel clustering method in proposed by taking word frequency into consideration. At the same time, covariance matrix is used to tackle the dimensionality explosion problem of VLAWE method.

Recently, a novel document distance metric called Word Mover’s Distance (WMD) [19] is proposed and achieved good performance in classification tasks. Based on the fact that semantically similar words will have close vector representations, the distance between two sentences are models as the minimal ’travel’ cost for moving the embedded words from one sentence to another. WMD targets on modeling the distance between sentences in the shared word embedding space. It is natural to consider the possibility of computing the sentence representation directly from the word embedding space by semantical distance measures.

There are a few works trying to obtain sentence/document representation based on Word Mover’s Distance. D2KE (distances to kernels and embeddings) and WME (word mover’s embedding) converts the distance measure into positive definite kernels and has better theoretical guarantees. However, both methods are proposed under the assumption that Word Mover’s Distance is a good standard for sentence representation. In our work, we borrow the ’travel’ concept of embedded words in WMD’s method. And use covariance matrix to model the interaction between semantic concepts in a discrete way.

Iii Proposed S3E Method

As illustrated in Fig. 1, the S3E method contains three steps: 1) constructing semantic groups based on word vectors; 2) using the inter-group descriptor to find the subspace representation; and 3) using correlations between semantic groups to yield the covariance descriptor. Those are detailed below.

Semantic Group Construction. Given word in the vocabulary, , its uni-gram probability and vector are represented by and , respectively. We assign weights to words based on :


where is a small pre-selected parameter, which is added to avoid the explosion of the weight when is too small. Clearly, . Words are clustered into groups using the K-means++ algorithm [20], and weights are incorporated in the clustering process. This is needed since some words of higher frequencies (e.g. ’a’, ’and’, ’the’) are less discriminative by nature. They should be assigned with lower weights in the semantic group construction process.

Intra-group Descriptor. After constructing semantic groups, we find the centroid of each group by computing the weighted average of word vectors in that group. That is, for the group, , we learn its representation by


where is the number of words in group . For sentence , we allocate words in to their semantic groups. To obtain the intra-group descriptor, we compute the cumulative residual between word vectors and their centroid () in the same group. Then, the representation of sentence in the semantic group can be written as


If there are semantic groups in total, we can represent sentence with the following matrix:


where is the dimension of word embedding.

Inter-group Descriptor. After obtaining the intra-group descriptor, we measure interactions between semantic groups with covariance coefficients. We can interpret in (4), as observations of -dimensional random variables, and use to denote the mean of each row in . Then, the inter-group covariance matrix can be computed as




is the covariance between groups and . Thus, matrix can be written as


Since the covariance matrix is symmetric, we can vectorize its upper triangular part and use it as the representation for sentence . The Frobenius norm of the original matrix is kept the same with the Euclidean norm of vectorized matrices. This process produces an embedding of dimension . Then, the embedding of sentence becomes


Finally, the sentence embedding in Eq. (8) is .


The semantic group construction process can be pre-computed for efficiency. Our runtime complexity is (), where is the length of a sentence, is the number of semantic groups, and is the dimension of word embedding in use. Our algorithm is linear with respect to the sentence length. The S3E method is much faster than all parameterized models and most of non-parameterized methods such as [12] where the singular value decomposition is needed during inference. The run time comparison is also discussed in Sec. IV-C.

Iv Experiments

We evaluate our method on two sentence embedding evaluation tasks to verify the generalizability of S3E. Semantic textual similarity tasks are used to test the clustering and retrieval property of our sentence embedding. Discriminative power of sentence embedding is evaluated by supervised tasks.

For performance benchmarking, we compare S3E with a series of other methods including parameterized and non-parameterized ones.

  1. Non-parameterized Models

    1. Avg. GloVe embedding;

    2. SIF [11]: Derived from an improved random-walk model. Consist of two parts: weighted averaging of word vectors and first principal component removal;

    3. -means [14]: Concatenating different word embedding models and different power ratios;

    4. DCT [15]: Introduce discrete cosine transform into sentence sequential modeling;

    5. VLAWE [18]: Introduce VLAD (vector of locally aggregated descriptor) into sentence embedding field;

  2. Parameterized Models

    1. Skip-thought [5]: Extend word2vec unsupervised training objectives from word level into sentence level;

    2. InferSent [6]: Bi-directional LSTM encoder trained on high quality sentence inference data.

    3. Sent2Vec [21]: Learn n-gram word representation and use average as the sentence representation.

    4. FastSent [22]: An improved Skip-thought model for fast training on large corpus. Simplify the recurrent neural network as bag-of-words representation.

    5. ELMo [23]: Deep contextualized word embedding. Sentence embedding is computed by averaging all LSTM outputs.

    6. Avg. BERT embedding [24]: Average the last layer word representation of BERT model.

    7. SBERT-WK [10]: A fusion method to combine representations across layers of deep contextualized word models.

Iv-a Textual Similarity Tasks

Model Dim STS12 STS13 STS14 STS15 STS16 STSB SICK-R Avg.
Parameterized models
skip-thought[5] 4800 30.8 24.8 31.4 31.0 - - 86.0 40.80
InferSent[6] 4096 58.6 51.5 67.8 68.3 70.4 74.7 88.3 68.51
ELMo[23] 3072 55.0 51.0 63.0 69.0 64.0 65.0 84.0 64.43
Avg. BERT [24] 768 46.9 52.8 57.2 63.5 64.5 65.2 80.5 61.51
SBERT-WK [10] 768 70.2 68.1 75.5 76.9 74.5 80.0 87.4 76.09
Non-parameterized models
Avg. GloVe 300 52.3 50.5 55.2 56.7 54.9 65.8 80.0 59.34
SIF[11] 300 56.2 56.6 68.5 71.7 - 72.0 86.0 68.50
-mean[14] 3600 54.0 52.0 63.0 66.0 67.0 72.0 86.0 65.71
S3E (GloVe) 355-1575 59.5 62.4 68.5 72.3 70.9 75.5 82.7 69.59
S3E (FastText) 355-1575 62.5 67.8 70.2 76.1 74.3 77.5 84.7 72.64
S3E (L.F.P.) 955-2175 61.0 69.3 73.2 76.1 74.4 78.6 84.7 73.90
TABLE I: Experimental results on textual similarity tasks in terms of the Pearson correlation coefficients (%), where the best results for parameterized and non-parameterized are in bold respectively.

We evaluate the performance of the S3E method on the SemEval semantic textual similarity tasks from 2012 to 2016, the STS Benchmark and SICK-Relatedness dataset. The goal is to predict the similarity between sentence pairs. The sentence pairs contains labels between 0 to 5, which indicate their semantic relatedness. The Pearson correlation coefficients between prediction and human-labeled similarities are reported as the performance measure. For STS 2012 to 2016 datasets, the similarity prediction is computed using the cosine similarity. For STS Benchmark dataset and SICK-R dataset, they are under supervised setting and aims to predict the probability distribution of relatedness scores. We adopt the same setting with [25] for these two datasets and also report the Pearson correlation coefficient.

The S3E method can be applied to any static word embedding method. Here, we report three of them; namely, GloVe [26], FastText [27] and L.F.P. 2. Word embedding is normalized using [28]. Parameter in Eq. (1) is set to for all experiments. The word frequency, , is estimated from the wiki dataset3. The number of semantic groups, , is chosen from the set and the best performance is reported.

Experimental results on textual similarity tasks are shown in Table I, where both non-parameterized and parameterized models are compared. Recent parameterized method SBERT-WK provides the best performance and outperforms other method by a large margin. S3E method using L.F.P word embedding is the second best method in average comparing with both parameterized and non-parameterized methods. As mentioned, our work is compatible with any weight-based methods. With better weighting schemes, the S3E method has a potential to perform even better. As choice of word embedding, L.F.P performs better than FastText and FastText is better than GloVe vector in Table I, which is consistent with the previous findings [29]. Therefore, choosing more powerful word embedding models can be helpful in performance boost.

Iv-B Supervised Tasks

Parameterized models
skip-thought[5] 4800 76.6 81.0 93.3 87.1 81.8 91.0 73.2 84.3 83.54
FastSent[22] 300 70.8 78.4 88.7 80.6 - 76.8 72.2 - 77.92
InferSent[6] 4096 79.3 85.5 92.3 90.0 83.2 87.6 75.5 85.1 84.81
Sent2Vec[21] 700 75.8 80.3 91.1 85.9 - 86.4 72.5 - 82.00
USE[7] 512 80.2 86.0 93.7 87.0 86.1 93.8 72.3 83.3 85.30
ELMo[23] 3072 80.9 84.0 94.6 91.0 86.7 93.6 72.9 82.4 85.76
SBERT-WK [10] 768 83.0 89.1 95.2 90.6 89.2 93.2 77.4 85.5 87.90
Non-parameterized models
GloVe(Ave) 300 77.6 78.5 91.5 87.9 79.8 83.6 72.1 79.0 81.25
SIF[11] 300 77.3 78.6 90.5 87.0 82.2 78.0 - 84.6 82.60
p-mean[14] 3600 78.3 80.8 92.6 89.1 84.0 88.4 73.2 83.5 83.74
DCT[15] 300-1800 78.5 80.1 92.8 88.4 83.7 89.8 75.0 80.6 83.61
VLAWE[18] 3000 77.7 79.2 91.7 88.1 80.8 87.0 72.8 81.2 82.31
S3E (GloVe) 355-1575 78.3 80.4 92.5 89.4 82.0 88.2 74.9 82.0 83.46
S3E (FastText) 355-1575 78.8 81.4 92.9 88.5 83.5 87.0 75.7 81.4 83.65
S3E(L.F.P.) 955-2175 79.4 81.4 92.9 89.4 83.5 89.0 75.6 82.6 84.23
TABLE II: Experimental results on supervised tasks, where sentence embeddings are fixed during the training process and the best results for parameterized and non-parameterized models are marked in bold respectively.

The SentEval toolkit4 [30] is used to evaluate on eight supervised tasks:

  1. MR: Sentiment classification on movie reviews.

  2. CR: Sentiment classification on product reviews.

  3. SUBJ: Subjectivity/objective classification.

  4. MPQA: Opinion polarity classification.

  5. SST2: Stanford sentiment treebank for sentiment classification.

  6. TREC: Question type classification.

  7. MRPC: Paraphrase identification.

  8. SICK-Entailment: Entailment classification on SICK dataset.

Dataset # Samples Task Class
MR 11k movie review 2
CR 4k product review 2
SUBJ 10k subjectivity/objectivity 2
MPQA 11k opinion polarity 2
SST2 70k sentiment 2
TREC 6k question-type 6
MRPC 5.7k paraphrase detection 2
SICK-E 10k entailment 3
TABLE III: Examples in downstream tasks

The details for each dataset is also shown in Table III. For all tasks, we trained a simple MLP classifier that contain one hidden layer of 50 neurons. It is same as it was done in [15] and only tuned the regularization term on validation sets. The hyper-parameter setting of S3E is kept the same as that in textual similarity tasks. The batch size is set to 64 and Adam optimizer is employed. For MR, CR, SUBJ, MPQA and MRPC datasets, we use the nested 10-fold cross validation. For TREC and SICK-E, we use the cross validation. For SST2 the standard validation is utilized. All experiments are trained with 4 epochs.

Experimental results on supervised tasks are shown in Table II. The S3E method outperforms all non-parameterized models, including DCT [15], VLAWE [18] and -means [14]. The S3E method adopts a word embedding dimension smaller than -means and VLAWE and also flexible in choosing embedding dimensions. As implemented in other weight-based methods, the S3E method does not consider the order of words in a sentence but splits a sentence into different semantic groups. The S3E method performs the best on the paraphrase identification (MRPC) dataset among all non-parameterized and parameterized methods excluding SBERT-WK. This is attributed to that, when paraphrasing, the order is not important since words are usually swapped. In this context, the correlation between semantic components play an important role in determining the similarity between a pair of sentences and paraphrases.

Comparing with parameterized method, S3E also outperforms a series of them including Skip-thought, FastSent and Sent2Vec. In general, parameterized methods performs better than non-parameterized ones on downstream tasks. The best performance is the recently proposed SBERT-WK method which incorporate a pre-trained deep contextualized word model. However, even though good perform is witnessed, deep models are requiring much more computational resources which makes it hard to integrate into mobile or terminal devices. Therefore, S3E method has its own strength in its efficiency and good performance.

Iv-C Inference Speed

Model CPU inference time (ms) GPU inference time (ms)
InferSent 53.07 15.23
SBERT-WK 179.27 42.79
GEM 26.54 -
SIF 1.56 -
Proposed S3E 0.69 -
TABLE IV: Inference time comparison. Data are collected from 5 trails.

We compare the inference speed of S3E with other models including the non-parameterized and parameterized ones. For fair comparison, the batch size is set to 1 and all sentences from STSB datasets are used for evaluation (17256 sentences). All benchmark results can run on CPU5 and GPU6. The results are showed in Table IV.

Comparing the other method, S3E is the very efficient in inference speed and this is very important in sentence embedding. Without the acceleration of powerful GPU, when doing comparing tasks of 10,000 sentence pairs, deep contextualized models takes about 1 hour to accomplish, which S3E only requires 13 seconds.

Iv-D Sensitivity to Cluster Numbers

Fig. 2: Comparing results with different settings on cluster numbers. STSB result is presented in Pearson Correlation Coefficients (%). SICK-E and SST2 are presented in accuracy.

We test the sensitivity of S3E to the setting of cluster numbers. The cluster number is set from 5 to 60 with internal of 5 clusters. Results for STS-Benchmark, SICK-Entailment and SST2 dataset are reported. As we can see from Figure 2, performance of S3E is quite robust for different choice of cluster numbers. The performance varies less than 1% in accuracy or correlation.

V Discussion

Averaging word embedding provides a simple baseline for sentence embedding. A weighted sum of word embeddings should offer improvement intuitively. Some methods tries to improve averaging are to concatenate word embedding in several forms such as -means [14] and VLAWE[18]. Concatenating word embeddings usually encounters the dimension explosion problem. The number of concatenated components cannot be too large.

Our S3E method is compatible with exiting models and its performance can be further improved by replacing each module with a stronger one. First, we use word weights in constructing semantically similar groups and can incorporate different weighting scheme in our model, such as SIF [11], GEM [12]. Second, different clustering schemes such as the Gaussian mixture model and dictionary learning can be utilized to construct semantically similar groups [13, 31]. Finally, the intra-group descriptor can be replaced by methods like VLAWE [18] and -means [14]. In inter-group descriptor, correlation between semantic groups can also be modeled in a non-linear way by applying different kernel functions. Another future direction is to add sequential information into current S3E method.

Vi Conclusion

A sentence embedding method based on semantic subspace analysis was proposed. The proposed S3E method has three building modules: semantic group construction, intra-group description and inter-group description. The S3E method can be integrated with many other existing models. It was shown by experimental results that the proposed S3E method offers state-of-the-art performance among non-parameterized models. S3E is outstanding for its effectiveness with low computational complexity.


  1. Our code is available at github.
  2. concatenated LexVec, FastText and PSL
  3. https://dumps.wikimedia.org/
  4. https://github.com/facebookresearch/SentEval
  5. Intel i7-5930 of 3.50GHz with 12 cores
  6. Nvidia GeForce GTX TITAN X


  1. G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, “Phrase-based & neural unsupervised machine translation,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  2. T. Dozat and C. D. Manning, “Simpler but more accurate semantic dependency parsing,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).   Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 484–490. [Online]. Available: https://www.aclweb.org/anthology/P18-2077
  3. D. Shen, G. Wang, W. Wang, M. Renqiang Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin, “Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms,” in ACL, 2018.
  4. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
  5. R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in neural information processing systems, 2015, pp. 3294–3302.
  6. A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” arXiv preprint arXiv:1705.02364, 2017.
  7. D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil, “Universal sentence encoder for English,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.   Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 169–174. [Online]. Available: https://www.aclweb.org/anthology/D18-2029
  8. S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal, “Learning general purpose distributed sentence representations via large scale multi-task learning,” ICLR, 2018.
  9. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, 11 2019. [Online]. Available: http://arxiv.org/abs/1908.10084
  10. B. Wang and C.-C. J. Kuo, “SBERT-WK: A sentence embedding method by dissecting bert-based word models,” arXiv preprint arXiv:2002.06652, 2020.
  11. S. Arora, Y. Liang, and T. Ma, “A simple but tough-to-beat baseline for sentence embeddings,” 2017.
  12. Z. Yang, C. Zhu, and W. Chen, “Parameter-free sentence embedding via orthogonal basis,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 638–648.
  13. D. Mekala, V. Gupta, B. Paranjape, and H. Karnick, “Scdv: Sparse composite document vectors using soft clustering over distributional representations,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 659–669.
  14. A. Rücklé, S. Eger, M. Peyrard, and I. Gurevych, “Concatenated power mean word embeddings as universal cross-lingual sentence representations,” arXiv preprint arXiv:1803.01400, 2018.
  15. N. Almarwani, H. Aldarmaki, and M. Diab, “Efficient sentence embedding using discrete cosine transform,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3663–3669. [Online]. Available: https://www.aclweb.org/anthology/D19-1380
  16. S. Kayal and G. Tsatsaronis, “Eigensent: Spectral sentence embeddings using higher-order dynamic mode decomposition,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4536–4546.
  17. M. Torki, “A document descriptor using covariance of word vectors,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).   Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 527–532. [Online]. Available: https://www.aclweb.org/anthology/P18-2084
  18. R. T. Ionescu and A. Butnaru, “Vector of locally-aggregated word embeddings (vlawe): A novel document-level representation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 363–369.
  19. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in International conference on machine learning, 2015, pp. 957–966.
  20. D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms.   Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
  21. M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features,” in NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
  22. F. Hill, K. Cho, and A. Korhonen, “Learning distributed representations of sentences from unlabelled data,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   San Diego, California: Association for Computational Linguistics, Jun. 2016, pp. 1367–1377. [Online]. Available: https://www.aclweb.org/anthology/N16-1162
  23. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. of NAACL, 2018.
  24. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  25. K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” arXiv preprint arXiv:1503.00075, 2015.
  26. J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  27. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
  28. B. Wang, F. Chen, A. Wang, and C.-C. J. Kuo, “Post-processing of word representations via variance normalization and dynamic embedding,” in 2019 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2019, pp. 718–723.
  29. B. Wang, A. Wang, F. Chen, Y. Wang, and C.-C. J. Kuo, “Evaluating word embedding models: methods and experimental results,” APSIPA Transactions on Signal and Information Processing, vol. 8, 2019.
  30. A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations,” arXiv preprint arXiv:1803.05449, 2018.
  31. V. Gupta, A. Saw, P. Nokhiz, P. Netrapalli, P. Rai, and P. Talukdar, “P-sif: Document embeddings using partition averaging.”
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description