Sentiment Analysis of Citations Using Word2vec

Sentiment Analysis of Citations Using Word2vec

Haixia Liu School Of Computer Science, University of Nottingham Malaysia Campus, Jalan Broga, 43500 Semenyih, Selangor Darul Ehsan.

Citation sentiment analysis is an important task in scientific paper analysis. Existing machine learning techniques for citation sentiment analysis are focusing on labor-intensive feature engineering, which requires large annotated corpus. As an automatic feature extraction tool, word2vec has been successfully applied to sentiment analysis of short texts. In this work, I conducted empirical research with the question: how well does word2vec work on the sentiment analysis of citations? The proposed method constructed sentence vectors (sent2vec) by averaging the word embeddings, which were learned from Anthology Collections (ACL-Embeddings). I also investigated polarity-specific word embeddings (PS-Embeddings) for classifying positive and negative citations. The sentence vectors formed a feature space, to which the examined citation sentence was mapped to. Those features were input into classifiers (support vector machines) for supervised classification. Using 10-cross-validation scheme, evaluation was conducted on a set of annotated citations. The results showed that word embeddings are effective on classifying positive and negative citations. However, hand-crafted features performed better for the overall classification.

sentiment analysis, word2vec

1 Introduction

The evolution of scientific ideas happens when old ideas are replaced by new ones. Researchers usually conduct scientific experiments based on the previous publications. They either take use of others work as a solution to solve their specific problem, or they improve the results documented in the previous publications by introducing new solutions. I refer to the former as positive citation and the later negative citation. Citation sentence examples 111Randomly selected from :
with different sentiment polarity are shown in Table 1.

Citing Cited Polarity Examples
A1 A0 Positive One of the most effective taggers based on a pure HMM is that developed at Xerox (Cutting et al. , 1992).
A2 A0 Negative Brill’s results demonstrate that this approach can outperform the Hidden Markov Model approaches that are frequently used for part-of-speech tagging (Jelinek, 1985; Church, 1988; DeRose, 1988; Cutting et al. , 1992; Weischedel et al., 1993), as well as showing promise for other applications.
Table 1: Examples of positive and negative citations.

Sentiment analysis of citations plays an important role in plotting scientific idea flow. I can see from Table 1, one of the ideas introduced in paper A0 is Hidden Markov Model (HMM) based part-of-speech (POS) tagging, which has been referenced positively in paper A1. In paper A2, however, a better approach was brought up making the idea (HMM based POS) in paper A0 negative. This citation sentiment analysis could lead to future-works in such a way that new approaches (mentioned in paper A2) are recommended to other papers which cited A0 positively 222Restriction: the citations share the similar topics.
In this case: HMM based POS tagging
. Analyzing citation sentences during literature review is time consuming. Recently, researchers developed algorithms to automatically analyze citation sentiment. For example, [1] extracted several features for citation purpose and polarity classification, such as reference count, contrary expression and dependency relations. Jochim et al. tried to improve the result by using unigram and bigram features [2]. [3] used word level features, contextual polarity features, and sentence structure based features to detect sentiment citations. Although they generated good results using the combination of features, it required a lot of engineering work and big amount of annotated data to obtain the features. Further more, capturing accurate features relies on other NLP techniques, such as part-of-speech tagging (POS) and sentence parsing. Therefore, it is necessary to explore other techniques that are free from hand-crafted features. With the development of neural networks and deep learning, it is possible to learn the representations of concepts from unlabeled text corpus automatically. These representations can be treated as concept features for classification. An important advance in this area is the development of the word2vec technique [4], which has proved to be an effective approach in Twitter sentiment classification [5].

In this work, the word2vec technique on sentiment analysis of citations was explored. Word embeddings trained from different corpora were compared.

2 Related Work

Mikolov et al. introduced word2vec technique [4] that can obtain word vectors by training text corpus. The idea of word2vec (word embeddings) originated from the concept of distributed representation of words [6]. The common method to derive the vectors is using neural probabilistic language model [7]. Word embeddings proved to be effective representations in the tasks of sentiment analysis [5, 8, 9] and text classification [10]. Sadeghian and Sharafat [11] extended word embeddings to sentence embeddings by averaging the word vectors in a sentiment review statement. Their results showed that word embeddings outperformed the bag-of-words model in sentiment classification. In this work, I are aiming at evaluating word embeddings for sentiment analysis of citations. The research questions are:

  1. How well does word2vec work on classifying positive and negative citations?

  2. Can sentiment-specific word embeddings improve the classification result?

  3. How well does word2vec work on classifying implicit citations?

  4. In general, how well does word2vec work on classifying positive, negative and objective citations in comparison with hand-crafted features?

3 Methodology

3.1 Pre-processing

The SentenceModel provided by LingPipe was used to segment raw text into its constituent sentences 333
. The data I used to train the vectors has noise. For example, there are incomplete sentences mistakenly detected (e.g. Publication Year.). To address this issue, I eliminated sentences with less than three words.

3.2 Overall Sent2vec Training

In the work, I constructed sentence embeddings based on word embeddings. I simply averaged the vectors of the words in one sentence to obtain sentence embeddings (sent2vec). The main process in this step is to learn the word embedding matrix :


where () is the word embedding for word , which could be learned by the classical word2vec algorithm [4]. The parameters that I used to train the word embeddings are the same as in the work of Sadeghian and Sharafat

3.3 Polarity-Specific Word Representation Training

To improve sentiment citation classification results, I trained polarity specific word embeddings (PS-Embeddings), which were inspired by the Sentiment-Specific Word Embedding [5]. After obtaining the PS-Embeddings, I used the same scheme to average the vectors in one sentence according to the sent2vec model.

4 Experiment

4.1 Training Dataset

The ACL-Embeddings (300 and 100 dimensions) from ACL collection were trained . ACL Anthology Reference Corpus 444 contains the canonical 10,921 computational linguistics papers, from which I have generated 622,144 sentences after filtering out sentences with lower quality.

For training polarity specific word embeddings (PS-Embeddings, 100 dimensions), I selected 17,538 sentences (8,769 positive and 8,769 negative) from ACL collection, by comparing sentences with the polar phrases 555

The pre-trained Brown-Embeddings (100 dimensions) learned from Brown corpus was also used 666h as a comparison.

4.2 Test Dataset

To evaluate the sent2vec performance on citation sentiment detection, I conducted experiments on three datasets. The first one (dataset-basic) was originally taken from ACL Anthology [12]. Athar and Awais [3] manually annotated 8,736 citations from 310 publications in the ACL Anthology. I used all of the labeled sentences (830 positive, 280 negative and 7,626 objective) for testing. 777In [3]’s work, they used 244 negative, 743 positive and 6277 objective citations for testing.

The second dataset (dataset-implicit) was used for evaluating implicit citation classification, containing 200,222 excluded (x), 282 positive (p), 419 negative (n) and 2,880 objective (o) annotated sentences. Every sentence which does not contain any direct or indirect mention of the citation is labeled as being excluded (x) 888\(\)~--.

The third dataset (dataset-pn) is a subset of dataset-basic, containing 828 positive and 280 negative citations. Dataset-pn was used for the purposes of (1) evaluating binary classification (positive versus negative) performance using sent2vec; (2) Comparing the sentiment classification ability of PS-Embeddings with other embeddings.

4.3 Evaluation Strategy

One-Vs-The-Rest strategy was adopted 999\(\\ modules/multiclass.html\) for the task of multi-class classification and I reported F-score, micro-F, macro-F and weighted-F scores 101010\(\\ generated/sklearn.metrics.f1_{s}core.html\) using 10-fold cross-validation. The F1 score is a weighted average of the precision and recall. In the multi-class case, this is the weighted average of the F1 score of each class. There are several types of averaging performed on the data: Micro-F calculates metrics globally by counting the total true positives, false negatives and false positives. Macro-F calculates metrics for each label, and find their unweighted mean. Macro-F does not take label imbalance into account. Weighted-F calculates metrics for each label, and find their average, weighted by support (the number of true instances for each label). Weighted-F alters macro-F to account for label imbalance.

4.4 Results

The performances of citation sentiment classification on dataset-basic and dataset-implicit were shown in Table 2 and Table 3 respectively. The result of classifying positive and negative citations was shown in Table 4. To compare with the outcomes in the work of [3] 111111The test dataset is slightly larger than [3]’s test dataset., I selected two records from their results: the best one (based on features n-gram + dependencies + negation) and the baseline (based on 1-3 grams). From Table 2 I can see that the features extracted by [3] performed far better than word embeddings, in terms of macro-F (their best macro-F is 0.90, the one in this work is 0.33). However, the higher micro-F score (The highest micro-F in this work is 0.88, theirs is 0.78) and the weighted-F scores indicated that this method may achieve better performances if the evaluations are conducted on a balanced dataset. Among the embeddings, ACL-Embeddings performed better than Brown corpus in terms of macro-F and weighted-F measurements 121212I did not perform significant test for the comparison.. To compare the dimensionality of word embeddings, ACL300 gave a higher micro-F score than ACL100, but there is no difference between 300 and 100 dimensional ACL-embeddings when look at the macro-F and weighted-F scores.

Methods Micro-F Macro-F Weigh-F
ACL300 0.88 0.33 0.82
ACL100 0.87 0.33 0.82
Brown100 0.87 0.31 0.81
n-grams 0.60 0.87 -
”+dep+neg 0.76 0.90 -
Table 2: Performance of citation sentiment classification.

Table 3 showed the sent2vec performance on classifying implicit citations with four categories: objective, negative, positive and excluded. The method in this experiment had a poor performance on detecting positive citations, but it was comparable with both the baseline and sentence structure method [13] for the category of objective citations. With respect to classifying negative citations, this method was not as good as sentence structure features but it outperformed the baseline. The results of classifying category X from the rest showed that the performances of this method and the sentence structure method are fairly equal.

Sentiment Baseline Athar ACL300
O (F-score) 0.86 0.89 0.84
N (F-score) 0.14 0.62 0.44
P (F-score) 0.40 0.55 0.27
Macro-F 0.47 0.69 0.44
Weighted-F - - 0.77
X vs O,N,P (F-score) 0.990 0.996 0.997
Table 3: Performance of implicit citation sentiment classification.

Table 4 showed the results of classifying positive and negative citations using different word embeddings. The macro-F score 0.85 and the weighted-F score 0.86 proved that word2vec is effective on classifying positive and negative citations. However, unlike the outcomes in the paper of [5], where they concluded that sentiment specific word embeddings performed best, integrating polarity information did not improve the result in this experiment.

Trained Corpus Macro-F Weigh-F
Brown100 0.84 0.85
ACL300 0.85 0.86
ACL100 0.85 0.85
PS-ACL300 0.84 0.85
Table 4: Performance of classifying positive and negative citations.

5 Discussion and Conclusion

In this paper, I reported the citation sentiment classification results based on word embeddings. The binary classification results in Table 4 showed that word2vec is a promising tool for distinguishing positive and negative citations. From Table 4 I can see that there are no big differences among the scores generated by ACL100 and Brown100, despite they have different vocabulary sizes (ACL100 has 14,325 words, Brown100 has 56,057 words). The polarity specific word embeddings did not show its strength in the task of binary classification. For the task of classifying implicit citations (Table 3), in general, sent2vec (macro-F 0.44) was comparable with the baseline (macro-F 0.47) and it was effective for detecting objective sentences (F-score 0.84) as well as separating X sentences from the rest (F-score 0.997), but it did not work well on distinguishing positive citations from the rest. For the overall classification (Table 2), however, this method was not as good as hand-crafted features, such as n-grams and sentence structure features. I may conclude from this experiment that word2vec technique has the potential to capture sentiment information in the citations, but hand-crafted features have better performance.


  • [1] A. Abu-Jbara, J. Ezra, and D. R. Radev, “Purpose and polarity of citation: Towards nlp-based bibliometrics.” in HLT-NAACL, 2013, pp. 596–606.
  • [2] C. Jochim and H. Schütze, “Improving citation polarity classification with product reviews.” in ACL (2), 2014, pp. 42–48.
  • [3] A. Athar, “Sentiment analysis of citations using sentence structure-based features,” in Proceedings of the ACL 2011 student session.   Association for Computational Linguistics, 2011, pp. 81–87.
  • [4] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [5] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, “Learning sentiment-specific word embedding for twitter sentiment classification,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, 2014, pp. 1555–1565.
  • [6] M. J. Hinton, Geoffrey and D. Rumelhart, “Distributed representations,” 1986.
  • [7] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” The Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
  • [8] B. Xue, C. Fu, and Z. Shaobin, “A study on sentiment computing and classification of sina weibo with word2vec,” in Big Data (BigData Congress), 2014 IEEE International Congress on.   IEEE, 2014, pp. 358–363.
  • [9] D. Zhang, H. Xu, Z. Su, and Y. Xu, “Chinese comments sentiment classification based on word2vec and svm perf,” Expert Systems with Applications, vol. 42, no. 4, pp. 1857–1863, 2015.
  • [10] J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” in Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on.   IEEE, 2015, pp. 136–140.
  • [11] A. Sadeghian and A. R. Sharafat, “Bag of words meets bags of popcorn,” 2015.
  • [12] S. Bird, R. Dale, B. J. Dorr, B. R. Gibson, M. Joseph, M.-Y. Kan, D. Lee, B. Powley, D. R. Radev, and Y. F. Tan, “The acl anthology reference corpus: A reference dataset for bibliographic research in computational linguistics.” in LREC, 2008.
  • [13] A. Athar and S. Teufel, “Detection of implicit citations for sentiment detection,” in Proceedings of the Workshop on Detecting Structure in Scholarly Discourse, ser. ACL ’12.   Stroudsburg, PA, USA: Association for Computational Linguistics, 2012, pp. 18–26. [Online]. Available:
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description