SentiLR: Linguistic Knowledge Enhanced Language Representation for Sentiment Analysis
Most of the existing pre-trained language representation models neglect to consider the linguistic knowledge of texts, whereas we argue that such knowledge can promote language understanding in various NLP tasks. In this paper, we propose a novel language representation model called SentiLR, which introduces word-level linguistic knowledge including part-of-speech tag and prior sentiment polarity from SentiWordNet to benefit the downstream tasks in sentiment analysis. During pre-training, we first acquire the prior sentiment polarity of each word by querying the SentiWordNet dictionary with its part-of-speech tag. Then, we devise a new pre-training task called label-aware masked language model (LA-MLM) consisting of two subtasks: 1) word knowledge recovering given the sentence-level label; 2) sentence-level label prediction with linguistic knowledge enhanced context. Experiments show that SentiLR achieves state-of-the-art performance on several sentence-level / aspect-level sentiment analysis tasks by fine-tuning, and also obtain comparative results on general language understanding tasks.
Recently, pre-trained language representation models such as GPT Radford et al. (2018, 2019), ELMo Peters et al. (2018), BERT Devlin et al. (2019) and XLNet Yang et al. (2019) have achieved promising results in NLP tasks, including reading comprehension Rajpurkar et al. (2016), natural language inference Bowman et al. (2015); Williams et al. (2018) and sentiment classification Socher et al. (2013). These models capture contextual information from large-scale unlabelled corpora via well-designed pre-training tasks. The literature has commonly reported that pre-trained models can be used as effective feature extractors and achieve state-of-the-art performance on various downstream tasks Wang et al. (2019).
Although pre-trained language representation models have achieved transformative performance, the pre-training tasks like masked language model and next sentence prediction Devlin et al. (2019) neglect to consider the linguistic knowledge. We argue that such knowledge is important for some NLP tasks, particularly for sentiment analysis. For instance, existing work has shown that linguistic knowledge including part-of-speech tag Qian et al. (2015); Huang et al. (2017) and prior sentiment polarity Qian et al. (2017) of each word is closely related to the sentiment of longer texts such as sentences and paragraphs. We argue that pre-trained models enriched with the linguistic knowledge of words will benefit the understanding of the sentiment of the whole texts, thereby resulting in better performance on sentiment analysis.
Although directly introducing the linguistic knowledge from external linguistic resources is feasible, it remains a challenge for the model to learn beneficial knowledge-aware representation that promotes the downstream tasks in sentiment analysis. The linguistic knowledge roughly reflects different impacts of individual words on the sentiment of a whole sentence. Some of these words may act as sentiment shifters. For example, negation words constantly change the sentiment to the opposite polarity Zhu et al. (2014), while intensity words modify the valence degree, i.e., sentiment intensity of the text Qian et al. (2017). However, the sentiment labels of sentences are commonly derived from multiple sentiment shifts induced by words, and modeling the complex relationship between the sentence-level sentiment labels and word-level sentiment shifts is still underexplored. Thus, the goal of our research is to fully employ the linguistic knowledge to get language representation entailing the connection between high-level labels and words, which improves the performance in the tasks of sentiment analysis.
In this paper, we propose a novel pre-trained language representation model called SentiLR to deal with this challenge. First, to acquire the linguistic knowledge of each word, we utilize SentiWordNet 3.0 Baccianella et al. (2010) as our linguistic resource. Specifically, we look up the sentiment scores of words with corresponding part-of-speech tags in SentiWordNet. Since we can not accurately match the meaning of each word with the sense in SentiWordNet, we compute a weighted sum of the sentiment score of all the senses as the prior sentiment polarity for each word Guerini et al. (2013). Then, to capture the relationship between sentence-level labels and word-level sentiment shifts using linguistic knowledge, we devise a novel pre-training task called label-aware masked language model. This task contains two sub-tasks: 1) predicting a masked word, part-of-speech tag, and sentiment polarity at masked positions given the sentence-level sentiment label; 2) predicting the sentence-level label, the masked word and its linguistic knowledge including part-of-speech tag and sentiment polarity simultaneously. These two sub-tasks are expected to encourage the model to utilize linguistic knowledge to build the connection between high-level sentiment labels and low-level sentiment shifts. Our contributions are three folds:
We analyze the importance of incorporating linguistic knowledge into pre-trained language representation models, and we observe that effectively leveraging linguistic knowledge benefits the sentiment analysis tasks.
We propose a novel pre-trained language representation model called SentiLR, which acquires word-level sentiment polarity from SentiWordNet and adopts label-aware masked language model to capture the relationship between sentence-level sentiment labels and word-level sentiment shifts.
We conduct experiments on sentence-level / aspect-level sentiment classification tasks and show that SentiLR can outperform state-of-the-art pre-trained language representation models such as BERT and XLNet.
2 Related Work
2.1 Pre-trained Language Representation Model
Early work on pre-trained language representation models mainly focuses on distributed word representations, such as word2vec Mikolov et al. (2013) and Glove Pennington et al. (2014). Since the distributed word representation is independent of context, it’s challenging for such representation to model the complex word characteristics under different contexts. Thus contextual language representation based on pre-trained models including CoVe McCann et al. (2017), ELMo Peters et al. (2018), GPT Radford et al. (2018, 2019) and BERT Devlin et al. (2019) becomes prevalent recently. These models use deep LSTM Hochreiter and Schmidhuber (1997) or Transformer Vaswani et al. (2017) as the encoder to acquire contextual language representation. Various pre-training tasks were explored including traditional NLP tasks like machine translation McCann et al. (2017) and language model Peters et al. (2018); Radford et al. (2018, 2019), or other tasks such as masked language model and next sentence prediction Devlin et al. (2019).
With the advent of BERT Devlin et al. (2019) achieving state-of-the-art performances on various NLP tasks, many variants of BERT have been proposed. Due to the important role of entities in language understanding, two heuristic ways have been studied to make the pre-trained model aware of entities, i.e. introducing knowledge graph Zhang et al. (2019) / knowledge base Peters et al. (2019) explicitly and designing entity-specific masking strategies during pre-training Sun et al. (2019a, b). Considering the implicit relationship among different NLP tasks, post-training approaches Xu et al. (2019); Li et al. (2019) conduct supervised training on the pre-trained BERT with transfer tasks which are related to target tasks, in order to get a better initialization for target tasks. The model structure and the pre-training tasks of BERT are also worth exploring. Some researchers measure the impact of key hyper-parameters to improve the under-trained BERT Liu et al. (2019), and others improve the masked language model with masking contiguous random spans Joshi et al. (2019) or decomposing the training objective into auto-regressive language model Yang et al. (2019).
Other work propose task-specific pre-training strategies to acquire task-specific language representation applied to the corresponding tasks such as data augmentation Wu et al. (2019), cross-lingual analysis Lample and Conneau (2019), relation extraction Alt et al. (2019); Soares et al. (2019) and language generation Song et al. (2019); Dong et al. (2019). To the best of our knowledge, SentiLR is the first work to explore sentiment-specific pre-trained language representation model for downstream sentiment analysis tasks.
2.2 Linguistic Knowledge for Sentiment Analysis
Linguistic knowledge such as part of speech and word-level sentiment polarity is commonly used as external features in sentiment analysis. Part of speech can facilitate the understanding of the syntactic structure of texts by improving the parsing performance Socher et al. (2013). It can also be incorporated into all layers of RNN as tag embeddings Qian et al. (2015). Huang et al. (2017) shows that part of speech can help to learn sentiment-favorable representations.
Word-level sentiment polarity is mostly derived from sentiment lexicons Hu and Liu (2004); Wilson et al. (2005). Guerini et al. (2013) obtains the prior sentiment polarity by weighting the sentiment scores over all the senses of words in SentiWordNet Esuli and Sebastiani (2006); Baccianella et al. (2010). Teng et al. (2016) proposes a lexicon-based weighted sum model, which weights the prior sentiment scores of sentiment words to get the sentiment label of the whole sentence. Qian et al. (2017) models the linguistic role of sentiment, negation and intensity words via linguistic regularizers in the training objective.
3.1 Task Definition and Model Overview
Our task is formulated as follows: given a text sequence of length , our goal is to acquire the representation of the whole sequence that captures the contextual information and the linguistic knowledge simultaneously. In this formulation, indicates the dimension of the representation vector.
Figure 1 shows the overview of our model pipeline which contains three stages: 1) Acquiring the prior sentiment polarity for each word with its corresponding part-of-speech tag; 2) Conducting pre-training via two tasks i.e. label-aware masked language modeling and next sentence prediction; 3) Fine-tuning on sentiment analysis tasks with different settings. Compared with the vanilla pre-trained models like BERT Devlin et al. (2019), our model enriches the input sequence with its linguistic knowledge including part-of-speech tags and sentiment polarity labels, and utilizes a modified masked language model to capture the relationship between sentence-level sentiment labels and word-level knowledge in addition to context dependency.
3.2 Linguistic Knowledge Acquisition
This module obtains the sentiment polarity for each word with its part-of-speech tag. The input of this module is a sequence of tuples containing words and part-of-speech labels tagged by external tools such as NLTK111http://www.nltk.org/. Assume that for each tuple , we can find different senses with their sense numbers and positive / negative scores in SentiWordNet due to the ambiguity, where indicates the order of different senses and is the positive / negative score assigned by SentiWordNet. Since we can’t accurately match the meaning of each word in the sequence with the sense in the SentiWordNet, we follow Guerini et al. (2013) to convert the scores of all the senses to a prior sentiment label:
where the reciprocal of the of each sense weights the respective score since in the SentiWordNet smaller sense number indicates more frequent use of this sense in natural language. Note that if we can’t find any sense for in SentiWordNet, the label of will be assigned.
3.3 Pre-training Tasks
During pre-training, Label-aware masked language model (LA-MLM) and next sentence prediction (NSP) are adopted as the pre-training tasks where the setting of NSP is identical to the one proposed by Devlin et al. (2019). Label-aware masked language model is designed to utilize the linguistic knowledge to grasp the implicit dependency between sentence-level sentiment labels and words in addition to context dependency. It contains two separate sub-tasks, both of which take the position embedding, token embedding and segment embedding as the input. The position embedding introduces the position information into the model, while the segment embedding shows the boundary of different sentences. They are implemented in the same setting as BERT. Besides the original word embedding, the token embedding additionally includes the part-of-speech embedding and the word-level sentiment polarity embedding obtained in Section 3.2.
The goal of sub-task#1 is to recover the masked sequence conditioned on the sentence-level label, as shown in Figure 2. In this setting, we add the sentence-level sentiment embedding to the inputs and the model is required to predict the word, part-of-speech tag and word-level sentiment polarity individually using the hidden states at the masked positions. This sub-task explicitly exerts the impact of the high-level sentiment label on the words and the linguistic knowledge of words, enhancing the ability of our model to explore the complex connection among them.
The purpose of sub-task#2 is to predict the sentence-level label and the word information based on the hidden states at [CLS] and masked positions respectively. From Figure 3, we can see that the label is used as the supervision signal, which is different from sub-task#1. The simultaneous prediction of labels, words and linguistic knowledge of words enables our model to capture the implicit relationship among them.
Since the two sub-tasks are separate, we empirically set the proportion of pre-training data provided for the two sub-tasks to be 4:1. As for the masking strategy, we increase the masking probability of the words with positive / negative sentiment polarity from 15% in the setting of BERT to 30% because they are more possible to cause sentiment shifts in the whole sentence.
3.4 Fine-tuning Setting
Equipped with the ability to utilize linguistic knowledge via pre-training, our model can be fine-tuned to different sentiment analysis tasks, including sentence-level / aspect-level sentiment classification. We follow the fine-tuning setting of the existing work Devlin et al. (2019); Xu et al. (2019):
Sentence-level Sentiment Classification: The input of this task is a text sequence . The sentiment label is obtained based on the hidden state of [CLS].
Aspect-level Sentiment Classification: In addition to the text sequence, the input additionally contains an aspect term / aspect category sequence . The sentiment label is also acquired based on the hidden state of [CLS]. Figure 4 illustrates the fine-tuning settings.
4.1 Pre-training Dataset and Implementation
We adopted the Yelp Dataset Challenge 2019222https://www.yelp.com/dataset/challenge as our pre-training dataset. This dataset contains 6,685,900 reviews with 5-class review-level sentiment labels. Each review consists of 8.1 sentences and 127.8 words on average.
Since our method can adapt to all the BERT-style pre-training models, we used vanilla BERT Devlin et al. (2019) as the base framework to construct Transformer blocks in this paper and leave the exploration of other models like RoBERTa Liu et al. (2019) as future work. The hyper-parameters of the Transformer blocks were set to be the same as BERT-Base due to the limited computational power. Considering the large cost of training from scratch, we utilized the parameters of pre-trained BERT333https://github.com/google-research/bert to initialize our model. We also followed BERT to use WordPiece vocabulary Wu et al. (2016) with a vocabulary size of 30,522. The maximum sequence length in the pre-training phase was 128, while the batch size was 512. We took Adam Kingma and Ba (2015) as the optimizer and set the learning rate to be 5e-5. Our model was pre-trained on Yelp Dataset Challenge 2019 for 1 epoch with label-aware masked language model and next sentence prediction. Note that we’ll release all the data, codes and model parameters.
We compared SentiLR with several state-of-the-art pre-trained language representation models:
BERT: The pre-trained model based on masked language model and next sentence prediction Devlin et al. (2019).
XLNet: The variant of BERT which autoregressively recovers the masked tokens with permutation language model Yang et al. (2019).
For fair comparison, all the baselines in this paper were set to the base version. The number of parameters in each model is listed in Table 1. Since SentiLR adopts the same architecture of Transformer blocks as BERT, the number of parameters in these two models are almost the same and less than XLNet.
4.3 Sentence-level Sentiment Classification
|SST||8,544 / 1,101 / 2,210||19.2||5|
|MR||8,534 / 1,078 / 1,050||21.7||2|
|IMDB||24,749 / 249 / 24,999||279.2||2|
|Yelp-2||504,000 / 56,000 / 38,000||155.3||2|
|Yelp-5||594,000 / 56,000 / 50,000||156.6||5|
The goal of the sentence-level sentiment classification is to predict the sentiment labels of sentences or paragraphs, which examines the model’s ability to understand the whole text. We evaluated our model on several sentence-level sentiment classification benchmarks including Stanford Sentiment Treebank (SST) Socher et al. (2013), Movie Review (MR) Pang and Lee (2005), IMDB Maas et al. (2011) and Yelp-2/5 Zhang et al. (2015) which cover widely used datasets at different scales. We reported the statistics of the datasets in Table 2 including the number of training / validation / test set, the average length and the number of classes. Since MR, IMDB and Yelp-2/5 don’t have validation sets, we randomly sampled subsets from the training sets as the validation sets, and tested all the models with the same data split.
The results are shown in Table 3. We can observe that SentiLR performs better or equally compared with other baselines on MR, Yelp-2 and Yelp-5. As for SST and IMDB, our model clearly surpasses BERT and shows comparative performances with XLNet. This demonstrates that our model can derive the sentence-level labels based on the sentiment shifts in the sentences and get a better understanding of the sentiment in the whole text.
4.4 Aspect-level Sentiment Classification
|Task||Aspect Term Sentiment Classification|
|Amount||2,163 / 150 / 638||3,452 / 150 / 1,120|
|Task||Aspect Category Sentiment Classification|
|Amount||3,366 / 150 / 973||2,150 / 150 / 751|
Aspect-level sentiment classification is an important task in sentiment analysis. Given the aspect term / aspect category and the corresponding review, this task aims to predict the sentiment of the aspect based on the review, which evaluates the ability to capture the sentiment of specific content. The difference between aspect term and aspect category is that the former is a specific term (e.g. fish) while the latter is a coarse-grained category (e.g. food). For aspect term sentiment classification, we chose SemEval2014 Task 4 (laptop and restaurant domains) as the benchmarks, while for aspect category sentiment classification, we used SemEval2014 Task 4 (restaurant domain) and SemEval2016 Task 5 (restaurant domain). The statistics of these benchmarks containing the amount of training / validation / test sets, the number of classes and the number of aspect terms / aspect categories are shown in Table 4. We followed the existing work Xu et al. (2019) to leave 150 examples from the training sets for validation.
|Task||Aspect Term Sentiment Classification|
|Dataset||SemEval14 (Laptop)||SemEval14 (Restaurant)|
|Task||Aspect Category Sentiment Classification|
|Dataset||SemEval14 (Restaurant)||SemEval16 (Restaurant)|
We present the results of aspect-level sentiment classification in Table 5. We can see that SentiLR outperforms the baselines in both accuracy and Macro-F1 on these datasets, indicating that our model can successfully grasp the sentiment of the given aspects. Since the improvement of Macro-F1 is more notable than that of accuracy, it is convinced that our model actually does better in all the three sentiment classes. Due to the sparsity of aspect terms compared with aspect categories, our model improved a larger margin on the task of aspect category sentiment classification than the aspect terms.
4.5 General Language Understanding Tasks
To explore whether the performance of SentiLR on common NLP tasks will improve or degrade, we evaluated our model on General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2019), which collects diverse language understanding tasks. We fine-tuned SentiLR on each task of GLUE respectively, and compared its performance with vanilla BERT. Since the test sets of GLUE are not publicly available, we reported the results on development sets in Table 6. Note that we directly used the results of BERT on SST-2, MNLI, QNLI and MRPC which are reported by Devlin et al. (2019) and re-implemented the BERT model fine-tuned on the rest of the tasks by ourselves.
From Table 6, SentiLR surely gets better results on the tasks in sentiment analysis like SST-2. We also observe that our model outperforms BERT on CoLA, MRPC, QNLI tasks, and gets comparative results on the other datasets. Among these datasets, CoLA requires fine-grained grammaticality distinction for complex syntactic structures Warstadt and Bowman (2019), which may be aided by part-of-speech tag information. Similarly, QNLI has also been reported to be improved with external part-of-speech features Rajpurkar et al. (2016). Thereby, our model which is able to utilize the linguistic knowledge achieves better performance accordingly.
4.6 Ablation Study
To study the effectiveness and significance of the linguistic knowledge introduced and the label-aware masked language model, we remove the linguistic knowledge and two sub-tasks in the label-aware masked language model respectively and present the results in Table 7. Since the two sub-tasks are separate, the setting of -subtask#1/2 in Table 7 indicates that the pre-training data are all fed into the other sub-task. Additionally, the -knowledge setting means that we remove the part-of-speech and sentiment polarity embedding in the input as well as the supervision signals of linguistic knowledge in two sub-tasks.
|Input Sentence||The movie is of [MASK] quality with [MASK] good comments about it.|
|Sentence-level Label||Predicted Sentence|
|0||The movie is of poor quality with no good comments about it.|
|1||The movie is of low quality with few good comments about it.|
|2||The movie is of decent quality with some good comments about it.|
|3||The movie is of good quality with several good comments about it.|
|4||The movie is of excellent quality with many good comments about it.|
The results in Table 7 show that both the linguistic knowledge and the pre-training task contribute to the final performance. In terms of the different effects of two sub-tasks, they perform comparatively on the sentence-level classification and aspect term sentiment classification. Nevertheless, sub-task#2 seems more important to the aspect category sentiment classification as the performance degrades severely on SemEval14 (Restaurant) when sub-task#2 is ablated. Considering the impact of knowledge, the performance of SentiLR doesn’t degrade a lot compared with the setting of removing the pre-training task. This result indicates that SentiLR doesn’t only depend on the external knowledge from SentiWordNet. The well-designed pre-training task facilitates the model to explore the information within contexts even without the explicit knowledge and build the deep connection between the labels and words.
4.7 Further Analysis on Label-aware Masked Language Model
Label-aware masked language model plays an important part in SentiLR. It makes our model learn to capture not only the context dependency but the linguistic knowledge of words. In order to get a deeper understanding of how this pre-training task captures the context dependency and the relationship between sentence-level labels and word-level knowledge, we provided some generated cases of label-aware masked language model after pre-training.
|Input Sentence||This restaurant is really|
|[MASK] regarding its serve.|
|Negative Words Neutral Words Positive Words|
Firstly, we show that our pre-trained model can capture the deep relationship between sentence-level labels and sentiment words. Given the same input sentence with one masked word and different sentence-level labels in the form of sentence-level embeddings, our model can recover the masked word with respect to the global sentiment. We calculated the weighted sentiment score via where is the probability of word at the [MASK] position computed by SentiLR, indicates the predicted part-of-speech tags from SentiLR, and is obtained from the SentiWordNet via Equation (3.2). As this weighted score reveals the sentiment polarity of the model’s prediction, we can see from Table 9 that it gradually shifts from negative to positive as the sentence-level label goes from 0 to 4. We also calculated the accumulated generation probabilities of negative, neutral and positive words defined by the prior sentiment labels to show the changes of word usage in fine-grained sentiment settings.
Secondly, we demonstrate that our model can simultaneously capture the context dependency and the sentiment-related linguistic knowledge. We can see from Table 8 that our model chooses different words at the first [MASK] to satisfy the fine-grained sentence-level labels. Then, our model infers the relationship between the amount of positive comments and the quality of the movie via context dependency and fills the second [MASK] with reasonable quantifiers.
In this paper, we propose a novel pre-trained language representation model called SentiLR, which captures not only the context dependency but also the linguistic knowledge of each word. We introduce the linguistic knowledge from SentiWordNet and design label-aware masked language model to enable our model to utilize the knowledge in sentiment analysis tasks. Experiments show that our model can outperform several state-of-the-art pre-trained language representations in the sentiment analysis tasks.
This work was supported by the National Science Foundation of China key project with grant No. 61936010 and regular project with grand No. 61876096, and the National Key R&D Program of China (Grant No. 2018YFC0830200). This work was also supported by Beijing Academy of Artificial Intelligence, BAAI.
- Alt et al. (2019) Christoph Alt, Marc Hübner, and Leonhard Hennig. 2019. Fine-tuning pre-trained transformer language models to distantly supervised relation extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 1388–1398.
- Baccianella et al. (2010) Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the International Conference on Language Resources and Evaluation.
- Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 632–642.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
- Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv: 1905.03197.
- Esuli and Sebastiani (2006) Andrea Esuli and Fabrizio Sebastiani. 2006. SENTIWORDNET: A publicly available lexical resource for opinion mining. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, pages 417–422.
- Guerini et al. (2013) Marco Guerini, Lorenzo Gatti, and Marco Turchi. 2013. Sentiment analysis: How to derive prior polarities from sentiwordnet. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1259–1269.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177.
- Huang et al. (2017) Minlie Huang, Qiao Qian, and Xiaoyan Zhu. 2017. Encoding syntactic knowledge in neural networks for sentiment classification. ACM Transactions on Information Systems, 35(3):26:1–26:27.
- Joshi et al. (2019) Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019. Spanbert: Improving pre-training by representing and predicting spans. arXiv preprint arXiv: 1907.10529.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations.
- Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv: 1901.07291.
- Li et al. (2019) Zhongyang Li, Xiao Ding, and Ting Liu. 2019. Story ending prediction by transferable BERT. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 1800–1806.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv: 1907.11692.
- Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150.
- McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In 43rd Annual Meeting of the Association for Computational Linguistics, pages 115–124.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1532–1543.
- Peters et al. (2019) Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv: 1909.04164.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2227–2237.
- Qian et al. (2017) Qiao Qian, Minlie Huang, Jinhao Lei, and Xiaoyan Zhu. 2017. Linguistically regularized LSTM for sentiment classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1679–1689.
- Qian et al. (2015) Qiao Qian, Bo Tian, Minlie Huang, Yang Liu, Xuan Zhu, and Xiaoyan Zhu. 2015. Learning tag embeddings and tag-specific composition functions in recursive neural network. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 1365–1374.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. In Technical Report.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. In Technical Report.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
- Soares et al. (2019) Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 2895–2905.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1631–1642.
- Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Conference on Machine Learning, pages 5926–5936.
- Sun et al. (2019a) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019a. ERNIE: enhanced representation through knowledge integration. arXiv preprint arXiv: 1904.09223.
- Sun et al. (2019b) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2019b. ERNIE 2.0: A continual pre-training framework for language understanding. arXiv preprint arXiv: 1907.12412.
- Teng et al. (2016) Zhiyang Teng, Duy-Tin Vo, and Yue Zhang. 2016. Context-sensitive lexicon features for neural sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1629–1638.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
- Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- Warstadt and Bowman (2019) Alex Warstadt and Samuel R Bowman. 2019. Grammatical analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint arXiv:1901.03438.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1112–1122.
- Wilson et al. (2005) Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 347–354.
- Wu et al. (2019) Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Conditional BERT contextual augmentation. In 19th International Conference on Computational Science, pages 84–95.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Gregory S. Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv: 1609.08144.
- Xu et al. (2019) Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2324–2335.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv: 1906.08237.
- Zhang et al. (2015) Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657.
- Zhang et al. (2019) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: enhanced language representation with informative entities. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 1441–1451.
- Zhu et al. (2014) Xiaodan Zhu, Hongyu Guo, Saif Mohammad, and Svetlana Kiritchenko. 2014. An empirical study on the effect of negation words on sentiment. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 304–313.