Fine-grained Sentiment Classification using BERT
Sentiment classification is an important process in understanding people’s perception towards a product, service, or topic. Many natural language processing models have been proposed to solve the sentiment classification problem. However, most of them have focused on binary sentiment classification. In this paper, we use a promising deep learning model called BERT to solve the fine-grained sentiment classification task. Experiments show that our model outperforms other popular models for this task without sophisticated architecture. We also demonstrate the effectiveness of transfer learning in natural language processing in the process.
Sentiment classification is a form of text classification in which a piece of text has to be classified into one of the predefined sentiment classes. It is a supervised machine learning problem. In binary sentiment classification, the possible classes are positive and negative. In fine-grained sentiment classification, there are five classes (very negative, negative, neutral, positive, and very positive). Fig 1 shows a black-box view of a fine-grained sentiment classifier model.
Sentiment classification model, like any other machine learning model, requires its input to be a fixed-sized vector of numbers. Therefore, we need to convert a text—sequence of words represented as ASCII or Unicode—into a fixed-sized vector that encodes the meaningful information of the text. Many statistical and deep learning NLP models have been proposed just for that. Recently, there has been an explosion of developments in NLP as well as other deep learning architectures.
While transfer learning (pretraining and finetuning) has become the de-facto standard in computer vision, NLP is yet to utilize this concept fully. However, neural language models such as word vectors, paragraph vectors, and GloVe have started the transfer learning revolution in NLP. Recently, Google researchers published BERT (Bidirectional Encoder Representations from Transformers), a deep bidirectional language model based on the Transformer architecture, and advanced the state-of-the-art in many popular NLP tasks. In this paper, we use the pretrained BERT model and fine-tune it for the fine-grained sentiment classification task on the Stanford Sentiment Treebank (SST) dataset.
The rest of the paper is organized into six sections. In Section II, we mention our motivation for this work. In Section III, we discuss related works. In Section IV, we describe the dataset we performed our experiments on. We explain our model architecture and methodology in detail in Section V. Then we present and analyze our results in Section VI. Finally, we provide our concluding remarks in Section VII.
We have been working on replicating the different research paper results for sentiment analysis, especially on the fine-grained Stanford Sentiment Treebank (SST) dataset. After the popularity of BERT, researchers have tried to use it on different NLP tasks, including binary sentiment classification on SST-2 (binary) dataset, and they were able to obtain state-of-the-art results as well. But we haven’t yet found any experimentation done using BERT on the SST-5 (fine-grained) dataset. Because BERT is so powerful, fast, and easy to use for downstream tasks, it is likely to give promising results in SST-5 dataset as well. This became the main motivation for pursuing this work.
Iii Related Work
Sentiment classification is one of the most popular tasks in NLP, and so there has been a lot of research and progress in solving this task accurately. Most of the approaches have focused on binary sentiment classification, most probably because there are large public datasets for it such as the IMDb movie review dataset. In this section, we only discuss some significant deep learning NLP approaches applied to sentiment classification.
The first step in sentiment classification of a text is the embedding, where a text is converted into a fixed-size vector. Since the number of words in the vocabulary after tokenization and stemming is limited, researchers first tackled the problem of learning word embeddings. The first promising language model was proposed by Mikolov et al.. They trained continuous semantic representation of words from large unlabeled text that could be fine-tuned for downstream tasks. Pennington et al. used a co-occurrence matrix and only trained on non-zero elements to efficiently learn semantic word embeddings. Bojanowski et al. broke words into character -grams for smaller vocabulary size and fast training.
The next step is to combine a variable number of word vectors into a single fixed-size document vector. The trivial way is to take the sum or the average, but they don’t lose the ordering information of words and thus don’t give good results. Tai et al. used recursive neural networks to compute vector representation of sentences by utilizing the intrinsic tree structure of natural language sentences. Socher et al. introduced a tensor-based compositionaity function for better interaction between child nodes in recursive networks. They also introduced the Stanford Sentiment Treebank (SST) dataset for fine-grained sentiment classification. Tai et al. applied various forms of long short-term memory (LSTM) networks and Kim applied convolutional neural networks (CNN) towards sentiment classification.
All of the approaches mentioned above are context-free, i.e., they generate single word embedding for each word in the vocabulary. For instance, “bank“ would have the same representation in “bank deposit“ and “river bank“. Recent language model research has been trying to train contextual embeddings. Peters et al. extracted context-sensitive features from left-to-right and right-to-left LSTM-based language model. Devlin et al. proposed BERT (Bidirectional Encoder Representations from Transformers), an attention-based Transformer architecture, to train deep bidirectional representations from unlabeled texts. Their architecture not only obtains state-of-the-art results on many NLP tasks but also allows a high degree of parallelism since it is not based on sequential or recurrent connections.
Stanford Sentiment Treebank (SST) is one of the most popular publicly available datasets for fine-grained sentiment classification task. It contains 11,855 one-sentence movie reviews extracted from Rotten Tomatoes. Not only that, each sentence is also parsed by the Stanford constituency parser into a tree structure with the whole sentence as the root node and the individual words as leaf nodes. Moreover, each node is labeled by at least three humans. In total, SST contains 215,154 unique manually labeled texts of varying lengths. Fig 2 shows a sample review from the SST dataset in a parse-tree structure with all its nodes labeled. Therefore, this dataset can be used to train models to learn the sentiment of words, phrases, and sentences together.
There are five sentiment labels in SST: 0 (very negative), 1 (negative), 2 (neutral), 3 (positive), and 4 (very positive). If we only consider positivity and negativity, we get the binary SST-2 dataset. If we consider all five labels, we get SST-5. For this research, we evaluate the performance of various models on all nodes as well as on just the root nodes, and on both SST-2 and SST-5.
Sentiment classification takes a natural language text as input and outputs a sentiment score . Our method has three stages from input sentence to output score, which are described below. We use pretrained BERT model to build a sentiment classifier. Therefore, in this section, we briefly explain BERT and then describe our model architecture.
BERT (Bidirectional Encoder Representations from Transformers is an embedding layer designed to train deep bidirectional representations from unlabeled texts by jointly conditioning on both left and right context in all layers. It is pretrained from a large unsupervised text corpus (such as Wikipedia dump or BookCorpus) using the following objectives:
Masked word prediction: In this task, 15% of the words in the input sequence are masked out, the entire sequence is fed to a deep bidirectional Transfomer encoder, and then the model learns to predict the masked words.
Next sentence prediction: To learn the relationship between sentences, BERT takes two sentences and as inputs and learns to classify whether actually follows or is it just a random sentence.
Unlike traditional sequential or recurrent models, the attention architecture processes the whole input sequence at once, enabling all input tokens to be processed in parallel. The layers of BERT architecture are visualized in Fig 3. Pretrained BERT model can be fine-tuned with just one additional layer to obtain state-of-the-art results in a wide range of NLP tasks.
There are two variants for BERT models: BERTBASE and BERTLARGE. The difference between them is listed in Table I.
|No. of layers (Transformer blocks)||12||24|
|No. of hidden units||768||1024|
|No. of self-attention heads||12||16|
|Total trainable parameters||110M||340M|
V-A1 Input format
BERT requires its input token sequence to have a certain format. First token of every sequence should be [CLS] (classification token) and there should be a [SEP] token (separation token) after every sentence. The output embedding corresponding to the [CLS] token is the sequence embedding that can be used for classifying the whole sequence.
We perform the following preprocessing steps on the review text before we feed them into out model.
First, we remove all the digits, punctuation symbols and accent marks, and convert everything to lowercase.
We then tokenize the text using the WordPiece tokenizer. It breaks the words down to their prefix, root, and suffix to handle unseen words better. For example, playing play + ##ing.
V-B3 Special token addition
Finally, we add the [CLS] and [SEP] tokens at the appropriate positions.
V-C Proposed Architecture
We build a simple architecture with just a dropout regularization and a softmax classifier layers on top of pretrained BERT layer to demonstrate that BERT can produce great results even without any sophisticated task-specific architecture.
Fig 4 shows the overall architecture of our model. There are four main stages. The first is the proprocessing step as described earlier. Then we compute the sequence embedding from BERT. We then apply dropout with a probability factor of to regularize and prevent overfitting. Dropout is only applied in training phase and not in inference phase. Finally, the softmax classification layer will output the probabilities of the input text belonging to each of the class labels such that the sum of the probabilities is . The softmax layer is just a fully connected neural network layer with the softmax activation function. The softmax function is given in (1).
where is the intermediate output of the softmax layer (also called logits). The output node with the highest probability is then chosen as the predicted label for the input.
Vi Experiments and Results
In this section, we discuss the results of our model and compare with it some of the popular models that solve the same problem, i.e., sentiment classification on the SST dataset.
Vi-a Comparison Models
Vi-A1 Word embeddings
In this method, the word vectors pretrained on large text corpus such as Wikipedia dump are averaged to get the document vector, which is then fed to the sentiment classifier to compute the sentiment score.
Vi-A2 Recursive networks
Various types of recursive neural networks (RNN) have been applied on SST. We compare our results with the standard RNN and the more sophisticated RNTN. Both of them were trained on SST from scratch, without pretraining.
Vi-A3 Recurrent networks
Sophisticated recurrent networks such as left-to-right and bidrectional LSTM networks have also been applied on SST.
Vi-A4 Convolutional networks
In this approach, the input sequences were passed through a 1-dimensional convolutional neural network as feature extractors.
Vi-B Evaluation Metric
Since the dataset has roughly balanced number of samples of all classes, we directly use the accuracy measure to evaluate the performance of our model and compare it with other models. The accuracy is defined simply as follows:
The result and comparisons are shown in Table II. It shows the accuracy of various models on SST-2 and SST-5. It includes results for all phrases as well as for just the root (whole review). We can see that our model, despite being a simple architecture, performs better in terms of accuracy than many popular and sophisticated NLP models.
|Avg word vectors||85.1||80.1||73.3||32.7|
Some values are blank in “All” columns because the original authors of those paper did not publish their result on all phrases.
In this paper, we used the pretrained BERT model and fine-tuned it for the fine-grained sentiment classification task on the SST dataset. Even with such a simple downstream architecture, our model was able to outperform complicated architectures like recursive, recurrent, and convolutional neural networks. Thus, we have demonstrated the transfer learning capability in NLP enabled by deep contextual language models like BERT.
We would like to express our gratitude towards Prof. Dr. Shashidhar Ram Joshi for his invaluable advice and guidance on this paper. We also thank all the helpers and reviewers for their valuable input to this work.
-  (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §III.
-  (2014) A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 740–750. Cited by: §IV.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §I, §III, Fig. 3, §V-A.
-  (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §III, §VI-A4, TABLE II.
-  (2014) Distributed representations of sentences and documents. In International Conference on Machine Learning, pp. 1188–1196. Cited by: §I, TABLE II.
-  (2011-06) Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150. Cited by: §III.
-  (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: §I, §III.
-  (2014) GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Cited by: §I, §III.
-  (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §III.
-  (2012) Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §V-B2.
-  (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 151–161. Cited by: §III, TABLE II.
-  (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1631–1642. Cited by: §III, Fig. 2, §IV, §VI-A2, TABLE II.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §V-C.
-  (2015) Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1556–1566. Cited by: §III, §VI-A3, TABLE II.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §I, §III, 1st item.