Aspect and Opinion Term Extraction for Aspect Based Sentiment Analysis of Hotel Reviews Using Transfer Learning
One of the tasks in aspect-based sentiment analysis is to extract aspect and opinion terms from review text. Our study focuses on evaluating transfer learning using BERT (Devlin et al., 2019) to classify tokens from hotel reviews in bahasa Indonesia. We show that the default BERT model failed to outperform a simple argmax method. However, changing the default BERT tokenizer to our custom one can improve the scores on our labels of interest by at least 5%. For I-ASPECT and B-SENTIMENT, it can even increased the scores by 11%. On entity-level evaluation, our tweak on the tokenizer can achieve scores of 87% and 89% for ASPECT and SENTIMENT labels respectively. These scores are only 2% away from the best model by Fernando et al. (2019), but with much less training effort (8 vs 200 epochs).
Sentiment analysis (Pang et al., 2008) in review text usually consists of multiple aspects. For instance, the following review talks about the location, room, and staff aspects of a hotel, “Excellent location to the Tower of London. We also walked to several other areas of interest; albeit a bit of a trek if you don’t mind walking. The room was a typical hotel room in need of a refresh, however clean. The staff couldn’t have been more professional, they really helped get us a taxi when our pre arranged pickup ran late.” In this review, some of the sentiment terms are “excellent”, “typical”, “clean”, and “professional”.
In this study, we are focusing on the aspect and opinion term extraction from the reviews to do aspect-based sentiment analysis (Liu and Zhang, 2012). While some work has been done in this task (Wang et al., 2017; Fernando et al., 2019; Xue and Li, 2018), we have not seen a transfer learning approach (Ruder, 2019) employed, which should need much less training effort. Using transfer learning is especially helpful for low-resource languages (Kocmi and Bojar, 2018), such as bahasa Indonesia.
Our main contribution in this study is evaluating BERT (Devlin et al., 2019) as a pretrained transformer model on this token classification task on hotel reviews in bahasa Indonesia. We also found that the current pretrained BERT tokenizer has a poor encoder for bahasa Indonesia, thus we proposed our own custom tokenizer. We also provided simpler baselines, namely argmax and logistic regression on word embeddings as comparisons.
For this aspect and opinion term extraction task, we use tokenized and annotated hotel reviews on Airy Rooms111https://www.airyrooms.com/ provided by Fernando et al. (2019)222https://github.com/jordhy97/final_project. The dataset consists of 5000 reviews in bahasa Indonesia. The dataset is divided into training and test sets of 4000 and 1000 reviews respectively. The label distribution of the tokens in BIO scheme can be seen in Table 1. In addition, we also see this case as on entity level, i.e. ASPECT, SENTIMENT, and OTHER labels.
We found that there are 1643 and 809 unique tokens in the training and test sets respectively. Moreover, 75.4% of the unique tokens in the test set can be found in the training set.
For baseline model, we employed two methods: a simple argmax method and logistic regression on word embeddings from fastText implementation (Bojanowski et al., 2017). In the argmax method, we classify a token as the most probable label in the training set. For fastText implementation, we use the skip-gram model and produce 100-dimensional vectors.
We proposed to use transfer learning from pretrained BERT-Base, Multilingual Cased (Devlin et al., 2019) for this token classification problem. We used the implementation in PyTorch by Hugging Face (2019)333https://github.com/huggingface/pytorch-transformers. We found out that the multilingual cased tokenizer of BERT does not recognize some common terms in our dataset, such as “kamar” (room), “kendala” (issue), “wifi”, “koneksi” (connection), “bagus” (good), “bersih” (clean). In the training and validation sets, we found 24,370 unknown tokens. Thus, we encode the token ourselves to have no unknown tokens. For the rest of this paper, we will call this model BERT-custom. Since the labels are imbalanced, we are using -score as the evaluation metric, which is defined as:
Our experiment setup for BERT and BERT-custom is to use Adam (Kingma and Ba, 2015) optimizer with as the learning rate and 5 epochs. The batch size is 32 and we are optimizing the cross entropy loss function. We split the training set into 70:30 for training and validation sets to tune the hyperparameters and then train with the whole training set before applying the model onto the test set.
3 Results and discussion
The results from our experiments with BIO scheme labels are summarized in Table 2. We can see in the table that using the default tokenizer cannot beat the baseline scores for B-ASPECT and B-SENTIMENT labels. However, changing the tokenizer can improve the scores by at least 5%. For I-ASPECT and B-SENTIMENT, it can increase the scores by 11%. On the other hand, Fernando et al. (2019) trained their model using 200 epochs, while we only use 5 epochs. We also found that simply using word embedding (fastText) is not suitable for this task since it failed to achieve higher scores compared to a simple argmax method. Furthermore, we can see in Figure 1 that the model overfits after about 12 iterations (mini-batches).
|(Fernando et al., 2019)||0.916||0.873||0.939||0.886||0.957|
Unlike conditional random fields (CRF), BERT does not constrain the output labels. Thus, you might see I-ASPECT or I-SENTIMENT without preceding B-ASPECT or B-SENTIMENT. In our case, we found 735 cases of invalid BIO when using default BERT tokenizer and 12 cases of invalid BIO while using our custom tokenizer. Some examples of sentences with invalid token labels are “…termasuk(O) kamar(O) mandi(I-ASPECT) nya(I-ASPECT)…” (“…including the bathroom…”) and “…lantai(O) 3(O) tidak(I-ASPECT) ada(I-ASPECT)…” (“…3rd floor does not have…”).
|(Fernando et al., 2019)||0.89||0.91|
Table 3 shows the performance on entity level. We are only interested in evaluating the ASPECT and SENTIMENT labels while actually trained the models with 3 labels. In this case, we increased the number of epochs to 8 since it can yield higher scores.
It is interesting to see that BERT is not even better than argmax in this simplified setting. Nevertheless, changing the default BERT tokenizer is beneficial as well. BERT-custom model outperforms argmax by more than 5% on our labels of interest and only 2% away from beating the results by Fernando et al. (2019).
4 Related work
Wang et al. (2017) summarized several studies on aspect and opinion terms extraction. Some of the methods used are association rule mining (Hu and Liu, 2004), dependency rule parsers (Qiu et al., 2011), conditional random fields (CRF) and hidden Markov model (HMM) (Li et al., 2010; Jin et al., 2009), topic modelling (Chen et al., 2014; Zhao et al., 2010), and deep learning (Fernando et al., 2019; Wang et al., 2017; Xue et al., 2017; Xue and Li, 2018).
Fernando et al. (2019) combines the idea of coupled multilayer attentions (CMLA) by Wang et al. (2017) and double embeddings by Xue and Li (2018) on aspect and opinion term extraction on SemEval. The work by Xue and Li (2018) itself is an improvement from what their prior work on the same task (Xue et al., 2017). Thus, we only included the work by Fernando et al. (2019) because they show that we can get the best result by combining the latest work by Wang et al. (2017) and Xue and Li (2018).
In their paper, Devlin et al. (2019) show that they can achieve state-of-the-art performance not only on sentence-level, but also on token-level tasks, such as for named entity recognition (NER). This motivates us to explore BERT in our study. This way, we do not need to use dependency parsers or any feature engineering.
5 Conclusions and future work
Our work shows that BERT can achieve scores of more than 80% in aspect and opinion term extraction task with BIO scheme in noisy bahasa Indonesia text by changing the default tokenizer to have fewer unknown tokens. For both BIO scheme and aspect/sentiment/other labels, this simple tweak results in more than 5% absolute increase in scores on our labels of interest. On entity-level evaluation, changing the default tokenizer yields around 8% absolute increase in scores.
In the future, we are aiming to compare several transformer-based models, such as XLNet (Yang et al., 2019), XLM (Lample and Conneau, 2019), and RoBERTa (Liu et al., 2019) when they are trained using multilingual datasets that include text in bahasa Indonesia as well. We also plan to fine-tune those models with richer text in bahasa Indonesia to reduce the number of unknown tokens. Furthermore, it is also necessary to evaluate the same task on different datasets.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Chen et al. (2014) Zhiyuan Chen, Arjun Mukherjee, and Bing Liu. 2014. Aspect extraction with automated prior knowledge learning. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 347–358.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Fernando et al. (2019) Jordhy Fernando, Masayu Leylia Khodra, and Ali Akbar Septiandri. 2019. Aspect and Opinion Terms Extraction Using Double Embeddings and Attention Mechanism for Indonesian Hotel Reviews. arXiv preprint arXiv:1908.04899.
- Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 168–177, New York, NY, USA. ACM.
- Hugging Face (2019) Hugging Face. 2019. PyTorch-Transformers.
- Jin et al. (2009) Wei Jin, Hung Hay Ho, and Rohini K Srihari. 2009. A novel lexicalized hmm-based learning framework for web opinion mining. In Proceedings of the 26th annual international conference on machine learning, pages 465–472. Citeseer.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR 2015.
- Kocmi and Bojar (2018) Tom Kocmi and Ondřej Bojar. 2018. Trivial transfer learning for low-resource neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 244–252, Belgium, Brussels. Association for Computational Linguistics.
- Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.
- Li et al. (2010) Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu, Ying-Ju Xia, Shu Zhang, and Hao Yu. 2010. Structure-aware review mining and summarization. In Proceedings of the 23rd international conference on computational linguistics, pages 653–661. Association for Computational Linguistics.
- Liu and Zhang (2012) Bing Liu and Lei Zhang. 2012. A survey of opinion mining and sentiment analysis. In Mining text data, pages 415–463. Springer.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Pang et al. (2008) Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2):1–135.
- Qiu et al. (2011) Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion word expansion and target extraction through double propagation. Computational linguistics, 37(1):9–27.
- Ruder (2019) Sebastian Ruder. 2019. Neural Transfer Learning for Natural Language Processing. Ph.D. thesis, NATIONAL UNIVERSITY OF IRELAND, GALWAY.
- Wang et al. (2017) Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2017. Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In Thirty-First AAAI Conference on Artificial Intelligence.
- Xue and Li (2018) Wei Xue and Tao Li. 2018. Aspect based sentiment analysis with gated convolutional networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2514–2523, Melbourne, Australia. Association for Computational Linguistics.
- Xue et al. (2017) Wei Xue, Wubai Zhou, Tao Li, and Qing Wang. 2017. Mtna: a neural multi-task model for aspect category classification and aspect term extraction on restaurant reviews. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 151–156.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
- Zhao et al. (2010) Wayne Xin Zhao, Jing Jiang, Hongfei Yan, and Xiaoming Li. 2010. Jointly modeling aspects and opinions with a maxent-lda hybrid. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 56–65. Association for Computational Linguistics.