Semi-Unsupervised Lifelong Learning for Sentiment Classification: Less Manual Data Annotation and More Self-Studying
Lifelong machine learning is a novel machine learning paradigm which can continually accumulate knowledge during learning. The knowledge extracting and reusing abilities enable the lifelong machine learning to solve the related problems. The traditional approaches like Naïve Bayes and some neural network based approaches only aim to achieve the best performance upon a single task. Unlike them, the lifelong machine learning in this paper focus on how to accumulate knowledge during learning and leverage them for the further tasks. Meanwhile, the demand for labeled data for training also be significantly decreased with the knowledge reusing. This paper suggests that the aim of the lifelong learning is to use less labeled data and computational cost to achieve the performance as well as or even better than the supervised learning.
Over the past 30 years, machine learning have achieved a significant development. However, we are still in a era of ”Weak AI” rather than ”Strong AI”. Current machine learning algorithms only know how to solve a specific problem but have no idea when they meet some related problems. Hence, the lifelong machine learning (simply said as lifelong learning or ”LML” below) (Thrun, 1998) was raised to solve a infinite sequence of related tasks by knowledge accumulation and reusing. For the related problems, an integrated model with knowledge reusing could decrease the cost for the sample annotation.
For instance, in the sentiment classification we need to predict the sentiment (positive or negative) of a sentence or a document. For different sentiment classification tasks, traditional approaches need to train an independent model on each domain to obtain the best performance. Hence, for each domain we need to collect labeled data for the supervised learning. In this way, the algorithm will never know how to solve a problem without new labeled data. This is what a typical ”weak AI”.
To achieve the goal of ”strong AI”, we need to change our learning goal to really understand the sentiment of words. Which means that the algorithm should know how each word influences the sentiment of a document in different tasks. If we can achieve this learning goal, the algorithms are able to solve new tasks without teaching. Zhiyuan Chen and etc. (Chen et al., 2015) ever proposed a approach to close the goal. They made a big progress but the supervised learning still is needed. Guangyi Lv and etc. (Lv et al., 2019) extend the work of (Chen et al., 2015) with a neural network based approach. However, the supervised learning still is necessary under their setting and huge volume of labeled data are required. Hence, this paper aims to decrease the usage of labeled data while maintain the performance.
2. Lifelong Machine Learning
It was firstly called as lifelong machine learning since 1995 by Thrun (Thrun and Mitchell, 1995; Thrun, 1996). Efficient Lifelong Machine Learning (ELLA) (Ruvolo and Eaton, 2013) raised by Ruvolo and Eaton. Comparing with the multi-task learning (Caruana, 1997), ELLA is much more efficient. Zhiyuan and etc. (Chen et al., 2015) improved the sentiment classification by involving knowledge. The object function was modified with two penalty terms which corresponding with previous tasks.
2.1. Components of LML
The knowledge system contains the following components:
Knowledge Base (KB): The knowledge Base(Chen et al., 2015) mainly used to maintain the previous knowledge. Based on the type of knowledge, it could be divided as Past Information Store (PIS), Meta-Knowledge Miner (MKM) and Meta-Knowledge Store (MKS).
Knowledge Reasoner (KR): The knowledge reasoner is designed to generate new knowledge upon the archived knowledge by logic inference. A strict logic design is required so the most of the LML algorithms lack of the component.
Knowledge-Base Learner (KBL): The Knowledge-Based Learner(Chen et al., 2015) aims to retrieve and transfer previous knowledge to the current task. Hence, it contains two parts: task knowledge miner and leaner. The miner seeks and determines which knowledge could be reused, and the learner transfers such knowledge to the current task.
2.2. Sentiment Classification
Hong and etc.(Hong et al., 2018) had discussed that the NLP field is most suitable for the lifelong machine learning researches due to its knowledge is easy to extract and to be understood by human. Previous classical paper(Chen et al., 2015) chose the sentiment classification as the learning target because it could be regarded as a large task as well as a group of related sub-tasks in the different domains. Although these sub-tasks are related to each other but a model only trained on a single sub-tasks is unable to perform well in the rest sub-tasks. This requires the algorithms could know when the knowledge can be used and when can not due to the distribution of each sub-tasks is different. Known these, an algorithm can be called as ”lifelong” because it is able to transfer previous knowledge to new tasks to improve performance.
Although deep learning already is applied in sentiment classification, it still could not leverage past knowledge well. This because the complexity of neural network limits the researches to define and extract knowledge from the data. As the previous work(Chen et al., 2015), this paper also uses Naïve Bayes as the knowledge can be presented by the probability. In this way, we need to know the probability of each word that shows in the positive or negative content. We also need to know well that some words may only have sentiment polarity in some specific domains(equal to tasks in this paper). ”Lifelong Sentiment Classification” (”LSC” for simple below) (Chen et al., 2015) records that which domain does a word have the sentiment orientation. If a word always has sentiment polarity or has significant polarity in current domain, a higher weight will sign to it more than other words. This approach contains a knowledge transfer operation and a knowledge validation operation.
3. Contribution of This Paper
Although LSC(Chen et al., 2015) already raised a lifelong approach, it only aims to improve the classification accuracy. It still is under the setting of the supervised learning and also is unable to deliver an explicit knowledge to guild further learning.
Based on the LSC, this paper advances the lifelong learning in sentiment classification and have two main contributions:
A improved lifelong learning paradigm is proposed to solve the sentiment classification problem under unsupervised learning setting with previous knowledge.
We introduce a novel approach to discover and store the words with sentiment polarity for reuse.
4. Sentiment Polarity Words
4.1. Naïve Bayesian Text Classification
In this paper, we define a word has sentiment polarity by calculating the probability that it appears in a positive or negative content (sentence or document). If a word has a high probability with sentiment polarity, it also will leads to the document have higher probability of sentiment probability based on the Naïve Bayesian (NB) formula. Hence, to determine the words with polarity is the key to predict the sentiment.
Naïve Bayesian (NB) classifier (McCallum and Nigam, 1999) calculates the probability of each word in a document and then to predict the sentiment polarity (positive or negative). We use the same formula below as in the LSC(Chen et al., 2015). is the probability of a word appears in a class:
Where is either positive (+) or negative (-) sentiment polarity. is the frequency of a word w in documents of class . —V— is the size of vocabulary V and is used for smoothing ( set as 1 for Laplace smoothing in this paper).
Given a document, we can calculate the probability of it for different classes by:
Where is the given document, is the frequence of a word appears in this document.
To predict the class of a document, we only need to calculate . If the difference is lager than 0, the document should be predict as positive polarity:
As we only need to know whether is lager that 0, so the formula could be simplify to:
4.2. Discover Words with Sentiment Polarity
Ideally, if we know the , and of all words, we can predict the sentiment polarity for all documents. However, above three key components are different in different domains. LSC (Chen et al., 2015) proposed a possible solution to calculate , but it uses all words which has high risk to be overfitting. As we known, not all words have sentimental polarity like ”a”, ”one” and etc. while some words always have polarity like ”good”, ”hate”, ”excellent” and so on. In addition, some words only have sentiment polarity in specific domains. For example, ”tough” in reviews of the diamond indicates that the diamond have a good quality while it means hard to chew in the domain of food. Hence, in order to achieve the goal of the lifelong learning. We need to find the words always have sentiment polarity and be careful for those words only shows polarity in specific domains.
5. Lifelong Semi-supervised Learning for Sentiment Classification
Although LSC (Chen et al., 2015) considered the difference among domains, it still is a typical supervised learning approach.In this paper, we proposed to learn as two stages:
Initial Learning Stage: to explore a basic set of sentiment words. After that, the model should be able to basically classify a new domain with a good performance.
Self-study Stage: Use the knowledge accumulated from the initial stage to handle new domains, also fine-tune and consolidate the knowledge generated from the initial learning stage.
5.1. Initial Learning Stage
In this stage, we need to train the model to remember some sentiment polarity words. This requires us to find the words with sentiment polarity in each domain. We need to answer two questions here:
How to determine the polarity of a word?
How much domains do we need for the initial learning stage?
For the first question, we need to find which words mainly show in the positive or negative documents. This means for a word with positive polarity, or . In this paper, we will use to represent the polarity. This because that the is easy to extend into the multi-classes classification problems. According to the Bayesian formula, .
5.2. Self-study Stage
In this stage, our main task is to explore which words have polarity. We will mainly use these words to predict the new domains and assign the pseudo-labels to them. With the pseudo labels, we are able to discover the new words with polarity. Following is the the procedure for self-study:
Using the sentiment words accumulated from the previous tasks to predict a new domain, then assign the prediction results as the pseudo labels.
Using the reviews and pseudo labels of above new domain as new training data to run Naïve model.
Update the sentiment words knowledge base.
In the experiment, we use the same datasets as LSC (Chen et al., 2015) used. It contains the reviews from 20 domains crawled from the Amazon.com and each domain has 1,000 reviews (the distribution of positive and negative reviews is imbalanced).
6.2. Word Polarity Analysis
To answer the first question for the initial learning stage, we need to know which words exactly influence the sentiment classification. Firstly, we calculate and for each words. Then, we define the polarity degree by . Finally, we only choose a specific percentage words to predict and see whether the performance decreases. In addition, we also only consider the words that at least show over average 5 times in per domain. This because that we did not delete the symbols and numbers in the data, and these characters may be noise in the training data.
We firstly sorted the words or symbols (no data pre-processing to the corpus in this paper) by the polarity and then choose a specific percentage words or symbols from the whole words to only 10%. From Table 1 we can see that using no less than 30% can obtains the best average result. This means that the most of words and symbols do not have obvious sentiment orientation.
Hence, we will only keep 30% of words for Naïve Bayes model and even get better f1 score. Although the performance decrease on a single domain, the better global performance can achieve with only the sentiment words.
6.3. Requirement for the Initial Learning
For the second question of the initial learning stage, the answer depends on the tasks. In the practice, all of the labeled data definitely need to be used for training. The only question should be conceded is that how much labeled data can meet the minimum requirement. For this sentiment classification task, one domain is absolutely insufficient. Based on the experiment result, the initial learning stage at least needs two domains, and can achieve much better performance with three domains. Increase more domains will not significant influence the performance. Hence, three domains are enough for this task. For different tasks, two labeled domains are necessary. More labeled domains are suggested to continue collect until the performance on the new domains tends to steady.
6.4. Self-study Learning
In the self-study learning stage, the learning is designed under the unsupervised learning setting. In this stage, there is any labeled data. Instead of that, we uses the model generate from the initial learning stage to predict each domain and assign the pseudo labels to them. After that, the model will regard the pseudo labels as the real labels and continue the training on the new domain. With this method, self-study learning stage can learn new domains well without any labeled data.
Table 2 is the F1 score of three models on 17 domains. The first three domains was used for the initial learning stage. And we use the Macro-F1 score because the datasets are imbalanced and it can prove our performance on the minor classes. We compared our model (Semi-Unsupervised Learning, SU-LML for short) with Naïve Bayes model which only trained on the first three (source) domains (NB-S) and Naïve Bayes model trained on each domain with labels by 5-fold cross validation (NB-T). We can see that our approach is significantly better than other two approaches. It even perform better than the NB-T, a typically supervised learning. The figure 2 shows the result more clearly. The comparisons to LSC and neural based lifelong learning (Lv et al., 2019) are not going to show here, because firstly their codes are still unavailable and secondly their approaches are totally supervised learning.
|Word||Degree for Negative Sentiment|
6.5. Knowledge Generated during Learning
In this paper, we done one more important things is that we discovered which words have sentiment polarity. If a word was regarded with sentiment polarity, we increase the polarity score of it with one. In addition, we will plus an additional score from 0 to 1 to 1 based on the rank. From table 3, we can see that most top words with negative emotion and most of them make sense.
7. Conclusion and Outlook
We proposed a semi-unsupervised lifelong sentiment classification approach in this paper. It can accumulate knowledge from the previous learning and turn to self-study. A very few labeled data required in our approach so it is very suitable for the industry scenario. The performance of it even exceeds the supervised learning, which shows that the knowledge reusing of the lifelong learning is useful.
Although we only show two classes classification here, but the ideal is also suitable for the multi-classes classification. All text classification can use this approach, not only sentiment classification. Our model classify documents by the knowledge of the sentiment polarity of the words, which uses the same approach of we human being. We shows that to focus the goal behind the learning tasks is more meaningful than just to find a solution. Understanding the words is much important than solve a sentiment classification task. We should learn the knowledge and skills for all tasks rather than a solution for a single task.
This research is supported by the Research Institute of Big Data Analytics, Xi’an Jiaotong – Liverpool University and the CERNET Innovation Project under Grant NGII20161010.
- Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41–75.
- Chen et al. (2015) Zhiyuan Chen, Nianzu Ma, and Bing Liu. 2015. Lifelong learning for sentiment classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2. 750–756.
- Hong et al. (2018) Xianbin Hong, Prudence Wong, Dawei Liu, Sheng-Uei Guan, Ka Lok Man, and Xin Huang. 2018. Lifelong Machine Learning: Outlook and Direction. In Proceedings of the 2nd International Conference on Big Data Research. ACM, 76–79.
- Lv et al. (2019) Guangyi Lv, Shuai Wang, Bing Liu, Enhong Chen, and Kun Zhang. 2019. Sentiment Classification by Leveraging the Shared Knowledge from a Sequence of Domains. In International Conference on Database Systems for Advanced Applications. Springer, 795–811.
- McCallum and Nigam (1999) Andrew McCallum and Kamal Nigam. 1999. Text classification by bootstrapping with keywords, EM and shrinkage. Unsupervised Learning in Natural Language Processing (1999).
- Ruvolo and Eaton (2013) Paul Ruvolo and Eric Eaton. 2013. ELLA: An efficient lifelong learning algorithm. In International Conference on Machine Learning. 507–515.
- Thrun (1996) Sebastian Thrun. 1996. Is learning the n-th thing any easier than learning the first?. In Advances in neural information processing systems. 640–646.
- Thrun (1998) Sebastian Thrun. 1998. Lifelong learning algorithms. In Learning to learn. Springer, 181–209.
- Thrun and Mitchell (1995) Sebastian Thrun and Tom M Mitchell. 1995. Lifelong robot learning. Robotics and autonomous systems 15, 1-2 (1995), 25–46.