Label-guided Learning for Text Classification
Text classification is one of the most important and fundamental tasks in natural language processing. Performance of this task mainly dependents on text representation learning. Currently, most existing learning frameworks mainly focus on encoding local contextual information between words. These methods always neglect to exploit global clues, such as label information, for encoding text information. In this study, we propose a label-guided learning framework LguidedLearn for text representation and classification. Our method is novel but simple that we only insert a label-guided encoding layer into the commonly used text representation learning schemas. That label-guided layer performs label-based attentive encoding to map the universal text embedding (encoded by a contextual information learner) into different label spaces, resulting in label-wise embeddings. In our proposed framework, the label-guided layer can be easily and directly applied with a contextual encoding method to perform jointly learning. Text information is encoded based on both the local contextual information and the global label clues. Therefore, the obtained text embeddings are more robust and discriminative for text classification. Extensive experiments are conducted on benchmark datasets to illustrate the effectiveness of our proposed method.
Text classification can be simply described as the task that given a sequence of text (usually a sentence, paragraph, or document
The essential step in text classification is to obtain text representation. In the earlier study, a piece of given text is usually represented with a hand-crafted feature vector . Recently, inspired by the success of word embedding learning, a piece of text (a sentence/paragraph/document) is also represented with an embedding, which is automatically learned from the raw text by neural networks. Theses learning methods mainly include sequential-based learning models [10, 25, 5, 13] and graph-based learning models [11, 4, 2, 24]. All of these text learning methods are based on modeling local contextual information between words to encode a piece of text into a universal embedding, without considering the difference of labels. More recently, some research suggests that global label clues are also important for text representation learning [18, 1, 15, 21, 16].
In this study, we exploit label constraints/clues to guide text information encoding and propose a label-guided learning framework LguidedLearn for text classification. In our framework, each label is represented by an embedding matrix. A label-guided layer is proposed to map universal contextual-based embeddings into different label spaces, resulting in label-wise text embeddings. LguidedLearn performs jointly learning of word-word contextual encoding and label-word attentive encoding. The ultimately obtained text embeddings are informative and discriminative for text classification. A series of comprehensive experiments are conducted to illustrate the effectiveness of our proposed model.
2 Related Work
In the computer vision community, many studies have exploited label information for image classification [1, 7]. All of these models jointly encode label description text information and image information to enhance the performance of image classification. Recently, several studies have involved label embedding learning in natural language processing tasks. For example, Nam et al. \shortcitenam2016all proposed a model to learn the label and word embeddings jointly. Pappas et al. \shortcitepappas2019gile also presented a model GILE to encode input-label embedding. However, all these models require that there must be a piece of description text for each label. The learning performance is dependent on the quality of the label description text. Furthermore, this requirement will limit the models’ application.
In this section, we first intuitively describe how to involve labels into text encoding and the main layers needed for text representation learning. Then, we present formally the proposed framework LguidedLearn (Label-guided Learning) for text classification.
Given a piece of text, what kinds of information/clues should be encoded in the text representation learning for classification? First local contextual information is essential for text embedding learning. We notice that not all words/characters in the given text are equally useful for correctly labeling. Therefore, we also need global label constraints/information to guide text encoding. Based on the above considerations, a learning framework for text classification should include:
Pre-trained encoding layer: get pre-trained word or character embeddings;
Contextual encoding layer: encode contextual information between words into the text embeddings
Label-guided encoding layer: perform label attentive learning to encode global information (constraints) into the text embeddings;
Classifying layer: conduct feature compression and text classification.
3.2 The Framework: LguidedLearn
The proposed label-guided learning framework is shown in the Figure 1. In this section, we introduce the framework in detail. Given a sequence of text (), we apply the following learning layers successively.
Pre-trained encoding layer
The aim of pre-trained layer is to obtain low-dimensional continuous embeddings for words in the sequence of text.
where (where is the pre-trained embedding size) is a pre-trained embedding of word , and is a kind of word embedding learner, such as Glove .
Contextual encoding layer
The contextual layer further encodes words’ dynamic contextual information in the current text sequence.
where (where is the contextual embedding size) is a contextual-encoded embedding corresponding to the word . In the supervised learning tasks, can be effectively implemented with a LSTM or BiLSTM network.
Label-guided encoding layer
Let be the label set, where is the number of labels. Each label () is represented with an embedding matrix consisting of embeddings (), where () is the th embedding in the embedding matrix . The label-guided layer jointly encodes label information and contextual information by projecting contextual-encoded embeddings into the label space. Take the th () label for example:
where are the contextual-encoded embeddings of the given sequence text, is a label-wise embedding specified with the th label. can be implemented with the following simple way:
where is a label attentive weight to measure the compatibility of the pair , where is the
contextual embedding of the th word in the given sequence text,
and is the embedding matrix of the label . To get the compatibility weight, the cosine similarities between and each embedding in = () are computed
Considering the formula (4), the collected values () should be normalized
According to the label-guided encoding formula (4), we can obtain label-wise embeddings (), where () is the text embedding encoded with the guidance of the label . All the label-wise embeddings are concatenated
where is the ultimate embedding used to represent the given sequence of text.
In the classifying layer, we first employ MLP to compress the large text embedding into a proper one
where . In this study, we empirically set (10 times of the number of the labels). Then the classifying label distribution of the given sequence of text is obtained by
In the above formula, MLP is used to encode the embedding into a vector with dimension equal to the number of the labels.
4 Experiments and results
In this study, we employ two text datasets to investigate the effectiveness of our proposed method. For convenience, the two datasets are simply denoted by TCD-1 (TCD: Text Classification Dataset) and TCD-2, respectively. A summary statistics of the two datasets (TCD-1 and TCD-2) are presented in Table 1.
|Dataset||Name||# Doc||# Word||# Class||# Len|
Tcd-1 We select TCD-1 as an experimental dataset because it is used by  to evaluate their proposed text and label jointly learning model LEAM, which is a main baseline for comparative analysis. TCD-1 includes five sub-datasets:
AGNews: news topic classification over four categories (world, entertainment, sports and business). Each document is composed of an internet news article.
Yelp-F: sentiment classification of polarity star labels ranging from 1 to 5. The dataset is obtained from the Yelp Review Dataset Challenge in 2015.
Yelp-B: sentiment classification of polarity labels (negative or positive). The dataset is the same as the Yelp Full, only the labels are different, where polarity star 1 and 2 are treated as negative, and 4 and 5 as positive.
DBPedia: ontology classification over fourteen classes. The dataset is picked from DBpedia 2014 (Wikipedia).
YahooQA: QA topic classification over ten categories. The dataset is collected from the version 1.0 of the Yahoo! Answers Comprehensive Questions and Answers.
Tcd-2 We select TCD-2 to test our model because it is a very comprehensive dataset, referring to many different domains, and is widely used by more recent work . TCD-2 also consists of five sub-datasets:
20NG: document topic classification over 20 different categories.
R52: document topic classification over 52 categories. The R52 dataset is a subset of the Reuters 21578 dataset.
R8: document topic classification over 8 categories. The R8 dataset is also a subset of the Reuters 21578 dataset.
Ohsumed: disease classification over 23 categories. The Ohsumed corpus is from the MEDLINE database.
MR: binary sentiment classification. The MR dataset is a movie review dataset, in which each review only contains one sentence.
Though 20NG, R52 and R8 are all document topic classification. The three sub-datasets are very different in labels, as presented in Figure 2. For example, 20NG mainly refers to sports, politics and computers. While R52 and R8 mainly contains life related issues and financial related topics. R8 has a more rough categorizations than that of R52.
Baselines on TCD-1 Baselines reported on TCD-1 mainly contain five categories:
Baselines on TCD-2 Baselines conducted on TCD-2 are:
Traditional models: TF-IDF + LR;
Sequential deep learning models: CNN-rand, CNN-non-static and LSTM;
Label embedding based model LEAM  and our proposed LguidedLearn.
TCD-2 is a very comprehensive dataset used for text classification. Besides sequential deep learning models, some recent graph-based neural network models have also been reported on TCD-2.
4.3 Experimental settings
Model setting In the pre-trained layer, we use Glove to obtain pre-trained word embeddings with dimension of . The contextual layer is implemented by BiLSTM with dimension of . In the label-guided layer, each label is represented by an embedding matrix with size of . That is, each label is represented by five embeddings.
Learning setting In our training, batch size is 25 and learning rate is 0.001. All experiments are repeated 10 times.
4.4 Results and analysis
A series of comparison experiments are conducted on the datasets of TCD-1 and TCD-2. A summary of text classification performance and simple analysis are presented.
- Results on TCD-1 The comparison results on TCD-1 are presented in Table 2. The results illustrate that our proposed framework LguidedLearn can achieve the best performance on all the test datasets of TCD-1. Even compared to some recently published strong text classification algorithms (such as fastText, SWEM, Deep CNN (29 layers), Bi-BloSAN, and LEAM) LguidedLearn can obtain stable and prominent gains, especially on the AGNews, Yelp-B, and Yelp-F datasets.
- Results on TCD-2 The results conducted on TCD-2 are presented in Table 3. From the results we can see that besides the dataset of 20NG, LguidedLearn obtains the best classification accuracy on all the other datasets. The recent very popular graph neural network (GNN) models show strong ability in classifying text documents. We notice that even the very strong GNN-based models (such as Graph-CNN-C, Graph-CNN-S, Graph-CNN-F, and Text GCN) are surpassed by our proposed framework LguidedLearn by a large margin. Documents of 20NG are very long (where average length is 221.26 words and about 18% documents are more than 400 words) and surpass the maximum encoding length of the majority models. This is the reason that on the dataset of 20NG non-sequential encoding models (TF-IDF+LR, SWEM, and Text GCN) can achieve better performance than those sequential encoding models (such as LSTM and CNN based models). LguidedLearn (having a contextual layer with BiLSTM) is also slightly affected by this factor.
|TF-IDF + LR||83.19||93.74||86.95||54.66||74.59|
Analysis of Label-guided encoding
The main difference between our presented LguidedLearn and the traditional deep learning models is the label-guided encoding layer, which performs a label attentive learning. Compared to previous label attentive learning models, such as LEAM, LguidedLearn 1) extends label embedding to label embedding space (represented by a series of embeddings) and 2) performs jointly learning of contextual encoding and label-guided encoding. We present a series of comprehensive analysis to the label-guided layer according to the above considerations.
- Label attentive learning analysis The results of LguidedLearn without label-guided encoding layer, denoted by Label-gudied (w/o), are presented in Table 4. Actually, after removing label-guided layer LguidedLearn is reduced to a BiLSTM-based text classification model. The comparison results (Label-guided (w/o) vs. LguidedLearn) show that the label-guided layer brings huge gains ( accuracy improvement), especially on complex task datasets: 20NG, R52, Ohsumed and MR.
The MR dataset is a sentiment classification task which requires model having ability to capture sentiment-related detailed information. An example taken from MR is presented in Figure 3. The example shows that the label-guided layer can extract label (sentiment) related information from input words by performing label attentive learning. For example, words of “funny” and “also” are likely to be projected into the Positive space with large attentive weights, and words of “dark” and “disturbing” are more likely projected into the Negative space.
The datasets of 20NG, R52 and Ohsumed have many labels and documents are very long. These are very challenging text classification tasks, which need model to correctly extract label-related information from a very redundant and even noisy input. A medical text example taken from the Ohsumed dataset is presented in Figure 4. The example illustrates the effectiveness of the label attentive learning. The medical text example has more than 200 words, only some pieces of text (denoted by red color) are effectively projected into the correct label space with large label attentive weights. These pieces of text are strongly related to the sample label Digestive System Disease, such as “percutaneous cholangioplasty” and “balloon cholangioplasty of 17 patients with 28 benign biliary strictures”.
- Label embedding space analysis Different from document/sentence samples, each label is actually a class, which should contains all kinds of representative information from the samples belonging to the label. Therefore, it’s not reasonable to use only one embedding to represent a label. In the study, we extend label embedding to label embedding space which is represented by a series of embeddings (embedding matrix). An example experiment, conducted on the 20NG dataset, is presented in Figure 5 to show the accuracy performance with using varying size of label embedding matrix. From the results we can see that the performance first improves as increasing the number of embeddings per label from one to five, and then decreases with using more than ten embeddings per label. Because at first increasing the label embedding size will increase labels‘ representation ability; but using too many embeddings also will decrease the label discriminative ability. The best choice of the size of label embedding matrix is dependent on datasets. For convenience and avoiding excessively fine tuning parameters, in all experiments we simply use five embeddings to represent each label (see the section of Experimental settings). The results of a comprehensive experiment are presented in Table 5. The comparing results illustrate that using embedding matrix (used by LguideLearn) can obtain stable and fruitful improvements, compared to using one label embedding (usually used by previous label embedding model, such as LEAM).
- Contextual and label attentive jointly learning analysis LEAM model also performs label attentive learning (completed by a label and words jointly learning process) and is reported having effectiveness on the dataset of TCD-1 . In our more experiments conducted on other dataset (such as TCD-2), the results illustrate that LEAM is not always effective and even much worse than other strong baselines (see Table 3). Besides publicated dataset, experimental results (details are not presented in the paper due to space limitation) on our real application dataset also illustrate the problem. The main reason is that the framework of LEAM is unqualified in encoding contextual information between words. An important merit of our proposed LguideLearn framework is the contextual encoding and label attentive encoding jointly learning. In the framework, the label-guided layer (performing label attentive learning) can be easily and directly applied with an effective contextual learning model (BiLSTM) to achieve contextual information and label attentive constraints jointly encoding. Comparison experimental results are presented in Table 6 to illustrate the importance of contextual and label attentive jointly learning. To control the effectiveness of using label embedding matrix, we make LguidedLearn also use one embedding for each label (as used in LEAM), denoted by LguidedLearn-1. The results (in Table 6) show that even LguidedLearn-1 surpass LEAM by a large margin. At sometimes, the performance of LEAM is worse than a BiLSTM model (due to lacking of encoding contextual information between words), such as on the datasets of R8 and MR.
4.5 Preliminary experiments with BERT
Though we employ BiLSTM to encode contextual information in the framework LguidedLearn, our proposed label guided encoding layer actually can be applied with any other contextual information learner. Results of
preliminary experiments with BERT are reported in Table 7. In the experiments, we employ pre-trained BERT
In this study, we propose a universal framework LguidedLearn to exploit label global information for text representation and classification. In the framework, a label guided encoding layer can be easily and directly applied with a contextual information encoding module to bring fruitful gains for text classification. A series of extensive experiments and analysis are presented to illustrate the effectiveness of our proposed learning schema.
- We will not specifically differentiate “sentence”, “document”, and “paragraph”. These terms can be used interchangeably.
- In this study we only consider single label (not multi-labels) classification problem.
- To make the cosine similarities can be computed, the dimension of contextual embeddings must be equal to the dimension of the embeddings in the label embedding matrix. That is .
- (2013) Label-embedding for attribute-based classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 819–826. Cited by: §1, §2.
- (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
- (2013) Spectral networks and locally connected networks on graphs. Cited by: item -.
- (2018) A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30 (9), pp. 1616–1637. Cited by: §1.
- (2016) Very deep convolutional networks for text classification. Cited by: §1, item -.
- (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: item -.
- (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §2.
- (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: item -.
- (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: item -, item -.
- (2014) Convolutional neural networks for sentence classification. Cited by: §1.
- (2016) Semi-supervised classification with graph convolutional networks. Cited by: §1.
- (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: item -.
- (2016) Recurrent neural network for text classification with multi-task learning. pp. 2873–2879. Cited by: §1, §1.
- (2016) Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports 6, pp. 26094. Cited by: §1.
- (2016) All-in text: learning document, label, and word representations jointly. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
- (2019) GILE: a generalized input-label embedding for text classification. Transactions of the Association for Computational Linguistics 7, pp. 139–155. Cited by: §1, §1.
- (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2.1.
- (2013) Label embedding for text recognition.. In BMVC, pp. 5–1. Cited by: §1.
- (2018) Baseline needs more love: on simple word-embedding-based models and associated pooling mechanisms. Cited by: item -, item -.
- (2018) Bi-directional block self-attention for fast and memory-efficient sequence modeling. ICLR. Cited by: item -, item -.
- (2018) Joint embedding of words and labels for text classification. arXiv preprint arXiv:1805.04174. Cited by: §1, §1, item -, item -, §4.1, §4.3, §4.4.2.
- (2012) Baselines and bigrams: simple, good sentiment and topic classification. In Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers-volume 2, pp. 90–94. Cited by: §1.
- (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 1480–1489. Cited by: item -, item -.
- (2019) Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7370–7377. Cited by: §1, §1, item -, §4.1, §4.3.
- (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §1, item -, item -.