Label-guided Learning for Text Classification

Label-guided Learning for Text Classification


Text classification is one of the most important and fundamental tasks in natural language processing. Performance of this task mainly dependents on text representation learning. Currently, most existing learning frameworks mainly focus on encoding local contextual information between words. These methods always neglect to exploit global clues, such as label information, for encoding text information. In this study, we propose a label-guided learning framework LguidedLearn for text representation and classification. Our method is novel but simple that we only insert a label-guided encoding layer into the commonly used text representation learning schemas. That label-guided layer performs label-based attentive encoding to map the universal text embedding (encoded by a contextual information learner) into different label spaces, resulting in label-wise embeddings. In our proposed framework, the label-guided layer can be easily and directly applied with a contextual encoding method to perform jointly learning. Text information is encoded based on both the local contextual information and the global label clues. Therefore, the obtained text embeddings are more robust and discriminative for text classification. Extensive experiments are conducted on benchmark datasets to illustrate the effectiveness of our proposed method.

1 Introduction

Text classification can be simply described as the task that given a sequence of text (usually a sentence, paragraph, or document1) we need to build a learning system to output a one-hot vector2 to indicate the category/class of the input sequence of text. It is a very important and fundamental task in the natural language processing community. In practice, many real applications can be cast into text classification tasks, such as document organization, news topic categorization, sentiment classification, and text-based disease diagnoses [13, 14, 21, 16, 24].

The essential step in text classification is to obtain text representation. In the earlier study, a piece of given text is usually represented with a hand-crafted feature vector [22]. Recently, inspired by the success of word embedding learning, a piece of text (a sentence/paragraph/document) is also represented with an embedding, which is automatically learned from the raw text by neural networks. Theses learning methods mainly include sequential-based learning models [10, 25, 5, 13] and graph-based learning models [11, 4, 2, 24]. All of these text learning methods are based on modeling local contextual information between words to encode a piece of text into a universal embedding, without considering the difference of labels. More recently, some research suggests that global label clues are also important for text representation learning [18, 1, 15, 21, 16].

In this study, we exploit label constraints/clues to guide text information encoding and propose a label-guided learning framework LguidedLearn for text classification. In our framework, each label is represented by an embedding matrix. A label-guided layer is proposed to map universal contextual-based embeddings into different label spaces, resulting in label-wise text embeddings. LguidedLearn performs jointly learning of word-word contextual encoding and label-word attentive encoding. The ultimately obtained text embeddings are informative and discriminative for text classification. A series of comprehensive experiments are conducted to illustrate the effectiveness of our proposed model.

2 Related Work

In the computer vision community, many studies have exploited label information for image classification [1, 7]. All of these models jointly encode label description text information and image information to enhance the performance of image classification. Recently, several studies have involved label embedding learning in natural language processing tasks. For example, Nam et al. \shortcitenam2016all proposed a model to learn the label and word embeddings jointly. Pappas et al. \shortcitepappas2019gile also presented a model GILE to encode input-label embedding. However, all these models require that there must be a piece of description text for each label. The learning performance is dependent on the quality of the label description text. Furthermore, this requirement will limit the models’ application.

3 Method

In this section, we first intuitively describe how to involve labels into text encoding and the main layers needed for text representation learning. Then, we present formally the proposed framework LguidedLearn (Label-guided Learning) for text classification.

Figure 1: The framework of our proposed label-guided learning, LguidedLearn, for text classification.

3.1 Intuition

Given a piece of text, what kinds of information/clues should be encoded in the text representation learning for classification? First local contextual information is essential for text embedding learning. We notice that not all words/characters in the given text are equally useful for correctly labeling. Therefore, we also need global label constraints/information to guide text encoding. Based on the above considerations, a learning framework for text classification should include:

  • Pre-trained encoding layer: get pre-trained word or character embeddings;

  • Contextual encoding layer: encode contextual information between words into the text embeddings

  • Label-guided encoding layer: perform label attentive learning to encode global information (constraints) into the text embeddings;

  • Classifying layer: conduct feature compression and text classification.

3.2 The Framework: LguidedLearn

The proposed label-guided learning framework is shown in the Figure 1. In this section, we introduce the framework in detail. Given a sequence of text (), we apply the following learning layers successively.

Pre-trained encoding layer

The aim of pre-trained layer is to obtain low-dimensional continuous embeddings for words in the sequence of text.


where (where is the pre-trained embedding size) is a pre-trained embedding of word , and is a kind of word embedding learner, such as Glove [17].

Contextual encoding layer

The contextual layer further encodes words’ dynamic contextual information in the current text sequence.


where (where is the contextual embedding size) is a contextual-encoded embedding corresponding to the word . In the supervised learning tasks, can be effectively implemented with a LSTM or BiLSTM network.

Label-guided encoding layer

Let be the label set, where is the number of labels. Each label () is represented with an embedding matrix consisting of embeddings (), where () is the th embedding in the embedding matrix . The label-guided layer jointly encodes label information and contextual information by projecting contextual-encoded embeddings into the label space. Take the th () label for example:


where are the contextual-encoded embeddings of the given sequence text, is a label-wise embedding specified with the th label. can be implemented with the following simple way:


where is a label attentive weight to measure the compatibility of the pair , where is the contextual embedding of the th word in the given sequence text, and is the embedding matrix of the label . To get the compatibility weight, the cosine similarities between and each embedding in = () are computed 3, resulting in a similarity degree vector (). The largest similarity value is collected by


Considering the formula (4), the collected values () should be normalized


According to the label-guided encoding formula (4), we can obtain label-wise embeddings (), where () is the text embedding encoded with the guidance of the label . All the label-wise embeddings are concatenated


where is the ultimate embedding used to represent the given sequence of text.

Classifying layer

In the classifying layer, we first employ MLP to compress the large text embedding into a proper one


where . In this study, we empirically set (10 times of the number of the labels). Then the classifying label distribution of the given sequence of text is obtained by


In the above formula, MLP is used to encode the embedding into a vector with dimension equal to the number of the labels.

4 Experiments and results

4.1 Datasets

In this study, we employ two text datasets to investigate the effectiveness of our proposed method. For convenience, the two datasets are simply denoted by TCD-1 (TCD: Text Classification Dataset) and TCD-2, respectively. A summary statistics of the two datasets (TCD-1 and TCD-2) are presented in Table 1.

Dataset Name # Doc # Word # Class # Len
DBPedia 630,343 21,666 14 57.31
Yelp-B 598,000 25,709 2 137.97
TCD-1 Yelp-F 700,000 22,768 5 151.83
YahooQA 1,460,000 607,519 10 53.40
AGNews 127,600 13,009 4 43.84
20NG 18,846 42,757 20 221.26
R8 7,674 7,688 8 65.72
TCD-2 R52 9,100 8,892 52 69.82
Ohsumed 7,400 14,157 23 135.82
MR 10,662 18,764 2 20.39
Table 1: Summary statistics of the datasets TCD-1 and TCD-2. Note: #Doc is the number of documents and #Len is the average length of each document.

Tcd-1 We select TCD-1 as an experimental dataset because it is used by [21] to evaluate their proposed text and label jointly learning model LEAM, which is a main baseline for comparative analysis. TCD-1 includes five sub-datasets:

  • AGNews: news topic classification over four categories (world, entertainment, sports and business). Each document is composed of an internet news article.

  • Yelp-F: sentiment classification of polarity star labels ranging from 1 to 5. The dataset is obtained from the Yelp Review Dataset Challenge in 2015.

  • Yelp-B: sentiment classification of polarity labels (negative or positive). The dataset is the same as the Yelp Full, only the labels are different, where polarity star 1 and 2 are treated as negative, and 4 and 5 as positive.

  • DBPedia: ontology classification over fourteen classes. The dataset is picked from DBpedia 2014 (Wikipedia).

  • YahooQA: QA topic classification over ten categories. The dataset is collected from the version 1.0 of the Yahoo! Answers Comprehensive Questions and Answers.

Tcd-2 We select TCD-2 to test our model because it is a very comprehensive dataset, referring to many different domains, and is widely used by more recent work [24]. TCD-2 also consists of five sub-datasets:

  • 20NG: document topic classification over 20 different categories.

  • R52: document topic classification over 52 categories. The R52 dataset is a subset of the Reuters 21578 dataset.

  • R8: document topic classification over 8 categories. The R8 dataset is also a subset of the Reuters 21578 dataset.

  • Ohsumed: disease classification over 23 categories. The Ohsumed corpus is from the MEDLINE database.

  • MR: binary sentiment classification. The MR dataset is a movie review dataset, in which each review only contains one sentence.

Though 20NG, R52 and R8 are all document topic classification. The three sub-datasets are very different in labels, as presented in Figure 2. For example, 20NG mainly refers to sports, politics and computers. While R52 and R8 mainly contains life related issues and financial related topics. R8 has a more rough categorizations than that of R52.

Figure 2: Labels of the document topic classification datasets: 20NG, R52 and R8.

4.2 Baselines

Baselines on TCD-1 Baselines reported on TCD-1 mainly contain five categories:

  • Powerful traditional language feature models, such as BOW (bag-of-words) [25];

  • Effective embedding based models, including fastText [9] and SWEM [19];

  • Deep learning models, mainly including commonly used CNN-based models and LSTM-based models, such as Small CNN [25], Large CNN [25], Deep CNN [5], LSTM, and SA-LSTM;

  • Attention based models: HAN [23] and Bi-BloSAN [20];

  • Label embedding based models: LEAM [21] and our proposed LguidedLearn.

Baselines on TCD-2 Baselines conducted on TCD-2 are:

  • Traditional models: TF-IDF + LR;

  • Embedding based models: PV-DBOW [12], PV-DM [12], fastText [9], and SWEM [19];

  • Sequential deep learning models: CNN-rand, CNN-non-static and LSTM;

  • Graph deep learning models: Graph-CNN-C [6], Graph-CNN-S [3], Graph-CNN-F [8] and Text GCN [24];

  • Attention based models, HAN [23] and Bi-BloSAN [20];

  • Label embedding based model LEAM [21] and our proposed LguidedLearn.

TCD-2 is a very comprehensive dataset used for text classification. Besides sequential deep learning models, some recent graph-based neural network models have also been reported on TCD-2.

4.3 Experimental settings

Dataset Setting The settings of the training/testing of TCD-1 and TCD-2 are as the same of the used in [21] and [24], respectively.

Model setting In the pre-trained layer, we use Glove to obtain pre-trained word embeddings with dimension of . The contextual layer is implemented by BiLSTM with dimension of . In the label-guided layer, each label is represented by an embedding matrix with size of . That is, each label is represented by five embeddings.

Learning setting In our training, batch size is 25 and learning rate is 0.001. All experiments are repeated 10 times.

4.4 Results and analysis

Performance comparison

A series of comparison experiments are conducted on the datasets of TCD-1 and TCD-2. A summary of text classification performance and simple analysis are presented.

- Results on TCD-1 The comparison results on TCD-1 are presented in Table 2. The results illustrate that our proposed framework LguidedLearn can achieve the best performance on all the test datasets of TCD-1. Even compared to some recently published strong text classification algorithms (such as fastText, SWEM, Deep CNN (29 layers), Bi-BloSAN, and LEAM) LguidedLearn can obtain stable and prominent gains, especially on the AGNews, Yelp-B, and Yelp-F datasets.

Model YahooQA DBPedia AGNews Yelp-B  Yelp-F
BOW 68.90 96.60 88.80 92.20 58.00
SWEM 73.53 98.42 92.24 93.76 61.11
fastText 72.30 98.60 92.50 95.70 63.90
Small CNN 69.98 98.15 89.13 94.46 58.59
Large CNN 70.94 98.28 91.45 95.11 59.48
Deep CNN 73.43 98.71 91.27 95.72 64.26
LSTM 70.84 98.55 86.06 94.74 58.17
SA-LSTM - 98.60 - - -
HAN 75.80 - - - -
Bi-BloSAN 76.28 98.77 93.32 94.56 62.13
LEAM-linear 75.22 98.32 91.75 93.43 61.03
LEAM 77.42 99.02 92.45 95.31 64.09
LguidedLearn 77.61 99.08 93.67 96.80 68.08
Table 2: The results of text classification accuracy (%) on TCD-1.

- Results on TCD-2 The results conducted on TCD-2 are presented in Table 3. From the results we can see that besides the dataset of 20NG, LguidedLearn obtains the best classification accuracy on all the other datasets. The recent very popular graph neural network (GNN) models show strong ability in classifying text documents. We notice that even the very strong GNN-based models (such as Graph-CNN-C, Graph-CNN-S, Graph-CNN-F, and Text GCN) are surpassed by our proposed framework LguidedLearn by a large margin. Documents of 20NG are very long (where average length is 221.26 words and about 18% documents are more than 400 words) and surpass the maximum encoding length of the majority models. This is the reason that on the dataset of 20NG non-sequential encoding models (TF-IDF+LR, SWEM, and Text GCN) can achieve better performance than those sequential encoding models (such as LSTM and CNN based models). LguidedLearn (having a contextual layer with BiLSTM) is also slightly affected by this factor.

Model 20NG R8 R52 Ohsumed MR
TF-IDF + LR       83.19 93.74 86.95 54.66 74.59
PV-DBOW 74.36 85.87 78.29 46.65 61.09
PV-DM 51.14 52.07 44.92 29.50 59.47
fastText 11.38 86.04 71.55 14.59 72.17
SWEM 85.16 95.32 92.94 63.12 76.65
CNN-rand 76.93 94.02 85.37 43.87 74.98
CNN-non-static 82.15 95.71 87.59 58.44 77.75
LSTM 65.71 93.68 0.8554 0.4113 0.7506
LSTM (pretrain) 75.43 96.09 90.48 51.10 77.33
Graph-CNN-C 81.42 96.99 92.75 63.86 77.22
Graph-CNN-S 96.80 92.74 62.82 76.99
Graph-CNN-F 96.89 93.20 63.04 76.74
Text GCN 86.34 97.07 93.56 68.36 76.74
LEAM 81.91 93.31 91.84 58.58 76.95
LguidedLearn 85.58 97.86 96.12 70.45 82.07
Table 3: The results of text classification accuracy (%) on TCD-2.

Analysis of Label-guided encoding

The main difference between our presented LguidedLearn and the traditional deep learning models is the label-guided encoding layer, which performs a label attentive learning. Compared to previous label attentive learning models, such as LEAM, LguidedLearn 1) extends label embedding to label embedding space (represented by a series of embeddings) and 2) performs jointly learning of contextual encoding and label-guided encoding. We present a series of comprehensive analysis to the label-guided layer according to the above considerations.

- Label attentive learning analysis The results of LguidedLearn without label-guided encoding layer, denoted by Label-gudied (w/o), are presented in Table 4. Actually, after removing label-guided layer LguidedLearn is reduced to a BiLSTM-based text classification model. The comparison results (Label-guided (w/o) vs. LguidedLearn) show that the label-guided layer brings huge gains ( accuracy improvement), especially on complex task datasets: 20NG, R52, Ohsumed and MR.

The MR dataset is a sentiment classification task which requires model having ability to capture sentiment-related detailed information. An example taken from MR is presented in Figure 3. The example shows that the label-guided layer can extract label (sentiment) related information from input words by performing label attentive learning. For example, words of “funny” and “also” are likely to be projected into the Positive space with large attentive weights, and words of “dark” and “disturbing” are more likely projected into the Negative space.

The datasets of 20NG, R52 and Ohsumed have many labels and documents are very long. These are very challenging text classification tasks, which need model to correctly extract label-related information from a very redundant and even noisy input. A medical text example taken from the Ohsumed dataset is presented in Figure 4. The example illustrates the effectiveness of the label attentive learning. The medical text example has more than 200 words, only some pieces of text (denoted by red color) are effectively projected into the correct label space with large label attentive weights. These pieces of text are strongly related to the sample label Digestive System Disease, such as “percutaneous cholangioplasty” and “balloon cholangioplasty of 17 patients with 28 benign biliary strictures”.

Model   20NG R8 R52 Ohsumed MR
Label-guided (w/o)   73.18 96.31 90.54 49.27 77.68
LguidedLearn 85.58 97.86 96.12 70.45 82.07
Table 4: Ablation analysis of the LguidedLearn framework. Label-guided (w/o): LguidedLearn without label-guided encoding layer.
Figure 3: Visualization of the learned attentive weights between words and labels. The sample is taken from the MR dataset which has two kinds of label: “Positive” and “Negative” .
Figure 4: An example of medical text document taken from the Ohsumed dataset which has 23 different labels. The correct label of the example is Digestive System Disease . Words denoted by red color have large attentive weights corresponding to the correct label. The sample actually has more than 200 words (words with small attentive weights are omitted by “…”).

- Label embedding space analysis Different from document/sentence samples, each label is actually a class, which should contains all kinds of representative information from the samples belonging to the label. Therefore, it’s not reasonable to use only one embedding to represent a label. In the study, we extend label embedding to label embedding space which is represented by a series of embeddings (embedding matrix). An example experiment, conducted on the 20NG dataset, is presented in Figure 5 to show the accuracy performance with using varying size of label embedding matrix. From the results we can see that the performance first improves as increasing the number of embeddings per label from one to five, and then decreases with using more than ten embeddings per label. Because at first increasing the label embedding size will increase labels‘ representation ability; but using too many embeddings also will decrease the label discriminative ability. The best choice of the size of label embedding matrix is dependent on datasets. For convenience and avoiding excessively fine tuning parameters, in all experiments we simply use five embeddings to represent each label (see the section of Experimental settings). The results of a comprehensive experiment are presented in Table 5. The comparing results illustrate that using embedding matrix (used by LguideLearn) can obtain stable and fruitful improvements, compared to using one label embedding (usually used by previous label embedding model, such as LEAM).

Figure 5: Results (on the 20NG dataset) of LguidedLearn with using varying number of label embeddings.
Model   20NG R8 R52 Ohsumed MR
LguidedLearn-1 84.01 97.72 95.13 69.89 80.41
LguidedLearn 85.58 97.86 96.12 70.45 82.07
Table 5: Comparison results of one label embedding and label embedding space (represented by five embeddings by default in this study). LguidedLearn-1: LguidedLearn with using only one embedding to represent each label.

- Contextual and label attentive jointly learning analysis LEAM model also performs label attentive learning (completed by a label and words jointly learning process) and is reported having effectiveness on the dataset of TCD-1 [21]. In our more experiments conducted on other dataset (such as TCD-2), the results illustrate that LEAM is not always effective and even much worse than other strong baselines (see Table 3). Besides publicated dataset, experimental results (details are not presented in the paper due to space limitation) on our real application dataset also illustrate the problem. The main reason is that the framework of LEAM is unqualified in encoding contextual information between words. An important merit of our proposed LguideLearn framework is the contextual encoding and label attentive encoding jointly learning. In the framework, the label-guided layer (performing label attentive learning) can be easily and directly applied with an effective contextual learning model (BiLSTM) to achieve contextual information and label attentive constraints jointly encoding. Comparison experimental results are presented in Table 6 to illustrate the importance of contextual and label attentive jointly learning. To control the effectiveness of using label embedding matrix, we make LguidedLearn also use one embedding for each label (as used in LEAM), denoted by LguidedLearn-1. The results (in Table 6) show that even LguidedLearn-1 surpass LEAM by a large margin. At sometimes, the performance of LEAM is worse than a BiLSTM model (due to lacking of encoding contextual information between words), such as on the datasets of R8 and MR.

Model   20NG R8 R52 Ohsumed MR
BiLSTM   73.18 96.31 90.54 49.27 77.68
LEAM 81.91 93.31 91.84 58.58 76.95
LguidedLearn-1 84.01 97.72 95.13 69.89 80.41
Table 6: Compared to previous label embedding attentive learning model LEAM. LguidedLearn-1: LguidedLearn with using only one embedding to represent each label.

4.5 Preliminary experiments with BERT

Though we employ BiLSTM to encode contextual information in the framework LguidedLearn, our proposed label guided encoding layer actually can be applied with any other contextual information learner. Results of preliminary experiments with BERT are reported in Table 7. In the experiments, we employ pre-trained BERT 4 to produce contextual embeddings for each words in an input document/sentence. The very preliminary results presented in Table 7 illustrate that the proposed label guided layer brings fruitful gains by performing label attentive learning.

Model   20NG R8 R52 Ohsumed MR
BERT   67.90 96.02 89.66 51.17 79.24
Lguided-BERT-1 76.09 97.49 94.26 59.41 81.03
Lguided-BERT-3 78.87 98.28 94.32 62.37 81.06
Table 7: Label guided encoding layer is applied with BERT. Lguided-BERT-1: label guided encoding applied with the last layer of BERT. Lguided-BERT-3: label guided encoding applied with the last three layers of BERT, where each layer uses different label embedding matrix.

5 Conclusion

In this study, we propose a universal framework LguidedLearn to exploit label global information for text representation and classification. In the framework, a label guided encoding layer can be easily and directly applied with a contextual information encoding module to bring fruitful gains for text classification. A series of extensive experiments and analysis are presented to illustrate the effectiveness of our proposed learning schema.


  1. We will not specifically differentiate “sentence”, “document”, and “paragraph”. These terms can be used interchangeably.
  2. In this study we only consider single label (not multi-labels) classification problem.
  3. To make the cosine similarities can be computed, the dimension of contextual embeddings must be equal to the dimension of the embeddings in the label embedding matrix. That is .


  1. Z. Akata, F. Perronnin, Z. Harchaoui and C. Schmid (2013) Label-embedding for attribute-based classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 819–826. Cited by: §1, §2.
  2. P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro and R. Faulkner (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
  3. J. Bruna, W. Zaremba, A. Szlam and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. Cited by: item -.
  4. H. Cai, V. W. Zheng and K. C. Chang (2018) A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30 (9), pp. 1616–1637. Cited by: §1.
  5. A. Conneau, H. Schwenk, L. Barrault and Y. Lecun (2016) Very deep convolutional networks for text classification. Cited by: §1, item -.
  6. M. Defferrard, X. Bresson and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: item -.
  7. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §2.
  8. M. Henaff, J. Bruna and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: item -.
  9. A. Joulin, E. Grave, P. Bojanowski and T. Mikolov (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: item -, item -.
  10. Y. Kim (2014) Convolutional neural networks for sentence classification. Cited by: §1.
  11. T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. Cited by: §1.
  12. Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196. Cited by: item -.
  13. P. Liu, X. Qiu and X. Huang (2016) Recurrent neural network for text classification with multi-task learning. pp. 2873–2879. Cited by: §1, §1.
  14. R. Miotto, L. Li, B. A. Kidd and J. T. Dudley (2016) Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports 6, pp. 26094. Cited by: §1.
  15. J. Nam, E. L. Mencía and J. F’̀urnkranz (2016) All-in text: learning document, label, and word representations jointly. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
  16. N. Pappas and J. Henderson (2019) GILE: a generalized input-label embedding for text classification. Transactions of the Association for Computational Linguistics 7, pp. 139–155. Cited by: §1, §1.
  17. J. Pennington, R. Socher and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.2.1.
  18. J. A. Rodriguez-Serrano, F. Perronnin and F. Meylan (2013) Label embedding for text recognition.. In BMVC, pp. 5–1. Cited by: §1.
  19. D. Shen, G. Wang, W. Wang, M. R. Min, Q. Su, Y. Zhang, C. Li, R. Henao and L. Carin (2018) Baseline needs more love: on simple word-embedding-based models and associated pooling mechanisms. Cited by: item -, item -.
  20. T. Shen, T. Zhou, G. Long, J. Jiang and C. Zhang (2018) Bi-directional block self-attention for fast and memory-efficient sequence modeling. ICLR. Cited by: item -, item -.
  21. G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao and L. Carin (2018) Joint embedding of words and labels for text classification. arXiv preprint arXiv:1805.04174. Cited by: §1, §1, item -, item -, §4.1, §4.3, §4.4.2.
  22. S. Wang and C. D. Manning (2012) Baselines and bigrams: simple, good sentiment and topic classification. In Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers-volume 2, pp. 90–94. Cited by: §1.
  23. Z. Yang, D. Yang, C. Dyer, X. He, A. Smola and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 1480–1489. Cited by: item -, item -.
  24. L. Yao, C. Mao and Y. Luo (2019) Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7370–7377. Cited by: §1, §1, item -, §4.1, §4.3.
  25. X. Zhang, J. Zhao and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §1, item -, item -.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description