Efficient Path Prediction for Semi-Supervised and Weakly Supervised Hierarchical Text Classification
Hierarchical text classification has many real-world applications. However, labeling a large number of documents is costly. In practice, we can use semi-supervised learning or weakly supervised learning (e.g., dataless classification) to reduce the labeling cost. In this paper, we propose a path cost-sensitive learning algorithm to utilize the structural information and further make use of unlabeled and weakly-labeled data. We use a generative model to leverage the large amount of unlabeled data and introduce path constraints into the learning algorithm to incorporate the structural information of the class hierarchy. The posterior probabilities of both unlabeled and weakly labeled data can be incorporated with path-dependent scores. Since we put a structure-sensitive cost to the learning algorithm to constrain the classification consistent with the class hierarchy and do not need to reconstruct the feature vectors for different structures, we can significantly reduce the computational cost compared to structural output learning. Experimental results on two hierarchical text classification benchmarks show that our approach is not only effective but also efficient to handle the semi-supervised and weakly supervised hierarchical text classification.
Text classification has always been an important task, particularly with the vast growth of text data on the Web needed to be classified. The applications include news classification (Dagan et al., 1997), product review classification (Pang et al., 2002), spam detection (Mccord and Chuah, 2011) and so on. Hierarchical classification (HC) and structured prediction are involved since the classes are usually organized as a hierarchy. In recent decades, many approaches have been proposed for HC. For example, top-down classification (Sun and Lim, 2001) classifies documents at the top layer and then propagates the results to next layer until the leaves. This greedy strategy propagates the classification error along the hierarchy. Contrarily, bottom-up classification (Bennett and Nguyen, 2009) backpropagates the labels from the leaves to the top layer, making the leaves with less training data but sharing some similarities with their parents and siblings may not get well considered and trained. Moreover, structural output learning, such as structural perceptron (Collins, 2002) and structural SVM (Tsochantaridis et al., 2005), can leverage the structural information in the class hierarchy well, but they need to do Kesler construction (Nilsson, 1965; Duda and Hart, 1973) where for each sub-structure, the new features are constructed based on the existing features and the class dependencies. That is why structural output learning usually takes more time to train than top-down and bottom-up approaches. All the above approaches are supervised methods. When there are more unlabeled data, it is more challenging if we consider both class dependencies and efficiency of the practical use of hierarchical text classification.
There exist several ways to use the large amount of unlabeled data, among which semi-supervised learning (SSL) (Chapelle et al., 2010) and weakly supervised learning such as dataless classification (Chang et al., 2008; Song and Roth, 2014) are two representative ways. An example of SSL is (Nigam et al., 2006). It uses a mixture multinomial model to estimate the posterior probabilities of unlabeled data, which share the same parameters with the naive Bayes model for the labeled data. More parameters can be introduced to model the hierarchical structure, causing the model redundant and meanwhile not accurate enough. As for weakly supervised learning, dataless classification (Song and Roth, 2014) uses the semantic similarities between label descriptions and document contents to provide weak labels for documents. When applying the weak labels, current approaches simply treat each label similarity independently and do not consider the path constraints in the label hierarchy.
To tackle the above problems for semi-supervised and weakly supervised hierarchical text classification, we propose a path cost-sensitive learning algorithm based on a generative model for text. When estimating the path posterior distribution, the path-dependent scores are incorporated to make the posteriori path-sensitive. The path-dependent score evaluates how accurate the current model is in terms of classifying a document among the paths in the class hierarchy. Then during inference, classification is constrained to keep the consistency of the hierarchy. By this mechanism, we develop a simple model with fewer parameters compared with existing approaches while maintaining the consistency property for the class dependencies in the hierarchy.
The contributions of our paper are as follows:
We propose a new approach for hierarchical text classification based on a probabilistic framework. We highlight its meanings in cost-sensitive learning and constraint learning.
We show significant improvements on two widely used hierarchical text classification benchmarks and demonstrate our algorithm’s effectiveness in semi-supervised and weakly-supervised learning settings.
Our approach reduces the complexity of traditional methods. We achieve tens of times speedup while outperforming the state-of-the-art discriminative baseline.
The code and data used in the paper are available at https://github.com/HKUST-KnowComp/PathPredictionForTextClassification.
2. Related Work
There are only a few studies on semi-supervised hierarchical (text) classification (Nigam et al., 2006; Dalvi et al., 2016), partially because of the difficulty to evaluate the class dependencies for unlabeled data and the time cost of using more complicated algorithms such as structural output learning (Mann and McCallum, 2008). Most semi-supervised hierarchical text classification works were based on EM algorithm introduced by (Nigam et al., 2006). Some are related with ours (see in Section 2.1), while others are not, e.g., (Dalvi et al., 2016) used EM algorithm to deal with incomplete hierarchy problem, which was not the same setting as ours. In this section, we simply start with the review of general hierarchical text classification and then explain the uniqueness and significance of our work.
Hierarchical text classification has been studied for several decades. Flat multi-label classification methods (Tikk and Biró, 2004) ignore the hierarchy, thus poor for HC. Early works (Koller and Sahami, 1997; Dumais and Chen, 2000; Liu et al., 2005) often used “pachinko-machine models” which assigned a local classifier at each node and classified documents recursively. Top-down and bottom-up approaches utilize the local classifier ideas, but top-down is a greedy strategy so it may not find optimal solutions, while bottom-up approach does not well consider and train the classes with less training data.
To better exploit the class hierarchy, algorithms particularly designed for trees can assist. In practice, both generative and discriminative models are used. In the following, we will review the related work of these two categories.
2.1. Generative Models
(Nigam et al., 2006) summarized the text generative model and provided the naive Bayes classifier and Expectation-Maximization (EM) algorithm for flat classification. As for HC, it introduced more parameters to account for the class dependencies. (McCallum et al., 1998) remodeled the framework in another way. They applied shrinkage to smooth parameter estimates using the class hierarchy. (Cong et al., 2004) also used the same generative framework but proposed a clustering-based partitioning technique. These generative hierarchical methods can bring some structural information to the model, but they do not make full use of the hierarchy and have difficulties scaling to large hierarchies.
2.2. Discriminative Models
Discriminative methods are also popular for HC. Orthogonal Transfer (Xiao et al., 2011) borrowed the idea of top-down classification where each node had a regularized classifier and each node’s normal vector classifying hyperplane was encouraged to be orthogonal to its ancestors’. Hierarchical Bayesian Logistic Regression (Gopal et al., 2012) leveraged the hierarchical dependencies by giving the children nodes a prior centered on the parameters of its parents. The idea was further developed in Hierarchically Regularized SVM and Logistic Regression (Gopal and Yang, 2013), where the hierarchical dependencies were incorporated into the parameter regularization structure. More recently, the idea of hierarchical regularization has been applied to deep models and also showed some improvements (Peng et al., 2018). (Charuvaka and Rangwala, 2015) simplified the construction of classifier by building a binary classifier on each tree node and providing the cost-sensitive learning (HierCost). All the above approaches are still based on top-down or greedy classification which can result in non-optimal solutions. Another similar work with ours is (Wu et al., 2017)’s hierarchical loss for classification, which defined the hierarchical loss or win as the weighted sum of the probabilities of the nodes along the path. In contrast to their work, we use the sum of the (weakly) labeled instances along a path as score to perform path cost-sensitive learning.
To find more theoretically guaranteed solutions, some algorithms were developed based on structural output learning (Lafferty et al., 2001; Collins, 2002; Taskar et al., 2003; Tsochantaridis et al., 2005), which can be proved to be global optimal for HC. Hierarchical SVM (HSVM) (Cai and Hofmann, 2004), one example of structural SVM, generalized SVM to structured data with a path-dependent discriminant function. In general, when performing structural output learning, Kesler construction is used to construct the feature vectors for comparing different structures (Nilsson, 1965; Duda and Hart, 1973), which adds much more computation than top-down or bottom-up classification approaches.
In summary, generative and discriminative models can both be adapted to HC problems. Discriminative models achieve better performance with adequate labeled data (Ng and Jordan, 2001), especially if a better representation for text can be found, e.g., using deep learning (Peng et al., 2018). Whereas generative models have their advantage for handling more uncertainties (Ng and Jordan, 2001) for limited labeled data and under noisy supervision. Our work is based on a generative model yet has the same parameter size as the flat classification. We find that it significantly boosts the performance of semi-supervised learning and weakly supervised learning as well as reduces the computational cost.
3. Path Prediction for Hierarchical Classification
In HC, the classes constitute a hierarchy, denoted as . is a tree whose depth is , with the root node in depth . Then the classes are distributed from depth to . We suppose that all leaf nodes are in depth . This can always be satisfied by expanding the shallower leaf node (i.e. giving it a child) until it reaches depth . When evaluating models, these dummy nodes from can be easily removed to avoid affecting the performance measure.
Let be the class sets in depth , depth accordingly, with sizes . To classify a document, we assign labels in each depth, i.e., the document gets labels . These form a path in if the classification results in each depth are consistent with other depths. We want to maintain the consistency of the hierarchy, therefore we classify the documents by paths instead of by multi-label classes. After assigned a path, the document’s classes are the nodes lying in the path. It is similar with structured prediction since a path can be regarded as a structured object, which contains more information than a set of multi-label classes without path constraints.
To sum up, path prediction aims at making use of the structural information in the class hierarchy to train the classifier. Note that the classifier is for paths in the hierarchy instead of classes. The details of path prediction algorithm are given in the next section.
4. Path Cost-Sensitive Learning
In this section, we introduce our method which utilizes the structural information to learn the classifier, revealing its meanings in cost-sensitive learning and constraint learning.
4.1. Path-Generated Probabilistic Framework
We base our work on a widely-used probabilistic framework, which constructs a generative model for text. In the framework, text data are assumed to be generated from a mixture of multinomial distributions over words. Previous works (Nigam et al., 2006; McCallum et al., 1998) assumed that the mixture components in the generative model have a one-to-one correspondence with the classes. However, in order to perform path prediction, we presume that the mixture components have a one-to-one correspondence with the paths.
Define to be the set of all paths which start from the root node and end in some leaf node in the class hierarchy , so the size of equals to that of the leaf nodes . Let be the vocabulary. Denote as the parameters for the mixture multinomial model. For a document with length , suppose is the word frequency of word in , which is the document feature represented by vector space model (Liu and Yang, 2012). Then the generative process runs as following.
First, select a mixture component, or equivalently a path , from (prior of ). Next, generate the document by selecting the length and picking up words from . According to the law of total probability and the naive Bayes assumption that given the labels, the occurrence times of each word in a document are conditionally independent with its position as well as other words, the probability of generating is
In general, document lengths are assumed to be independent with classes, thus independent with paths. So model parameters include the path prior and the multinomial distribution over words for each path .
4.2. Path-Dependent Scores
Given a data set , consisting of the labeled documents and the unlabeled documents . We now derive the parameter estimation in a supervised manner. With only labeled data considered, we maximize , which can be done by counting the corresponding occurrences of events. The event counts are usually the hard counts for flat classification. Here we use a path-dependent score to substitute it.
First we define the score of a node in for a document. Suppose , is the node in path . The node score of for , denoted as , indicates the label of . When is labeled with the ground truth labels, if and only if is one of ’s labels. We also consider the weakly supervised case. In (Song and Roth, 2014)’s dataless text classification, for , it is weakly labeled by the semantical similarities with classes. We assign value to if has the largest similarity with among all classes in depth and 0 otherwise.
Next we introduce the path score. For , the score of path , denoted as , is the sum of the nodes’ scores lying in except the root node since it makes no sense for classification.
Take the hierarchy in Figure 1 as an example. is labeled as , then , , while other paths score 0. If is weakly labeled by the similarities, then we label it with the classes having the maximum similarity in each depth and obtain the path scores in the same way.
4.3. Path Cost-Sensitive Naive Bayes Classifier
While doing the empirical counts, the Laplace smoothing is often applied by adding one count to each event to avoid zero probability and shrink the estimator. Combining the event counts (i.e. the path scores) and the smoothing term, the parameter estimates are:
There are two aspects of using the path scores as event counts:
Cost-sensitive performance measures are considered since different data samples are given different weights. In Figure 1, is counted twice for , once in and once in , thus obtaining more weights. and are not right paths for , but they still classify correctly in depth , thus get one count, less than but larger than other paths who have no correct labels at all. This path cost-sensitive learning behavior helps the model to maintain structural information.
The path scores function as the measuring indicators of paths, capacitating the model to classify the documents by paths. The path prediction actually puts constraints on the classifier, where the prediction results must be consistent with the class hierarchy. Furthermore, the constraint learning reduces the search space and improves efficiency.
After estimating from , for any test document , the posterior probability distribution can be obtained by Bayes’ rule:
Then will be classified into .
The path cost-sensitive naive Bayes classifier (PCNB) for the generative model are introduced above. Next we will present the semi-supervised path cost-sensitive learning algorithm.
4.4. Semi-Supervised Path Cost-Sensitive Learning
Until now, only the labeled data are used during training, but we want to make use of the unlabeled data to ameliorate the classifier. We follow (Nigam et al., 2006) to apply EM technique for SSL.
When the initial parameters are given, the posterior probabilities of , computed through Eq. (4.3), can act as the path score for . Combining the labeled and unlabeled data together, the parameter estimates are changed into
Note that the numerical value of for ranges in since it is the posterior probability. Therefore, the unlabeled data weight less than the labeled data while estimating the parameters. It is reasonable because the labeled data are more authentic than the inference results of unlabeled data, especially in the early iterations where the model does not reach convergence.
The new obtained via Eqs. (6) and (7) are then used to compute the posterior probabilities of again, which in turn update . The iterative process keeps maximizing the likelihood of the dataset , equivalent to maximizing the log likelihood:
Refer to (Dempster et al., 1977), the convergence of EM can be guaranteed, but it reaches some local maxima. To enable the algorithm to find good local maxima, we initialize with those obtained through the naive Bayes classifier on . Algorithm 1 presents the EM algorithm for the path cost-sensitive classification (PCEM).
For empirical evaluation of effectiveness and efficiency of our approach, we design experiments on semi-supervised and weakly supervised hierarchical text classification tasks, compared to the representative and the state-of-the-art baselines.
5.1. Experimental Design
We use two datasets, both of which have semi-supervised and weakly-supervised version. The statistics are listed in Table 1.
We compare our path cost-sensitive algorithms (PCNB and PCEM) with the following baselines:
Flat naive Bayes classifier (Flat-NB) and Flat-EM algorithm: the flat classifiers introduced in (Nigam et al., 2006).
Naive Bayes classifier with multiple components (NBMC) and EMMC: a more expressive model proposed by (Nigam et al., 2006).
Top-down naive Bayes classifier (TDNB) and TDEM: the classifiers run in the top-down way.
Win-driven naive Bayes classifier (WDNB) and WDEM: the modified hierarchical loss for classification (Wu et al., 2017).
Logistic regression (LR) and SVM: two classical discriminative methods. Our experiments use the LibLinear222 https://www.csie.ntu.edu.tw/ (Fan et al., 2008) to train corresponding models and test. During the experiments, we found that dual solvers were much faster and even better in performance than primal solvers, so we chose dual solvers.
5.1.3. Evaluation Metrics
We use scores (Yang, 1999) to evaluate the performances of all methods. Denote , , as the instance numbers of true-positive, false-positive and false negative for class . Let be the set of all classes except the root node. Two conventional scores are defined as:
where is the averaged precision and is the averaged recall.
where and are the precision and the recall for .
For the two scores, we measure the overall performance of all classes in the hierarchy in our experiments.
To evaluate our algorithms, we compare our algorithms with the baselines in semi-supervised and weakly supervised hierarchical text classification. Results on all datasets under label rate are summarized in Table 2, where label rate means there are data in the training set are labeled or weakly-labeled, which is a common setting for semi-supervised text classification. To show that our approach (PCEM) indeed levareges unlabeled data and weakly labeled data well, we present the results under different label rates compared with other EM methods in Figure 2 and 3. For each experiment, we randomly split the training data into labeled and unlabeled according to the label rate, then run experiments using the splitted training data. The running is executed for 5 times and the mean scores are calculated. Next we will analyze the results. Time efficiency will also be discussed.
5.2.1. Semi-Supervised Classification with True Labels
Table 2 shows that when the training data are partly labeled with the ground truth labels, PCEM has remarkable superiority over other methods all the time. The discriminative baselines do not have their advantages on the semi-supervised and weakly supervised settings. When compared with generative baselines, our approaches, either naive Bayes (PCNB) or semi-supervised (PCEM), are the best among the corresponding methods. It demonstrates that our algorithms makes good use of the structural information to improve the hierarchical classification.
As expected, EM approaches outperform the corresponding naive Bayes classifier under label rate, which reveals the benefits from the unlabeled data. However, we also noticed that EM may be surpassed by NB when the label rate gets larger. That is related with whether the ratio between labeled and unlabeled data is suitable for SSL, as well as the bias of unlabeled data. This issue has been discussed in previous works (Fox-Roberts and Rosten, 2014).
To see the performance in SSL, we compare PCEM with other EM methods in Figure 2. The label rate ranges in . We find that PCEM outperforms others steadily. Other hierarchical EM methods are close to Flat-EM, showing that they takes little advantage of the class hierarchy. The results reveal the effectiveness of PCEM under all label rates for semi-supervised classification.
5.2.2. Weakly-Supervised Classification on Dataless Setting
We also make a comparison with the baselines on dataless text classification. The experimental setting is the same as the semi-supervised classification, except that the training documents do not have labels. Instead, some of them are ‘labeled’ as classes with the maximal semantical similarities. We use the dataless 20NG and RCV1 datasets provided by (Song and Roth, 2014). Results are presented in Table 2 and Figure 3.
We find the consistent results with the semi-supervised setting. PCEM can always beat the baselines with significant improvements. PCNB is also better than other NB methods. It is worth noting that the gaps between our algorithms (PCNB and PCEM) and the baselines are bigger than those in the semi-supervised setting. We think the reason is that for this weakly-labeled dataset, the similarities can be seen as noisy labels for documents. In this noisy circumstance, our path cost-sensitive learning algorithm with the probabilistic framework is pretty good at making use of the structural information and features of unlabeled data to recover the true generative distribution.
5.2.3. Efficiency Comparison
Time complexity is also under consideration to evaluate our algorithms. PCNB is highly efficient, faster than all of the other methods except Flat-NB and even competitive with Flat-NB. PCEM is slightly slower than LR and SVM, but that is because EM methods leverage the unlabeled data, which cannot be used by discriminative methods. The trade-off is acceptable, especially considering the excellent performance of PCEM. Furthermore, PCEM also achieves tens of times speedup compared to HierCost.
6. Conclusions and Future Work
We present an effective and efficient approach for hierarchical text classification. Our path cost-sensitive learning algorithm alters the traditional generative model of text with a path-generated model to constrain the classifier by the class hierarchy. We show that our algorithm outperforms other baselines on semi-supervised learning and weakly supervised learning. In addition, our model has the potential of extension to other models, not limited to the generative one, if the path-dependent scores are incorporated appropriately. For the possible future work, we will convert the current framework into a discriminative learning framework following (Collins, 2002) and apply deep neural models to learn a better representation for text (Meng et al., 2019, 2018). Discrimative framework will further improve the learning when there are more labeled data and deep neural models are more powerful to handle different kinds of weak supervision.
This paper was supported by HKUST-WeChat WHAT Lab and the Early Career Scheme (ECS, No. 26206717) from Research Grants Council in Hong Kong.
- Bennett and Nguyen (2009) Paul N. Bennett and Nam Nguyen. 2009. Refined experts: improving classification in large taxonomies. In SIGIR. ACM, 11–18.
- Cai and Hofmann (2004) Lijuan Cai and Thomas Hofmann. 2004. Hierarchical document categorization with support vector machines. In CIKM. ACM, 78–87.
- Chang et al. (2008) Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of Semantic Representation: Dataless Classification. In AAAI. AAAI Press, 830–835.
- Chapelle et al. (2010) Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. 2010. Semi-Supervised Learning (1st ed.). The MIT Press.
- Charuvaka and Rangwala (2015) Anveshi Charuvaka and Huzefa Rangwala. 2015. HierCost: Improving Large Scale Hierarchical Classification with Cost Sensitive Learning. In ECML/PKDD (1) (Lecture Notes in Computer Science), Vol. 9284. Springer, 675–690.
- Collins (2002) Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In EMNLP.
- Cong et al. (2004) Gao Cong, Wee Sun Lee, Haoran Wu, and Bing Liu. 2004. Semi-supervised Text Classification Using Partitioned EM. In DASFAA (Lecture Notes in Computer Science), Vol. 2973. Springer, 482–493.
- Dagan et al. (1997) Ido Dagan, Yael Karov, and Dan Roth. 1997. Mistake-Driven Learning in Text Categorization. In EMNLP. ACL.
- Dalvi et al. (2016) Bhavana Dalvi, Aditya Kumar Mishra, and William W. Cohen. 2016. Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies. In WSDM. ACM, 193–202.
- Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 1–38.
- Duda and Hart (1973) Richard O Duda and Peter E Hart. 1973. Pattern classification and scene analysis. A Wiley-Interscience Publication, New York: Wiley, 1973 (1973).
- Dumais and Chen (2000) Susan T. Dumais and Hao Chen. 2000. Hierarchical classification of Web content. In SIGIR. ACM, 256–263.
- Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research 9 (2008), 1871–1874.
- Fox-Roberts and Rosten (2014) Patrick Fox-Roberts and Edward Rosten. 2014. Unbiased generative semi-supervised learning. Journal of Machine Learning Research 15, 1 (2014), 367–443.
- Gopal and Yang (2013) Siddharth Gopal and Yiming Yang. 2013. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In KDD. ACM, 257–265.
- Gopal et al. (2012) Siddharth Gopal, Yiming Yang, Bing Bai, and Alexandru Niculescu-Mizil. 2012. Bayesian models for Large-scale Hierarchical Classification. In NIPS. 2420–2428.
- Koller and Sahami (1997) Daphne Koller and Mehran Sahami. 1997. Hierarchically Classifying Documents Using Very Few Words. In ICML. Morgan Kaufmann, 170–178.
- Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML. 282–289.
- Lang (1995) Ken Lang. 1995. NewsWeeder: Learning to Filter Netnews. In ICML. Morgan Kaufmann, 331–339.
- Lewis et al. (2004) David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (2004), 361–397.
- Liu and Yang (2012) Mingyoug Liu and Jiangang Yang. 2012. An improvement of TFIDF weighting in text categorization. International proceedings of computer science and information technology (2012), 44–47.
- Liu et al. (2005) Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. 2005. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explorations 7, 1 (2005), 36–43.
- Mann and McCallum (2008) Gideon S. Mann and Andrew McCallum. 2008. Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields. In ACL. The Association for Computer Linguistics, 870–878.
- McCallum et al. (1998) Andrew McCallum, Ronald Rosenfeld, Tom M. Mitchell, and Andrew Y. Ng. 1998. Improving Text Classification by Shrinkage in a Hierarchy of Classes. In ICML. Morgan Kaufmann, 359–367.
- Mccord and Chuah (2011) Michael Mccord and M Chuah. 2011. Spam detection on twitter using traditional classifiers. In international conference on Autonomic and trusted computing. Springer, 175–186.
- Meng et al. (2018) Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. Weakly-Supervised Neural Text Classification. In CIKM. ACM, 983–992.
- Meng et al. (2019) Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2019. Weakly-Supervised Hierarchical Text Classification. In AAAI. AAAI Press.
- Ng and Jordan (2001) Andrew Y. Ng and Michael I. Jordan. 2001. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes. In NIPS. MIT Press, 841–848.
- Nigam et al. (2006) Kamal Nigam, Andrew McCallum, and Tom Mitchell. 2006. Semi-supervised text classification using EM. Semi-Supervised Learning (2006), 33–56.
- Nilsson (1965) N. J. Nilsson. 1965. Learning machines: Foundations of Trainable Pattern-Classifying Systems (1st ed.). McGraw-Hill.
- Pang et al. (2002) Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. In EMNLP.
- Peng et al. (2018) Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song, and Qiang Yang. 2018. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN. In WWW. ACM, 1063–1072.
- Song and Roth (2014) Yangqiu Song and Dan Roth. 2014. On Dataless Hierarchical Text Classification. In AAAI. AAAI Press, 1579–1585.
- Sun and Lim (2001) Aixin Sun and Ee-Peng Lim. 2001. Hierarchical Text Classification and Evaluation. In ICDM. IEEE Computer Society, 521–528.
- Taskar et al. (2003) Benjamin Taskar, Carlos Guestrin, and Daphne Koller. 2003. Max-Margin Markov Networks. In NIPS. 25–32.
- Tikk and Biró (2004) Domonkos Tikk and György Biró. 2004. A hierarchical test categorization approach and its application to FRT expansion. Austr. J. Intelligent Information Processing Systems 8, 3 (2004), 123–131.
- Tsochantaridis et al. (2005) Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research 6 (2005), 1453–1484.
- Wu et al. (2017) Cinna Wu, Mark Tygert, and Yann LeCun. 2017. Hierarchical loss for classification. CoRR abs/1709.01062 (2017).
- Xiao et al. (2011) Lin Xiao, Dengyong Zhou, and Mingrui Wu. 2011. Hierarchical Classification via Orthogonal Transfer. In ICML. Omnipress, 801–808.
- Yang (1999) Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Retr. 1, 1-2 (1999), 69–90.