Efficient Path Prediction for SemiSupervised and Weakly Supervised Hierarchical Text Classification
Abstract.
Hierarchical text classification has many realworld applications. However, labeling a large number of documents is costly. In practice, we can use semisupervised learning or weakly supervised learning (e.g., dataless classification) to reduce the labeling cost. In this paper, we propose a path costsensitive learning algorithm to utilize the structural information and further make use of unlabeled and weaklylabeled data. We use a generative model to leverage the large amount of unlabeled data and introduce path constraints into the learning algorithm to incorporate the structural information of the class hierarchy. The posterior probabilities of both unlabeled and weakly labeled data can be incorporated with pathdependent scores. Since we put a structuresensitive cost to the learning algorithm to constrain the classification consistent with the class hierarchy and do not need to reconstruct the feature vectors for different structures, we can significantly reduce the computational cost compared to structural output learning. Experimental results on two hierarchical text classification benchmarks show that our approach is not only effective but also efficient to handle the semisupervised and weakly supervised hierarchical text classification.
1. Introduction
Text classification has always been an important task, particularly with the vast growth of text data on the Web needed to be classified. The applications include news classification (Dagan et al., 1997), product review classification (Pang et al., 2002), spam detection (Mccord and Chuah, 2011) and so on. Hierarchical classification (HC) and structured prediction are involved since the classes are usually organized as a hierarchy. In recent decades, many approaches have been proposed for HC. For example, topdown classification (Sun and Lim, 2001) classifies documents at the top layer and then propagates the results to next layer until the leaves. This greedy strategy propagates the classification error along the hierarchy. Contrarily, bottomup classification (Bennett and Nguyen, 2009) backpropagates the labels from the leaves to the top layer, making the leaves with less training data but sharing some similarities with their parents and siblings may not get well considered and trained. Moreover, structural output learning, such as structural perceptron (Collins, 2002) and structural SVM (Tsochantaridis et al., 2005), can leverage the structural information in the class hierarchy well, but they need to do Kesler construction (Nilsson, 1965; Duda and Hart, 1973) where for each substructure, the new features are constructed based on the existing features and the class dependencies. That is why structural output learning usually takes more time to train than topdown and bottomup approaches. All the above approaches are supervised methods. When there are more unlabeled data, it is more challenging if we consider both class dependencies and efficiency of the practical use of hierarchical text classification.
There exist several ways to use the large amount of unlabeled data, among which semisupervised learning (SSL) (Chapelle et al., 2010) and weakly supervised learning such as dataless classification (Chang et al., 2008; Song and Roth, 2014) are two representative ways. An example of SSL is (Nigam et al., 2006). It uses a mixture multinomial model to estimate the posterior probabilities of unlabeled data, which share the same parameters with the naive Bayes model for the labeled data. More parameters can be introduced to model the hierarchical structure, causing the model redundant and meanwhile not accurate enough. As for weakly supervised learning, dataless classification (Song and Roth, 2014) uses the semantic similarities between label descriptions and document contents to provide weak labels for documents. When applying the weak labels, current approaches simply treat each label similarity independently and do not consider the path constraints in the label hierarchy.
To tackle the above problems for semisupervised and weakly supervised hierarchical text classification, we propose a path costsensitive learning algorithm based on a generative model for text. When estimating the path posterior distribution, the pathdependent scores are incorporated to make the posteriori pathsensitive. The pathdependent score evaluates how accurate the current model is in terms of classifying a document among the paths in the class hierarchy. Then during inference, classification is constrained to keep the consistency of the hierarchy. By this mechanism, we develop a simple model with fewer parameters compared with existing approaches while maintaining the consistency property for the class dependencies in the hierarchy.
The contributions of our paper are as follows:

We propose a new approach for hierarchical text classification based on a probabilistic framework. We highlight its meanings in costsensitive learning and constraint learning.

We show significant improvements on two widely used hierarchical text classification benchmarks and demonstrate our algorithm’s effectiveness in semisupervised and weaklysupervised learning settings.

Our approach reduces the complexity of traditional methods. We achieve tens of times speedup while outperforming the stateoftheart discriminative baseline.
The code and data used in the paper are available at https://github.com/HKUSTKnowComp/PathPredictionForTextClassification.
2. Related Work
There are only a few studies on semisupervised hierarchical (text) classification (Nigam et al., 2006; Dalvi et al., 2016), partially because of the difficulty to evaluate the class dependencies for unlabeled data and the time cost of using more complicated algorithms such as structural output learning (Mann and McCallum, 2008). Most semisupervised hierarchical text classification works were based on EM algorithm introduced by (Nigam et al., 2006). Some are related with ours (see in Section 2.1), while others are not, e.g., (Dalvi et al., 2016) used EM algorithm to deal with incomplete hierarchy problem, which was not the same setting as ours. In this section, we simply start with the review of general hierarchical text classification and then explain the uniqueness and significance of our work.
Hierarchical text classification has been studied for several decades. Flat multilabel classification methods (Tikk and Biró, 2004) ignore the hierarchy, thus poor for HC. Early works (Koller and Sahami, 1997; Dumais and Chen, 2000; Liu et al., 2005) often used “pachinkomachine models” which assigned a local classifier at each node and classified documents recursively. Topdown and bottomup approaches utilize the local classifier ideas, but topdown is a greedy strategy so it may not find optimal solutions, while bottomup approach does not well consider and train the classes with less training data.
To better exploit the class hierarchy, algorithms particularly designed for trees can assist. In practice, both generative and discriminative models are used. In the following, we will review the related work of these two categories.
2.1. Generative Models
(Nigam et al., 2006) summarized the text generative model and provided the naive Bayes classifier and ExpectationMaximization (EM) algorithm for flat classification. As for HC, it introduced more parameters to account for the class dependencies. (McCallum et al., 1998) remodeled the framework in another way. They applied shrinkage to smooth parameter estimates using the class hierarchy. (Cong et al., 2004) also used the same generative framework but proposed a clusteringbased partitioning technique. These generative hierarchical methods can bring some structural information to the model, but they do not make full use of the hierarchy and have difficulties scaling to large hierarchies.
2.2. Discriminative Models
Discriminative methods are also popular for HC. Orthogonal Transfer (Xiao et al., 2011) borrowed the idea of topdown classification where each node had a regularized classifier and each node’s normal vector classifying hyperplane was encouraged to be orthogonal to its ancestors’. Hierarchical Bayesian Logistic Regression (Gopal et al., 2012) leveraged the hierarchical dependencies by giving the children nodes a prior centered on the parameters of its parents. The idea was further developed in Hierarchically Regularized SVM and Logistic Regression (Gopal and Yang, 2013), where the hierarchical dependencies were incorporated into the parameter regularization structure. More recently, the idea of hierarchical regularization has been applied to deep models and also showed some improvements (Peng et al., 2018). (Charuvaka and Rangwala, 2015) simplified the construction of classifier by building a binary classifier on each tree node and providing the costsensitive learning (HierCost). All the above approaches are still based on topdown or greedy classification which can result in nonoptimal solutions. Another similar work with ours is (Wu et al., 2017)’s hierarchical loss for classification, which defined the hierarchical loss or win as the weighted sum of the probabilities of the nodes along the path. In contrast to their work, we use the sum of the (weakly) labeled instances along a path as score to perform path costsensitive learning.
To find more theoretically guaranteed solutions, some algorithms were developed based on structural output learning (Lafferty et al., 2001; Collins, 2002; Taskar et al., 2003; Tsochantaridis et al., 2005), which can be proved to be global optimal for HC. Hierarchical SVM (HSVM) (Cai and Hofmann, 2004), one example of structural SVM, generalized SVM to structured data with a pathdependent discriminant function. In general, when performing structural output learning, Kesler construction is used to construct the feature vectors for comparing different structures (Nilsson, 1965; Duda and Hart, 1973), which adds much more computation than topdown or bottomup classification approaches.
In summary, generative and discriminative models can both be adapted to HC problems. Discriminative models achieve better performance with adequate labeled data (Ng and Jordan, 2001), especially if a better representation for text can be found, e.g., using deep learning (Peng et al., 2018). Whereas generative models have their advantage for handling more uncertainties (Ng and Jordan, 2001) for limited labeled data and under noisy supervision. Our work is based on a generative model yet has the same parameter size as the flat classification. We find that it significantly boosts the performance of semisupervised learning and weakly supervised learning as well as reduces the computational cost.
3. Path Prediction for Hierarchical Classification
In HC, the classes constitute a hierarchy, denoted as . is a tree whose depth is , with the root node in depth . Then the classes are distributed from depth to . We suppose that all leaf nodes are in depth . This can always be satisfied by expanding the shallower leaf node (i.e. giving it a child) until it reaches depth . When evaluating models, these dummy nodes from can be easily removed to avoid affecting the performance measure.
Let be the class sets in depth , depth accordingly, with sizes . To classify a document, we assign labels in each depth, i.e., the document gets labels . These form a path in if the classification results in each depth are consistent with other depths. We want to maintain the consistency of the hierarchy, therefore we classify the documents by paths instead of by multilabel classes. After assigned a path, the document’s classes are the nodes lying in the path. It is similar with structured prediction since a path can be regarded as a structured object, which contains more information than a set of multilabel classes without path constraints.
To sum up, path prediction aims at making use of the structural information in the class hierarchy to train the classifier. Note that the classifier is for paths in the hierarchy instead of classes. The details of path prediction algorithm are given in the next section.
4. Path CostSensitive Learning
In this section, we introduce our method which utilizes the structural information to learn the classifier, revealing its meanings in costsensitive learning and constraint learning.
4.1. PathGenerated Probabilistic Framework
We base our work on a widelyused probabilistic framework, which constructs a generative model for text. In the framework, text data are assumed to be generated from a mixture of multinomial distributions over words. Previous works (Nigam et al., 2006; McCallum et al., 1998) assumed that the mixture components in the generative model have a onetoone correspondence with the classes. However, in order to perform path prediction, we presume that the mixture components have a onetoone correspondence with the paths.
Define to be the set of all paths which start from the root node and end in some leaf node in the class hierarchy , so the size of equals to that of the leaf nodes . Let be the vocabulary. Denote as the parameters for the mixture multinomial model. For a document with length , suppose is the word frequency of word in , which is the document feature represented by vector space model (Liu and Yang, 2012). Then the generative process runs as following.
First, select a mixture component, or equivalently a path , from (prior of ). Next, generate the document by selecting the length and picking up words from . According to the law of total probability and the naive Bayes assumption that given the labels, the occurrence times of each word in a document are conditionally independent with its position as well as other words, the probability of generating is
(1) 
In general, document lengths are assumed to be independent with classes, thus independent with paths. So model parameters include the path prior and the multinomial distribution over words for each path .
4.2. PathDependent Scores
Given a data set , consisting of the labeled documents and the unlabeled documents . We now derive the parameter estimation in a supervised manner. With only labeled data considered, we maximize , which can be done by counting the corresponding occurrences of events. The event counts are usually the hard counts for flat classification. Here we use a pathdependent score to substitute it.
First we define the score of a node in for a document. Suppose , is the node in path . The node score of for , denoted as , indicates the label of . When is labeled with the ground truth labels, if and only if is one of ’s labels. We also consider the weakly supervised case. In (Song and Roth, 2014)’s dataless text classification, for , it is weakly labeled by the semantical similarities with classes. We assign value to if has the largest similarity with among all classes in depth and 0 otherwise.
Next we introduce the path score. For , the score of path , denoted as , is the sum of the nodes’ scores lying in except the root node since it makes no sense for classification.
(2) 
Take the hierarchy in Figure 1 as an example. is labeled as , then , , while other paths score 0. If is weakly labeled by the similarities, then we label it with the classes having the maximum similarity in each depth and obtain the path scores in the same way.
4.3. Path CostSensitive Naive Bayes Classifier
While doing the empirical counts, the Laplace smoothing is often applied by adding one count to each event to avoid zero probability and shrink the estimator. Combining the event counts (i.e. the path scores) and the smoothing term, the parameter estimates are:
(3)  
(4) 
There are two aspects of using the path scores as event counts:

Costsensitive performance measures are considered since different data samples are given different weights. In Figure 1, is counted twice for , once in and once in , thus obtaining more weights. and are not right paths for , but they still classify correctly in depth , thus get one count, less than but larger than other paths who have no correct labels at all. This path costsensitive learning behavior helps the model to maintain structural information.

The path scores function as the measuring indicators of paths, capacitating the model to classify the documents by paths. The path prediction actually puts constraints on the classifier, where the prediction results must be consistent with the class hierarchy. Furthermore, the constraint learning reduces the search space and improves efficiency.
After estimating from , for any test document , the posterior probability distribution can be obtained by Bayes’ rule:
(5) 
Then will be classified into .
The path costsensitive naive Bayes classifier (PCNB) for the generative model are introduced above. Next we will present the semisupervised path costsensitive learning algorithm.
4.4. SemiSupervised Path CostSensitive Learning
Until now, only the labeled data are used during training, but we want to make use of the unlabeled data to ameliorate the classifier. We follow (Nigam et al., 2006) to apply EM technique for SSL.
When the initial parameters are given, the posterior probabilities of , computed through Eq. (4.3), can act as the path score for . Combining the labeled and unlabeled data together, the parameter estimates are changed into
(6)  
(7) 
Note that the numerical value of for ranges in since it is the posterior probability. Therefore, the unlabeled data weight less than the labeled data while estimating the parameters. It is reasonable because the labeled data are more authentic than the inference results of unlabeled data, especially in the early iterations where the model does not reach convergence.
The new obtained via Eqs. (6) and (7) are then used to compute the posterior probabilities of again, which in turn update . The iterative process keeps maximizing the likelihood of the dataset , equivalent to maximizing the log likelihood:
(8) 
Refer to (Dempster et al., 1977), the convergence of EM can be guaranteed, but it reaches some local maxima. To enable the algorithm to find good local maxima, we initialize with those obtained through the naive Bayes classifier on . Algorithm 1 presents the EM algorithm for the path costsensitive classification (PCEM).
5. Experiments
For empirical evaluation of effectiveness and efficiency of our approach, we design experiments on semisupervised and weakly supervised hierarchical text classification tasks, compared to the representative and the stateoftheart baselines.
5.1. Experimental Design
5.1.1. Datasets
We use two datasets, both of which have semisupervised and weaklysupervised version. The statistics are listed in Table 1.
#Training  #Test  #Features  #Leaves  #Nodes  Depth  

20NG  15,077  3,769  103,363  20  27  2 
RCV1  6,395  1,733  26,888  35  56  3 

20NG^{1}^{1}1http://qwone.com/~jason/20Newsgroups/ (Lang, 1995) 20 Newsgroups is a widelyused text classification dataset. To experiment on weaklysupervised setting and compare with semisupervised baselines, we use dataless 20NG provided in (Song and Roth, 2014).
5.1.2. Baselines
We compare our path costsensitive algorithms (PCNB and PCEM) with the following baselines:

Generative baselines

Flat naive Bayes classifier (FlatNB) and FlatEM algorithm: the flat classifiers introduced in (Nigam et al., 2006).

Naive Bayes classifier with multiple components (NBMC) and EMMC: a more expressive model proposed by (Nigam et al., 2006).

Topdown naive Bayes classifier (TDNB) and TDEM: the classifiers run in the topdown way.

Windriven naive Bayes classifier (WDNB) and WDEM: the modified hierarchical loss for classification (Wu et al., 2017).


Discriminative baselines

Logistic regression (LR) and SVM: two classical discriminative methods. Our experiments use the LibLinear^{2}^{2}2 https://www.csie.ntu.edu.tw/ (Fan et al., 2008) to train corresponding models and test. During the experiments, we found that dual solvers were much faster and even better in performance than primal solvers, so we chose dual solvers.

HierCost^{3}^{3}3https://cs.gmu.edu/~mlbio/HierCost/ (Charuvaka and Rangwala, 2015): the stateoftheart discriminative method for hierarchical text classification.

20NG  RCV1  
labeled  dataless  labeled  dataless  
LR  ‡52.02  ‡42.41  ‡44.16  ‡31.51  ‡69.59  †24.43  †33.54  ‡9.00 
SVM  ‡48.33  ‡39.73  ‡41.70  ‡30.24  ‡68.78  †23.97  †34.15  †9.72 
HierCost  ‡48.12  ‡40.89  ‡43.26  ‡32.30  ‡69.22  †24.98  †31.07  ‡8.84 
NB  ‡53.39  ‡39.94  ‡47.29  ‡30.67  ‡70.68  †24.48  †33.29  ‡8.39 
NBMC  ‡46.99  ‡38.02  ‡43.24  ‡28.82  †69.84  †23.52  †28.91  ‡6.91 
TDNB  ‡55.50  ‡42.16  ‡48.06  ‡31.02  ‡70.37  †24.65  33.67  †8.40 
WDNB  ‡53.66  ‡41.53  ‡47.19  ‡31.02  ‡70.89  †25.04  34.24  ‡9.38 
PCNB  ‡58.33  ‡48.04  ‡52.14  †38.50  †73.63  29.95  37.06  12.47 
EM  †63.21  †49.30  †55.13  †37.40  75.38  28.32  38.05  †10.76 
EMMC  †66.56  †52.95  59.50  †41.56  74.86  28.04  32.79  †10.42 
TDEM  †62.14  ‡46.89  †55.62  †37.14  †74.48  †26.88  40.76  †10.91 
WDEM  †62.71  ‡48.85  ‡47.19  ‡31.02  76.35  28.66  34.24  ‡9.38 
PCEM  70.73  60.02  63.54  48.56  77.83  33.49  40.96  14.96 
5.1.3. Evaluation Metrics
We use scores (Yang, 1999) to evaluate the performances of all methods. Denote , , as the instance numbers of truepositive, falsepositive and false negative for class . Let be the set of all classes except the root node. Two conventional scores are defined as:

,
where is the averaged precision and is the averaged recall. 
,
where and are the precision and the recall for .
For the two scores, we measure the overall performance of all classes in the hierarchy in our experiments.
5.2. Results
To evaluate our algorithms, we compare our algorithms with the baselines in semisupervised and weakly supervised hierarchical text classification. Results on all datasets under label rate are summarized in Table 2, where label rate means there are data in the training set are labeled or weaklylabeled, which is a common setting for semisupervised text classification. To show that our approach (PCEM) indeed levareges unlabeled data and weakly labeled data well, we present the results under different label rates compared with other EM methods in Figure 2 and 3. For each experiment, we randomly split the training data into labeled and unlabeled according to the label rate, then run experiments using the splitted training data. The running is executed for 5 times and the mean scores are calculated. Next we will analyze the results. Time efficiency will also be discussed.
5.2.1. SemiSupervised Classification with True Labels
Table 2 shows that when the training data are partly labeled with the ground truth labels, PCEM has remarkable superiority over other methods all the time. The discriminative baselines do not have their advantages on the semisupervised and weakly supervised settings. When compared with generative baselines, our approaches, either naive Bayes (PCNB) or semisupervised (PCEM), are the best among the corresponding methods. It demonstrates that our algorithms makes good use of the structural information to improve the hierarchical classification.
As expected, EM approaches outperform the corresponding naive Bayes classifier under label rate, which reveals the benefits from the unlabeled data. However, we also noticed that EM may be surpassed by NB when the label rate gets larger. That is related with whether the ratio between labeled and unlabeled data is suitable for SSL, as well as the bias of unlabeled data. This issue has been discussed in previous works (FoxRoberts and Rosten, 2014).
To see the performance in SSL, we compare PCEM with other EM methods in Figure 2. The label rate ranges in . We find that PCEM outperforms others steadily. Other hierarchical EM methods are close to FlatEM, showing that they takes little advantage of the class hierarchy. The results reveal the effectiveness of PCEM under all label rates for semisupervised classification.
5.2.2. WeaklySupervised Classification on Dataless Setting
We also make a comparison with the baselines on dataless text classification. The experimental setting is the same as the semisupervised classification, except that the training documents do not have labels. Instead, some of them are ‘labeled’ as classes with the maximal semantical similarities. We use the dataless 20NG and RCV1 datasets provided by (Song and Roth, 2014). Results are presented in Table 2 and Figure 3.
We find the consistent results with the semisupervised setting. PCEM can always beat the baselines with significant improvements. PCNB is also better than other NB methods. It is worth noting that the gaps between our algorithms (PCNB and PCEM) and the baselines are bigger than those in the semisupervised setting. We think the reason is that for this weaklylabeled dataset, the similarities can be seen as noisy labels for documents. In this noisy circumstance, our path costsensitive learning algorithm with the probabilistic framework is pretty good at making use of the structural information and features of unlabeled data to recover the true generative distribution.
5.2.3. Efficiency Comparison
Time complexity is also under consideration to evaluate our algorithms. PCNB is highly efficient, faster than all of the other methods except FlatNB and even competitive with FlatNB. PCEM is slightly slower than LR and SVM, but that is because EM methods leverage the unlabeled data, which cannot be used by discriminative methods. The tradeoff is acceptable, especially considering the excellent performance of PCEM. Furthermore, PCEM also achieves tens of times speedup compared to HierCost.
6. Conclusions and Future Work
We present an effective and efficient approach for hierarchical text classification. Our path costsensitive learning algorithm alters the traditional generative model of text with a pathgenerated model to constrain the classifier by the class hierarchy. We show that our algorithm outperforms other baselines on semisupervised learning and weakly supervised learning. In addition, our model has the potential of extension to other models, not limited to the generative one, if the pathdependent scores are incorporated appropriately. For the possible future work, we will convert the current framework into a discriminative learning framework following (Collins, 2002) and apply deep neural models to learn a better representation for text (Meng et al., 2019, 2018). Discrimative framework will further improve the learning when there are more labeled data and deep neural models are more powerful to handle different kinds of weak supervision.
7. Acknowledgement
This paper was supported by HKUSTWeChat WHAT Lab and the Early Career Scheme (ECS, No. 26206717) from Research Grants Council in Hong Kong.
References
 (1)
 Bennett and Nguyen (2009) Paul N. Bennett and Nam Nguyen. 2009. Refined experts: improving classification in large taxonomies. In SIGIR. ACM, 11–18.
 Cai and Hofmann (2004) Lijuan Cai and Thomas Hofmann. 2004. Hierarchical document categorization with support vector machines. In CIKM. ACM, 78–87.
 Chang et al. (2008) MingWei Chang, LevArie Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of Semantic Representation: Dataless Classification. In AAAI. AAAI Press, 830–835.
 Chapelle et al. (2010) Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. 2010. SemiSupervised Learning (1st ed.). The MIT Press.
 Charuvaka and Rangwala (2015) Anveshi Charuvaka and Huzefa Rangwala. 2015. HierCost: Improving Large Scale Hierarchical Classification with Cost Sensitive Learning. In ECML/PKDD (1) (Lecture Notes in Computer Science), Vol. 9284. Springer, 675–690.
 Collins (2002) Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In EMNLP.
 Cong et al. (2004) Gao Cong, Wee Sun Lee, Haoran Wu, and Bing Liu. 2004. Semisupervised Text Classification Using Partitioned EM. In DASFAA (Lecture Notes in Computer Science), Vol. 2973. Springer, 482–493.
 Dagan et al. (1997) Ido Dagan, Yael Karov, and Dan Roth. 1997. MistakeDriven Learning in Text Categorization. In EMNLP. ACL.
 Dalvi et al. (2016) Bhavana Dalvi, Aditya Kumar Mishra, and William W. Cohen. 2016. Hierarchical Semisupervised Classification with Incomplete Class Hierarchies. In WSDM. ACM, 193–202.
 Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) (1977), 1–38.
 Duda and Hart (1973) Richard O Duda and Peter E Hart. 1973. Pattern classification and scene analysis. A WileyInterscience Publication, New York: Wiley, 1973 (1973).
 Dumais and Chen (2000) Susan T. Dumais and Hao Chen. 2000. Hierarchical classification of Web content. In SIGIR. ACM, 256–263.
 Fan et al. (2008) RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research 9 (2008), 1871–1874.
 FoxRoberts and Rosten (2014) Patrick FoxRoberts and Edward Rosten. 2014. Unbiased generative semisupervised learning. Journal of Machine Learning Research 15, 1 (2014), 367–443.
 Gopal and Yang (2013) Siddharth Gopal and Yiming Yang. 2013. Recursive regularization for largescale classification with hierarchical and graphical dependencies. In KDD. ACM, 257–265.
 Gopal et al. (2012) Siddharth Gopal, Yiming Yang, Bing Bai, and Alexandru NiculescuMizil. 2012. Bayesian models for Largescale Hierarchical Classification. In NIPS. 2420–2428.
 Koller and Sahami (1997) Daphne Koller and Mehran Sahami. 1997. Hierarchically Classifying Documents Using Very Few Words. In ICML. Morgan Kaufmann, 170–178.
 Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML. 282–289.
 Lang (1995) Ken Lang. 1995. NewsWeeder: Learning to Filter Netnews. In ICML. Morgan Kaufmann, 331–339.
 Lewis et al. (2004) David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (2004), 361–397.
 Liu and Yang (2012) Mingyoug Liu and Jiangang Yang. 2012. An improvement of TFIDF weighting in text categorization. International proceedings of computer science and information technology (2012), 44–47.
 Liu et al. (2005) TieYan Liu, Yiming Yang, Hao Wan, HuaJun Zeng, Zheng Chen, and WeiYing Ma. 2005. Support vector machines classification with a very largescale taxonomy. SIGKDD Explorations 7, 1 (2005), 36–43.
 Mann and McCallum (2008) Gideon S. Mann and Andrew McCallum. 2008. Generalized Expectation Criteria for SemiSupervised Learning of Conditional Random Fields. In ACL. The Association for Computer Linguistics, 870–878.
 McCallum et al. (1998) Andrew McCallum, Ronald Rosenfeld, Tom M. Mitchell, and Andrew Y. Ng. 1998. Improving Text Classification by Shrinkage in a Hierarchy of Classes. In ICML. Morgan Kaufmann, 359–367.
 Mccord and Chuah (2011) Michael Mccord and M Chuah. 2011. Spam detection on twitter using traditional classifiers. In international conference on Autonomic and trusted computing. Springer, 175–186.
 Meng et al. (2018) Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2018. WeaklySupervised Neural Text Classification. In CIKM. ACM, 983–992.
 Meng et al. (2019) Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2019. WeaklySupervised Hierarchical Text Classification. In AAAI. AAAI Press.
 Ng and Jordan (2001) Andrew Y. Ng and Michael I. Jordan. 2001. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes. In NIPS. MIT Press, 841–848.
 Nigam et al. (2006) Kamal Nigam, Andrew McCallum, and Tom Mitchell. 2006. Semisupervised text classification using EM. SemiSupervised Learning (2006), 33–56.
 Nilsson (1965) N. J. Nilsson. 1965. Learning machines: Foundations of Trainable PatternClassifying Systems (1st ed.). McGrawHill.
 Pang et al. (2002) Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. In EMNLP.
 Peng et al. (2018) Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song, and Qiang Yang. 2018. LargeScale Hierarchical Text Classification with Recursively Regularized Deep GraphCNN. In WWW. ACM, 1063–1072.
 Song and Roth (2014) Yangqiu Song and Dan Roth. 2014. On Dataless Hierarchical Text Classification. In AAAI. AAAI Press, 1579–1585.
 Sun and Lim (2001) Aixin Sun and EePeng Lim. 2001. Hierarchical Text Classification and Evaluation. In ICDM. IEEE Computer Society, 521–528.
 Taskar et al. (2003) Benjamin Taskar, Carlos Guestrin, and Daphne Koller. 2003. MaxMargin Markov Networks. In NIPS. 25–32.
 Tikk and Biró (2004) Domonkos Tikk and György Biró. 2004. A hierarchical test categorization approach and its application to FRT expansion. Austr. J. Intelligent Information Processing Systems 8, 3 (2004), 123–131.
 Tsochantaridis et al. (2005) Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research 6 (2005), 1453–1484.
 Wu et al. (2017) Cinna Wu, Mark Tygert, and Yann LeCun. 2017. Hierarchical loss for classification. CoRR abs/1709.01062 (2017).
 Xiao et al. (2011) Lin Xiao, Dengyong Zhou, and Mingrui Wu. 2011. Hierarchical Classification via Orthogonal Transfer. In ICML. Omnipress, 801–808.
 Yang (1999) Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Retr. 1, 12 (1999), 69–90.