Inconsistent Node Flattening for Improving Top-down Hierarchical Classification
Large-scale classification of data where classes are structurally organized in a hierarchy is an important area of research. Top-down approaches that exploit the hierarchy during the learning and prediction phase are efficient for large-scale hierarchical classification. However, accuracy of top-down approaches is poor due to error propagation , prediction errors made at higher levels in the hierarchy cannot be corrected at lower levels. One of the main reason behind errors at the higher levels is the presence of inconsistent nodes that are introduced due to the arbitrary process of creating these hierarchies by domain experts. In this paper, we propose two different data-driven approaches (local and global) for hierarchical structure modification that identifies and flattens inconsistent nodes present within the hierarchy. Our extensive empirical evaluation of the proposed approaches on several image and text datasets with varying distribution of features, classes and training instances per class shows improved classification performance over competing hierarchical modification approaches. Specifically, we see an improvement upto 7 in Macro-F1 score with our approach over best TD baseline. SOURCE CODE: http://www.cs.gmu.edu/mlbio/InconsistentNodeFlattening
Hierarchies (taxonomies) are the most commonly used data structure for organizing large volume of datasets in various domains including bioinformatics111http://geneontology.org/ for organizing gene sequences, ImageNet222http://image-net.org/ for organizing images and Yahoo!333http://dir.yahoo.com/ web directories for organizing web documents. Given the hierarchy of classes, the Hierarchical Classification (HC) problem deals with learning models by utilizing (or ignoring) the hierarchical structure to automatically classify unlabeled test instances (examples) into relevant classes (categories). Over the years, HC has gained immense interest among researchers and is evident from the various large-scale online prediction challenges such as LSHTC444http://lshtc.iit.demokritos.gr/, BioASQ555http://www.bioasq.org/ and ILSVRC666http://image-net.org/challenges/LSVRC/.
In the past, various methods have been developed to improve the HC performance . One of the simplest method is to learn binary one-versus-rest classifier for each of the leaf categories, ignoring the hierarchical relationships. This method is referred as flat classification. Other methods involve use of the hierarchies (see Section II) during the learning and prediction process. Hierarchies provide useful structural relationships (such as parent-child and siblings) among different classes that can be exploited for learning generalized classification models. Previously, researchers have demonstrated the usefulness of hierarchies for classification and have obtained promising results [2, 3, 4, 5, 6, 7, 8]. However, in many situations hierarchies used for learning models are not consistent due to the presence of inconsistent nodes (and links) resulting in excessive error propagation. As such, HC approaches are outperformed by the flat classifiers that completely ignore the hierarchy [9, 10].
Flat classifiers, though effective in some cases, suffer from two major issues: (i) During the prediction phase, flat classifiers invoke all the models for leaf categories and are considerably slower than top-down HC approaches, in which only the models in the relevant path are invoked. (ii) For large-scale HC problems, it is challenging to learn effective classification models that can discriminate between large number of classes. This problem is worse for datasets with skewed class distributions where plenty of classes have very few examples for training (rare categories problem) . Large-scale datasets show power-law distribution of examples per category . Considering these issues, the focus of this paper is to improve top-down HC approaches, which are computationally feasible for large-scale datasets and handle the imbalance problem by utilizing structural relationships.
The main drawback of top-down HC approaches that contributes to their inferior classification performance is error propagation — compounding of errors from misclassifications at higher levels which cannot be rectified at the lower levels. This problem can be alleviated to certain extent by restructuring (modifying) the hierarchy to remove inconsistent nodes that causes performance deterioration. In this paper, our main contribution includes development of data-driven approaches for removing inconsistencies in the expert defined (original) hierarchy leading to a hierarchy that achieves higher classsification performance irrespective of the HC approach used for training. We propose a flattening approach where inconsistent nodes are selectively removed from the hierarchy. The criterion for flattening a node is based on the optimal regularized risk minimization objective value attained by the model trained for that node on a separate validation set. If the objective value for a node is above a certain threshold, then we flatten , i.e., we remove from the hierarchy and add its children to ’s parent node. Based on the strategy adapted for identifying inconsistent nodes, we propose two different approaches for inconsistent nodes flattening (INF) from the hierarchy: (i) Local approach (Level-INF) that computes a level-wise cutoff threshold and (ii) Global approach (Global-INF) that computes a global threshold for the entire hierarchy.
Experimental comparisons of top-down HC approach on our proposed modified hierarchy shows statistically significant performance improvement in comparison to the baseline hierarchy (expert defined or original) and other comparative methods for hierarchy modification [13, 14]. We also performed detailed analysis to show that the reduction in misclassification error at higher levels with our proposed hierarchy modification approach leads to reduced error propagation and hence better classification performance. In comparison to flat classification, our approach is more accurate for classes with fewer training instances (rare categories) and is computationally efficient for large-scale HC problems during the prediction phase.
Ii Literature Review
There has been a large body of research focusing on the HC problem. Besides completely ignoring the hierarchy (flat classifiers), one class of HC methods solve various local subproblems that train individual classifier(s) for each of the nodes (or parent nodes) in the hierarchy or learn classifiers for each of the levels. This methods are referred as local classification because only local structural relationship information are used during training these classifiers. To predict the labels of instances, top-down local hierarchical methods proceed by selecting the most relevant node at the topmost level and then recursively selecting the best node until a leaf category is reached, which is the final predicted label. Local approaches are more popular due to their computational benefits . Contrary to local classification, global classification methods  learn a single complex model over all the nodes in the hierarchy and are computationally more expensive than flat and local methods. Therefore, in this paper we focus on top-down local classification methods for training models and predicting labels.
Some of the earlier studies focus on exploiting hierarchies among categories for the purpose of classification [3, 16, 17, 18], but the number of categories are limited to a few hundreds. One of the earlier breakthroughs in the field of hierarchical text categorization was by Koller et al. . This approach used a divide and conquer paradigm for solving the HC problem which can easily be adopted in large-scale settings. Following this, numerous approaches have been developed to improve HC for larger datasets. Liu et al.  studied the classification performance using a SVM based method that scales for millions of categories. Gopal et al.  used a regularization term within the optimization function to capture the parent-child relationships in the hierarchy. This approach referred as HR-LR and HR-SVM shows improved classification performance but the training procedure for large-scale problem requires a distributed implementation and map-reduce supported infrastructure. Xue et al.  proposed a two stage approach. For each test document, in the first stage a set of candidate categories are retrieved based on similarity to the test document. Then the second stage builds a classifier on the hierarchy restricted to the set of categories fetched in first stage and classifies the test document using the restricted hierarchy. Although, pruning reduces the hierarchy to a manageable size, one severe drawback of this approach is having to train a different classifier for each test document which is expensive. Other works in the field of HC can be found in a detailed survey by Silla et al. .
Ii-a Hierarchy Modification
As discussed earlier, most HC approaches rely on hierachical relationships for learning complex models to improve the classification performance. However, the performance can be negatively impacted if the hierarchy used during learning models is inconsistent. Therefore, it is of utmost importance to generate an improved hierarchical representation that is suitable for classification prior to learning models. Inconsistencies in the hierarchy are due to the following reasons:
Hierarchies are designed for efficient search and navigation without considering HC performance.
Hierarchical groupings of categories is done based on semantics, whereas classification depends on data characteristics such as term frequency.
Multiple hierarchies are possible for the same dataset (such as SCOP and CATH  for protein structures). However, there is no consensus regarding which hierarchy is better for classification.
Consistent hierarchy design for datasets with large number of categories is prone to errors.
Several approaches that restructure the hierarchy have been developed in past. Level flattening  is one of the approach used in earlier works of hierarchy modification, where some of the levels are flattened (removed) from the original hierarchy prior to learning models. Based on levels that are flattened various methods of modification exist. Top Level Flattening (TLF) as shown in Figure 1(c) modifies the hierarchy by flattening the top level in the original hierarchy. Model learning and prediction for flattened hierarchy is done in similar fashion as top-down methods. Bottom Level Flattening (BLF) and Multiple Level Flattening (MLF), shown in Figures 1(d) and 1(e) are similar methods of hierarchy modification where bottom and multiple levels are removed, respectively. As done in Wang et al. , we removed the first and third levels for evaluating the MLF approach.
Babbar et al.  proposed a maximum-margin based strategy for hierarchy modification. This method selectively removes some of the inconsistent nodes in the hierarchy based on margins rather than removing complete levels. Hierarchy modification using this approach (shown in Figure 1(f)), is beneficial for classification and has been theoretically justified . We followed a similar approach for hierarchy modification. However, our method differs in following two aspects: (i) We developed a more systematic approach for threshold selection to identify the inconsistent nodes for flattening. Our approach is based on deviation from mean that is empirically tuned for each dataset, and (ii) We also considered a global perspective of the hierarchy (Global-INF) for threshold selection to identify the inconsistent set of nodes. This approach is more intuitive and realistic measure for threshold selection because it prevents excessive flattening of the nodes that is just based on local decisions, thereby allowing the benefits of leveraging the hierarchy during the model learning and classification process, especially for rare categories (see Section VI for justification).
|original (experts defined) hierarchy|
|set of real numbers|
|dimensionality of input vector|
|total number of training examples|
|set of leaf categories|
|input vector for -th training example|
|true label for -th training example|
|predicted label for -th test example|
|binary label used for -th training example to learn weight vector for|
|-th node in , = 1 iff =, -1 otherwise|
|predicted label for -th test example at -th node in the hierarchy,|
|= 1 iff prediction is positive, -1 otherwise|
|weight vector for -th node|
|mis-classification penalty parameter|
|optimal objective function value for -th node obtained using|
|validation dataset. We have dropped the subscript at some places|
|for ease of description|
|modified hierarchy after level-wise inconsistent node removal|
|modified hierarchy after global inconsistent node removal|
|set of nodes at -th level in|
|set of nodes (except root) in|
|set of inconsistent nodes using level-wise INF method|
|set of inconsistent nodes using global-INF method|
|mean of samples in set|
|standard deviation of samples in set|
|set of values for node at -th level in|
|set of values for all nodes (except root) in|
|threshold limit for identifying inconsistent nodes at -th level in|
|threshold limit for identifying inconsistent nodes in|
|fitness parameter for level-wise threshold selection at -th level in|
|fitness parameter for global threshold selection in|
Hierarchy modification using a supervised learning approaches are also proposed in the literature [11, 17], where the hierarchy is gradually modified to achieve better hierarchy for improving the classification performance. These methods have an expensive evaluation costs that needs to be performed after each modification, making it computationally infeasible for large-scale settings. Hence, we do not compare our approach to these methods. Other competitive methods that involves restructuring the hierarchy are developed by us and appear in an arXiv publication .
Table I summarizes the common notations used in this paper. We use bold letters to indicate vector variables.
Iii-a Problem Setup
Given, a hierarchy we train a binary one-vs-rest classifiers for each of the node — to discriminate its positive examples from the examples of other nodes (, negative examples) in the hierarchy. We followed the ‘inclusive policy’ for training classifiers, where all the descendant categories of node (including node itself) is considered as positive examples and the remaining categories as negative examples . In this paper, we have used logistic regression (LR)  as the underlying base model for training. The LR objective uses logistic loss to minimize the empirical risk and -norm term (denoted by ) to control the model complexity and prevent from overfitting. The objective function for training a model corresponding to node is provided in eq. (1).
For each node , we solve eq. (1) to obtain the optimal weight vector denoted by . The complete set of parameters for all the nodes constitutes the learned model for the hierarchical top-down classifier. For LR models the conditional probability for given its feature vector and the weight vector is given by eq. (2) and the classification decision function using eq. (3).
For a test example with feature vector , the top-down classifier predicts the class label as shown in eq. (4), where denotes the set of children of node . Essentially, the algorithm starts at the root node and recursively selects the best child nodes till it reaches a terminal node belonging to the set of leaf categories .
Iii-B Inconsistent Node Flattening
Gao et al.  showed that for any classifier that correctly classifies random input-output pairs using a set of decision nodes, the generalization error bound with probability estimates greater than 1 - is less than the expression shown in eq. (5).
where denotes the margin at node , is a constant term and is the radius of the ball containing the distribution’s support.
This provides two significant strategies in designing our approach to reduce the generalization error: (i) Increase the margin for learned models at node in the hierarchy, or (ii) Decrease the number of decision nodes involved in making the prediction. For achieving the optimum classification performance, we need to balance the trade-off between the margin and the number of decision nodes . Two of the extreme cases for learning hierarchical classifiers are flat and top-down methods. For flat classifiers, we have to make single decision (i.e., = 1) but margin width is presumably small due to the large number of leaf categories that needs to be distiguished, which makes it difficult to obtain large margin. For top-down hierarchical classifiers, we have to make a series of decisions from root to leaf nodes (i.e., ) but margin is larger due to the fewer number of categories that needs to be distinguished at each of the decision nodes. Motivated by this trade-off, in this paper we propose a method that removes some of the inconsistent nodes in the hierarchy , and thereby, increasing the value of margin for learned models at node in the hierarchy, while minimizing the number of decision nodes to classify an unlabeled test instances.
In order to improve the effectiveness of classification we need to identify these inconsistent nodes and flatten them. We mark a node within the hierarchy as inconsistent if the value of the objective function becomes greater than a chosen threshold value. To get a more reliable estimate of the , we first train the regularized LR models on a training set locally for each node and then compute the objective function on a separated validation set, which is different from the training set. The objective value on validation set for node is denoted by . We develop the following approaches for setting the threshold for flattening.
Level-wise Inconsistent Node Flattening: In this approach, referred as Level-INF, we select a different threshold locally for each level in the hierarchy. Algorithm 1 presents the level-wise approach that selects inconsistent nodes at each level in a top-down manner. The threshold for level is computed as the sum of mean and times the standard deviation of the set of values , where is a fitness parameter at level that is empirically estimated for each dataset based on values (see Section VI-B) and represents the set of nodes in level . All nodes that satisfy are marked as inconsistent and added to the set of inconsistent nodes denoted by . This procedure is repeated for all levels of the hierarchy. Finally, we flatten the nodes in set — remove and corresponding edges, and add edges from children of to ’s parent node. The modified hierarchy thus obtained is denoted by . Using the modified hierarchy, we re-train a top-down classifier.
Global Inconsistent Node Flattening: Different from Level-INF approach, which sets different thresholds for each level, the global method shown in Algorithm 2 computes a single threshold value for all levels. The threshold is computed as the sum of mean and times the standard deviation of the set of value , where is a fitness parameter that is empirically estimated for dataset considering all nodes values. is used to identify the set of inconsistent nodes in the hierarchy (i.e., all nodes with ). The hierarchy obtained by flattening the nodes present in is denoted by . Using the modified hierarchy, we re-train a top-down classifier. In this paper we refer to this approach as Global-INF.
Iv Experimental Evaluations
We have used text and image datasets for evaluating the performance of our proposed approaches. Various statistics of the datasets used in our experiments are listed in Table II. All these datasets are single-labeled and the examples are mandatorily assigned to the leaf nodes in the hierarchy (although our proposed approaches is trivially extendable to datasets with multi-label and non-mandatory leaf node label assignments). For all text datasets, we have applied the tf-idf transformation with -norm normalization to the word-frequency feature vector. Description of the used data is as follows:
Medical images annotated with medical applications codes. Each image is represented by the 80 features that are extracted using local distribution of edges.
Diatom images that was created as the part of the ADIAC project. Features for each image is created using various feature extraction techniques mentioned in . Further, we have preprocessed the original dataset by removing the examples that belongs to the internal nodes.
|Dataset||#Total Node||#Leaf Node||Depth||#Training||#Testing||#Features|
Collection of patent documents organized in international patent classification hierarchy.
DMOZ-SMALL, DMOZ-2010 and DMOZ-2012888http://dmoz.org
Multiple web documents organized into various classes using the hierarchical structure. It is subset of the web pages from open directory project and has been released as the part of the LSHTC999http://lshtc.iit.demokritos.gr/ challenge. For evaluating the DMOZ-2010 and DMOZ-2012 datasets we have used the provided test split and prediction scores are obtained using the web-portal interface101010http://lshtc.iit.demokritos.gr/node/81111111 http://lshtc.iit.demokritos.gr/LSHTC3_oracleUpload that was used during the competition.
Iv-B Evaluation Metrics
Flat Evaluation Measures: We have used the standard metrics  micro- () and macro- (M) for evaluating the performance of various methods. To compute , we sum up the category specific true positives , false positives and false negatives for different categories and compute the score as:
Unlike , M gives equal weight to all the categories so that the average score is not skewed in favor of the larger categories. It is defined as follows:
where, is the number of leaf categories.
Hierarchical Evaluation Measures: Different from flat measures that penalizes each of the misclassified examples equally, hierarchical measures take into consideration hierarchical distance between the true label and predicted label for evaluating the classifier performance. In general, misclassifications that are closer to the actual class are less severe than misclassifications that are farther from the true class with respect to the hierarchy. The hierarchy based measures includes hierarchical () (harmonic mean of hierarchical precision (), hierarchical recall ()) and tree-induced error () , which are defined as follows:
where, and are respectively the set of ancestors of the predicted and true labels which include the label itself, but do not include the root node. gives the length of the undirected path between categories and in the graph. For lower scores are better, whereas for all other measures higher scores are better.
Note that for consistent evaluation of hierarchical measures we have used the original hierarchy for all methods unless where noted.
|Dataset||Hierarchical Baselines||Proposed Approaches|
|CLEF||72.74 (0.43)||75.84 (0.32)||73.76 (0.32)||X||74.48 (0.42)||75.25 (0.55)||77.14 (0.01)|
|35.92 (0.01)||38.45 (0.65)||40.93 (0.19)||X||39.53 (0.74)||39.89 (0.24)||46.54 (0.01)|
|DIATOMS||53.27 (0.32)||56.93 (0.28)||53.27 (0.24)||X||58.36 (0.64)||58.32 (0.64)||61.31 (0.53)|
|44.46 (0.24)||45.17 (0.62)||44.30 (0.64)||X||45.21 (0.65)||48.77 (0.12)||51.85 (0.23)|
|IPC||49.32 (0.32)||51.28 (0.61)||50.36 (0.64)||X||51.36 (0.32)||50.40 (0.32)||52.30 (0.12)|
|42.51 (0.94)||44.99 (0.43)||43.74 (0.81)||X||42.80 (0.94)||43.26 (0.43)||45.65 (0.11)|
|DMOZ-SMALL||45.10 (0.23)||45.48 (0.19)||44.34 (0.32)||45.80 (0.64)||46.01 (0.74)||45.43 (0.21)||46.61 (0.28)|
|30.65 (0.43)||30.60 (0.54)||30.94 (0.53)||30.62 (0.32)||30.82 (0.63)||30.34 (0.12)||31.86 (0.64)|
|DMOZ-2010||40.22 (0.55)||41.32 (0.32)||40.34 (0.24)||41.77 (0.56)||41.82 (0.42)||40.71 (0.83)||42.37 (0.27)|
|28.37 (0.46)||29.05 (0.84)||28.41 (0.57)||29.11 (0.13)||29.18 (0.54)||28.66 (0.53)||30.41 (0.64)|
|DMOZ-2012||50.13 (0.28)||50.32 (0.42)||50.11 (0.32)||48.05 (0.39)||50.31 (0.48)||49.90 (0.92)||50.64 (0.22)|
|29.89 (0.23)||29.89 (0.23)||29.73 (0.14)||27.65 (0.48)||30.04 (0.57)||30.52 (0.74)||30.58 (0.28)|
Table shows mean and (standard deviation) in bracket across five runs. ’X’ denotes MLF not possible. The significance-test results are denoted as for a p-value less than 0.05 and for p-value less than 0.01. We have used sign-test and wilcoxon rank test for statistical evaluation of and scores, respectively. Tests are between our best proposed approach, Global-INF and best baseline approach, MTA for single run. These statistical tests are not performed on DMOZ-2010 and DMOZ-2012 datasets because we do not have access to true labels from the online evaluation system.
|Dataset||Hierarchy||Hierarchical Baselines||Proposed Approaches|
|CLEF||Original||74.52 (0.01)||78.24 (0.75)||75.13 (0.46)||X||76.01 (0.74)||76.81 (0.59)||79.06 (0.01)|
|Modified||-||77.78 (0.65)||78.08 (0.13)||X||77.50 (0.23)||78.28 (0.24)||80.87 (0.13)|
|DIATOMS||Original||56.15 (0.21)||62.53 (0.43)||56.14 (0.17)||X||59.60 (0.28)||60.03 (0.24)||62.80 (0.04)|
|Modified||-||63.38 (0.24)||57.02 (0.62)||X||59.70 (0.14)||59.98 (0.28)||63.88 (0.13)|
|IPC||Original||62.57 (0.32)||64.39 (0.38)||63.00 (0.10)||X||63.42 (0.54)||63.26 (0.34)||64.73 (0.12)|
|Modified||-||65.48 (0.32)||63.24 (0.41)||X||63.14 (0.54)||62.52 (0.38)||66.29 (0.28)|
|DMOZ-SMALL||Original||63.14 (0.54)||63.17 (0.43)||63.26 (0.52)||63.32 (0.64)||63.20 (0.54)||61.98 (0.56)||63.37 (0.44)|
|Modified||-||64.32 (0.50)||63.94 (0.38)||63.39 (0.19)||63.82 (0.42)||58.02 (0.14)||64.97 (0.75)|
|DMOZ-2012||Original||73.04 (0.21)||72.70 (0.17)||73.04 (0.28)||70.49 (0.03)||73.03 (0.11)||71.41 (0.38)||73.19 (0.02)|
|CLEF||Original||1.26 (0.01)||1.08 (0.08)||1.23 (0.03)||X||1.13 (0.09)||1.15 (0.05)||1.04 (0.03)|
|Modified||-||0.89 (0.07)||0.88 (0.04)||X||0.90 (0.04)||0.94 (0.01)||0.71 (0.09)|
|DIATOMS||Original||1.76 (0.01)||1.49 (0.01)||1.76 (0.03)||X||1.60 (0.03)||1.60 (0.06)||1.49 (0.02)|
|Modified||-||1.28 (0.02)||1.32 (0.02)||X||1.14 (0.06)||1.16 (0.02)||1.08 (0.08)|
|IPC||Original||2.23 (0.02)||2.12 (0.04)||2.20 (0.01)||X||2.22 (0.06)||2.19 (0.01)||2.10 (0.02)|
|Modified||-||1.64 (0.01)||1.58 (0.04)||X||1.80 (0.06)||1.83 (0.03)||1.38 (0.02)|
|DMOZ-SMALL||Original||3.55 (0.04)||3.55 (0.02)||3.53 (0.03)||3.53 (0.06)||3.51 (0.08)||3.65 (0.06)||3.50 (0.02)|
|Modified||-||2.96 (0.05)||2.90 (0.01)||2.62 (0.03)||2.68 (0.03)||2.82 (0.02)||2.37 (0.03)|
|DMOZ-2010||Original||3.69 (0.03)||3.58 (0.01)||3.68 (0.10)||3.56 (0.08)||3.61 (0.02)||3.74 (0.04)||3.53 (0.01)|
Table shows mean and (standard deviation) in bracket across five runs. ’X’ denotes MLF not possible. Evaluations for DMOZ-2010 and DMOZ-2012 datasets cannot be performed on new hierarchy as it is not supported by the web-portal. Further, for DMOZ-2010 and TE score for DMOZ-2012 dataset is not available from the online evaluation system.
Iv-C Experimental Protocol
In all the experiments, we have divided the training dataset into train and small validation dataset in the ratio 90:10. Each experiment was run five times with different sets of train and validation split chosen randomly. Testing is done on an independent held-out dataset as provided by these benchmarks. The model is trained by choosing mis-classification penalty parameter () in the set [, , , , , , ]. The best parameter is selected using a validation set. The best parameters are used to re-train the models on the entire training set and the performance is measured on a held-out test set. For the INF methods, we compute and save the value for each node in the hierarchy using a validation set. Setting the threshold as (or for -th level in Level-INF approach), we remove the inconsistent nodes where best value of fitness parameter (or ) is computed empirically for each dataset (see Section VI-B). All experiments were conducted using a modified version of liblinear121212http://www.csie.ntu.edu.tw/cjlin/liblinear/ software  and were run on ARGO, a research computing cluster provided by the Office of Research Computing (URL: http://orc.gmu.edu), at George Mason University, VA.
V Comparative Approaches
V-a Flat Methods
Logistic Regression (LR)
We train binary one-versus-rest regularized LR classifiers for each of the leaf categories, ignoring the hierarchical structure. The prediction decision for unlabeled test instance x is based on the maximum prediction score achieved when compared across the one-versus-rest classifiers as shown in eq. (10).
Error Correcting Output Codes (ECOC) 
This approach combines binary classifiers to exploit correlations and correct errors. Codewords are generated randomly with bits assigned for representing the hierarchical information between the categories. Experiments were done with codeword length varying from 32 to 1024 bits depending on the dataset. For testing an unlabeled example, the output codeword is compared to the codeword of each categories, and the one with the minimum hamming distance is selected as the class label for that example.
V-B Top-Down (TD) Hierarchical Methods
For all TD hierarchical baselines we train a binary one-vs-rest classifiers for each of the node (except root) in the hierarchy and predictions are made starting from the root node and recursively selecting the best scoring child nodes until a leaf node is reached (see eq. (4)). Depending upon the hierarchy that we use during the training and prediction process, we compare with the following baselines.
Top-Down Logistic Regression (TD-LR)
Original hierarchy provided by the domain experts is used for classifiers training and label prediction.
Modified hierarchy obtained by flattening different level(s) is used instead of original hierarchy. Depending on level(s) flattened we have Top Level Flattening (TLF), Bottom Level Flattening (BLF), Multiple Level Flattening (MLF) hierarchy as discussed in Section II-A.
Maximum-margin based Taxonomy Adaptation (MTA)
Original hierarchy is modified using the margin value computed at each node in the hierarchy as described in Babbar et al. .
Vi Results and Discussion
Vi-a Comparison to Top-down (TD) Hierarchical Baselines
Performance based on Flat Metrics: Table III presents the and performance comparison of our proposed hierarchical modification approaches with TD-LR (involves no hierarchy modification) and comparative TD hierarchy modification approaches as baselines. We see that our proposed approach Global-INF consistently outperforms all other approaches for the different datasets across all metrics. For the image datasets we see a relative performance improvement upto 7 in M on comparing Global-INF with the best TD modification baseline , MTA. To validate the performance improvement we conducted pairwise statistical significance tests between our best approach, Global-INF and best TD baseline for all datasets except DMOZ-2010 and DMOZ-2012, where true test labels (and class-wise performance) are not available from the online evaluation. Specifically, we compute sign-test for  and non-parametric wilcoxon rank test for M scores (it should be noted that significance tests are between two approaches for single run). In Figure 3 we present the pairwise statistical comparisons for different approaches studied here on the DMOZ-SMALL dataset. The Global-INF consistently outperforms other baseline approaches studied here.
Overall, results of statistical tests shows that Global-INF approach significantly outperforms the best baseline (and hence other baselines) for all the datasets (see Table III).
On comparing our two proposed approaches – Global-INF has better performance over Level-INF. This is because the Level-INF approach strictly enforces some of the nodes to be flattened from each levels although there value may be much lower than the other nodes at different levels in the hierarchy and vice-versa. In contrast, Global-INF approach takes all nodes values into consideration while making a decision and hence it determines a better set of inconsistent nodes. MTA approach has poor performance due to the similar issues as with Level-INF approach. Performance of level flattening approaches, viz., TLF, BLF and MLF, suffers because these methods remove the entire level(s) in the hierarchy and do not take into consideration whether any node in that level is important for HC. TD-LR approach has the worst performance because of the inconsistent nodes present in the original hierarchy that are negatively impacting the generalization capabilities of learned models at the higher levels (see Section VI-C), which results in error propogation.
Performance based on Hierarchical Metrics: Hierarchical evaluation metrics and (described in Section IV-B) compute errors for misclassified examples based on the definition of a defined hierarchy. As such, Table IV presents the and TE score for all TD approaches evaluated over the original hierarchy and the modified hierarchy (obtained by flattening). We can see that our proposed approach, Global-INF outperforms other approaches because it is able to identify a better set of inconsistent nodes. On comparing the classification performance over the original hierarchy and the modified hierarchy, we can see that for most of the approaches classification on modified hierarchy shows an improved performance. This is because flattening of hierarchies results in the reduction of hierarchical path length for mis-classified examples contributing to performance improvement.
Vi-B Empirical Study for Threshold Selection
Figure 4 shows the performance comparison of flat LR approach against our best TD approach, Global-INF with varying selection of threshold in the interval [, ] (performance deteriorates after ) with step-size 0.1 for CLEF and DMOZ-SMALL datasets. We choose these datasets for evaluation because they have different data characteristics. The CLEF dataset is well balanced and does not suffer from the rare categories issue, whereas DMOZ-SMALL dataset is highly imbalanced and majority of the classes belong to rare categories (i.e., having 10 examples) as shown in Figure 2(a). In order to identify the set of inconsistent nodes in the hierarchy, we compare the computed value of each internal node with the chosen threshold and mark the node as inconsistent iff . It can be seen from the figure that for CLEF dataset, performance improves as the threshold () decreases giving intuition that should be kept smaller removing more internal nodes from the hierarchy (enforcing flat structure) is better and hence reducing the threshold value can possibly lead to better results. However, for the DMOZ-SMALL dataset, performance first increases and than decreases with maximum performance achieved at . This behavior suggests that for imbalanced data distribution with potentially large number of rare categories, we should generally keep the threshold higher. It helps to leverage the hierarchical information while reducing error propagation by removing inconsistent nodes. The best threshold for a specific dataset can be chosen empirically using a small validation set as done in this study.
To further understand the behavior of modified hierarchy using Global-INF approach, we analyzed the datasets in terms of level-wise fan-out ( children) in the hierarchy, before and after removing inconsistent nodes. We can see from the Figure 5 that with both datasets maximum flattening take place at higher levels in the hierarchy, which results in increased fan-out value. Reason for inconsistencies at higher levels is the presence of many dissimilar classes beneath each node, which makes it comparatively difficult to learn generalized classifiers resulting in higher values (i.e., inconsistent nodes marked for removal).
|CLEF||L-1||21.27 (0.63)||214||20.18 (0.26)||203|
|L-2||07.71 (0.42)||240||07.34 (0.29)||224|
|L-3||11.30 (0.16)||274||05.66 (0.13)||227|
|DMOZ-SMALL||L-1||42.47 (0.32)||789||39.83 (0.17)||740|
|L-2||14.45 (0.62)||921||12.91 (0.21)||855|
|L-3||15.14 (0.34)||972||17.99 (0.14)||968|
|L-4||12.32 (0.02)||1001||07.57 (0.10)||991|
|L-5||15.66 (0.05)||1020||33.33 (0.04)||992|
Table shows mean and (standard deviation) of error rate across five runs. ME denotes the average number of misclassified examples upto that level.
Vi-C Level-wise Misclassification Error
Table V shows the level-wise error analysis that is obtained for TD-LR and our best approach, Global-INF for CLEF and DMOZ-SMALL datasets. We can see that at higher levels, the Global-INF approach misclassifies fewer examples (shown in #ME column) that results in less error propagation down the levels, and hence better overall performance. This experiment supports our hypothesis that Global-INF approach identifies better set of inconsistent nodes that helps in minimizing the error propagation. Results for other TD baselines are not shown in the paper for brevity.
|# Train||Best Proposed||Flat Baselines|
CLEF, DIATOMS and IPC datasets does not have any categories with 5 examples (10 for IPC). The significance-test results are denoted as for a p-value less than 0.05 and for p-value less than 0.01. Wilcoxon rank test is used for statistical evaluation of scores. Tests are between Global-INF and best flat, LR approach.
Vi-D Comparison to Flat Baselines
Table VI shows the and performance comparison of our best approach, Global-INF against flat baselines – LR and ECOC approach. For easier analysis, we have showed the results for datasets separated by varying distribution of training size (for evaluating DMOZ-2010 and DMOZ-2012 datasets we have used a separate held out dataset because we don’t know the actual labels of test dataset from the online evaluation). We show the results for because it gives equal importance to all the classes while evaluation and hence provides better essence of the results for datasets with skewed distribution. For computing , we have used the original hierarchy for consistent evaluation. As we can see from the table, the LR approach outperforms Global-INF approach for CLEF, DIATOMS and IPC datasets because these datasets are well balanced and have smaller number of categories. However, for the DMOZ datasets, our approach Global-INF has better performance because hierarchical structure provides useful information for categorizing classes with rare categories. Within the DMOZ datasets, rare categories make up more than 80 of the classes as shown in Figure 2. The ECOC approach has the worst performance because the codewords used in our experiments are chosen randomly and merging of categories may require nonÂlinear discriminants instead of the linear classifiers used in this paper.
Vi-E Computational Run Time
Although, the flat LR approach outperforms Global-INF approach for some datasets in terms of classification performance, their prediction runtime is significantly higher and it can be untenable for large-scale problems [2, 19]. The prediction runtime comparison of Global-INF and LR approach is shown in Figure 6. As expected, Global-INF approach has comparatively lower prediction runtime (upto 4x improvement). The difference is significant for large-scale datasets (DMOZ-2010 and DMOZ-2012). For completeness, we also report the total training runtime in Table VII. The Global-INF approach has higher training runtime due to the overhead involved with classifiers re-training after hierarchy modification and also involves training one-vs-rest binary classifiers for internal nodes in addition to leaf categories. Nevertheless, both flat and TD approaches are trivially parallelizable due to decoupling (, no interactions) between the classifiers learnt at different nodes in the hierarchy. For reporting training runtime, we trained classifiers in parallel across multiple compute nodes in the cluster and sum up the time taken at each node. In our experiments, we choose expensive one-vs-rest binary classifiers over comparably cheaper one-vs-sibling binary classifiers because our preliminary experiments showed better results with one-vs-rest approach. It should also be noted that there is no significant difference between the prediction and training runtime of different TD approaches, and hence we do not report them here.
Vii Conclusion and Future Work
In this paper, we proposed two different approaches for hierarchy modification that restructures the hierarchy by flattening most prominent set of inconsistent nodes, thereby improving the hierarchy representation which is more suited for HC. Performance evaluation on wide range of datasets over the proposed modified hierarchy shows improved classification results because fewer examples are misclassified at higher levels, resulting in less error propagation. Comparison of our proposed approach with the competitive hierarchy modification approaches in the literature showed significant performance improvement supporting the hypothesis that our approach identifies the better set of inconsistent nodes. We also performed experiments to compare our approach with the flat approach with varying distribution of training examples per categories. Results demonstrated the usefulness of leveraging hierarchical information for classifying classes with fewer training examples.
In future, we plan to study the effect of our modified hierarchy on various state-of-the-art HC approaches .
This work was supported by NSF grant #202882 to HR.
-  C. N. Silla Jr and A. A. Freitas, “A survey of hierarchical classification across different application domains,” Data Mining and Knowledge Discovery, vol. 22, no. 1-2, pp. 31–72, 2011.
-  S. Gopal and Y. Yang, “Recursive regularization for large-scale classification with hierarchical and graphical dependencies,” in ACM SIGKDD, 2013, pp. 257–265.
-  L. Cai and T. Hofmann, “Hierarchical document categorization with support vector machines,” in CIKM, 2004, pp. 78–87.
-  D. Koller and M. Sahami, “Hierarchically classifying documents using very few words,” in ICML, 1997, pp. 170–178.
-  A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng, “Improving text classification by shrinkage in a hierarchy of classes.” in ICML, 1998, pp. 359–367.
-  S. Dumais and H. Chen, “Hierarchical classification of web content,” in ACM SIGIR, 2000, pp. 256–263.
-  A. Sun and E.-P. Lim, “Hierarchical text classification and evaluation,” in ICDM, 2001, pp. 521–528.
-  G. Xue, D. Xing, Q. Yang, and Y. Yu, “Deep classification in large-scale text hierarchies,” in ACM SIGIR, 2008, pp. 619–626.
-  L. Xiao, D. Zhou, and M. Wu, “Hierarchical classification via orthogonal transfer,” in ICML, 2011, pp. 801–808.
-  A. Zimek, F. Buchwald, E. Frank, and S. Kramer, “A study of hierarchical and flat classification of proteins,” IEEE/ACM TCBB, vol. 7, no. 3, pp. 563–571, 2010.
-  R. Babbar, I. Partalas, E. Gaussier, and M. Amini, “On flat versus hierarchical classification in large-scale taxonomies,” in NIPS, 2013, pp. 1824–1832.
-  T. Liu, Y. Yang, H. Wan, Q. Zhou, H. Gao, B.and Zeng, Z. Chen, and W. Ma, “An experimental study on large-scale web categorization,” in WWW special interest tracks and posters, 2005, pp. 1106–1107.
-  X.-L. Wang and B.-L. Lu, “Flatten hierarchies for large-scale hierarchical text categorization,” in ICDIM, 2010, pp. 139–144.
-  R. Babbar, I. Partalas, E. Gaussier, and M. Amini, “Maximum-margin framework for training data synchronization in large-scale hierarchical classification,” in Neural Information Processing, 2013, pp. 336–343.
-  C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel, “Decision trees for hierarchical multi-label classification,” Machine Learning, vol. 73, no. 2, pp. 185–214, 2008.
-  O. Dekel, “Distribution-calibrated hierarchical classification,” in NIPS, 2009, pp. 450–458.
-  L. Tang, J. Zhang, and H. Liu, “Acclimatizing taxonomic semantics for hierarchical content classification,” in ACM SIGKDD, 2006, pp. 384–393.
-  O. Dekel, J. Keshet, and Y. Singer, “Large margin hierarchical classification,” in ICML, 2004, p. 27.
-  T. Liu, Y. Yang, H. Wan, H. Zeng, Z. Chen, and W. Ma, “Support vector machines classification with a very large-scale taxonomy,” ACM SIGKDD Exp. Newsletter, vol. 7, no. 1, pp. 36–43, 2005.
-  C. Hadley and D. Jones, “A systematic comparison of protein structure classifications: Scop, cath and fssp,” Structure, vol. 7, no. 9, pp. 1099–1112, 1999.
-  T. Gao and D. Koller, “Discriminative learning of relaxed hierarchy for large-scale visual recognition,” in ICCV, 2011, pp. 2072–2079.
-  A. Naik and H. Rangwala, “Filter based taxonomy modification for improving hierarchical classification,” http://arxiv.org/abs/1603.00772, 2016.
-  R. Eisner, B. Poulin, D. Szafron, P. Lu, and R. Greiner, “Improving protein function prediction using the hierarchical structure of the gene ontology,” in Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Comp. Bio., 2005, pp. 1–10.
-  A. Naik, A. Charuvaka, and H. Rangwala, “Classifying documents within multiple hierarchical datasets using multi-task learning,” in ICTAI, 2013, pp. 390–397.
-  I. Dimitrovski, D. Kocev, S. Loskovska, and S. Džeroski, “Hierarchical annotation of medical images,” Pattern Recognition, vol. 44, no. 10, pp. 2436–2449, 2011.
-  I. Dimitrovski, D. Kocev, S. Loskovska, and S.Džeroski, “Hierarchical classification of diatom images using predictive clustering trees,” Ecological Informatics, vol. 7, pp. 19–29, 2012.
-  Y. Yang, “An evaluation of statistical approaches to text categorization,” Information Retrieval, vol. 1, no. 1-2, pp. 69–90, 1999.
-  R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “LIBLINEAR: A library for large linear classification,” JMLR, vol. 9, pp. 1871–1874, 2008.
-  R. Ghani, “Using error-correcting codes for text classification,” in ICML, 2000, pp. 303–310.
-  Y. Yang and X. Liu, “A re-examination of text categorization methods,” in ACM SIGIR, 1999, pp. 42–49.