Embedding Feature Selection for Large-scale Hierarchical Classification

Embedding Feature Selection for Large-scale Hierarchical Classification

Azad Naik and Huzefa Rangwala Department of Computer Science
George Mason University
Fairfax, VA, United States
Email: anaik3@gmu.edu, rangwala@cs.gmu.edu

Large-scale Hierarchical Classification (HC) involves datasets consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale HC problem. It speeds up the training process, reduces the prediction time and minimizes the memory requirements by compressing the total size of learned model weight vectors. Majority of the studies have also shown feature selection to be competent and successful in improving the classification accuracy by removing irrelevant features. In this work, we investigate various filter-based feature selection methods for dimensionality reduction to solve the large-scale HC problem. Our experimental evaluation on text and image datasets with varying distribution of features, classes and instances shows upto 3x order of speed-up on massive datasets and upto 45 less memory requirements for storing the weight vectors of learned model without any significant loss (improvement for some datasets) in the classification accuracy. Source Code: https://cs.gmu.edu/mlbio/featureselection.

Feature Selection, Top-down Hierarchical Classification, Logistic Regression, Scalability

I Introduction

Hierarchies (Taxonomies) are popular for organizing large volume datasets in various application domains [1, 2]. Several large-scale online prediction challenges such as LSHTC111http://lshtc.iit.demokritos.gr/ (webpage classification), BioASQ222http://bioasq.org/ (PubMed documents classification) and ILSVRC333http://www.image-net.org/challenges/LSVRC/2016/ (image classification) revolve around the HC problem. Although, substantial amount of data with inter-class dependencies information are beneficial for improving HC, one of the major challenges in dealing with these datasets comprising large-number of categories (classes), high-dimensional features and large-number of training instances (examples) is scalability.

Many large-scale HC approaches have been developed in past to deal with the various “big data” challenges by: (i) training faster models, (ii) quickly predicting class-labels and (iii) minimizing memory usage. For example, Gopal et al. [3] proposed the log-concavity bound that allows parallel training of model weight vectors across multiple computing units. This achieves significant speed-up along with added flexibility of storing model weight vectors at different units. However, the memory requirements is still large (26 GB for DMOZ-2010 dataset, refer to Table III) which requires complex distributed hardware for storage and implementation. Alternatively, Map-Reduce based formulation of learning model is introduced [4, 5] which is scalable but have software/hardware dependencies that limits the applicability of this approach.

To minimize the memory requirements, one of the popular strategy is to incorporate the feature selection in conjunction with model training [6, 7]. The main intuition behind these approaches is to squeeze the high-dimensional features into lower dimensions. This allows the model to be trained on low-dimensional features only; significantly reducing the memory usage while retaining (or improving) the classification accuracy. This is possible because only subset of features are beneficial to discriminate between classes at each node in the hierarchy. For example, to distinguish between sub-class ‘Chemistry’ and ‘Physics’ that belongs to class ‘Science’ features like chemical, reactions and acceleration are important whereas features like coach, memory and processor are irrelevant. HC methods that leverage the structural relationship shows improved classification performance but are computationally expensive [8, 5, 9].

Fig. 1: Figure demonstrating the importance of feature selection for HC. Green color (sticky note) represents the top five best features selected using gini-index feature selection method at each internal node. Internal nodes are represented by orange color (elliptical shape) and leaf nodes are represented by blue color (rectangular shape).

In this paper, we study different filter-based feature selection methods for solving large-scale HC problem. Feature selection serves as the preprocessing step in our learning framework prior to training models. Any developed methods for solving HC problem can be integrated with the selected features, providing flexibility in choosing the HC algorithm of our choice along with computational efficiency and storage benefits. Our proposed “adaptive feature selection” also shows an improvement of 2% in classification accuracy. Experiments on various real world datasets across different domains demonstrates the utility of the feature selection over full set of high-dimensional features. We also investigate the effect of feature selection in classification performance when the number of labeled instances per class is low.

Ii Literature Review

Ii-a Hierarchical Classification

Several methods have been developed to address the hierarchical classification problem [4, 5, 10, 11, 12]. These methods can be broadly divided into three major categories. (i) Flat approach is one of the simplest and straight forward method to solve the HC problem. In this method, hierarchical structure is completely ignored and an independent one-vs-rest or multi-class classifiers are trained for each of the leaf categories that can discriminate it from remaining leaf categories. For predicting the label of instances, the flat method invokes the classifiers corresponding to all leaf categories and selects the leaf category with highest prediction score. As such, flat approach have expensive training and prediction runtime for datasets with large number of classes. (ii) Local classification involves the use of local hierarchical relationships during the model training process. Depending on how the hierarchical relationships are leveraged, various local methods exist [13]. In this paper, we have used the most popular “local classifier per parent node” method as the baseline for evaluations. Specifically, we train a multi-class classifier at each of the parent node to maximize the discrimination between its children nodes. For making predictions a top-down method (discussed in Section II-B) is followed. (iii) Global classification learns a single complex model where all relevant hierarchical relationships are explored jointly during the optimization process making these approaches expensive for training. Label predictions is done using a similar approach followed for flat or local methods.

Ii-B Top-Down Hierarchical Classification

One of the most efficient approach for solving large-scale HC problem is using the top-down method [8, 14]. In this method, local or global classification method is used for model training and the unlabeled instances are recursively classified in a top-down fashion. At each step, best node is picked based on the computed prediction score of its children nodes. The process repeats until the leaf node representing a certain category (or class-label) is reached, which is the final predicted label (refer to eq. (3)).

Top-Down (TD) methods are popular for large-scale problems owing to their computational benefits where only the subset of classes in the relevant path are considered during prediction phase. For example, in order to make second level prediction provided the first level prediction is ‘Sci’ (shown in Figure 1) we only need to consider the children of ‘Sci’ class (, electronics, med, crypt and space), thereby, avoiding the large number of second level classes such as ‘Sys’, ‘Politics’, ‘Sports’, ‘graphics’, ‘autos’. In the past, top-down methods have been successfully used to solve HC problems [11, 15, 16]. Liu et al. [8] performed classification on large-scale Yahoo! dataset and analyzed the complexity of the top-down approach. In [17], a selective classifier top-down method is proposed where the classifier to train at particular node is chosen in a data-driven manner.

Ii-C Feature Selection

There have been several studies focused on feature selection methods for the flat classification problem [18, 19, 20, 21, 22, 23]. However, very few work emphasize on feature selection for HC problem that are limited to small number of categories [24, 25]. Figure 1 demonstrates the importance of feature selection for hierarchical settings where only the relevant features are chosen at each of the decision (internal) nodes. More details about the figure will be discussed in Section V (Case Study).

Feature selection aims to find a subset of highly discriminant features that minimizes the error rate and improve the classifier performance. Based on the approach adapted for selecting features two broad categories of feature selection exist, namely, wrapper and filter-based methods. Wrapper approaches evaluate the fitness of the selected features using the intended classifier. Although many different wrapper-based approaches have been proposed, these methods are not suitable for large-scale problems due to the expensive evaluation needed to select the subset of features [18]. On the contrary, filter approaches select the subset of features based on the certain measures or statistical properties that does not require the expensive evaluations. This makes the filter-based approaches a natural choice for large-scale problem. Hence, in this paper we have focused on various filter-based approaches for solving HC problem (discussed in Section III-C). In literature, third category referred as embedded approaches have also been proposed which are a hybrid of the wrapper and filter methods. However, these approaches have not been shown to be efficient for large-scale classification [18] and hence, we do not focus on hybrid methods.

To the best of our knowledge this is the first paper that performs a broad study of filter-based feature selection methods for HC problem.

Iii Methods

Iii-a Definitions and Notations

In this paper, we use bold lower-case and upper-case letters to indicate vector and matrix variables, respectively. Symbol denotes the set of internal nodes in the hierarchy where for each node we learn the multi-class classifier denoted by = to discriminate between its children nodes . represents the optimal model weight vectors for child of node . denotes the set of leaf nodes (categories) to which instances are assigned. The total number of training instances are denoted by and denotes the total number of training instances considered at node which corresponds to all instances of descendant categories at node . denotes the set of total features (dimensionality) for each instance where feature is denoted by . denotes the subset of relevant features selected using feature selection algorithm. , denotes the training dataset where and . For training optimal model corresponding to child at node we use the binary label for training instance where = 1 iff = and = -1 otherwise. Predicted label for test instance is denoted by .

Iii-B Hierarchical Classification

Given a hierarchy , we train multi-class classifiers for each of the internal nodes in the hierarchy— to discriminate between its children nodes . In this paper, we have used Logistic Regression (LR) as the underlying base model for training [5, 26]. The LR objective uses logistic loss to minimize the empirical risk and -norm (denoted by ) or squared -norm term (denoted by ) to control model complexity and prevent overfitting. Usually, -norm encourages sparse solution by randomly choosing single parameter amongst highly correlated parameters whereas -norm jointly shrinks the correlated parameters. The objective function for training a model corresponding to child of node is provided in eq. (1).


where is a mis-classification penalty parameter and denotes the regularization term given by eq. (2).


For each child of node within the hierarchy, we solve eq. (1) to obtain the optimal weight vector denoted by . The complete set of parameters for all the children nodes = constitutes the learned multi-class classifiers at node whereas total parameters for all internal nodes = constitutes the learned model for Top-Down (TD) classifier.

For a test instance , the TD classifier predicts the class label as shown in eq. (3). Essentially, the algorithm starts at the root node and recursively selects the best child nodes until it reaches a terminal node belonging to the set of leaf nodes .


Iii-C Feature Selection

The focus of our study in this paper is on filter-based feature selection methods which are scalable for large-scale datasets. In this section, we present four feature selection approaches that are used for evaluation purposes.

Gini-Index - It is one of the most widely used method to compute the best split (ordered feature) in the decision tree induction algorithm [27]. Realizing its importance, it was extended for the multi-class classification problem [28]. In our case, it measure the feature’s ability to distinguish between different leaf categories (classes). Gini-Index of feature with classes can be computed as shown in eq. (4).


where is the conditional probability of class given feature .

Smaller the value of Gini-Index, more relevant and useful is the feature for classification. For HC problem, we compute the Gini-Index corresponding to all feature’s independently at each internal node and select the best subset of features () using a held-out validation dataset.

Minimal Redundancy Maximal Relevance (MRMR) - This method incorporates the following two conditions for feature subset selection that are beneficial for classification.

  1. Identify features that are mutually maximally dissimilar to capture better representation of entire dataset and

  2. Select features to maximize the discrimination between different classes.

The first criterion referred as “minimal redundancy” selects features that carry distinct information by eliminating the redundant features. The main intuition behind this criterion is that selecting two similar features contains no new information that can assist in better classification. Redundancy information of feature set can be computed using eq. (5).


where is the mutual information that measure the level of similarity between features and [29].

The second criterion referred as “maximum relevance” enforces the selected features to have maximum discriminatory power for classification between different classes. Relevance of feature set can be formulated using eq. (6).


where is the mutual information between the feature and leaf categories that captures how well the feature can discriminate between different classes [20].

The combined optimization of eq. (5) and eq. (6) leads to a feature set with maximum discriminatory power and minimum correlations among features. Depending on strategy adapted for optimization of these two objectives different flavors exist. The first one referred as “mutual information difference (MRMR-D)” formulates the optimization problem as the difference between two objectives as shown in eq. (7). The second one referred as “mutual information quotient (MRMR-Q)” formulates the problem as the ratio between two objectives and can be computed using eq. (8).


For HC problem again we select the best top features (using a validation dataset) for evaluating these methods.

Kruskal-Wallis - This is a non-parametric statistical test that ranks the importance of each feature. As a first step this method ranks all instances across all leaf categories and computes the feature importance metric as shown in eq. (9):


where is the number of instances in category, is the ranking of instances in the category and denotes the average rank across all instances.

It should be noted that using different feature results in different ranking and hence feature importance. Lower the value of computed score , more relevant is the feature for classification.

Data: Hierarchy , input-output pairs ()
Result: Learned model weight vectors:            = [, , , ],
= ; /* 1st subroutine: Feature Selection */ for   do
       Compute score (relevance) corresponding to feature using feature selection algorithm mentioned in Section III-C;
end for
Select top features based on score (and correlations) amongst features where best value of is tuned using a validation dataset /* 2nd subroutine: Model Learning using Reduced Feature Set */ for n  do
       /* learn models for discriminating child at node */ Train optimal multi-class classifiers at node using reduced feature set as shown in eq. (1); /* update model weight vectors */ = [, ];
end for
Algorithm 1 Feature Selection (FS) based Model Learning for Hierarchical Classification (HC)
Dataset Domain # Leaf Node # Internal Node Height # Training # Testing # Features Avg. # children
(per internal node)
NG Text 20 8 4 11,269 7,505 61,188 3.38
CLEF Image 63 34 4 10,000 1,006 80 2.56
IPC Text 451 102 4 46,324 28,926 1,123,497 5.41
DMOZ-SMALL Text 1,139 1,249 6 6,323 1,858 51,033 1.91
DMOZ-2010 Text 12,294 4,928 6 128,710 34,880 381,580 3.49
DMOZ-2012 Text 11,947 2,016 6 383,408 103,435 348,548 6.93
TABLE I: Dataset Statistics

Iii-D Proposed Framework

Algorithm 1 presents our proposed method for embedding feature selection into the HC framework. It consist of two independent main subroutines: (i) a feature selection algorithm (discussed in Section III-C) for deciding the appropriate set of features at each decision (internal) node and (ii) a supervised learning algorithm (discussed in Section III-B) for constructing a TD hierarchical classifier using reduced feature set. Feature selection serves as the preprocessing step in our framework which provides flexibility in choosing any HC algorithm.

We propose two different approaches for choosing relevant number of features at each internal node . The first approach which we refer as “global feature selection (Global FS)” selects the same number of features for all internal nodes in the hierarchy where the number of features are determined based on the entire validation dataset performance. The second approach, referred as “adaptive feature selection (Adaptive FS)” selects different number of features at each internal node to maximize the performance at that node. It should be noted that adaptive method only uses the validation dataset that exclusively belongs to the internal node (, descendant categories of node ). Computationally, both approaches are almost identical because model tuning and optimization requires similar runtime which accounts for the major fraction of computation.

Iv Experimental Evaluations

Iv-a Dataset Description

We have performed an extensive evaluation of various feature selection methods on a wide range of hierarchical text and image datasets. Key characteristics about the datasets that we have used in our experiments are shown in Table I. All these datasets are single-labeled and the instances are assigned to the leaf nodes in the hierarchy. For text datasets, we have used the word-frequency representation and perform the tf-idf transformation with -norm to the word-frequency feature vector.
Text Datasets
    NEWSGROUP (NG)444http://qwone.com/jason/20Newsgroups/ - It is a collection of approximately 20,000 news documents partitioned (nearly) evenly across twenty different topics such as ‘baseball’, ‘electronics’ and ‘graphics’ (refer to Figure 1).
    IPC555http://www.wipo.int/classifications/ipc/en/ - Collection of patent documents organized in International Patent Classification (IPC) hierarchy.
    DMOZ-SMALL, DMOZ-2010 and DMOZ-2012666http://dmoz.org - Collection of multiple web documents organized in various classes using the hierarchical structure. Dataset has been released as the part of the LSHTC777http://lshtc.iit.demokritos.gr/ challenge in the year 2010 and 2012. For evaluating the DMOZ-2010 and DMOZ-2012 datasets we have used the provided test split and the results reported for this benchmark is blind prediction obtained from web-portal interface888http://lshtc.iit.demokritos.gr/node/81.
Image Datasets
    CLEF [30] - Dataset contains medical images annotated with Information Retrieval in Medical Applications (IRMA) codes. Each image is represented by the 80 features that are extracted using local distribution of edges method.

Iv-B Evaluation Metrics

We have used the standard set based performance measures Micro- and Macro- [31] for evaluating the performance of learned models.
    Micro- () - To compute , we sum up the category specific true positives , false positives and false negatives for different leaf categories and compute the score as follows:


where, P and R are the overall precision and recall values for all the classes taken together.
    Macro- () - Unlike , M gives equal weight to all the categories so that the average score is not skewed in favor of the larger categories. It is defined as follows:


where denotes the set of leaf categories, and are the precision and recall values for leaf category .

Fig. 2: Performance comparison of LR + -norm models with varying percentage (%) of features selected using different feature selection (global) methods on text and image datasets.
Fig. 3: Performance comparison of LR + -norm models with varying percentage (%) of features selected using different feature selection (global) methods on text and image datasets.

Iv-C Experimental Details

For all the experiments, we divide the training dataset into train and small validation dataset in the ratio 90:10. The train dataset is used to train TD classifiers whereas the validation dataset is used to tune the parameter. The model is trained for a range of mis-classification penalty parameter () values in the set {0.001, 0.01, 0.1, 1, 10, 100, 1000} with best value selected using a validation dataset. Adopting the best parameter, we retrain the models on the entire training dataset and measure the performance on a separate held-out test dataset. For feature selection, we choose the best set of features using the validation dataset by varying the number of features between 1 and 75 of all the features. Our preliminary experiments showed no significant improvement after 75 hence we bound the upper limit to this value. We performed all the experiments on ARGO cluster (http://orc.gmu.edu) with dual Intel Xeon E5-2670 8 core CPUs and 64 GB memory. Source code implementation of the proposed algorithm discussed in this paper is made available at our website999https://cs.gmu.edu/mlbio/featureselection for repeatability and future use.

Dataset Metric Adaptive FS Global FS All Features
NG 76.16 76.39 74.94
76.10 76.07 74.56
CLEF 72.66 72.27 72.17
36.73 35.07 33.14
IPC 48.23 46.35 46.14
41.54 39.52 39.43
DMOZ-SMALL 40.32 39.52 38.86
26.12 25.07 24.77
DMOZ-2010 35.94 35.40 34.32
23.01 21.32 21.26
DMOZ-2012 44.12 43.94 43.92
23.65 22.18 22.13
  • (and ) indicates that improvements are statistically significant with 0.05 (and 0.1) significance level.

TABLE II: Performance comparison of adaptive and global approach for feature selection based on Gini-Index with all features. LR + -norm model is used for evaluation

V Results Discussion

V-a Case Study

To understand the quality of features selected at different internal nodes in the hierarchy we perform case study on NG dataset. We choose this dataset because we have full access to feature information. Figure 1 demonstrates the results of top five features that is selected using best feature selection method , Gini-Index (refer to Figure 2 and 3). We can see from the figure that selected features corresponds to the distinctive attributes which helps in better discrimination at particular node. For example, the features like Dod (Day of defeat or Department of defense), Car, Bike and Team are important at node ‘Rec’ to distinguish between the sub-class ‘autos’, ‘motorcycles’ and ‘Sports’ whereas other features like Windows, God and Encryption are irrelevant. This analysis illustrates the importance of feature selection for TD HC problem.

One important observation that we made in our study is that some of the features like Windows, God and Team are useful for discrimination at multiple nodes in the hierarchy (associated with parent-child relationships). This observation conflicts with the assumption made in the work by Xiao et al. [6], which attempts to optimize the objective function by necessitating the child node features to be different from the features selected at the parent node.

Dataset Adaptive FS Global FS All Features
# parameters size # parameters size # parameters size
NG 982,805 4.97 MB 908,820 3.64 MB 1,652,076 6.61 MB
CLEF 4,715 18.86 KB 5,220 20.89 KB 6,960 27.84 KB
IPC 306,628,256 1.23 GB 331,200,000 1.32 GB 620,170,344 2.48 GB
DMOZ-SMALL 74,582,625 0.30 GB 85,270,801 0.34 GB 121,815,771 0.49 GB
DMOZ-2010 4,035,382,592 16.14 GB 4,271,272,967 17.08 GB 6,571,189,180 26.28 GB
DMOZ-2012 3,453,646,353 13.81 GB 3,649,820,382 14.60 GB 4,866,427,176 19.47 GB
TABLE III: Comparison of memory requirements for LR + -norm model

V-B Classification Performance Comparison

Global FS - Figures 2 and 3 shows the and M comparison of LR models with -norm and -norm regularization combined with various feature selection methods discussed in Section III-C respectively. We can see that all feature selection method (except Kruskal-Wallis) show competitive performance results in comparison to the full set of features for all the datasets. Overall, Gini-Index feature selection method has slightly better performance over other methods. MRMR methods have a tendency to remove some of the important features as redundant based on the minimization objective obtained from data-sparse leaf categories which may not be optimal and negatively influences the performance. The Kruskal-Wallis method shows poor performance because of the statistical properties that is obtained from data-sparse nodes [32].

On comparing the -norm and -norm regularized models of best feature selection method (Gini-Index) with all features, we can see that -norm models have more performance improvement (especially for M scores) for all datasets whereas for -norm models performance is almost similar without any significant loss. This is because -norm assigns higher weight to the important predictor variables which results in more performance gain.

Since, feature selection based on Gini-Index gives the best performance, in the rest of the experiments we have used the Gini-Index as the baseline for comparison purpose. Also, we consider -norm model only due to space constraint.

Adaptive FS - Table II shows the LR + -norm models performance comparison of adaptive and global approaches for feature selection with all features. We can see from the table that adaptive approach based feature selection gives the best performance for all the datasets (except score of NG dataset which has very few categories). For evaluating the performance improvement of models we perform statistical significance test. Specifically, we perform sign-test for [33] and non-parametric wilcoxon rank test for . Results with 0.05 (0.1) significance level is denoted by (). Tests are between models obtained using feature selection methods and all set of features. We cannot perform test on DMOZ-2010 and DMOZ-2012 datasets because true predictions and class-wise performance score are not available from online web-portal.

Statistical evaluation shows that although global approach is slightly better in comparison to full set of features they are not statistically significant. On contrary, adaptive approach is much better with an improvement of 2% in and M scores which are statistically significant.

V-C Memory Requirements

Table III shows the information about memory requirements for various models with full set of features and best set of features that are selected using global and adaptive feature selection. Upto 45 reduction in memory size is observed for all datasets to store the learned models. This is a huge margin in terms of memory requirements considering the models for large-scale datasets (such as DMOZ-2010 and DMOZ-2012) are difficult to fit in memory.

It should be noted that optimal set of features is different for global and adaptive methods for feature selection hence they have different memory requirements. Overall, adaptive FS is slightly better because it selects small set of features that are relevant for distinguishing data-sparse nodes present in CLEF, IPC and the DMOZ datasets. Also, we would like to point out that Table III represents the memory required to store the learned model parameters only. In practice, 2-4 times more memory is required for temporarily storing the gradient values of model paramaters that is obtained during the optimization process.

Dataset Feature Selection Method
Gini-Index MRMR-D MRMR-Q Kruskal-Wallis
NG 2.10 5.33 5.35 5.42
CLEF 0.02 0.46 0.54 0.70
IPC 15.24 27.42 27.00 23.24
DMOZ-SMALL 23.65 45.24 45.42 34.65
DMOZ-2010 614 1524 1535 1314
DMOZ-2012 818 1824 1848 1268
TABLE IV: Feature selection preprocessing time (in minutes)()

V-D Runtime Comparison

Preprocessing Time - Table IV shows the preprocessing time needed to compute the feature importance using the different feature selection methods. The Gini-index method takes the least amount of time since it does not require the interactions between different features to rank the features. The MRMR methods are computationally expensive due to the large number of pairwise comparisons between all the features to identify the redundancy information. On other hand, the Kruskal-Wallis method has overhead associated with determining ranking of each features with different classes.

Model Training - Table V shows the total training time needed for learning models. As expected, feature selection requires less training time due to the less number of features that needs to be considered during learning. For smaller datasets such as NG and CLEF improvement is not noticeable. However, for larger datasets with high-dimensionality such as IPC, DMOZ-2010 and DMOZ-2012 improvement is much higher (upto 3x order speed-up). For example, DMOZ-2010 dataset training time reduces from 6524 minutes to mere 2258 minutes.

Prediction Time - For the dataset with largest number of test instances, DMOZ-2012 it takes 37 minutes to make predictions with feature selection as opposed to 48.24 minutes with all features using the TD HC approach.

In Figure 4 we show the training and prediction time comparison of large datasets (DMOZ-2010 and DMOZ-2012) between flat LR and the TD HC approach with (and without) feature selection. The flat method is comparatively more expensive than the TD approach (6.5 times for training and 5 times for prediction).

Dataset Model Feature Selection All Features
NG LR + 0.75 0.94
LR + 0.44 0.69
CLEF LR + 0.50 0.74
LR + 0.10 0.28
IPC LR + 24.38 74.10
LR + 20.92 68.58
DMOZ-SMALL LR + 3.25 4.60
LR + 2.46 3.17
DMOZ-2010 LR + 2258 6524
LR + 2132 6418
DMOZ-2012 LR + 8024 19374
LR + 7908 19193
TABLE V: Total training time (in minutes)()
Fig. 4: Training and prediction runtime comparison of LR + -norm model (in minutes).
Dataset Train size Feature Selection All Features
Distribution (per class) (Gini-Index)
5 27.44 26.45 25.74 24.33
(0.4723) (0.4415) (0.5811) (0.6868)
10 37.69 37.51 36.59 35.86
Low (0.2124) (0.2772) (0.5661) (0.3471)
Distribution 15 43.14 43.80 42.49 42.99
(0.3274) (0.3301) (0.1517) (0.7196)
25 52.12 52.04 50.33 50.56
(0.3962) (0.3011) (0.4486) (0.5766)
50 59.55 59.46 59.52 59.59
(0.4649) (0.1953) (0.3391) (0.1641)
100 66.53 66.42 66.69 66.60
High (0.0346) (0.0566) (0.7321) (0.8412)
Distribution 200 70.60 70.53 70.83 70.70
(0.6068) (0.5164) (0.7123) (0.6330)
250 72.37 72.24 73.06 72.86
(0.4285) (0.4293) (0.4732) (0.4898)
  • Table shows mean and (standard deviation) in bracket across five runs. (and ) indicates that improvements are statistically significant with 0.05 (and 0.1) significance level.

TABLE VI: Performance comparison of LR + -norm model with varying training size (# instances) per class on NG dataset

V-E Additional Results

Effect of Varying Training Size - Table VI shows the classification performance on NG dataset with varying training dataset distribution. We have tested the models by varying the training size (instances) per class () between 5 and 250. Each experiment is repeated five times by randomly choosing instances per class. Moreover, adaptive method with Gini-Index feature selection is used for experiments. For evaluating the performance improvement of models we perform statistical significance test (sign-test for and wilcoxon rank test for ). Results with 0.05 (0.1) significance level is denoted by ().

We can see from Table VI that for low distribution datasets, the feature selection method performs well and shows improvements of upto 2 (statistically significant) over the baseline method. The reason behind this improvement is that with low data distribution, feature selection methods prevents the models from overfitting by selectively choosing the important features that helps in discriminating between the models of various classes. For datasets with high distribution, no significant performance gain is observed due to sufficient number of available training instances for learning models which prevents overfitting when using all the features.

Levelwise Analysis - Figure 5 shows the level-wise error analysis for CLEF, IPC and DMOZ-SMALL datasets with or without feature selection. We can see that at topmost level more error is committed compared to the lower level. This is because at higher levels each of the children nodes that needs to be discriminated is the combination of multiple leaf categories which cannot be modeled accurately using the linear classifiers. Another observation is that adaptive feature selection gives best results at all levels for all datasets which demonstrates its ability to extract relevant number of features at each internal node (that belongs to different levels) in the hierarchy.

Fig. 5: Level-wise error analysis of LR + -norm model for CLEF, IPC and DMOZ-SMALL datasets.

Vi Conclusion and future work

In this paper we compared various feature selection methods for solving large-scale HC problem. Experimental evaluation shows that with feature selection we are able to achieve significant improvement in terms of runtime performance (training and prediction) without affecting the accuracy of learned classification models. We also showed that feature selection can be beneficial, especially for the larger datasets in terms of memory requirements. This paper presents the first study of various information theoretic feature selection methods for large-scale HC.

In future, we plan to extend our work by learning more complex models at each of the decision nodes. Specifically, we plan to use multi-task learning methods where related tasks can be learned jointly to improve the performance on each task. Feature selection gives us the flexibility of learning complex models due to reduced dimensionality of the features, which otherwise have longer runtime and larger memory requirements.


NSF Grant #1252318 and #1447489 to Huzefa Rangwala and Summer Research Fellowship from the office of provost, George Mason University to Azad Naik.


  • [1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig et al., “Gene ontology: tool for the unification of biology,” Nature genetics, vol. 25, no. 1, pp. 25–29, 2000.
  • [2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
  • [3] S. Gopal and Y. Yang, “Distributed training of large-scale logistic models.” in ICML, 2013, pp. 289–297.
  • [4] A. Naik and H. Rangwala, “A ranking-based approach for hierarchical classification,” in IEEE DSAA, 2015, pp. 1–10.
  • [5] S. Gopal and Y. Yang, “Recursive regularization for large-scale classification with hierarchical and graphical dependencies,” in ACM SIGKDD, 2013, pp. 257–265.
  • [6] L. Xiao, D. Zhou, and M. Wu, “Hierarchical classification via orthogonal transfer,” in ICML, 2011, pp. 801–808.
  • [7] B. Heisele, T. Serre, S. Prentice, and T. Poggio, “Hierarchical classification and feature reduction for fast face detection with support vector machines,” Pattern Recognition, vol. 36, no. 9, pp. 2007–2017, 2003.
  • [8] T. Liu, Y. Yang, H. Wan, H. Zeng, Z. Chen, and W. Ma, “Support vector machines classification with a very large-scale taxonomy,” ACM SIGKDD Explorations Newsletter, vol. 7, no. 1, pp. 36–43, 2005.
  • [9] L. Cai and T. Hofmann, “Hierarchical document categorization with support vector machines,” in CIKM, 2004, pp. 78–87.
  • [10] A. Naik and H. Rangwala, “Inconsistent node flattening for improving top-down hierarchical classification,” in IEEE DSAA, 2016.
  • [11] S. Dumais and H. Chen, “Hierarchical classification of web content,” in ACM SIGIR, 2000, pp. 256–263.
  • [12] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni, “Incremental algorithms for hierarchical classification,” Journal of Machine Learning Research, vol. 7, no. Jan, pp. 31–54, 2006.
  • [13] C. Silla Jr and A. Freitas, “A survey of hierarchical classification across different application domains,” Data Mining and Knowledge Discovery, vol. 22, no. 1-2, pp. 31–72, 2011.
  • [14] D. Koller and M. Sahami, “Hierarchically classifying documents using very few words,” in ICML, 1997, pp. 170–178.
  • [15] N. Holden and A. A. Freitas, “A hybrid particle swarm/ant colony algorithm for the classification of hierarchical biological data,” in SIS, 2005, pp. 100–107.
  • [16] T. Li and M. Ogihara, “Music genre classification with taxonomy,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, 2005, pp. v–197.
  • [17] A. Secker, M. N. Davies, A. A. Freitas, J. Timmis, M. Mendao, and D. R. Flower, “An experimental comparison of classification algorithms for the hierarchical prediction of protein function,” Expert Update (the BCS-SGAI Magazine), vol. 9, no. 3, pp. 17–22.
  • [18] J. Tang, S. Alelyani, and H. Liu, “Feature selection for classification: A review,” Data Classification: Algorithms and Applications, p. 37, 2014.
  • [19] M. Dash and H. Liu, “Feature selection for classification,” Intelligent data analysis, vol. 1, no. 3, pp. 131–156, 1997.
  • [20] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” TPAMI, vol. 27, no. 8, pp. 1226–1238, 2005.
  • [21] Z. Zheng, X. Wu, and R. Srihari, “Feature selection for text categorization on imbalanced data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 80–89, 2004.
  • [22] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in ECML, 1998, pp. 137–142.
  • [23] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial intelligence, vol. 97, no. 1, pp. 273–324, 1997.
  • [24] P. Ristoski and H. Paulheim, “Feature selection in hierarchical feature spaces,” in DS.   Springer, 2014, pp. 288–300.
  • [25] W. Wibowo and H. E. Williams, “Simple and accurate feature selection for hierarchical categorisation,” in Proceedings of the ACM symposium on Document engineering, 2002, pp. 111–118.
  • [26] A. Naik, A. Charuvaka, and H. Rangwala, “Classifying documents within multiple hierarchical datasets using multi-task learning,” in ICTAI, 2013, pp. 390–397.
  • [27] H. Ogura, H. Amano, and M. Kondo, “Feature selection with a measure of deviations from poisson in text categorization,” Expert Systems with Applications, vol. 36, no. 3, pp. 6826–6832, 2009.
  • [28] W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, “A novel feature selection algorithm for text categorization,” Expert Systems with Applications, vol. 33, no. 1, pp. 1–5, 2007.
  • [29] C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” Journal of bioinformatics and computational biology, vol. 3, no. 02, pp. 185–205, 2005.
  • [30] I. Dimitrovski, D. Kocev, S. Loskovska, and S. Džeroski, “Hierarchical annotation of medical images,” Pattern Recognition, vol. 44, no. 10, pp. 2436–2449, 2011.
  • [31] Y. Yang, “An evaluation of statistical approaches to text categorization,” Information retrieval, vol. 1, no. 1-2, pp. 69–90, 1999.
  • [32] C. Strobl and A. Zeileis, “Danger: High power! – exploring the statistical properties of a test for random forest variable importance,” in COMPSTAT, 2008.
  • [33] Y. Yang and X. Liu, “A re-examination of text categorization methods,” in ACM SIGIR, 1999, pp. 42–49.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description