logistic regression as a feature selection step for training stable classification trees for the prediction of severity criteria in imported malaria
Abstract
Multivariate classification methods using explanatory and predictive models are necessary for characterizing subgroups of patients according to their risk profiles. Popular methods include logistic regression and classification trees with performances that vary according to the nature and the characteristics of the dataset. In the context of imported malaria, we aimed at classifying severity criteria based on a heterogeneous patient population. We investigated these approaches by implementing two different strategies: L1 logistic regression (L1LR) that models a single global solution and classification trees that model multiple local solutions corresponding to discriminant subregions of the feature space. For each strategy, we built a standard model, and a sparser version of it. As an alternative to pruning, we explore a promising approach that first constrains the tree model with an L1LR-based feature selection, an approach we called L1LR-Tree. The objective is to decrease its vulnerability to small data variations by removing variables corresponding to unstable local phenomena. Our study is twofold: i) from a methodological perspective comparing the performances and the stability of the three previous methods, i.e L1LR, classification trees and L1LR-Tree, for the classification of severe forms of imported malaria, and ii) from an applied perspective improving the actual classification of severe forms of imported malaria by identifying more personalized profiles predictive of several clinical criteria based on variables dismissed for the clinical definition of the disease. The main methodological results show that the combined method L1LR-Tree builds sparse and stable models that significantly predicts the different severity criteria and outperforms all the other methods in terms of accuracy. The study shows that new biological and epidemiological factors may be integrated in the current clinico-biological picture to improve diagnosis and patient treatment.
1Introduction
For the purpose of diagnosis, the use of explanatory multivariate classification tools is essential to efficiently characterize groups of patients with a high risks of developing the disease [2]. Two off-the-shelf classifiers are linear logistic regression and decision trees. Based on two different model learning strategies, both methods investigate the relationship between binary response variables and a set of heterogeneous explanatory variables. To remove non-relevant variables and/or limit the complexity of the solution, both methods allow the integration of a penalization term in its objective function. The choice of the penalization for logistic regression aims to reduce the risk of overfitting induced by potential co-linearity and the combinatorial exploration of all possible two-way interactions [11]. logistic regression uses linear combinations of explanatory variables to learn a single decision boundary and build an easily understandable linear model. It selects a subset of discriminant features and assesses the predictive contribution of each of them in the model. However, it only considers linear interactions between features and their global variation related to the binary outcome. Moreover, it does not take into account missing data. The decision tree approach, non-parametric and non-linear, is particularly helpful to explore which feature subspaces are predictive of a class of subjects [5]. It learns multiple decision boundaries parallel to feature axes and builds easy-to-interpret models under the form of a set of if-then-else decision rules. It also handles data as complex as missing values, numerical and categorical data, multi-colinear variables, outliers and local relationships among variables. However, decision trees could generate over-complex and locally optimal solutions increasing the overfitting risk and ungystable decision trees due to small variations in the data. Pruning is generally applied to avoid the overfitting phenomenon. Some studies have compared these two approaches showing that their relative performances and stability depend of the nature of the signal (e.g., the signal-to-noise ratio) and the characteristics of the dataset (size) [20]. Therefore, recent studies have tried to combine them, particularly by applying a logistic regression model at the leaves of the decision trees in order to smooth the final model as an alternative to the standard pruning [15]. Another popular and efficient way to produce more accurate and stable classifiers is feature selection [10]. In this study, we proposed and tested a novel approach, called LR-Tree, that combines the two previous strategies. The objective of this novel approach is to fit more robust and simple decision tree learners by first applying a feature selection resulting from the logistic regression model.
We carried out this methodological study in the particular context of a thorough understanding of the mechanisms of severe forms of imported malaria. Despite the decrease of malaria cases in endemic areas since 2000 [24], the increasing number of travelers between endemic regions and western countries promotes imported malaria cases in non-endemic areas. Metropolitan France is the most concerned European country and the mortality rate of imported malaria is strongly related to severe malaria form favored by delays in access to health care. The World Health Organization (WHO) defined the different clinical and biological criteria for severe malaria in order to speed up the diagnosis and health care of patients that require urgent and intensive care units [24]. This clinico-biological picture inferring the diagnosis is multi-criteria and complex [22]. It also does not take into account epidemiological information which could provide further insight. Indeed, contrary to endemic regions in Africa, the populations of patients with imported malaria is heterogeneous and composed of first generation migrants (born in endemic regions and living in France), second generation migrants (children of first generation migrants born and living in France) and African or European travelers, adults or children with a different history of malaria and genetic background. We also observed an epidemiological evolution in the clinical presentation of severe forms of imported malaria with an increase of older patients from a migrant background and a decrease in the number of patients having neurological disorders [21].
Therefore, in this applied research work, we explored the influence of factors (demographic, epidemiological, clinical, biological and transcriptomic) dismissed in the current clinico-biological picture on both the diagnosis of severe forms of imported malaria and some clinical observations of acute malaria attacks (hematological syndrome, visceral failure, neurological disorders and parasitaemia level). Risk factors for developing severe malaria have been investigated in this context of heterogeneous populations of patients with classical univariate statistical methods [21]. However, these methods revealed their limits as they only assess the statistical associations between each factor and the severe criteria of imported malaria independently of each other and without prediction assumptions. Hence, the use of explanatory multivariate classification tools is essential to efficiently characterize groups with a high risk of developing complicated imported malaria and to reduce the mortality of Plasmodium falciparum infection in France based on multi-source data [12]. Our comparative study of the different learning strategies is especially interesting when considering the complexity of the data. Indeed, the available datasets corresponding to the different case studies present several difficulties that need to be overcome: heterogeneous populations of patients with under-represented subgroups, local phenomena with a weak signal-to-noise ratio, missing data, bias in the current classification, small dataset, etc.
In Section 2, we first present the data and the 6 case studies grouped into two experiments, and then, we briefly explain: the three classification methods, the model selection and the evaluation methodologies. In Section 3, we describe the different performance results and the learned models. Finally, in Section 4, we conclude on the benefit of the combined method LR-Tree, the clinical aspects and the perspectives of this work.
2Materials and Methods
2.1Dataset
The French National Reference Center of Malaria (FNRCM
Data type | |
Demographic | Age, Sex, Caucasian (dichotomous), African (dichotomous), Chemoprophylaxis taken (dichotomous) |
Epidemiological | Vis West Africa (Visit in West Africa, dichotomous), Vis Central Africa (Visit in Central Africa, dichotomous), Vis Other (Visit in an other endemic country, dichotomous), Res France or other non-endemic country (Resident in France or in another non-endemic country, dichotomous) |
Clinical | ATCD (history of the disease), Delay 2 (days from symptoms to recovery), Immunodependency (dichotomous) |
Biological | GB (White Blood Cells count), Platelets (platelets count), Serology, Serological Interpretation, Titration |
Transcriptomic | A1, A2, A3, B1, B2, C1,C2, BC1, BC2, Var1, Var2csa, Var3 (parasite genome, i.e. expression of var genes) |
We defined six case studies, called cs in this paper, grouped into two experiments. The first experiment, composed of two case studies, explores the influence of demographic, epidemiological, clinical biological and transcriptomic factors dismissed in the current clinico-biological picture (see Table ?) on the diagnosis of severe forms of imported malaria. The second experiment is composed of four case studies and explores the influence of the same previous factors on four clinical observations of acute malaria attacks: hematological syndrome, visceral failure, neurological disorders and parasitaemia level. Note that for these two experiments, we have removed from the input variables those that are directly used to infer the target, which is the malaria severity degree, such as organ or metabolic dysfunctions and blood smear measures.
First experiment The first case study focuses on the current diagnosis of severe imported malaria. For the second case study, we distinguished two subgroups of patients among those with severe malaria according to the existence of neurological and multi-organ clinical dysfunctions. The first one is called serious imported malaria and the second one is called critical imported malaria because this last form of the disease has a high probability of being fatal.
Finally, the studied dataset is composed of 353 patients diagnosed with three severity levels of imported malaria: moderate, serious or critical. For each patient, we have a total of 29 features. 12 of them concern the parasite’s genome, giving information of different nature and sources.
We define two case studies, each one comparing two groups of subjects:
the first case study includes the whole dataset by comparing subjects having moderate imported malaria to those having a severe form of the disease (i.e. serious and critical). 353 subjects are included in this experiment with 202 patients having a moderate form and 151 having a severe form. The objective of this experiment is to identify risk factors predictive of severe malaria in a heterogeneous population of patients.
the second case study compares subjects suffering from a serious imported malaria to those having a critical form of the disease. 151 subjects are included in this experiment with 88 having a serious form and 63 displaying the critical form. The objective of this experiment is to characterize and to validate the relevance of these two subgroups among patients suffering from severe malaria.
Second experiment The 4 different case studies consist in discriminating between two clinical states used for the definition of severe forms of imported malaria among a set of 343 patients.
the third case study compares patients suffering from Hematological Syndrome with people not affected by this condition. 49 patients suffer from this syndrome in the dataset.
the fourth case study compares patients suffering from Visceral Failure with people not affected by this condition. 271 patients suffer from this failure in the dataset.
the fifth case study compares patients suffering from Neurological Disorders with people not affected by this condition. 32 patients suffer from these disorders in the dataset.
the sixth case study compares patients displaying Parasitaemia greater than 4% with people not affected by this condition. 113 patients display this condition in the dataset.
2.2Methods
logistic regression
For the classification step, we used an regularized logistic regression
is the binary target, are the explicative variables, is the intercept and is the regressor vector.
The penalization parameter is introduced in the model to shrink the estimates of the regression coefficients towards zero and set some of them to zero relative to the maximum likelihood estimates:
where is the log-likelihood function:
Note that the parasite genome data have not been included in method as it requires not-empty features and many subjects have missing values for these features.
Decision Trees
Decision tree analysis
LR-Tree
The combined model consists of two steps:
Select a subset of features by fitting a logistic regression.
Build a decision tree on the selected features.
The purpose of this combined approach is the same as the pruning tree approach: to limit the appareance of an over-complex solution and lower the over-fitting risk. However, the penalization of these two approaches occurs in two different ways. Instead of pruning the learned decision tree to limit its size, the -Tree approach prior constrains the model by reducing the dimension of input features.
Model selection
For both methods, we applied a model selection to limit the complexity of the solution with a -fold cross-validation. We chose to both ensure a biais-variance trade-off of the test error estimates and a sufficient representation of the two groups within the test sets [13]. In each experimental dataset, we kept the original proportion of the two classes within each fold. For logistic regression, we optimized the penalization coefficient so that we capture two levels of model complexity. Therefore, we selected two values of : such that - is the best model minimizing the -folds cross-validation mean squared error and such that - corresponds to the simplest model which is no more than one standard error worse than the best model according to the one standard error rule [8].
For decision trees, we applied a cost-complexity minimization to limit the size of the tree and we called this simplified tree, the pruned tree :
The parameter defines the cost of adding another split to the model. This parameter is optimized within the -fold cross-validation.
Evaluation
As the output is dichotomous and the two regression methods, and decision trees, estimate the class membership probability , we define the following decision function to classify our samples:
We fixed the threshold of the decision function with respect to the distribution of the two classes in the six case studies of the two experiments in order to avoid biased models due to imbalanced classes. For all the case studies “cs 1 to 6”, the threshold is equal to the proportion of symptomatic patients. Hence, the patients are less frequently classified in the predominant class in a way that more strongly penalizes the misclassication of this class .
We checked the predictive power of our constructed classifiers, based on the , the regression trees and the -tree methods, through three performance indicators:
Recall
:
Specificity
:
Accuracy
:
The Recall (resp. Specificity) score aims to quantify the overall rate of samples correctly classified for the second (resp. first) class. They give two complementary insights about the quality of classification performances of the different methods. Indeed, from a medical point of view, we aim to discriminate and well classify the two groups of patients and not only the predominant class. We also assessed the statistical significance of the recall and specificity scores with a binomial test and we defined three significance levels: * , ** ,*** .
These performance indicators are computed with leave-one-out validation, classifying each patient one time, in order to generate stable learning models (highly correlated). This choice is due to the high heterogeneity of the patients.
3Results
In a first part (First experiment) we presented the results of our methodological objective, that is the comparative study of the different learning strategies, applied to the two case studies of the first experiment (moderate vs severe malaria and serious vs critical). We compared the results between the standard methods, i.e. LR and classification trees, and their sparse form, - vs - and Tree vs Prune. Then, we selected the best forms of each standard methods in term of performance scores and sparsity and we combined them to build a LR-Tree model. In a second part (Second experiment) we applied the combined model to the four case studies of the second experiment to validate this approach and point out some clinical insights given by the selected variables of the models. For all the case studies of the two experiments, we reported and explained the models obtained with the combined methods.
3.1First experiment
logistic regression- and regression trees-based models
Performance scores Classification trees-based models (i.e. Tree and Prune) have a better accuracy than -based models (i.e. - and -) for both case studies “cs 1 and 2” (see Fig. ? and ?). The Tree method outperforms the other methods with an accuracy score of (resp. ) to discriminate moderate and severe (resp. serious and critical) forms of imported malaria. It is also the only method to have both highly significant recall and specificity scores (pvalue 0.001) for the two case studies. Indeed, -based models tend to well classify the second class of the case studies, that is the least represented one composed of the patients with severe (resp. critical) imported malaria in the first (resp. second) case study, displaying significant recall scores. On the other hand, they tend to fail to correctly classify the first class. Conversely, the Prune models well classify the first class of the case studies, the most represented class, composed of the patients with moderate (resp. serious) imported malaria in the first (resp. second) case study, having the best significant specificity scores. On the other hand, they fail to classify the second class correctly. We can assume that the Tree method is more robust to unbalanced groups of samples and therefore extracts discriminant decision rules that generalize well to predict both classes.
For both case studies, we also observed that the simpler forms of the standard methods, namely - and Prune models, achieve the best significant recall and specificity scores, respectively. Conversely, they achieve the worst non-significant specificity and recall scores, respectively. Therefore, the sparsest approaches seem to be more sensitive to unbalanced groups of samples tending to over-classify a class more than the other.
Selected variables As expected, the simpler models, namely - and Prune, include less features than the standard models, namely - and Tree. For both the first case studies, and particularly the first one, the Tree-based models are on average sparser but less stable than the -based models (see Fig. ? and ?). Indeed, they capture on average less features but some of them are selected only few times corresponding probably to locally optimal solutions. A common pattern of selected stable features (i.e. almost selected systematically over the leave-one-out models) for all the methods is composed of white blood cells count (GB), platelets count, serological status and titration variables for “cs 1”. We observed the same common pattern of selected stable features plus the age for “cs 2”. Note that the serological status is a discrete feature deriving from the titration values and so they are considered similar features.
In the following, we focused on the Tree method, since it is the most powerful approach and it gives meaningful information on the models through the learned classification rules. Indeed, these latter characterize the different discriminant subregions of the feature space specific to subgroups of subjects. In addition to the common pattern, the three models capture the following stable features: the immunodependency and sex variables for “cs 1”, and the log-transformed of the expression of sub-group of var gene family A and visit in West Africa variables for “cs 2”.
Some of these results confirmed the observations of previous studies on the potential interactions of Plasmodium falciparum during acute malaria with negative hematological changes [17] like an increase of GB [3] and a decrease of platelets count [14] and at the same time with an immunological protection represented by serological status [4] on the development of the different severity forms of imported malaria. Furthermore, being older is a well-known risk factor for developing the acute form of imported malaria [6]. In [6], a statistical relationship has been reported between visiting West Africa, especially Gambia, and the risk of fatal malaria. Concerning the impact of gender on the discrimination between moderate and severe malaria, no statistical relation has been proven between gender and malaria severity. Nevertheless, one study showed that women are more susceptible to cerebral complications than men [16]. Concerning the expression of group A var gene, some studies have highlighted the role of the var gene family in cerebral malaria [1].
Combined -tree-based models
To effectively penalize the Tree method with a prior -based feature selection step, we used the method.
Performance scores The combined -tree method achieves similar or higher performances than the Tree ones, except for the recall score of the first case study which is inferior (see Fig. ? and ?).
Selected variables As the combined method builds the classification tree based on the features selected with -, it efficiently reduces the set of input features. The set of stable variables selected by the combined models corresponds to the previously observed common patterns for both case studies: GB, platelets count and serological status/titration (resp. plus age) variables for “cs 1” (resp. “cs 2”) (see Fig. ? and ?). Note that for the second case study the 151 combined models have selected either serology or titration leading to a total frequency of 103 for both variables. Therefore, the combined method led to sparser, more stable and discriminant (in terms of accuracy performances) models than those achieved by the Tree method. Tables ? and ? show examples of rule sets derived from stable - Tree models for each case study. From these classification rules, we can easily point out the subregions of the feature space predictive of the severe forms of imported malaria.
3.2Second experiment
Given the results of the methodological comparative study obtained on the two case studies of the first experiment, we applied the Tree to the four case studies of the second experiment.
Performance scores For all the case studies, the combined method discriminates with good accuracy scores between the two clinical states of the four clinical severity criteria (figure ?): hematological syndrome (), visceral failure (), neurological disorders () and parasitaemia level (). We can also conclude that for all these case studies, we significantly classify the two classes, except for the recall score of “cs 5” which can be explained by the low frequency of the class “Neurological disorders” (i.e. ).
Selected variables The selected variables presented on Figure ? and the classification rules (see Tables ? to ?) give medical insights on the influence of unused factors (demographic, epidemiological, clinical, biological and transcriptomic) on some clinical observations of acute malaria attacks. As previously observed in the first experiment, the variables platelets count and white blood cells count are strongly involved in the prediction of neurological disorders, hyper-parasitaemia and hematological syndrome. This could reflect the parasite sequestration in Plasmodium falciparum malaria. Concerning “cs 5”, the models showed that caucasian patients seem to be more affected by neurological disorders. Moreover, the corresponding classification rule set pointed out an interesting insight, that is the patients probably not previously affected by malaria (Caucasian, low titration/negative serology) are more sensitive to Neuro-malaria which indicates the presence of more severe forms of malaria [9]. On the other hand, patients with a history of malaria indicated by a positive serology display visceral failures. We currently observed more frequently the moderate malaria form with visceral failures and without neurological disorders. Furthermore, the trees models of “cs 4” captured only the variable Serology as a predictive factor of visceral failures. This can be explained by the fact that these symptoms may arise more from an inflammatory or immunological response than from a parasite sequestration. The gender seems also to have an impact on the presence of the hematological syndrome and the hyper-parasitaemia. Indeed, the rule sets (see Tables ? and ?) show that for a given range of platelets count and a given GB threshold, male patients develop these clinical symptoms while women do not. This may be due to the fact that men travel more frequently than women in endemic areas.
4Conclusion and discussion
Among the standard approaches, i.e. logistic regression and regression trees, only the Tree method efficiently well classifies the two classes of patients for both the first and the second case studies. However, the Tree models are not sparse and stable enough providing locally optimal solutions reflecting the intrinsic heterogeneity of the studied dataset. The pruning method drastically simplifies the Tree models while leading to poor, non-significant recall scores. This phenomenon could be explained by the fact that pruning tends to eliminate unstable branches, corresponding to variables with a great variance on threshold values and positions across cross-validation trees. Therefore, a pre-selection of the input features can be a good alternative solution to pruning in order to constrain the complexity and to increase the robustness to small data variations of the decision trees by removing under-represented phenomena in the studied population.
Our new method, called -tree, significantly discriminates the two classes for both experiments and we show that it outperforms all the other methods in terms of accuracy for the two first case studies. Moreover, it efficiently leads to sparser and more stable models than the Tree ones. We can conclude that our combined method is a relevant sparse tree-based method for classification problems even when the classes are strongly unbalanced as it is the case for the classes of the second experiment.
Concerning the prediction of the severe criteria of imported malaria, the combined method classifies around of the patients (until for the visceral failures) for both studied experiments. Hence, concerning the case study 2, we can conclude that the subclassification of severe imported malaria in serious and critical classes is valid. Moreover, the combined method produces explanatory and easily understandable models which can be represented under the form of rule sets. These rule sets confirm the predictive power of epidemiological and biological variables discarded from the current classification, such as platelets count, age, gender, white blood cells count and serology. They also provide meaningful information about the discriminant subregions of the selected features specifying for example the threshold or range of values of the selected biological measures.
However, these models did not capture some local phenomena in a stable way (cf variables captured with a low frequency over the leave-one-out models), probably due to their low representation in the dataset. This may explain a part of the misclassification of patients. A solution would be to expand the sample size, while ensuring the diversity of the population surveyed, in order to increase the statistical reliability of these phenomena. For the first experiment, a part of the classification error may also result from a bias in the definition of the classes based on the current clinico-biological picture. Indeed, as explained in the introduction, the diagnosis of severe imported malaria is multi-criteria, complex and does not take into account the heterogeneity of the individual profiles.
It is also important to mention that the use of the as a feature selection step prior to fitting the decision tree may be challenged to overcome the limitations of the method (linear interactions, no missing data, etc.). In future work, it would be interesting to investigate other penalized approaches.
Appendix
Moderate Form | |
Serious Form | |
No Hematological Syndrome | |
No Visceral Failure | |
No Neurological Disorders | |
Parasitology 4% | |
Footnotes
- http://www.cnrpalu-france.org/
- Computed in R with the package glmnet, https://cran.r-project.org/web/packages/glmnet/glmnet.pdf
- Computed in
R
with the packagerpart
,https://cran.r-project.org/web/packages/rpart/rpart.pdf
References
- Paludisme grave: de la physiopathologie aux nouveautés thérapeutiques.
Argy, N and Houzé, S. Journal des Anti-infectieux - Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes.
Austin, P C et al. Journal of clinical epidemiology - Evidence for significant influence of host immunity on changes in differential blood count during malaria.
Berens-Riha, N et al. Malaria journal - Do african immigrants living in france have long-term malarial immunity?
Bouchaud, O et al. The American journal of tropical medicine and hygiene -
Classification and regression trees
Breiman, L et al. . - Risk factors for mortality from imported falciparum malaria in the united kingdom over 20 years: an observational study.
Checkley, A M et al. BMJ - Rapport annuel d’activité.
CNR. Centre national de référence du Paludisme -
The elements of statistical learning, 2nd edition
Friedman, J H et al. . - Immunity to non-cerebral severe malaria is acquired after one or two infections.
Gupta, S et al. Nature medicine - An introduction to variable and feature selection.
Guyon, I and Elisseeff, A. The Journal of Machine Learning Research -
Applied logistic regression
Hosmer Jr, D W et al. . - Using classification tree modelling to investigate drug prescription practices at health facilities in rural tanzania.
Kajungu, D K et al. Malaria Journal - A study of cross-validation and bootstrap for accuracy estimation and model selection.
Kohavi, R et al. 14th International Joint Conference on Artificial Intelligence - Severe malarial thrombocytopenia: a risk factor for mortality in papua, indonesia.
Lampah, DA et al. Journal of Infectious Diseases - Logistic model trees.
Landwehr, N et al. Machine Learning - Age as a risk factor for severe manifestations and fatal outcome of falciparum malaria in european patients: observations from tropneteurop and simpid surveillance data.
Mühlberger, N et al. Clinical infectious diseases - How reliable are hematological parameters in predicting uncomplicated plasmodium falciparum malaria in an endemic region?
Muwonge, H et al. ISRN tropical medicine - Directives pour le traitment du paludisme.
OMS. Organisation mondiale de la Santé - Robust penalized logistic regression with truncated loss functions.
Park, S Y and Liu, Y. Canadian Journal of Statistics - Tree induction vs. logistic regression: A learning-curve analysis.
Perlich, C et al. The Journal of Machine Learning Research - Severe imported plasmodium falciparum malaria, france, 1996–2003.
Seringe, E et al. Emerging infectious diseases - Prise en charge et prévention du paludisme d’importation à plasmodium falciparum: recommendations pour la pratique clinique.
SPILF. Société de Pathologie Infectieuse de Langue Française - An introduction to recursive partitioning using the rpart routines.
Therneau, T M et al. Technical report cran r-project - World malaria report.
WHO. World Health Organization