Semiparametric Classification of Forest Graphical Models
Abstract
We propose a new semiparametric approach to binary classification that exploits the modeling flexibility of sparse graphical models. Specifically, we assume that each class can be represented by a foreststructured graphical model. Under this assumption, the optimal classifier is linear in the log of the one and twodimensional marginal densities. Our proposed procedure nonparametrically estimates the univariate and bivariate marginal densities, maps each sample to the logarithm of these estimated densities and constructs a linear SVM in the transformed space. We prove convergence of the resulting classifier to an oracle SVM classifier and give finite sample bounds on its excess risk. Experiments with simulated and real data indicate that the resulting classifier is competitive with several popular methods across a range of applications.
*2018****/***/**DornMary Frances Dorn, Amit Moscovich, Boaz Nadler and Clifford Spiegelman \ShortHeadingsClassification of Forest Graphical ModelsDorn, Moscovich, Nadler and Spiegelman \firstpageno1
TBD
binary classification, feature engineering, undirected graphical model, risk consistency, density estimation
1 Introduction
In this work, we consider binary classification problems where the vector of explanatory variables belongs to some dimensional space , and the response variable takes values in the set . Given a set of labeled samples , the problem is to construct a classifier , with a small misclassification error rate on future unlabeled instances of . It is well known that the optimal classifier is a threshold of the Likelihood Ratio statistic,
(1) 
The challenge in applying this result is that the class conditional densities are typically unknown. Accurate nonparametric dimensional density estimation requires a number of labeled samples that is exponential in . This is known as the curse of dimensionality (see for example chapter 2 of Tsybakov (2009)). Hence, in high dimensional settings, classification methods based on nonparametric multivariate density estimation may perform poorly compared to other methods.
Instead of a generative approach that estimates the conditional densities , many discriminative approaches to binary classification have been developed, that do not require estimation of class densities. Notable examples include Support Vector Machines (Cortes and Vapnik, 1995; Lee et al., 2004), Linear Discriminant Analysis and its regularized variants (Friedman, 1989; Bickel and Levina, 2004; Guo et al., 2006; Witten and Tibshirani, 2011), Artificial Neural Networks (Rosenblatt, 1958; Ripley, 1994; LeCun et al., 2015), Classification and Regression Trees (Breiman et al., 1984), Random Forests (Breiman, 2001) and Gradient Boosted Trees (Friedman, 2001).
In this work, we propose a new semiparametric approach to classification that incorporates a sparsity assumption on the dependency structure of the data. As we review in Section 2, our approach is related to the graphical model approaches taken by Lafferty et al. (2012), Pérez et al. (2009), Park et al. (2011), Tan et al. (2010) and Fan et al. (2016). Specifically, we assume that each class can be represented by a foreststructured graphical model. As described in Lemma 3, a key observation is that under this assumption, the loglikelihood ratio is a linear function of the univariate and bivariate logtransformed densities. The coefficients of this linear function are integers that depend on the unknown forest structured graphical models of the two classes. One approach to build a classifier is thus to approximate this linear function by estimating the densities and then attempting to recover the forest structure of the two classes. In contrast, we view the underlying graphical models as nuisance parameters and do not attempt to estimate them. Instead, our approach is simpler: it nonparametrically estimates the univariate and bivariate marginal densities, maps each sample to the logarithm of these estimated densities and constructs a linear SVM in the transformed space. On the theoretical front, in Section 4 we prove that the risk of the resulting classifier converges to that of an oracle SVM classifier, based on the exact marginal densities. In Section 5 we present several experiments with simulated and real datasets. These experiments demonstrate that our classifier is competitive with other leading methods across a wide range of applications.
Our proposed approach, in which only onedimensional and twodimensional densities are estimated, has several appealing advantages: (i) it circumvents the curse of dimensionality problem of high dimensional density estimation; (ii) it avoids the challenging problem of learning the underlying graphical models; (iii) it is computationally simple to implement than forest density estimators, and (iv) it is competitive with other popular classification methods. Further discussion and possible extensions appear in Section 6.
2 Learning Forest Graphical Models
Our proposed approach assumes that the conditional distribution of given for both classes can be well approximated by foreststructured graphical models. We begin with a brief description of probabilistic graphical models (Lauritzen, 1996; Jordan, 1998, 2004). Then we review some works concerned with density estimation and classification of several graphical model families.
Graphical models define a family of multivariate probability distributions that factorize according to some graph . The vertices represent the coordinates of the random vector and the edges in encode the conditional dependence relationships.
In undirected graphical models,
also known as Markov random fields, the absence of an edge indicates that the corresponding variables are conditionally independent given all the other variables.
In this paper we focus on undirected graphical models with no cycles, also called forests as they can be described as a set of trees. See Fig. 1 for a simple example of a forest composed of two trees,
which together describe the dependency structure of a 5dimensional random vector.
Assuming that the dependencies among the variables in a given dataset can be well approximated by a graph structure, several works considered the problem of recovering the underlying graph and its associated distribution. We focus instead on the construction of classifiers that sensibly use this modeling assumption, and treat the graph structure as a nuisance parameter, thus avoiding the challenging task of estimating the underlying graphical model of each class. Yet, a common theme is that both approaches nonparametrically estimate only univariate and bivariate densities.
2.1 Unsupervised Learning
We first briefly review some key prior works on graph learning. Several methods were developed to learn the structure of a graphical model. Some methods assume that the distribution is multivariate Gaussian. In this case, finding the graph structure amounts to locating the zeros in the inverse covariance matrix. In a high dimensional setting with a foresttype graph, this inverse matrix is sparse. Several works proposed regularized methods to estimate the zeros of an inverse covariance matrix (Meinshausen and Bühlmann, 2006; Rothman et al., 2008; Friedman et al., 2007). To relax the Gaussian assumption, while still taking advantage of the efficient computational procedures developed under it, Liu et al. (2012) and Xue and Zou (2012) proposed the nonparanormal model, or semiparametric Gaussian copula model.
A different approach is to assume that the graph structure is an undirected tree. In this case, the probability density factorizes as
where is a permutation of and is the parent of in the dependency tree. For a discrete multivariate distribution, Chow and Liu (1968) proposed a computationally efficient algorithm to find the treebased distribution with minimal KullbackLeibler (KL) divergence from . When itself has a tree structure, Chow and Wagner (1973) proved that as the sample size tends to infinity . Tan et al. (2011) extended this approach to allow a more sparse structure, by removing weak edges from the learned tree. The result is a forest composed of a union of disjoint trees. Such a foreststructured distribution admits the factorization
(2) 
where and are the edge and vertex sets of the forest. Lafferty et al. (2012) extended the Chow and Liu approach to the case of multivariate continuous data by using kernel density estimates of the univariate and bivariate densities in Eq. (2).
2.2 Classification
The various works mentioned in the previous section all considered an unsupervised setting. We now return to the binary classification problem, assuming that the data from each class has a foreststructured distribution of the form of Eq. (2). We denote the density for class by , and the classconditional univariate and bivariate densities by
Any density estimator can be used to build a generative classifier as follows: first estimate the class densities and then classify samples by their likelihood ratio as in Eq. (1). Examples of this approach for directed graphical models include the works of John and Langley (1995); Friedman et al. (1997) and Pérez et al. (2009).
Several discriminative methods have been proposed which consider special cases of forest models. Specifically, the following three nested families of distributions were studied:

Edgeless graph: The simplest graphical model is the graph with no edges. Its density is the product distribution . Under this model, the log likelihood ratio statistic yields the classical Naïve Bayes classifier . Based on this observation, Fan et al. (2016) proposed to nonparametrically estimate all the univariate densities, and transform each sample to the vector of its estimated log ratios . They then generalized NaïveBayes by considering a linear classifier of the form with the coefficients estimated by a penalized logistic regression. They also proposed a variant where they retain the original features as well and construct a linear classifier .

Chain graph: Park et al. (2011) proposed a different discriminative approach. Let be a permutation of . They assumed that there is a Markov chain according to which both classes are distributed. Namely, for and all , Under this assumption, the log likelihood ratio is simply the sum of where They nonparametrically estimate all univariate and bivariate densities and seek the permutation that provides the lowest misclassification error. Their approach is computationally intensive as they iterate over all possible Markov chains. To reduce the computational load, they proposed another variant whereby they apply Fisher’s linear discriminant analysis on all univariate and bivariate densities.

Tree: Tan et al. (2010) proposed to learn trees that are specifically tailored for classification. Their procedure attempts to fit each tree distribution to the observed data from that class and to simultaneously maximize the distance between the two distributions.
Note that these three cases form a nested family of graphical models since an edgeless graph can be described as a chain of independent variables whose bivariate densities satisfy . Also, every chain graph is a tree.
3 The Forest Density Classifier
We now describe our proposed method, which allows for a general forest dependency structure, thus generalizing the models discussed in the previous subsection. A key observation behind our method is that if the distributions of both classes are foreststructured as in Eq. (2), possibly with different graphs , then their likelihoodratio has a particularly simple form given by the following lemma.
If both classes are foreststructured then their log likelihood ratio is
(3)  
where is the degree of in the graph of class . The proof follows directly from Eq. (2) and is thus omitted.
Let be the oracle transformation that maps a feature vector to a vector containing all of ’s univariate and bivariate log density terms,
(4) 
where is an alias for . In this notation, Lemma 3 can be restated to say that the log likelihood ratio is given by
where the coefficients of depend on the graph structure of both classes. It follows that the Bayes classifier has the form
(5) 
where depends on the class imbalance and misclassification costs. The forest density classifier attempts to mimic the Bayes classifier using two approximations:

Replacing the oracle feature map , which is typically unknown, by a map of estimated densities,
Unlike the full dimensional distribution , which typically cannot be accurately estimated even for moderate values of , univariate and bivariate marginal densities can be accurately estimated in practice, without requiring huge sample sizes.

Replacing the unknown optimal classifier by a linear classifier
whose coefficients are learned from the data, so as to minimize the empirical misclassification rate.
The coefficients corresponding to the bivariate densities in the oracle linear separator are all either or , depending on the existence of edges in the respective graphical models, whereas the coefficients multiplying the univariate densities are integers that depend on the degrees of the corresponding variables in the graphical model. Constructing a linear classifier with such integertype constraints is a challenging combinatorial problem and may not be the best choice for the datadependent feature mapping . Hence, similar to the approach of Fan et al. (2016), we relax the integer constraints and allow the coefficients to be all real valued. With this relaxation, we can use a linear SVM to find a good weight vector . The construction of the forest density classifier is summarized in Figure 2.
innerleftmargin=20pt, innerrightmargin=20pt, innertopmargin=20pt, innerbottommargin=20pt, middlelinewidth=1pt, backgroundcolor=matlab_blue!20, roundcorner=5pt
This approach avoids the challenging task of fitting forest type distributions to each class. In the next section we prove that under certain conditions, the risk of the linear SVM solution based on the estimated densities converges to that of the optimal SVM classifier based on the exact densities .
4 Theory
Let be a set of i.i.d. labeled training samples from a probability distribution over the product space . To simplify the theoretical analysis we assume that the training samples were split into two disjoint sets and of sizes and , respectively, with . The labeled samples in are used to construct the univariate and bivariate density estimates and for each class . We then apply the feature map of Eq. (4) to the samples in and construct a linear classifier using the transformed samples . For simplicity, we consider feature maps that include an additional coefficient equal to 1, so that the intercept can be subsumed into . The class label predicted by a linear classifier applied to a featuremapped sample is thus sign.
In the rest of this section, we define several risk functions and then formulate the forest density classifier as an empirical risk minimizer. The classification error rate, or 01 risk, of the classifier is
From Eq. (5) it follows that minimizes . The empirical 01 risk is the average loss over the samples in ,
Minimizing this risk with respect to is computationally difficult (BenDavid et al., 2003). To circumvent this difficulty, one popular approach is to consider instead the risk and empirical risk with respect to the hinge loss ,
The hinge loss is a convex function and therefore can be minimized efficiently. Since the hingeloss bounds the 01 loss from above, minimizing the hinge loss is a way to control for the misclassification error. A common formulation of the SVM classifier minimizes the empirical hinge loss with an additional Tikhonov regularization term. For the transformed data set this is
(6) 
where controls the regularization strength (see for example Section 15.2 of BenDavid and ShalevShwartz (2014)). For our forest density classifier, we consider instead the SVM solution with Ivanov regularization,
(7) 
where the radius controls the complexity of the hypothesis class. Both the Tikhonov form in Eq. (6) and the Ivanov form in Eq. (7) share the same regularization path (Oneto et al., 2015). Namely, for every choice of Tikhonov regularization parameter there is an Ivanov regularizer which yields the same solution. Finally, we denote by the SVM population minimizer using the oracle feature map
(8) 
In the rest of this section, we prove that as the sample size tends to infinity, the risk of the forest density classifier converges to the oracle risk and give high probability bounds on their difference.
4.1 Convergence of the Datadependent Feature Map
We begin by studying the convergence rate of the estimated feature map to the oracle map . For simplicity, we use the bivariate notation to also denote univariate density estimates . For our theoretical analysis we need 3 assumptions on the class conditional densities and their estimation procedures:
Assumption 1
There are constants such that for all and all we have that .
Assumption 2
There exists a rate decay function that satisfies such that for every , the following bound holds with high probability (w.h.p.)
(9) 
Assumption 3
The following expectation is finite,
(10) 
and .
Remark 1
Assumption 2 holds under various conditions on the underlying density and method of estimation. For Hölder bivariate densities, the minimax bound holds (up to log factors) with high probability, both for kernel density estimation (Liu et al., 2011; Jiang, 2017) and for knearestneighbor density estimation (Dasgupta and Kpotufe, 2014).
Remark 2
A potential problem with Assumption 3 is that the terms may diverge due to being equal or close to zero on a nonnegligible subset of . This problem is easy to avoid by constructing an adjusted density estimator that satisfies for some small constant . In that case, if the Mean Integrated Square Error (MISE) is finite for all then Eq. (10) with the adjusted density estimator is guaranteed to be finite as well. The MISE has been studied extensively for various density estimators (see for example Tsybakov (2009)). In particular, when the bivariate densities are Hölder, kernel density estimation achieves the finitesample minimax bound for dimension 2, In this case, it can be shown that the adjusted kernel density estimator satisfies
4.2 Convergence of the SVM Risk to the Oracle Risk
In this section we present our main theoretical result. The proofs appear in the appendix. We begin with a technical lemma that bounds the difference in risks of using the exact log densities compared to using the estimated log densities . {lemma} Under Assumption 3, for any ,
This holds for the empirical risk as well,
We now show that for the hinge loss, the empirical risk of the forest density classifier defined in Eq. (7), converges in probability to the population risk of the oracle SVM solution defined in Eq. (8). {theorem} Let be the feature map constructed using samples and let be the forest density SVM classifier, trained using samples. Then, under Assumptions 1,2 and 3, the following bound holds w.h.p.
The terms and are due to errors in the univariate and bivariate density estimates, whereas the term is an SVM generalization term.
5 Experimental Results
We illustrate the predictive accuracy of our classifier on both simulated data that follows forest distributions as well as on several publicly available datasets. As described above, our proposed method (denoted ForestDens) trains a linear support vector machine classifier in the transformed space of univariate and bivariate log densities. In these experiments, the univariate densities are estimated using Matlab’s ksdensity kernel smoothing function with default parameters. The bivariate densities use a Gaussian product kernel. Our method may be tuned to improve performance on a specific domain by considering such things as different costs of misclassification and prior probabilities. However, we refrain from optimizing each classifier on every domain, and instead demonstrate the validity of our approach and its competitiveness across a broad range of problems.
5.1 Synthetic Data Experiments
In this section we consider synthetic datasets for which both classes have a foreststructured distribution. The experiments follow a factorial design:

Sample size of the training data, .

Class imbalance. The problem is either balanced, with half of the training samples belonging to each class, or unbalanced, whereby of the training samples belonging to the minority class.

Sparsity of the forest structure, measured as the number of edges in the two forests. The structures are either fully connected trees with edges or sparser forests having roughly twothirds as many edges.

Complexity of the marginal distributions. In the simple case, the joint distribution for each forest is multivariate normal. In the more complex setting, the marginals and are either tdistributed with 3 degrees of freedom or follow a mixture of two Gaussian distributions with different means and variances.

Partial common structure between the two classes. In the simpler setting, the two forests are constructed independently of each other without constraining them to be similar. In the more challenging setting, roughly twothirds of the structural features (edges and isolated nodes) are identical for the two classes. Figure 3 displays an example of a model in which two forest distributions were generated in this fashion.
We compare the forest density classifier with several popular classifiers: Linear Discriminant Analysis (LDA), the 5NearestNeighbor classifier (5NN), Naïve Bayes (NB), kernel Support Vector Machines with Radial Basis Functions (SVM RBF), and a Random Forest (RF) classifier. These methods were implemented using the builtin functions in Matlab with default parameters. Naïve Bayes was implemented using kernel smoothing density estimates with a Gaussian kernel. For each predictor and class, the bandwidth is automatically chosen to be a value that is optimal for a Gaussian distribution. For SVM (RBF), the scale of the RBF is automatically chosen using a heuristic procedure implemented in Matlab’s fitcsvm function. Random Forest was implemented using Matlab’s bagged decision tree function TreeBagger, with the number of trees set to 50. The number of variables randomly selected for each split is set to the default value of the square root of the total number of variables.
For each combination of factor levels, forestdistributed models were generated. A training dataset drawn from each model was used to construct the different classifiers, and the predictive accuracy of these classifiers was evaluated on an independent test set of samples generated from the same forestdistributed models. In these test sets, half of the observations were generated from each class. Average misclassification error rates from the replicates are illustrated in Figure 4. The full results, including standard deviations, are displayed in Tables 25 in the appendix. For each experiment, the classifiers with the best performance according to a Wilcoxon signedrank test at the level are printed in boldface. Due to the highly nonlinear separating boundary in each model, LDA performed no better than a random guess and therefore we omitted it from the tables.
The models generated in this study represent a wide variety of foreststructured distributions, and the results demonstrate that our proposed classifier generally outperformed the other classifiers when the forest density assumption holds. For each of the 64 factor level combinations, we made a pairwise comparison between the performance of the forest density classifier and each of the other classifiers. A onesided Wilcoxon signedrank test was used to test the null hypothesis that the misclassification error rate of the forest density classifier is no better than that of a competing classifier. Significance was tested at the Bonferronicorrected level. Figure 5 displays the proportion of the models for which the forest density classifier was found to perform significantly better. See Appendix B for more details.
5.2 Real Data Experiments
To test our method in cases where the forest assumption does not necessarily hold, we applied it to several publicly available binary classification datasets from different domains, all of which are available from the UCI Machine Learning Database (Dheeru and Karra Taniskidou, 2017). The chosen datasets have mostly continuous predictor variables, but range in sample size and dimensionality. We removed instances with missing values. Additionally, 35 samples were removed from the Pima diabetes dataset that are miscoded in one variable.
In the Ionosphere dataset, two ordinal variables are treated as categorical when constructing the Naïve Bayes classifier, so that the corresponding marginals are estimated as a multinomial distribution. Where features are not commensurate, such as the Pima diabetes dataset in which one feature is the diastolic blood pressure and another is the body mass index, they were standardized prior to applying 5NN and SVM (RBF), both of which are sensitive to the scale of the data.
Dataset  LDA  5NN  NB  SVM (RBF)  RF  ForestDens  

Ionosphere  34  351  
Liver disorder  6  345  
Ozone  72  1,847  
Parkinsons  22  195  
Pima diabetes  8  733  
Sonar  60  208  
SPECT heart  44  267  
Vertebral column  6  310  
WI breast (diag)  32  198  
WI breast (prog)  30  569 
The misclassification error was estimated by 10fold crossvalidation, with the folds sampled in a stratified manner so that they have approximately the same proportions of class labels as the full dataset. The crossvalidation was repeated 10 times to account for the variance in the error estimates due to taking a different random partition of the data. For each dataset, the various classifiers were learned on the same training sets and their performance evaluated on the same test sets. In particular, the crossvalidation folds were the same for all the experiments on each dataset. Since some datasets had a large class imbalance, we evaluated each classifier by its Balanced Error Rate (BER),
Sensitivity is the true positive rate, equal to the proportion of predicted positives from the positive set. Specificity is the true negative rate, equal to the proportion of predicted negatives from the negative set. Note that the misclassification error rates for the simulated datasets in the previous subsection were estimated on a test set with an equal number of samples in each class, and hence are equivalent to the BER.
The empirical mean and standard deviation of the BER taken across the 10 runs and crossvalidation folds are given in Table 1. For each dataset, the classifier with the smallest balanced error rate is underlined. Classifiers which are not significantly different from it according to a twosided paired test at the Bonferronicorrected level appear in bold. In half of the cases, more than four classifiers were found to not be significantly different from each other and none were bolded. The tests were performed with means and variances estimated using the 100 individual accuracies (from the 10 repeated 10fold crossvalidations) and with degrees of freedom.
The performance of the forest density classifier was better than SVM (RBF) in six of the ten datasets, including the two most highly imbalanced datasets Ozone and SPECT, and outperformed naïve Bayes and Random Forest in seven of the ten domains. Only for the Parkinsons dataset the forest density classifier performed worse than all of the other classifiers tested. These experiments demonstrate that the forest density classifier has the flexibility to perform well, even when the forest density assumption does not hold.
6 Discussion
In this work, we proposed the Forest Density Classifier. A semiparametric procedure for classifying vectors with foreststructured classconditional densities. A key advantage of this approach is that it requires only the estimation of univariate and bivariate densities. Empirical results on simulated and real datasets demonstrate that this method is competitive with popular nonparametric classifiers on a broad set of examples.
In recent years, the winners of many data modeling competitions have made use of extensive feature engineering, whereby many new features are generated from the original features in the dataset (for example, Koren (2009); Narayanan et al. (2011)). Our results suggest that the estimated log densities and are interesting features to consider for a wide range of classification problems. In our experiments and theoretical analysis, we construct a linear SVM that uses only the estimated log densities. However, one may choose to also keep the original features as did Fan et al. (2016) or to combine the log density features with entirely different features and classification methods.
The running time of our approach is typically dominated by the computation of the estimated log densities for all data points. Using a standard kernel density estimator, this cost is . However, since the computation involves only 1dimensional and 2dimensional density estimates, much more efficient methods exist that can reduce the total running time to (Ram et al., 2009). Regardless of the method of density estimation, the proposed procedure incurs computational and memory costs that are quadratic in the dimension. In very highdimensional settings, feature selection or sparsityinducing penalty terms may be added, as did Neumann et al. (2005) and Zhu et al. (2004). To reduce the memory footprint one may avoid storing the transformed samples, and instead to apply kernel SVM with a datadependent kernel .
This work was supported by a Texas A&MWeizmann research grant from Paul and Tina Gardner.
Appendix A Proofs
{proof}of Lemma 4.1. The first part follows immediately from Assumption 1, since
We now turn to the second part of the lemma.
By the mean value theorem, for any there exists some such that . It follows that
(11) 
Hence,
By Assumptions 1 and 2, w.h.p. , therefore
of Lemma 4.2. By definition,
Using the triangle inequality we have
Since the hingeloss is 1Lipschitz and , we can simplify the bound further to
By the CauchySchwarz inequality and Assumption 3,
The bound for the empirical risk is proved in the same manner.
of Theorem 4.2. We decompose the difference of risks into 4 terms,
(12)  
We now bound each of these terms separately. To bound the first and fourth terms we apply a generalization bound for the softmargin SVM. First recall that by Lemma 4.1, . It follows from Theorem 26.12 of BenDavid and ShalevShwartz (2014), that the following bound is satisfied w.h.p. over .
(13) 
In particular, this gives a high probability bound on the fourth term of Eq. (12). Using Corollary 4.1 we can similarly bound the first term of Eq. (12). w.h.p over ,
(14) 
To bound the second term of Eq. (12), note that is the minimizer of , hence
The proof concludes by bounding the third term of Eq. (12) using Lemma 4.2,
Appendix B Additional Experiments
Tables 25 provide the misclassification error rates, in percentage points of the different classification methods considered under the various foreststructured distributions. As expected, with sufficient sample size so that the univariate and bivariate density estimates are accurate, our proposed method gives the best results.
Common  

structure  Class imbalance  5NN  NB  SVM (RBF)  RF  ForestDens  
None  Unbalanced  
Balanced  
2/3 Shared  Unbalanced  
Balanced  
Common  

structure  Class imbalance  5NN  NB  SVM (RBF)  RF  ForestDens  
None  Unbalanced  
Balanced  
2/3 Shared  Unbalanced  
Balanced  
Common  

structure  Class imbalance  5NN  NB  SVM (RBF)  RF  ForestDens  
None  Unbalanced 