# An efficient model-free estimation of multiclass conditional probability

## Abstract

Conventional multiclass conditional probability estimation methods, such as Fisher’s discriminate analysis and logistic regression, often require restrictive distributional model assumption. In this paper, a model-free estimation method is proposed to estimate multiclass conditional probability through a series of conditional quantile regression functions. Specifically, the conditional class probability is formulated as difference of corresponding cumulative distribution functions, where the cumulative distribution functions can be converted from the estimated conditional quantile regression functions. The proposed estimation method is also efficient as its computation cost does not increase exponentially with the number of classes. The theoretical and numerical studies demonstrate that the proposed estimation method is highly competitive against the existing competitors, especially when the number of classes is relatively large.

## 1Introduction

Estimation of conditional class probability is important in statistical machine learning since the conditional class probability measures the strength and confidence of the classification outcomes. It also provides supplemental information to the classification labels, such as hazard reduction in “evidence-based” medication (Wahba, 2002) and pixel spectrum in remote sensing (Xu, 2005). In multiclass classification, a training sample is available with covariate and class label , where is the number of classes. Due to the discrete feature, the conditional distribution of given can be fully characterized by the conditional class probability . Estimation of is the primary goal of this paper, which is also known as the soft classification (Wahba, 2002; Liu, Zhang and Wu, 2011), as opposed to the hard classification that mainly focuses on predicting the class labels without estimating probability.

In literature, many classical probability estimation methods have been developed based on certain distributional model assumptions. For instances, Fisher’s discriminant analysis assumes that the covariates within each class follow multivariate Gaussian distributions with homogeneous or heteroscedastic covariance matrices. Relaxing the Gaussian distribution assumption, the multiple logistic regression takes one class as baseline and assumes the logarithms of all the odds ratios are linear functions of the covariates. Although these estimation methods have been widely used in practice, it is generally difficult to verify the distributional model assumptions and thus may lead to suboptimal performance when the assumptions are violated.

To circumvent the restrictive distributional assumption, various model-free probability estimation methods have been proposed and gained their popularity among the practitioners. Classification tree is a popular model-free classification method that produces probabilistic outputs, however it can be over-sensitive to the training set and thus suffers from issues of over-fitting and instability (Breiman, 1996). Wang, Shen and Liu (2008) proposes a model-free binary conditional probability estimation method by bracketing the conditional probability through a series of weighted binary large-margin classifiers with various weights . The method is based on the property that the consistent weighted binary large-margin classifiers aim at estimating , and hence that the small bracket containing can be obtained based on the estimated for different ’s. To extend the binary estimation method to multiclass case, a number of attempts have been proposed. Hastie and Tibshirani (1998) and Wu, Lin and Weng (2004) develop the pairwise coupling method, which converts the multiclass probability estimation into estimating multiple one-vs-one binary conditional probabilities. Wu, Zhang and Liu (2010) directly extends the idea of Wang, Shen and Liu (2008) and designs an interesting way of assigning weights to the multiple classes, and then produce the estimated conditional probability by searching for the -vertex polyhedron that contains . However, both methods require intensive computational cost as the number of one-vs-one binary classifications is proportional to and the number of -vertex polyhedrons increases with exponentially.

In this paper, an efficient bracketing scheme is proposed for estimating the multiclass conditional probability via a series of estimated conditional quantile functions (Koenker and Bassett, 1978; Koenker, 2005). The key idea is that can be formulated as the difference of corresponding cumulative distribution functions , which can be obtained through a series of estimated conditional quantiles of given . Compared with other model-free estimation methods, the proposed estimation method is computationally efficient in that its computational cost does not increase with exponentially, which is desirable especially when is large. The solution surface of the regularized quantile regression estimation (Rosset, 2009) can further alleviate the computation burden. More importantly, the asymptotic property of the proposed estimation method is established, which shows that the proposed estimation method achieves a fast convergence rate to the true . The simulation studies and real data analysis also demonstrate that the proposed method is highly competitive against the existing competitors.

The rest of the paper is organized as follows. Section 2 presents the proposed multiclass conditional probability estimation method along with its computational implementation. A tuning parameter selection criterion is also introduced. Section 3 establishes the asymptotic convergence property of the proposed method. Section 4 examines the numerical performance of the proposed estimation method in both simulated examples and real applications. Section 5 contains some discussion, and the appendix is devoted to technical proofs.

## 2Multiclass probability estimation via quantile estimation

This section presents the novel model-free estimation method for multiclass conditional probability and its computational implementation.

### 2.1Multiclass probability estimation via quantile regression

In multiclass classification with , estimation of is equivalent to estimation of due to the following decomposition,

where is the conditional cumulative distribution function of given . Furthermore, the estimated can be constructed through a series of estimated quantile regression functions, since

where represents the -th conditional quantile of given , defined as

Since is discrete and only takes values in , estimating can encounter various difficulties such as discontinuity as discussed in Machado et al. (2005) and Chen et al. (2010). A simple treatment is to jitter the discrete response by adding some continuous noises. In specific, denote the jittered response , where follows a uniform distribution on and is independent of , and denote as the -th quantile of given . With jittering, becomes continuous, is strictly increasing in , and thus is also continuous and strictly increasing in . More importantly, , and

Combining the results, can be explicitly connected with as in Lemma 1.

The -th quantile of given is where is set to be 0 for simplicity.

By Lemma 1, , and then

Therefore, estimation of boils down to estimating quantile regression function for various ’s. Specifically, let be a sequence of ’s, and , be the estimated ’s. According to (Equation 3), can be estimated as

where and for simplicity.

Note that can be estimated by any existing quantile regression estimation method, such as He, Ng and Portnoy (1998), Li, Liu and Zhu (2007), Wang, Zhu and Zhou (2009), Yang and He (2012), and many others. For illustration, we adopt the nonparametric method in Li, Liu and Zhu (2007), which is formulated as

where is a reproducing kernel Hilbert spaces (RKHS; Wahba 1990) induced by a pre-specified kernel function , is the check loss function and is the associated RKHS norm. It is shown in Li, Liu, and Zhu (2007) that the estimated based on (Equation 5) converges to in terms of for any , where .

As computational remarks, the proposed estimation method in (Equation 4) only requires fitting conditional quantile functions. The optimal value of , as shown in Section 3, only relies on the asymptotic behavior of the quantile regression estimation. The grid points can be simply set as equally spaced points on , and more sophisticated adaptive design can be employed as well. For comparison, when the number of grid points along each direction is , the computational complexity of the proposed method is , whereas the complexity of the method in Wu et al. (2010) is . It is clear that the proposed method is computationally more efficient as its complexity does not increase exponentially with . Furthermore, although the true is strictly increasing in , the fitted quantile regression functions may cross each other and thus become inconsistent with order of (He, 1997), leading to suboptimal estimation of in practice. To prevent that from happening, some non-crossing constraints as in Wu and Liu (2009), Bondell, Reich and Wang (2010) and Liu and Wu (2011) can be enforced. Finally, the estimation performance of (Equation 5) largely depends on the choice of tuning parameter , which needs to be appropriately determined.

### 2.2Model tuning and solution surface

In this section, a data adaptive model tuning method for multiclass conditional probability estimation is developed. To indicate the dependency on the tuning parameter , we denote the estimated conditional probability as and the quantile regression function as . The overall performance of in estimating is evaluated by the generalized Kullback-Leibler (GKL) loss between and ,

The corresponding comparative GKL loss, after omitting -unrelated terms in (Equation 6), is

It is natural to estimate by its empirical version,

where is an indicator function. However, often underestimates especially when the estimation model is over-complicated.

To remedy the underestimation bias, can be estimated similarly as in Wang, Shen and Liu (2008) by searching for the optimal correction terms for . Specifically, minimizing the distance between and a class of candidate estimators of form -dependent penalty with yields that

where . Here, evaluates the accuracy of estimating on , which is similar to the covariance penalty in Efron (2004) and the generalized degree of freedom in Shen and Huang (2006), and the term is a correction term adjusting the effect of random covariates on prediction and needs to be estimated, c.f., Breiman and Spector (1992), and Breiman (1992).

To construct the estimated and , the data perturbation technique (Wang and Shen, 2006) can be adopted. The key idea is to evaluate the generalization ability of the probability estimation method by its sensitivity to the local perturbations of and . The estimation formula can be derived via derivative estimation and approximated through a Monte Carlo approximation. The exact expressions are similar to (11) and (12) in Wang, Shen and Liu (2007) and thus omitted here.

Note that the data perturbation technique requires fitting the quantile regression function multiple times for various ’s and ’s, and thus can be computationally expensive. To further reduce the computation cost, the solution surface of the coefficient of with respect to and can be constructed following Rosset (2009). In particular, Li et al. (2007) and Takeuchi et al. (2009) show that the solution path of is piecewise linear with respect to (or ) when (or ) is fixed; Rosset (2009) explores the bi-level path of regularized quantile regression and shows that the solution surface of can be efficiently constructed with respect to both and . The solution surface is mapped as a piecewise linear function of or and the possible locations of the bi-level optima can be found in one run of the base algorithm. That being said, the coefficient of for various ’s and ’s can be obtained at essentially the same computation cost as fitting one time of the base algorithm. Figure 1 displays for a fixed as a function of and in a randomly selected replication of the simulated Example 1.

Figure 1 here |
---|

## 3Statistical learning theory

This section establishes the asymptotic convergence of the proposed multiclass conditional probability estimation method, measured by

The convergence rate is quantified in terms of the tuning parameter , the number of brackets , sample size , and the cardinality of .

### 3.1Asymptotic theory

The following technical assumptions are made.

*Assumption 1.* For any , there exists , such that for some positive sequence as .

This is analogous to Assumption 1 in Wang et al. (2008) and ensures that the true quantile regression function can be well approximated by .

*Assumption 2.* For any and , there exist constants and such that

Assumption 2 describes the local smoothness of within the neighborhood of . Note that with

by Lemma 4 in Li et al. (2007), so Assumption 2 is the same as Assumption A in Li et al. (2007).

Next we measures the cardinality of by the -metric entropy with bracketing. Given any , is an -bracketing function set of if for any there exists an such that , and for all . The -metric entropy with bracketing is then defined as the logarithm of the cardinality of the smallest -bracketing function set of . Denote , and .

*Assumption 3.* For some positive constants and , there exists some such that

where and .

Suppose Assumptions 1-3 are met, and there exists such that for any . For obtained as in (), provided that , where .

Under the assumptions in Theorem 1, provided that diverges as .

Theorem 1 and Corollary 1 provide probability and risk bounds for . They also suggest the ideal to be of order , yielding the fast rate of for .

### 3.2A theoretic example

To illustrate the asymptotic theory, a simple theoretic example is considered. Let be sampled from a uniform distribution on and be sampled according to if and 0.1 otherwise. Let , where is the Gaussian kernel.

To verify Assumption 1, note that for any , is continuous in except at and . For given , define

then is a continuous function of , and . Furthermore, as is continuous, Steinwart (2001) shows that there exists a such that . Therefore, . Since then

To verify Assumption 2, note that , and

where if Therefore, Assumption 2 is satisfied with and .

To verify Assumption 3, since (Zhou, 2002) for any given and is nonincreasing in , there exist positive constants , such that

Without loss of generality, assume , and then . Solving (Equation 8), yields that , when .

Finally, by Corollary 1, . This implies that when is set as .

## 4Numerical experiments

This section examines the effectiveness of the proposed multiclass probability estimation method in simulated and real examples. The numerical performance of the proposed method (OUR) is compared against three popular competitors: baseline logistic model (BLM), classification tree (TREE) and weighted multiclass classification (WMC; Wu et al., 2010). For illustration, the number of quantiles in our method is set as . The kernel function used in each method is set as the Gaussian kernel , where the scale parameter is set as the median of pairwise Euclidean distances within the training set. To optimize the performance of each estimation method, a grid search is employed to select the tuning parameter as in Section 2.2. The grid used in all examples is set as . A more refined grid search can be employed to further improve the numerical performance at the cost of increased computation burden.

In simulated examples where the true conditional probability is known, the performance of each estimation method is measured by its distance to . Various distance measures between and are computed based on the testing set,

where denotes the testing set, and is the cardinality of . To avoid degeneration in computing GKL loss and CEE, a small correction constant is added to when necessary.

### 4.1Simulated examples

Five simulated examples are generated for comparison.

*Example 1.* First, is generated uniformly over . Next, given , the covariates are generated from , a multivariate distribution with , and degree of freedom 2. The training size is 400, and the testing size is 2600.

*Example 2.* First, is generated uniformly over . Next, given , the covariates are generated from , where and . The training size is 400, and the testing size is 2600.

*Example 3.* First, is generated uniformly over . Next, given , the covariates are generated from , where and . The training size is 400, and the testing size is 2600.

*Example 4.* First, is generated uniformly over . Next, given , the covariates are generated from , where , and if is odd and if is even. The training size is 400, and the testing size is 2600.

*Example 5.* First, is generated uniformly over . Next, given , the covariates are generated from , where , and if is odd and if is even. The training size is 400, and the testing size is 2600.

Examples 1-3 are generated similarly, but with different number of classes and different mean vectors . When gets larger, the generated data from different classes become more overlapped and thus the resultant classification becomes more difficult. Examples 4 and 5 include additional noise variables and heteroscedastic covariance matrices. Each simulated example is repeated 50 times, and the averaged test errors and the corresponding standard deviations are reported in Table 1.

Table 1 here |
---|

Evidently, the proposed estimation method delivers superior numerical performance, and outperforms BLM, TREE and WMC in all the examples. As a model-free method, WMC yields competitive performance in Example 1 and Example 4 with where the data from different classes are relatively far apart leading to clear-cut classification boundary. However, when gets larger, WMC requires much more intensive computing power, and its numerical performance appears to be less satisfactory in Examples 2 and 5 with . Furthermore, the performance of WMC in Example 3 with is not reported in Table 1, since it is computationally expensive to achieve reasonably good estimation accuracy.

### 4.2Real applications

In this section, the proposed multiclass probability estimation method is applied to the iris data, the white wine quality data and the abalone data. All datasets are publicly available at the University of California Irvine Machine Learning Repository (*http://archive.ics.uci.edu/ml/*).

The iris data has 4 continuous attributes: sepal length, sepal width, petal length, and petal width, and three classes: Setosa, Versicolour, and Virginica. The size of the iris dataset is 150, and each class has 50 observations. We randomly select 30 observations from each class and set as the training set, and the remaining 60 observations are used for testing. The white wine quality data has 11 attributes, which characterize various aspects of the white wines, and the response ranges from 0 to 10 representing quality scores made by wine experts. For illustration, we focus only on three classes with quality scores 5, 6 and 7, and a total of 4535 white wines are selected, where 1457, 2198 and 880 white wines score 5, 6 and 7, respectively. We randomly select 100 white wines from each class as the training set, and the remaining 4235 white wines are used for testing. The abalone data has 8 attributes on various physical measurements of an abalone, and 29 classes representing different ages of an abalone. Since some extreme classes have very few abalones, we only focus on the largest classes with . In specific, for , classes are selected with a total of 2768 abalones; for , classes are selected with a total of 3498 abalones; for , classes are selected with a total of 3739 abalones. In all scenarios, we randomly select 50 abalones from each class as the training set, and keep the remaining abalones for testing.

Note that the true conditional probability is not available in the real applications, so only CEE is computed and used for comparison. In addition, we also compare the averaged misclassification error (MCE) of each probability estimation method on the testing set, where the classification label is predicted as , and MCE is defined as

The averaged CEE and MCE over 50 replications are reported in Table 2.

Table 2 here |
---|

It is evident that the proposed probability estimation method delivers competitive results against other competitors. It yields the smallest CEE and MCE in all real examples, except that WMC produces slightly smaller CEE in the iris example. The performance of WMC is not reported for the abalone example with and due to the computational burden.

## 5Summary

This paper proposes an efficient model-free multiclass conditional probability estimation method, where the estimated probabilities are constructed via a series of estimated conditional quantile regression functions. The proposed method does not require any distributional model assumption, and it is computationally efficient as its computation cost does not need to increase exponentially with . The asymptotic convergence rate of the proposed method is established, and the numerical experiments with both simulated examples and real applications demonstrate the advantage of the proposed method, especially when is large. In addition, can be regarded as the conditional density of discrete , and thus the proposed method can be naturally extended to a general framework of conditional density estimation (Hansen, 2004).

## Appendix: technical proofs

**Proof of Lemma 1.** When ,

The desired result follows immediately.

First, note that with , and then

Therefore, it suffices to bound for any .

Next, for simplicity, denote , , and . Simple calculation yields that

where the first inequality follows from the fact that is bounded by 1. Therefore, bounding boils down to bounding .

Based on the estimation method in (Equation 4), there exists , such that , and then and . Let , and we will show the relationship in the following four cases.

*Case 1.* If and , then . Based on Lemma 1,

which implies that .

*Case 2.* If and , then by Lemma 1,