Pointwise adaptation via stagewise aggregation of local estimates for multiclass
classification††thanks: Financial support by the Russian Academic Excellence Project
and by the German Research Foundation (DFG) through the Collaborative
Research Center 1294 is gratefully acknowledged.
We consider a problem of multiclass classification, where the training sample is generated from the model , , and are unknown Lipschitz functions. Given a test point , our goal is to estimate . An approach based on nonparametric smoothing uses a localization technique, i.e. the weight of observation depends on the distance between and . However, local estimates strongly depend on localizing scheme. In our solution we fix several schemes , compute corresponding local estimates for each of them and apply an aggregation procedure. We propose an algorithm, which constructs a convex combination of the estimates such that the aggregated estimate behaves approximately as well as the best one from the collection . We also study theoretical properties of the procedure, prove oracle results and establish rates of convergence under mild assumptions.
Pointwise adaptation for multiclass classification
class=MSC] \kwd[Primary ]62H30 \kwd[; secondary ]62G08 \kwd62G10
- 1 Introduction
- 2 Setup and notations
- 3 Algorithm and numerical experiments
- 4 Theoretical properties
- A Proof of Lemma 4
Multiclass classification is a natural generalization of the well-studied problem of binary classification with a wide range of applications. It is a problem of supervised learning when one observes a sample , where , , , . Pairs are generated independently according to some distribution over . The learner’s task is to find a rule in order to make a probability of misclassification
as small as possible. For a given class of admissible functions , one is often interested in the excess risk
which shows, how far the classifier from the best one in the class . Note that in this setting may be chosen outside of .
Concerning the multiclass learning problem, one can distinguish between two main approaches. The first one is by reducing to binary classification. The most popular and straightforward examples of these techniques are One-vs-All (OvA) and One-vs-One (OvO). Another example of reduction to the binary case is given by error correcting output codes (ECOC) . In  this approach was generalized for margin classifiers. A similar approach uses tree-based classifiers. Methods of the second type solve a single problem such as it is done in multiclass SVM  and multiclass one-inclusion graph strategy . One can refer to  for brief overview of multiclass classification methods. Daniely, Sabato and Shalev-Shwartz in  compared OvA, OvO, ECOC, tree-based classifiers and multiclass SVM for linear discrimination rules in a finite-dimensional space. From their theoretical study, multiclass SVM outperforms the OvA method. In  Crammer and Singer also showed a superiority of multiclass SVM on several datasets. Nevertheless, in our work, we will use One-vs-All for two reasons. First, we will consider a broad nonparametric class of functions and results in  do not cover this case. Second, in  Rifkin and Klautau showed that OvA behaves comparably to multiclass SVM if binary classifier in OvA is strong.
For each class , we construct binary labels and assume that, given , a conditional distribution of is , where , and . This model is very general and covers all possible distributions of on points. We must put some restrictions on the functions , . We will provide learning guarantees for the class of Lipschitz functions, i. e. we assume, there exists a constant such that for all and for all from to it holds
For this model, the optimal classifier can be found analytically
Unfortunately, true values are unknown, therefore we study a plug-in rule
where stands for an estimate of , . This reduces the problem of classification to a regression problem. In  it was shown that in general the regression problem is more difficult than classification. Fortunately, for some classes (including the class of Lipschitz functions), classification and regression have similar complexities as it was shown in .
For problems of nonparametric regression, different localization techniques are often used. Namely, one considers an estimate defined by maximization of localized log-likelihood
where is a log-likelihood of the -th observation, is called a localizing scheme and localizing weights depend on and . Particular examples of such technique are Nadaraya-Watson estimator, local polynomial estimators and nearest-neighbor-based estimators.
Note that the estimate strongly depends on the localizing scheme and its choice determines the performance of the classifier . Moreover, in multiclass learning there is a common problem of class imbalance, i. e. some classes may be not presented in a small vicinity of a distinct point. Obviously, one localizing scheme is not enough for such situation. To solve this problem, we consider several localizing schemes , compute local likelihood estimates (they are also called weak) , , for each of them and use a plug-in classifier based on a convex combination of these estimates. An aggregation of weak estimates is a key feature of our procedure.
The aggregation of estimators takes its origins in model selection and it was generalized to convex and linear aggregation in . In  and  optimal rates of aggregation were derived. Aggregation procedures have a wide range of applications and can be used in regression problems (, ), density estimation (, , ) and classification problems (, , ). They often solve an optimization problem in order to find aggregating coefficients (, , , ). In some cases such as exponential weighting (, ), a solution of the optimization problem can be written explicitly. An aggregation under the KL-loss was also studied in  and , where optimal rates of aggregation and exponential bounds were obtained. However, most of the existing aggregation procedures and results concern with a global aggregation. This means that the aggregating coefficients are universal and do not depend on the point where the classification rule is applied.
Our approach is based on local aggregation yielding a point dependent aggregation scheme. However, the proposed procedure does not require to solve an optimization problem. Instead, our procedure and sequentially finds a convex combination of weak estimates, which mimics the best possible choice of a model under the Kullback-Leibler loss for a given test point . The idea of the approach originates from , where an aggregation of binary classifiers was studied.
Finally, it is worth mentioning that nonparametric estimates have slow rates of convergence especially in the case of high dimension . It was shown in  and then in  that plug-in classifiers can achieve fast learning rates under certain assumptions in both binary and multiclass classification problems. We will use a similar technique to derive fast learning rates for the plug-in classifier based on the aggregated estimate.
Main contributions of this paper are following:
we propose an algorithm of multiclass classification, based on aggregation of local likelihood estimates, which works for a broad class of admissible functions;
the procedure is robust against class imbalance and outliers;
computational time of the procedure is , where and stand for the size of train and test datasets respectively, which makes it scalable for large problems;
theoretical guarantees claim optimal accuracy of classification with only a logarithmic payment for the number of classes and aggregated estimates.
The paper is organized as follows. In Section 2 we introduce definitions and notations. In Section 3 we formulate the multiclass classification procedure and demonstrate its performance on both artificial and real-world datasets. Finally, in Section 4 we study theoretical properties of the procedure. In particular, we derive oracle results for model selection, establish rates of convergence for the problem of nonparametric estimation and provide bounds for the excess risk .
2 Setup and notations
Given a training sample , we apply the following probabilistic model. Suppose that, given , labels have a conditional distribution
, and . The optimal classifier is the Bayes rule defined by
Unfortunately, true values are unknown, therefore we fix a test point and consider a plug-in classifier
where stands for an estimate of , .
Now, the problem is how to estimate , . Fix some and transform labels to binary:
It is clear that
where . This approach is nothing but the One-vs-All procedure for multiclass classification.
For fixed and , denote
One of the ways to estimate is to consider a localized log-likelihood
where is a log-likelihood of one observation and are some non-negative localizing weights. The local maximum likelihood estimate can be found explicitly.
For the log-likelihood function of the form (6) the estimate is given by the formula
where , . Moreover, for any it holds
Proof of the proposition is straightforward and requires to compute the derivative and put it to zero.
We proceed with two examples of such weights.
Example 2.1: k nearest neighbors
For k-NN estimates we have
where is a set of k nearest to points over . Then
Example 2.2: bandwidth-based kernel estimates
For bandwidth-based kernel estimates localizing weights are defined by the formula
where stands for some norm, is called bandwidth and is a localizing kernel. Standard examples of such kernels are following:
We will use a Euclidean norm in examples in Section 3.
Both k-NN and bandwidth-based localizing schemes require a proper choice of smoothing parameters ( and respectively). Fortunately, Bernoulli distribution belongs to an exponential family. Such distributions are quite well studied. In particular, in  an adaptive procedure of choosing the smoothing parameter was proposed. We will refer to that procedure as SSA (Spatial Stagewise Aggregation) as it called in the paper .
Let be a set of localizing schemes, i. e. for each . Each localizing scheme induces a set of estimates , defined by (7). Using the SSA procedure, we can get aggregated estimates . It was shown in  that for each behaves like almost the best estimate from . Next, we can use the plug-in rule (3). We will only require that for each from to , it holds
For two examples considered above, this condition means that candidates for the best model should be ordered by the number of nearest neighbors or by the bandwidth. The detailed description of the procedure for multiclass classification is given in Section 3. We will refer to it as MSSA (Multiclass Spatial Stagewise Aggregation).
To show consistency of the MSSA procedure we will derive convergence rates for and , where stands for the KL-divergence between two distributions, under certain assumptions. For two Bernoulli distributions with parameters and KL-divergence is defined by
and it is more informative, than the squared error. In the theoretical study, we will require a regularity of the KL-divergence. Namely, we assume that there exist constants such that
More precise statements will be discussed in Section 4.
Besides bounds on the estimation error, we also derive bounds on the classification error. Assume that pairs are such that is drawn according to some distribution and conditional distribution of is given by (1). For every rule define the risk
and the excess risk
where stands for the Bayes classifier.
Since we deal with the problem of nonparametric estimation, even the optimal estimator can show poor performance in the case of large dimension . Low noise assumptions are usually used to speed up rates of convergence and allow plug-in classifiers to achieve fast rates. We can rewrite
In case of binary classification a misclassification often occurs, when is close to with high probability. The well-known Mammen-Tsybakov noise condition ensures that such situation appears with low probability. Namely, it assumes that there exist universal non-negative constants and such that for all it holds
This assumption can be generalized to the multiclass case. Suppose, at the point we have
for some non-negative constants and . We will use this assumption to establish fast rates for the built plug-in classifier in Section 4.
3 Algorithm and numerical experiments
Suppose, one has candidates , , for an optimal localizing scheme. The set of weights induces weak estimates for each class from to . Denote
We assume that any two collections of weights differ at least in one element. The idea of the procedure is simple. For each class on the first step we choose the estimate . This estimate is very local and therefore has the smallest bias and the largest variance of order . Next, we try to enlarge the vicinity of averaging under a condition that the bias does not change dramatically. For this purpose we run a likelihood-ratio test on homogeneity: if the hypothesis is correct, then the difference won’t be significant. Otherwise, the function changes quickly in the vicinity of the test point and it is better to utilize . Fix a critical value and construct the estimate , where the coefficient is close to if is much less than , and close to if exceeds the value . On the step , , we repeat this test for a new estimate and the estimate constructed on the previous step.
The choice of is the same as it was in . The MSSA procedure returns aggregated estimates for each class. The classification rule is defined as
The choice of parameters is crucial for performance of the procedure. We tune values according to the propagation condition. The propagation condition means that in the homogeneous case the procedure must return the estimate corresponding to the most broadest localizing scheme . If it does not happen such a situation is called an early stopping. The chosen values must ensure that the early stopping occurs with a small probability (e.g. 0.05). In our experiments described further values were tuned only at one point and then used for all test points.
In all numerical experiments, we choose localizing weights according to nearest-neighbor-based schemes. Namely, for a number of neighbors we set equal to the distance to the -th nearest neighbor of the test point . The weight is then defined by the formula
where is either Gaussian or Epanechnikov kernel. Typically in our experiments, and then it requires time to compute local estimates for one class and time to aggregate them. As result, it takes time to compute estimates at one test point.
3.2 Experiments on artificial datasets
We start with presenting the performance of MSSA on artificial datasets. We generate points from a mixture model:
Then the density of is given by the formula
The Bayes rule for this case is given by the formula
Below we provide results for three different experiments.
Typical sample realizations in all three experiments are shown on Figure 1. In each experiment, we took a sequence of integers , and considered -nearest-neighbor-based localizing schemes with Epanechnikov and Gaussian kernels. We computed average leave-one-out cross-validation errors over sample realizations.
In the first experiment, we took classes, points, equal prior class probabilities and considered a mixture of the form (13) with
where stands for the density of Gaussian random vector with mean and variance .
Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 2.
In the second experiment, we took classes, points, equal prior class probabilities and considered a mixture (13) with
where stands for the density of Gaussian random vector with mean and variance . Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 3.
Finally, in the third experiment, we took classes, points, equal prior class probabilities and considered a mixture (13) with
where stands for the density of Gaussian random vector with mean and variance . Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 4.
3.3 Experiments on the real world datasets
|Dataset||Train||Test||Attributes||Classes||Class distribution (in %)|
|Ecoli||336||–||7||8||42.6, 22.9, 15.5, 10.4, 5.9, 1.5, 0.6, 0.6|
|Iris||150||–||4||3||33.3, 33.3, 33.3|
|Glass||214||–||9||6||32.7, 35.5, 7.9, 6.1, 4.2, 13.6|
|Pendigits||7494||3498||16||10||10.4, 10.4, 10.4, 9.6, 10.4, 9.6, 9.6, 9.6, 10.4, 9.6, 9.6|
|Satimage||4435||2000||36||6||24.1, 11.1, 20.3, 9.7, 11.1, 23.7|
|Seeds||210||–||7||3||33.3, 33.3, 33.3|
|Wine||178||–||13||3||33.1, 39.8, 26.9|
|Yeast||1484||–||8||10||16.4, 28.1, 31.2, 2.9, 2.3, 3.4, 10.1, 2.0, 1.3, 0.3|
We compare the performance of our algorithm with boosting of k-NN classifiers considered in  and SVM . For Pendigits and Satimage datasets we calculated misclassification error on the test dataset, for all other datasets we used leave-one-out cross-validation. Results of our experiments are shown in Table 2, best ones are boldfaced.
|Dataset||EK MSSA||GK MSSA||Boost-NN, ||SVM,  (table 2)|
|Ecoli||12.8 1.8||12.5 1.8||–||13.0 5.3|
|Glass||27.5 3.1||26.6 3.0||24.4 1.7|
|Pendigits||2.6 0.3||2.5 0.3||0.5 0.1|
|Satimage||9.6 0.7||9.6 0.7||9.6 0.3||11.0 0.7|
|Seeds||5.7 1.6||5.7 1.6||–||4.8 2.4|
|Wine||2.2 1.1||2.2 1.1||–||1.7 1.5|
|Yeast||40.5 1.3||40.4 1.3|
From Table 2, one can observe that localizing schemes with the Gaussian kernel behave slightly better than with the Epanechnikov kernel and MSSA with both kernels is comparable with SVM.
4 Theoretical properties
4.1 Main results
Before we formulate main theoretical properties of the procedure, we introduce an additional assumption. Namely, we assume that there exist constants and such that
The choice of models, which fulfill the assumption (A3), is up to statistician. Note that this assumption is quite reasonable in sense that if we assume that is of order and is of order then, is of order , and thus, the number of models we aggregate is not huge.
Main theoretical properties of the MSSA procedure can be formulated in the following theorems. First two results concern accuracy of estimation.
The result of Theorem 1 improves results in . However, note that this theorem does not imply similar results in expectation, since the choice of parameters depends on the predetermined confidence set level . Note that the logarithmic term of the number of models in (17) is usual for problems of model selection and cannot be improved.
The next result establishes rates of convergence for the procedure.
The rate is optimal for estimation of Lipschitz functions under regularity of the design. The MSSA procedure provides the optimal rate up to a logarithmic factor, which can be considered as a payment for adaptation.
Note that the condition (A) implies that the KL-divergence is bounded by . It allows obtaining bounds in expectation for the -th moment of the KL-loss. Indeed, fix arbitrary and choose . Using the result of Theorem 2 we immediately obtain
Bounds in expectation can be easily improved by a simple modification of the procedure. Namely, fix some and define , . For each let stand for a MSSA estimate with parameters defined by the formula (15). Finally, denote
A rigorous result is formulated in the next theorem.
Under assumptions of Theorem 2, the choice ensures that
for all .
The proof of this result is given in Section 4.4. Note that the modified procedure requires running the MSSA algorithm times and does not have a significant influence on the computational time.
With guarantees on the performance of estimation, we are ready to provide bounds on the excess risk of misclassification. Now we assume that a test point is drawn randomly according to distribution and has a conditional distribution (1).
Let be a training sample with independent entries and is a test point generated from the distribution and is given by . Let the multiclass low noise assumption (11) be fulfilled and supposethat for each realization of and a collection of localizing schemes is chosen in a way to ensure (A1) and (A3) with probability 1. Suppose that holds. Choose a constant from the condition (14) and set parameters according to (15). Let
Suppose that for each from to there exists a localizing scheme , such that
for some positive constants . Then for the excess risk one has
for some positive constant .
4.2 Proof of Theorem 1
We also use a reparametrization
throughout the proof.