Pointwise adaptation via stagewise aggregation of local estimates for multiclass
classification^{†}^{†}thanks: Financial support by the Russian Academic Excellence Project
5100
and by the German Research Foundation (DFG) through the Collaborative
Research Center 1294 is gratefully acknowledged.
Abstract
We consider a problem of multiclass classification, where the training sample is generated from the model , , and are unknown Lipschitz functions. Given a test point , our goal is to estimate . An approach based on nonparametric smoothing uses a localization technique, i.e. the weight of observation depends on the distance between and . However, local estimates strongly depend on localizing scheme. In our solution we fix several schemes , compute corresponding local estimates for each of them and apply an aggregation procedure. We propose an algorithm, which constructs a convex combination of the estimates such that the aggregated estimate behaves approximately as well as the best one from the collection . We also study theoretical properties of the procedure, prove oracle results and establish rates of convergence under mild assumptions.
Pointwise adaptation for multiclass classification
and
class=MSC] \kwd[Primary ]62H30 \kwd[; secondary ]62G08 \kwd62G10
Contents
1 Introduction
Multiclass classification is a natural generalization of the wellstudied problem of binary classification with a wide range of applications. It is a problem of supervised learning when one observes a sample , where , , , . Pairs are generated independently according to some distribution over . The learner’s task is to find a rule in order to make a probability of misclassification
as small as possible. For a given class of admissible functions , one is often interested in the excess risk
which shows, how far the classifier from the best one in the class . Note that in this setting may be chosen outside of .
Concerning the multiclass learning problem, one can distinguish between two main approaches. The first one is by reducing to binary classification. The most popular and straightforward examples of these techniques are OnevsAll (OvA) and OnevsOne (OvO). Another example of reduction to the binary case is given by error correcting output codes (ECOC) [13]. In [2] this approach was generalized for margin classifiers. A similar approach uses treebased classifiers. Methods of the second type solve a single problem such as it is done in multiclass SVM [8] and multiclass oneinclusion graph strategy [27]. One can refer to [3] for brief overview of multiclass classification methods. Daniely, Sabato and ShalevShwartz in [10] compared OvA, OvO, ECOC, treebased classifiers and multiclass SVM for linear discrimination rules in a finitedimensional space. From their theoretical study, multiclass SVM outperforms the OvA method. In [8] Crammer and Singer also showed a superiority of multiclass SVM on several datasets. Nevertheless, in our work, we will use OnevsAll for two reasons. First, we will consider a broad nonparametric class of functions and results in [10] do not cover this case. Second, in [23] Rifkin and Klautau showed that OvA behaves comparably to multiclass SVM if binary classifier in OvA is strong.
For each class , we construct binary labels and assume that, given , a conditional distribution of is , where , and . This model is very general and covers all possible distributions of on points. We must put some restrictions on the functions , . We will provide learning guarantees for the class of Lipschitz functions, i. e. we assume, there exists a constant such that for all and for all from to it holds
For this model, the optimal classifier can be found analytically
Unfortunately, true values are unknown, therefore we study a plugin rule
where stands for an estimate of , . This reduces the problem of classification to a regression problem. In [11] it was shown that in general the regression problem is more difficult than classification. Fortunately, for some classes (including the class of Lipschitz functions), classification and regression have similar complexities as it was shown in [30].
For problems of nonparametric regression, different localization techniques are often used. Namely, one considers an estimate defined by maximization of localized loglikelihood
where is a loglikelihood of the th observation, is called a localizing scheme and localizing weights depend on and . Particular examples of such technique are NadarayaWatson estimator, local polynomial estimators and nearestneighborbased estimators.
Note that the estimate strongly depends on the localizing scheme and its choice determines the performance of the classifier . Moreover, in multiclass learning there is a common problem of class imbalance, i. e. some classes may be not presented in a small vicinity of a distinct point. Obviously, one localizing scheme is not enough for such situation. To solve this problem, we consider several localizing schemes , compute local likelihood estimates (they are also called weak) , , for each of them and use a plugin classifier based on a convex combination of these estimates. An aggregation of weak estimates is a key feature of our procedure.
The aggregation of estimators takes its origins in model selection and it was generalized to convex and linear aggregation in [15]. In [28] and [31] optimal rates of aggregation were derived. Aggregation procedures have a wide range of applications and can be used in regression problems ([28], [31]), density estimation ([21], [25], [17]) and classification problems ([29], [32], [21]). They often solve an optimization problem in order to find aggregating coefficients ([16], [9], [19], [20]). In some cases such as exponential weighting ([18], [26]), a solution of the optimization problem can be written explicitly. An aggregation under the KLloss was also studied in [24] and [7], where optimal rates of aggregation and exponential bounds were obtained. However, most of the existing aggregation procedures and results concern with a global aggregation. This means that the aggregating coefficients are universal and do not depend on the point where the classification rule is applied.
Our approach is based on local aggregation yielding a point dependent aggregation scheme. However, the proposed procedure does not require to solve an optimization problem. Instead, our procedure and sequentially finds a convex combination of weak estimates, which mimics the best possible choice of a model under the KullbackLeibler loss for a given test point . The idea of the approach originates from [6], where an aggregation of binary classifiers was studied.
Finally, it is worth mentioning that nonparametric estimates have slow rates of convergence especially in the case of high dimension . It was shown in [5] and then in [14] that plugin classifiers can achieve fast learning rates under certain assumptions in both binary and multiclass classification problems. We will use a similar technique to derive fast learning rates for the plugin classifier based on the aggregated estimate.
Main contributions of this paper are following:

we propose an algorithm of multiclass classification, based on aggregation of local likelihood estimates, which works for a broad class of admissible functions;

the procedure is robust against class imbalance and outliers;

computational time of the procedure is , where and stand for the size of train and test datasets respectively, which makes it scalable for large problems;

theoretical guarantees claim optimal accuracy of classification with only a logarithmic payment for the number of classes and aggregated estimates.
The paper is organized as follows. In Section 2 we introduce definitions and notations. In Section 3 we formulate the multiclass classification procedure and demonstrate its performance on both artificial and realworld datasets. Finally, in Section 4 we study theoretical properties of the procedure. In particular, we derive oracle results for model selection, establish rates of convergence for the problem of nonparametric estimation and provide bounds for the excess risk .
2 Setup and notations
Given a training sample , we apply the following probabilistic model. Suppose that, given , labels have a conditional distribution
(1) 
, and . The optimal classifier is the Bayes rule defined by
(2) 
Unfortunately, true values are unknown, therefore we fix a test point and consider a plugin classifier
(3) 
where stands for an estimate of , .
Now, the problem is how to estimate , . Fix some and transform labels to binary:
(4) 
It is clear that
(5) 
where . This approach is nothing but the OnevsAll procedure for multiclass classification.
For fixed and , denote
and
One of the ways to estimate is to consider a localized loglikelihood
(6) 
where is a loglikelihood of one observation and are some nonnegative localizing weights. The local maximum likelihood estimate can be found explicitly.
Proposition 1.
For the loglikelihood function of the form (6) the estimate is given by the formula
(7) 
where , . Moreover, for any it holds
where .
Proof of the proposition is straightforward and requires to compute the derivative and put it to zero.
We proceed with two examples of such weights.
Example 2.1: k nearest neighbors
For kNN estimates we have
where is a set of k nearest to points over . Then
Example 2.2: bandwidthbased kernel estimates
For bandwidthbased kernel estimates localizing weights are defined by the formula
(8) 
where stands for some norm, is called bandwidth and is a localizing kernel. Standard examples of such kernels are following:

rectangular kernel:

triangular kernel:

Epanechnikov kernel:

Gaussian kernel:
We will use a Euclidean norm in examples in Section 3.
Both kNN and bandwidthbased localizing schemes require a proper choice of smoothing parameters ( and respectively). Fortunately, Bernoulli distribution belongs to an exponential family. Such distributions are quite well studied. In particular, in [6] an adaptive procedure of choosing the smoothing parameter was proposed. We will refer to that procedure as SSA (Spatial Stagewise Aggregation) as it called in the paper [6].
Let be a set of localizing schemes, i. e. for each . Each localizing scheme induces a set of estimates , defined by (7). Using the SSA procedure, we can get aggregated estimates . It was shown in [6] that for each behaves like almost the best estimate from . Next, we can use the plugin rule (3). We will only require that for each from to , it holds
(A1) 
For two examples considered above, this condition means that candidates for the best model should be ordered by the number of nearest neighbors or by the bandwidth. The detailed description of the procedure for multiclass classification is given in Section 3. We will refer to it as MSSA (Multiclass Spatial Stagewise Aggregation).
To show consistency of the MSSA procedure we will derive convergence rates for and , where stands for the KLdivergence between two distributions, under certain assumptions. For two Bernoulli distributions with parameters and KLdivergence is defined by
(9) 
and it is more informative, than the squared error. In the theoretical study, we will require a regularity of the KLdivergence. Namely, we assume that there exist constants such that
(A2) 
or equally
(A) 
where and
is a Fischer information. These regularity conditions are usually used (cf. [6], [24]). One may easily notice that is bounded above by .
Assumptions (A2), (A) define metriclike properties of the KLdivergence, under these assumptions
More precise statements will be discussed in Section 4.
Besides bounds on the estimation error, we also derive bounds on the classification error. Assume that pairs are such that is drawn according to some distribution and conditional distribution of is given by (1). For every rule define the risk
and the excess risk
where stands for the Bayes classifier.
Since we deal with the problem of nonparametric estimation, even the optimal estimator can show poor performance in the case of large dimension . Low noise assumptions are usually used to speed up rates of convergence and allow plugin classifiers to achieve fast rates. We can rewrite
In case of binary classification a misclassification often occurs, when is close to with high probability. The wellknown MammenTsybakov noise condition ensures that such situation appears with low probability. Namely, it assumes that there exist universal nonnegative constants and such that for all it holds
(10) 
This assumption can be generalized to the multiclass case. Suppose, at the point we have
Let be ordered values of . Then the condition (10) for multiclass can be formulated as follows (cf. [1], [14])
(11) 
for some nonnegative constants and . We will use this assumption to establish fast rates for the built plugin classifier in Section 4.
3 Algorithm and numerical experiments
3.1 Algorithm
Suppose, one has candidates , , for an optimal localizing scheme. The set of weights induces weak estimates for each class from to . Denote
(12) 
We assume that any two collections of weights differ at least in one element. The idea of the procedure is simple. For each class on the first step we choose the estimate . This estimate is very local and therefore has the smallest bias and the largest variance of order . Next, we try to enlarge the vicinity of averaging under a condition that the bias does not change dramatically. For this purpose we run a likelihoodratio test on homogeneity: if the hypothesis is correct, then the difference won’t be significant. Otherwise, the function changes quickly in the vicinity of the test point and it is better to utilize . Fix a critical value and construct the estimate , where the coefficient is close to if is much less than , and close to if exceeds the value . On the step , , we repeat this test for a new estimate and the estimate constructed on the previous step.
The choice of is the same as it was in [6]. The MSSA procedure returns aggregated estimates for each class. The classification rule is defined as
The choice of parameters is crucial for performance of the procedure. We tune values according to the propagation condition. The propagation condition means that in the homogeneous case the procedure must return the estimate corresponding to the most broadest localizing scheme . If it does not happen such a situation is called an early stopping. The chosen values must ensure that the early stopping occurs with a small probability (e.g. 0.05). In our experiments described further values were tuned only at one point and then used for all test points.
In all numerical experiments, we choose localizing weights according to nearestneighborbased schemes. Namely, for a number of neighbors we set equal to the distance to the th nearest neighbor of the test point . The weight is then defined by the formula
where is either Gaussian or Epanechnikov kernel. Typically in our experiments, and then it requires time to compute local estimates for one class and time to aggregate them. As result, it takes time to compute estimates at one test point.
3.2 Experiments on artificial datasets
We start with presenting the performance of MSSA on artificial datasets. We generate points from a mixture model:
and
Then the density of is given by the formula
(13) 
The Bayes rule for this case is given by the formula
Below we provide results for three different experiments.
Typical sample realizations in all three experiments are shown on Figure 1. In each experiment, we took a sequence of integers , and considered nearestneighborbased localizing schemes with Epanechnikov and Gaussian kernels. We computed average leaveoneout crossvalidation errors over sample realizations.
In the first experiment, we took classes, points, equal prior class probabilities and considered a mixture of the form (13) with
where stands for the density of Gaussian random vector with mean and variance .
Misclassification errors for each weak estimate and SSA estimate for both kNN and bandwidthbased localizing schemes are shown on Figure 2.
In the second experiment, we took classes, points, equal prior class probabilities and considered a mixture (13) with
where stands for the density of Gaussian random vector with mean and variance . Misclassification errors for each weak estimate and SSA estimate for both kNN and bandwidthbased localizing schemes are shown on Figure 3.
Finally, in the third experiment, we took classes, points, equal prior class probabilities and considered a mixture (13) with
where stands for the density of Gaussian random vector with mean and variance . Misclassification errors for each weak estimate and SSA estimate for both kNN and bandwidthbased localizing schemes are shown on Figure 4.
3.3 Experiments on the real world datasets
We proceed with experiments on datasets from the UCI repository [12]: Ecoli, Iris, Glass, Pendigits, Satimage, Seeds, Wine and Yeast. Short information about these datasets is given in Table 1.
Dataset  Train  Test  Attributes  Classes  Class distribution (in %) 

Ecoli  336  –  7  8  42.6, 22.9, 15.5, 10.4, 5.9, 1.5, 0.6, 0.6 
Iris  150  –  4  3  33.3, 33.3, 33.3 
Glass  214  –  9  6  32.7, 35.5, 7.9, 6.1, 4.2, 13.6 
Pendigits  7494  3498  16  10  10.4, 10.4, 10.4, 9.6, 10.4, 9.6, 9.6, 9.6, 10.4, 9.6, 9.6 
Satimage  4435  2000  36  6  24.1, 11.1, 20.3, 9.7, 11.1, 23.7 
Seeds  210  –  7  3  33.3, 33.3, 33.3 
Wine  178  –  13  3  33.1, 39.8, 26.9 
Yeast  1484  –  8  10  16.4, 28.1, 31.2, 2.9, 2.3, 3.4, 10.1, 2.0, 1.3, 0.3 
We compare the performance of our algorithm with boosting of kNN classifiers considered in [4] and SVM [22]. For Pendigits and Satimage datasets we calculated misclassification error on the test dataset, for all other datasets we used leaveoneout crossvalidation. Results of our experiments are shown in Table 2, best ones are boldfaced.
Dataset  EK MSSA  GK MSSA  BoostNN, [4]  SVM, [22] (table 2) 
Ecoli  12.8 1.8  12.5 1.8  –  13.0 5.3 
Iris  0.0  0.0  –  
Glass  27.5 3.1  26.6 3.0  24.4 1.7  
Pendigits  2.6 0.3  2.5 0.3  0.5 0.1  
Satimage  9.6 0.7  9.6 0.7  9.6 0.3  11.0 0.7 
Seeds  5.7 1.6  5.7 1.6  –  4.8 2.4 
Wine  2.2 1.1  2.2 1.1  –  1.7 1.5 
Yeast  40.5 1.3  40.4 1.3 
From Table 2, one can observe that localizing schemes with the Gaussian kernel behave slightly better than with the Epanechnikov kernel and MSSA with both kernels is comparable with SVM.
4 Theoretical properties
4.1 Main results
Before we formulate main theoretical properties of the procedure, we introduce an additional assumption. Namely, we assume that there exist constants and such that
(A3) 
The choice of models, which fulfill the assumption (A3), is up to statistician. Note that this assumption is quite reasonable in sense that if we assume that is of order and is of order then, is of order , and thus, the number of models we aggregate is not huge.
Main theoretical properties of the MSSA procedure can be formulated in the following theorems. First two results concern accuracy of estimation.
Theorem 1.
The result of Theorem 1 improves results in [6]. However, note that this theorem does not imply similar results in expectation, since the choice of parameters depends on the predetermined confidence set level . Note that the logarithmic term of the number of models in (17) is usual for problems of model selection and cannot be improved.
The next result establishes rates of convergence for the procedure.
Theorem 2.
The rate is optimal for estimation of Lipschitz functions under regularity of the design. The MSSA procedure provides the optimal rate up to a logarithmic factor, which can be considered as a payment for adaptation.
Note that the condition (A) implies that the KLdivergence is bounded by . It allows obtaining bounds in expectation for the th moment of the KLloss. Indeed, fix arbitrary and choose . Using the result of Theorem 2 we immediately obtain
Bounds in expectation can be easily improved by a simple modification of the procedure. Namely, fix some and define , . For each let stand for a MSSA estimate with parameters defined by the formula (15). Finally, denote
(21) 
A rigorous result is formulated in the next theorem.
Theorem 3.
The proof of this result is given in Section 4.4. Note that the modified procedure requires running the MSSA algorithm times and does not have a significant influence on the computational time.
With guarantees on the performance of estimation, we are ready to provide bounds on the excess risk of misclassification. Now we assume that a test point is drawn randomly according to distribution and has a conditional distribution (1).
Theorem 4.
Let be a training sample with independent entries and is a test point generated from the distribution and is given by . Let the multiclass low noise assumption (11) be fulfilled and supposethat for each realization of and a collection of localizing schemes is chosen in a way to ensure (A1) and (A3) with probability 1. Suppose that holds. Choose a constant from the condition (14) and set parameters according to (15). Let
and select
(22) 
Suppose that for each from to there exists a localizing scheme , such that
and
(23) 
for some positive constants . Then for the excess risk one has
for some positive constant .
4.2 Proof of Theorem 1
Lemma 1.
We also use a reparametrization
(24) 
throughout the proof.