Pointwise adaptation via stagewise aggregation of local estimates for multiclass classificationFinancial support by the Russian Academic Excellence Project 5-100 and by the German Research Foundation (DFG) through the Collaborative Research Center 1294 is gratefully acknowledged.

Pointwise adaptation via stagewise aggregation of local estimates for multiclass classificationthanks: Financial support by the Russian Academic Excellence Project 5-100
and by the German Research Foundation (DFG) through the Collaborative Research Center 1294 is gratefully acknowledged.

\fnmsNikita \snmPuchkin \thanksrefe, b, a label=e1]npuchkin@gmail.com [    \fnmsVladimir \snmSpokoiny \thanksrefe,f,b,d label=e3]spokoiny@wias-berlin.de [ Institute for Information Transmission Problems RAS, Bolshoy Karetny per. 19,
127051, Moscow, RF.
\printeade1
Weierstrass Institute and Humboldt University, Mohrenstr. 39, 10117 Berlin, Germany. \printeade3 National Research University Higher School of Economics, 20 Myasnitskaya ulitsa, 101000, Moscow, RF. Moscow Institute of Physics and Technology, 141701, Dolgoprudny, RF. Skolkovo Institute of Science and Technology (Skoltech), 143026, Moscow, RF.
Abstract

We consider a problem of multiclass classification, where the training sample is generated from the model , , and are unknown Lipschitz functions. Given a test point , our goal is to estimate . An approach based on nonparametric smoothing uses a localization technique, i.e. the weight of observation depends on the distance between and . However, local estimates strongly depend on localizing scheme. In our solution we fix several schemes , compute corresponding local estimates for each of them and apply an aggregation procedure. We propose an algorithm, which constructs a convex combination of the estimates such that the aggregated estimate behaves approximately as well as the best one from the collection . We also study theoretical properties of the procedure, prove oracle results and establish rates of convergence under mild assumptions.

[
\startlocaldefs\endlocaldefs\runtitle

Pointwise adaptation for multiclass classification

{aug}

and

class=MSC] \kwd[Primary ]62H30 \kwd[; secondary ]62G08 \kwd62G10

1 Introduction

Multiclass classification is a natural generalization of the well-studied problem of binary classification with a wide range of applications. It is a problem of supervised learning when one observes a sample , where , , , . Pairs are generated independently according to some distribution over . The learner’s task is to find a rule in order to make a probability of misclassification

as small as possible. For a given class of admissible functions , one is often interested in the excess risk

which shows, how far the classifier from the best one in the class . Note that in this setting may be chosen outside of .

Concerning the multiclass learning problem, one can distinguish between two main approaches. The first one is by reducing to binary classification. The most popular and straightforward examples of these techniques are One-vs-All (OvA) and One-vs-One (OvO). Another example of reduction to the binary case is given by error correcting output codes (ECOC) [13]. In [2] this approach was generalized for margin classifiers. A similar approach uses tree-based classifiers. Methods of the second type solve a single problem such as it is done in multiclass SVM [8] and multiclass one-inclusion graph strategy [27]. One can refer to [3] for brief overview of multiclass classification methods. Daniely, Sabato and Shalev-Shwartz in [10] compared OvA, OvO, ECOC, tree-based classifiers and multiclass SVM for linear discrimination rules in a finite-dimensional space. From their theoretical study, multiclass SVM outperforms the OvA method. In [8] Crammer and Singer also showed a superiority of multiclass SVM on several datasets. Nevertheless, in our work, we will use One-vs-All for two reasons. First, we will consider a broad nonparametric class of functions and results in [10] do not cover this case. Second, in [23] Rifkin and Klautau showed that OvA behaves comparably to multiclass SVM if binary classifier in OvA is strong.

For each class , we construct binary labels and assume that, given , a conditional distribution of is , where , and . This model is very general and covers all possible distributions of on points. We must put some restrictions on the functions , . We will provide learning guarantees for the class of Lipschitz functions, i. e. we assume, there exists a constant such that for all and for all from to it holds

For this model, the optimal classifier can be found analytically

Unfortunately, true values are unknown, therefore we study a plug-in rule

where stands for an estimate of , . This reduces the problem of classification to a regression problem. In [11] it was shown that in general the regression problem is more difficult than classification. Fortunately, for some classes (including the class of Lipschitz functions), classification and regression have similar complexities as it was shown in [30].

For problems of nonparametric regression, different localization techniques are often used. Namely, one considers an estimate defined by maximization of localized log-likelihood

where is a log-likelihood of the -th observation, is called a localizing scheme and localizing weights depend on and . Particular examples of such technique are Nadaraya-Watson estimator, local polynomial estimators and nearest-neighbor-based estimators.

Note that the estimate strongly depends on the localizing scheme and its choice determines the performance of the classifier . Moreover, in multiclass learning there is a common problem of class imbalance, i. e. some classes may be not presented in a small vicinity of a distinct point. Obviously, one localizing scheme is not enough for such situation. To solve this problem, we consider several localizing schemes , compute local likelihood estimates (they are also called weak) , , for each of them and use a plug-in classifier based on a convex combination of these estimates. An aggregation of weak estimates is a key feature of our procedure.

The aggregation of estimators takes its origins in model selection and it was generalized to convex and linear aggregation in [15]. In [28] and [31] optimal rates of aggregation were derived. Aggregation procedures have a wide range of applications and can be used in regression problems ([28], [31]), density estimation ([21], [25], [17]) and classification problems ([29], [32], [21]). They often solve an optimization problem in order to find aggregating coefficients ([16], [9], [19], [20]). In some cases such as exponential weighting ([18], [26]), a solution of the optimization problem can be written explicitly. An aggregation under the KL-loss was also studied in [24] and [7], where optimal rates of aggregation and exponential bounds were obtained. However, most of the existing aggregation procedures and results concern with a global aggregation. This means that the aggregating coefficients are universal and do not depend on the point where the classification rule is applied.

Our approach is based on local aggregation yielding a point dependent aggregation scheme. However, the proposed procedure does not require to solve an optimization problem. Instead, our procedure and sequentially finds a convex combination of weak estimates, which mimics the best possible choice of a model under the Kullback-Leibler loss for a given test point . The idea of the approach originates from [6], where an aggregation of binary classifiers was studied.

Finally, it is worth mentioning that nonparametric estimates have slow rates of convergence especially in the case of high dimension . It was shown in [5] and then in [14] that plug-in classifiers can achieve fast learning rates under certain assumptions in both binary and multiclass classification problems. We will use a similar technique to derive fast learning rates for the plug-in classifier based on the aggregated estimate.

Main contributions of this paper are following:

  • we propose an algorithm of multiclass classification, based on aggregation of local likelihood estimates, which works for a broad class of admissible functions;

  • the procedure is robust against class imbalance and outliers;

  • computational time of the procedure is , where and stand for the size of train and test datasets respectively, which makes it scalable for large problems;

  • theoretical guarantees claim optimal accuracy of classification with only a logarithmic payment for the number of classes and aggregated estimates.

The paper is organized as follows. In Section 2 we introduce definitions and notations. In Section 3 we formulate the multiclass classification procedure and demonstrate its performance on both artificial and real-world datasets. Finally, in Section 4 we study theoretical properties of the procedure. In particular, we derive oracle results for model selection, establish rates of convergence for the problem of nonparametric estimation and provide bounds for the excess risk .

2 Setup and notations

Given a training sample , we apply the following probabilistic model. Suppose that, given , labels have a conditional distribution

(1)

, and . The optimal classifier is the Bayes rule defined by

(2)

Unfortunately, true values are unknown, therefore we fix a test point and consider a plug-in classifier

(3)

where stands for an estimate of , .

Now, the problem is how to estimate , . Fix some and transform labels to binary:

(4)

It is clear that

(5)

where . This approach is nothing but the One-vs-All procedure for multiclass classification.

For fixed and , denote

and

One of the ways to estimate is to consider a localized log-likelihood

(6)

where is a log-likelihood of one observation and are some non-negative localizing weights. The local maximum likelihood estimate can be found explicitly.

Proposition 1.

For the log-likelihood function of the form (6) the estimate is given by the formula

(7)

where , . Moreover, for any it holds

where .

Proof of the proposition is straightforward and requires to compute the derivative and put it to zero.

We proceed with two examples of such weights.


Example 2.1: k nearest neighbors

For k-NN estimates we have

where is a set of k nearest to points over . Then


Example 2.2: bandwidth-based kernel estimates

For bandwidth-based kernel estimates localizing weights are defined by the formula

(8)

where stands for some norm, is called bandwidth and is a localizing kernel. Standard examples of such kernels are following:

  • rectangular kernel:

  • triangular kernel:

  • Epanechnikov kernel:

  • Gaussian kernel:

We will use a Euclidean norm in examples in Section 3.


Both k-NN and bandwidth-based localizing schemes require a proper choice of smoothing parameters ( and respectively). Fortunately, Bernoulli distribution belongs to an exponential family. Such distributions are quite well studied. In particular, in [6] an adaptive procedure of choosing the smoothing parameter was proposed. We will refer to that procedure as SSA (Spatial Stagewise Aggregation) as it called in the paper [6].

Let be a set of localizing schemes, i. e. for each . Each localizing scheme induces a set of estimates , defined by (7). Using the SSA procedure, we can get aggregated estimates . It was shown in [6] that for each behaves like almost the best estimate from . Next, we can use the plug-in rule (3). We will only require that for each from to , it holds

(A1)

For two examples considered above, this condition means that candidates for the best model should be ordered by the number of nearest neighbors or by the bandwidth. The detailed description of the procedure for multiclass classification is given in Section 3. We will refer to it as MSSA (Multiclass Spatial Stagewise Aggregation).

To show consistency of the MSSA procedure we will derive convergence rates for and , where stands for the KL-divergence between two distributions, under certain assumptions. For two Bernoulli distributions with parameters and KL-divergence is defined by

(9)

and it is more informative, than the squared error. In the theoretical study, we will require a regularity of the KL-divergence. Namely, we assume that there exist constants such that

(A2)

or equally

(A)

where and

is a Fischer information. These regularity conditions are usually used (cf. [6], [24]). One may easily notice that is bounded above by .

Assumptions (A2), (A) define metric-like properties of the KL-divergence, under these assumptions

More precise statements will be discussed in Section 4.

Besides bounds on the estimation error, we also derive bounds on the classification error. Assume that pairs are such that is drawn according to some distribution and conditional distribution of is given by (1). For every rule define the risk

and the excess risk

where stands for the Bayes classifier.

Since we deal with the problem of nonparametric estimation, even the optimal estimator can show poor performance in the case of large dimension . Low noise assumptions are usually used to speed up rates of convergence and allow plug-in classifiers to achieve fast rates. We can rewrite

In case of binary classification a misclassification often occurs, when is close to with high probability. The well-known Mammen-Tsybakov noise condition ensures that such situation appears with low probability. Namely, it assumes that there exist universal non-negative constants and such that for all it holds

(10)

This assumption can be generalized to the multiclass case. Suppose, at the point we have

Let be ordered values of . Then the condition (10) for multiclass can be formulated as follows (cf. [1], [14])

(11)

for some non-negative constants and . We will use this assumption to establish fast rates for the built plug-in classifier in Section 4.

3 Algorithm and numerical experiments

3.1 Algorithm

Suppose, one has candidates , , for an optimal localizing scheme. The set of weights induces weak estimates for each class from to . Denote

(12)

We assume that any two collections of weights differ at least in one element. The idea of the procedure is simple. For each class on the first step we choose the estimate . This estimate is very local and therefore has the smallest bias and the largest variance of order . Next, we try to enlarge the vicinity of averaging under a condition that the bias does not change dramatically. For this purpose we run a likelihood-ratio test on homogeneity: if the hypothesis is correct, then the difference won’t be significant. Otherwise, the function changes quickly in the vicinity of the test point and it is better to utilize . Fix a critical value and construct the estimate , where the coefficient is close to if is much less than , and close to if exceeds the value . On the step , , we repeat this test for a new estimate and the estimate constructed on the previous step.

1:procedure MSSA( – given collection of sets of weights, – given set of critical values)
2:     for   do
3:         For each calculate , and
4:         
5:         for  do
6:              
7:              ,
8:                              return
Algorithm 1 Multiclass Spatial Stagewise Aggregation (MSSA)

The choice of is the same as it was in [6]. The MSSA procedure returns aggregated estimates for each class. The classification rule is defined as

The choice of parameters is crucial for performance of the procedure. We tune values according to the propagation condition. The propagation condition means that in the homogeneous case the procedure must return the estimate corresponding to the most broadest localizing scheme . If it does not happen such a situation is called an early stopping. The chosen values must ensure that the early stopping occurs with a small probability (e.g. 0.05). In our experiments described further values were tuned only at one point and then used for all test points.

In all numerical experiments, we choose localizing weights according to nearest-neighbor-based schemes. Namely, for a number of neighbors we set equal to the distance to the -th nearest neighbor of the test point . The weight is then defined by the formula

where is either Gaussian or Epanechnikov kernel. Typically in our experiments, and then it requires time to compute local estimates for one class and time to aggregate them. As result, it takes time to compute estimates at one test point.

3.2 Experiments on artificial datasets

We start with presenting the performance of MSSA on artificial datasets. We generate points from a mixture model:

and

Then the density of is given by the formula

(13)

The Bayes rule for this case is given by the formula

Below we provide results for three different experiments.

Typical sample realizations in all three experiments are shown on Figure 1. In each experiment, we took a sequence of integers , and considered -nearest-neighbor-based localizing schemes with Epanechnikov and Gaussian kernels. We computed average leave-one-out cross-validation errors over sample realizations.

In the first experiment, we took classes, points, equal prior class probabilities and considered a mixture of the form (13) with

where stands for the density of Gaussian random vector with mean and variance .

Figure 1: Sample realizations in the first (left, classes, points), second (center, classes, points and third (right, classes, points) experiments with artificial datasets.

Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 2.

Figure 2: Misclassification errors for nearest-neighbor-based classifiers with Epanechnikov (left) and Gaussian(right) kernels. The solid line corresponds to the MSSA misclassification error.

In the second experiment, we took classes, points, equal prior class probabilities and considered a mixture (13) with

where stands for the density of Gaussian random vector with mean and variance . Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 3.

Figure 3: Misclassification errors for nearest-neighbor-based classifiers with Epanechnikov (left) and Gaussian(right) kernels. The solid line corresponds to the MSSA misclassification error.

Finally, in the third experiment, we took classes, points, equal prior class probabilities and considered a mixture (13) with

where stands for the density of Gaussian random vector with mean and variance . Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 4.

Figure 4: Misclassification errors for nearest-neighbor-based classifiers with Epanechnikov (left) and Gaussian(right) kernels. The solid line corresponds to the MSSA misclassification error.

3.3 Experiments on the real world datasets

We proceed with experiments on datasets from the UCI repository [12]: Ecoli, Iris, Glass, Pendigits, Satimage, Seeds, Wine and Yeast. Short information about these datasets is given in Table 1.

Dataset Train Test Attributes Classes Class distribution (in %)
Ecoli 336 7 8 42.6, 22.9, 15.5, 10.4, 5.9, 1.5, 0.6, 0.6
Iris 150 4 3 33.3, 33.3, 33.3
Glass 214 9 6 32.7, 35.5, 7.9, 6.1, 4.2, 13.6
Pendigits 7494 3498 16 10 10.4, 10.4, 10.4, 9.6, 10.4, 9.6, 9.6, 9.6, 10.4, 9.6, 9.6
Satimage 4435 2000 36 6 24.1, 11.1, 20.3, 9.7, 11.1, 23.7
Seeds 210 7 3 33.3, 33.3, 33.3
Wine 178 13 3 33.1, 39.8, 26.9
Yeast 1484 8 10 16.4, 28.1, 31.2, 2.9, 2.3, 3.4, 10.1, 2.0, 1.3, 0.3
Table 1: Information about datasets from the UCI repository [12]

We compare the performance of our algorithm with boosting of k-NN classifiers considered in [4] and SVM [22]. For Pendigits and Satimage datasets we calculated misclassification error on the test dataset, for all other datasets we used leave-one-out cross-validation. Results of our experiments are shown in Table 2, best ones are boldfaced.

Dataset EK MSSA GK MSSA Boost-NN, [4] SVM, [22] (table 2)
Ecoli 12.8 1.8 12.5 1.8 13.0 5.3
Iris 0.0 0.0
Glass 27.5 3.1 26.6 3.0 24.4 1.7
Pendigits 2.6 0.3 2.5 0.3 0.5 0.1
Satimage 9.6 0.7 9.6 0.7 9.6 0.3 11.0 0.7
Seeds 5.7 1.6 5.7 1.6 4.8 2.4
Wine 2.2 1.1 2.2 1.1 1.7 1.5
Yeast 40.5 1.3 40.4 1.3
Table 2: Misclassification errors (in %) with standard deviations for datasets from the UCI repository. Best results are boldfaced.

From Table 2, one can observe that localizing schemes with the Gaussian kernel behave slightly better than with the Epanechnikov kernel and MSSA with both kernels is comparable with SVM.

4 Theoretical properties

4.1 Main results

Before we formulate main theoretical properties of the procedure, we introduce an additional assumption. Namely, we assume that there exist constants and such that

(A3)

The choice of models, which fulfill the assumption (A3), is up to statistician. Note that this assumption is quite reasonable in sense that if we assume that is of order and is of order then, is of order , and thus, the number of models we aggregate is not huge.

Main theoretical properties of the MSSA procedure can be formulated in the following theorems. First two results concern accuracy of estimation.

Theorem 1.

Let assumptions (A1) – (A3) be fulfilled. Fix arbitrary and define a constant from the condition

(14)

Set

(15)

Then with probability at least simultaneously for all from to it holds

(16)

and for all

(17)
(18)

where is a universal constant and , depend only on (polynomially).

The result of Theorem 1 improves results in [6]. However, note that this theorem does not imply similar results in expectation, since the choice of parameters depends on the predetermined confidence set level . Note that the logarithmic term of the number of models in (17) is usual for problems of model selection and cannot be improved.

The next result establishes rates of convergence for the procedure.

Theorem 2.

Let conditions of Theorem 1 be fulfilled and set as in (15). Let

Select

(19)

Suppose that for each from to there exists a localizing scheme , such that

and

(20)

for some positive constants . Then with probability at least it holds

and

for some absolute constants .

The rate is optimal for estimation of Lipschitz functions under regularity of the design. The MSSA procedure provides the optimal rate up to a logarithmic factor, which can be considered as a payment for adaptation.

Note that the condition (A) implies that the KL-divergence is bounded by . It allows obtaining bounds in expectation for the -th moment of the KL-loss. Indeed, fix arbitrary and choose . Using the result of Theorem 2 we immediately obtain

Bounds in expectation can be easily improved by a simple modification of the procedure. Namely, fix some and define , . For each let stand for a MSSA estimate with parameters defined by the formula (15). Finally, denote

(21)

A rigorous result is formulated in the next theorem.

Theorem 3.

Under assumptions of Theorem 2, the choice ensures that

for all .

The proof of this result is given in Section 4.4. Note that the modified procedure requires running the MSSA algorithm times and does not have a significant influence on the computational time.

With guarantees on the performance of estimation, we are ready to provide bounds on the excess risk of misclassification. Now we assume that a test point is drawn randomly according to distribution and has a conditional distribution (1).

Theorem 4.

Let be a training sample with independent entries and is a test point generated from the distribution and is given by . Let the multiclass low noise assumption (11) be fulfilled and supposethat for each realization of and a collection of localizing schemes is chosen in a way to ensure (A1) and (A3) with probability 1. Suppose that holds. Choose a constant from the condition (14) and set parameters according to (15). Let

and select

(22)

Suppose that for each from to there exists a localizing scheme , such that

and

(23)

for some positive constants . Then for the excess risk one has

for some positive constant .

4.2 Proof of Theorem 1

We start with an auxiliary result, which were obtained in [21] and [6] respectively.

Lemma 1.

Let (A1) and (A2) be fulfilled. Then for the likelihood estimate of the form (7) and arbitrary it holds

Lemma 2.

Under assumptions (A1)–(A3), it holds

Moreover,

where and .

We also use a reparametrization

(24)

throughout the proof.

Proof of Theorem 1.

Let us fix an arbitrary from to and .

First, note that (14) and (15) imply

Now we bound . Note that