Pointwise adaptation via stagewise aggregation of local estimates for multiclass classificationFinancial support by the Russian Academic Excellence Project 5-100 and by the German Research Foundation (DFG) through the Collaborative Research Center 1294 is gratefully acknowledged.

# Pointwise adaptation via stagewise aggregation of local estimates for multiclass classification††thanks: Financial support by the Russian Academic Excellence Project 5-100 and by the German Research Foundation (DFG) through the Collaborative Research Center 1294 is gratefully acknowledged.

\fnmsNikita \snmPuchkin \thanksrefe, b, a label=e1]npuchkin@gmail.com [    \fnmsVladimir \snmSpokoiny \thanksrefe,f,b,d label=e3]spokoiny@wias-berlin.de [ Institute for Information Transmission Problems RAS, Bolshoy Karetny per. 19,
127051, Moscow, RF.
Weierstrass Institute and Humboldt University, Mohrenstr. 39, 10117 Berlin, Germany. \printeade3 National Research University Higher School of Economics, 20 Myasnitskaya ulitsa, 101000, Moscow, RF. Moscow Institute of Physics and Technology, 141701, Dolgoprudny, RF. Skolkovo Institute of Science and Technology (Skoltech), 143026, Moscow, RF.
###### Abstract

We consider a problem of multiclass classification, where the training sample is generated from the model , , and are unknown Lipschitz functions. Given a test point , our goal is to estimate . An approach based on nonparametric smoothing uses a localization technique, i.e. the weight of observation depends on the distance between and . However, local estimates strongly depend on localizing scheme. In our solution we fix several schemes , compute corresponding local estimates for each of them and apply an aggregation procedure. We propose an algorithm, which constructs a convex combination of the estimates such that the aggregated estimate behaves approximately as well as the best one from the collection . We also study theoretical properties of the procedure, prove oracle results and establish rates of convergence under mild assumptions.

[
\startlocaldefs\endlocaldefs\runtitle

{aug}

and

class=MSC] \kwd[Primary ]62H30 \kwd[; secondary ]62G08 \kwd62G10

## 1 Introduction

Multiclass classification is a natural generalization of the well-studied problem of binary classification with a wide range of applications. It is a problem of supervised learning when one observes a sample , where , , , . Pairs are generated independently according to some distribution over . The learner’s task is to find a rule in order to make a probability of misclassification

 P(Y≠f(X))

as small as possible. For a given class of admissible functions , one is often interested in the excess risk

 E(f)=P(Y≠f(X))−minf′∈FP(Y≠f′(X)),

which shows, how far the classifier from the best one in the class . Note that in this setting may be chosen outside of .

Concerning the multiclass learning problem, one can distinguish between two main approaches. The first one is by reducing to binary classification. The most popular and straightforward examples of these techniques are One-vs-All (OvA) and One-vs-One (OvO). Another example of reduction to the binary case is given by error correcting output codes (ECOC) [13]. In [2] this approach was generalized for margin classifiers. A similar approach uses tree-based classifiers. Methods of the second type solve a single problem such as it is done in multiclass SVM [8] and multiclass one-inclusion graph strategy [27]. One can refer to [3] for brief overview of multiclass classification methods. Daniely, Sabato and Shalev-Shwartz in [10] compared OvA, OvO, ECOC, tree-based classifiers and multiclass SVM for linear discrimination rules in a finite-dimensional space. From their theoretical study, multiclass SVM outperforms the OvA method. In [8] Crammer and Singer also showed a superiority of multiclass SVM on several datasets. Nevertheless, in our work, we will use One-vs-All for two reasons. First, we will consider a broad nonparametric class of functions and results in [10] do not cover this case. Second, in [23] Rifkin and Klautau showed that OvA behaves comparably to multiclass SVM if binary classifier in OvA is strong.

For each class , we construct binary labels and assume that, given , a conditional distribution of is , where , and . This model is very general and covers all possible distributions of on points. We must put some restrictions on the functions , . We will provide learning guarantees for the class of Lipschitz functions, i. e. we assume, there exists a constant such that for all and for all from to it holds

 |θ∗m(x)−θ∗m(x′)|⩽L|x−x′|.

For this model, the optimal classifier can be found analytically

Unfortunately, true values are unknown, therefore we study a plug-in rule

where stands for an estimate of , . This reduces the problem of classification to a regression problem. In [11] it was shown that in general the regression problem is more difficult than classification. Fortunately, for some classes (including the class of Lipschitz functions), classification and regression have similar complexities as it was shown in [30].

For problems of nonparametric regression, different localization techniques are often used. Namely, one considers an estimate defined by maximization of localized log-likelihood

where is a log-likelihood of the -th observation, is called a localizing scheme and localizing weights depend on and . Particular examples of such technique are Nadaraya-Watson estimator, local polynomial estimators and nearest-neighbor-based estimators.

Note that the estimate strongly depends on the localizing scheme and its choice determines the performance of the classifier . Moreover, in multiclass learning there is a common problem of class imbalance, i. e. some classes may be not presented in a small vicinity of a distinct point. Obviously, one localizing scheme is not enough for such situation. To solve this problem, we consider several localizing schemes , compute local likelihood estimates (they are also called weak) , , for each of them and use a plug-in classifier based on a convex combination of these estimates. An aggregation of weak estimates is a key feature of our procedure.

The aggregation of estimators takes its origins in model selection and it was generalized to convex and linear aggregation in [15]. In [28] and [31] optimal rates of aggregation were derived. Aggregation procedures have a wide range of applications and can be used in regression problems ([28], [31]), density estimation ([21], [25], [17]) and classification problems ([29], [32], [21]). They often solve an optimization problem in order to find aggregating coefficients ([16], [9], [19], [20]). In some cases such as exponential weighting ([18], [26]), a solution of the optimization problem can be written explicitly. An aggregation under the KL-loss was also studied in [24] and [7], where optimal rates of aggregation and exponential bounds were obtained. However, most of the existing aggregation procedures and results concern with a global aggregation. This means that the aggregating coefficients are universal and do not depend on the point where the classification rule is applied.

Our approach is based on local aggregation yielding a point dependent aggregation scheme. However, the proposed procedure does not require to solve an optimization problem. Instead, our procedure and sequentially finds a convex combination of weak estimates, which mimics the best possible choice of a model under the Kullback-Leibler loss for a given test point . The idea of the approach originates from [6], where an aggregation of binary classifiers was studied.

Finally, it is worth mentioning that nonparametric estimates have slow rates of convergence especially in the case of high dimension . It was shown in [5] and then in [14] that plug-in classifiers can achieve fast learning rates under certain assumptions in both binary and multiclass classification problems. We will use a similar technique to derive fast learning rates for the plug-in classifier based on the aggregated estimate.

Main contributions of this paper are following:

• we propose an algorithm of multiclass classification, based on aggregation of local likelihood estimates, which works for a broad class of admissible functions;

• the procedure is robust against class imbalance and outliers;

• computational time of the procedure is , where and stand for the size of train and test datasets respectively, which makes it scalable for large problems;

• theoretical guarantees claim optimal accuracy of classification with only a logarithmic payment for the number of classes and aggregated estimates.

The paper is organized as follows. In Section 2 we introduce definitions and notations. In Section 3 we formulate the multiclass classification procedure and demonstrate its performance on both artificial and real-world datasets. Finally, in Section 4 we study theoretical properties of the procedure. In particular, we derive oracle results for model selection, establish rates of convergence for the problem of nonparametric estimation and provide bounds for the excess risk .

## 2 Setup and notations

Given a training sample , we apply the following probabilistic model. Suppose that, given , labels have a conditional distribution

 P(Y=m|X)=θ∗m(X),1⩽m⩽M, (1)

, and . The optimal classifier is the Bayes rule defined by

 (2)

Unfortunately, true values are unknown, therefore we fix a test point and consider a plug-in classifier

 (3)

where stands for an estimate of , .

Now, the problem is how to estimate , . Fix some and transform labels to binary:

 Y′m,i=\mathbbm1(Yi=m) (4)

It is clear that

 (Y′m,i|Xi)∼Be(θ∗m(Xi)), (5)

where . This approach is nothing but the One-vs-All procedure for multiclass classification.

For fixed and , denote

 θ∗m,i=θ∗m(Xi),1⩽i⩽n, ˆθm,i=ˆθm(Xi),1⩽i⩽n,

and

 θ∗m=θ∗m(X), ˆθm=ˆθm(X).

One of the ways to estimate is to consider a localized log-likelihood

 Lm(θm)=n∑i=1wiℓ(Y′m,i,θm), (6)

where is a log-likelihood of one observation and are some non-negative localizing weights. The local maximum likelihood estimate can be found explicitly.

###### Proposition 1.

For the log-likelihood function of the form (6) the estimate is given by the formula

 ˜θm=SmN, (7)

where , . Moreover, for any it holds

 Lm(W,˜θm)−Lm(W,θm)=NK(˜θm,θm),

where .

Proof of the proposition is straightforward and requires to compute the derivative and put it to zero.

We proceed with two examples of such weights.

Example 2.1: k nearest neighbors

For k-NN estimates we have

 wi=\mathbbm1(Xi∈Dk(X)),

where is a set of k nearest to points over . Then

 ˜θm=1k∑i:Xi∈Dk(X)Y′m,i

Example 2.2: bandwidth-based kernel estimates

For bandwidth-based kernel estimates localizing weights are defined by the formula

 wi=K(∥Xi−X∥h),1⩽i⩽n, (8)

where stands for some norm, is called bandwidth and is a localizing kernel. Standard examples of such kernels are following:

• rectangular kernel:

• triangular kernel:

• Epanechnikov kernel:

• Gaussian kernel:

We will use a Euclidean norm in examples in Section 3.

Both k-NN and bandwidth-based localizing schemes require a proper choice of smoothing parameters ( and respectively). Fortunately, Bernoulli distribution belongs to an exponential family. Such distributions are quite well studied. In particular, in [6] an adaptive procedure of choosing the smoothing parameter was proposed. We will refer to that procedure as SSA (Spatial Stagewise Aggregation) as it called in the paper [6].

Let be a set of localizing schemes, i. e. for each . Each localizing scheme induces a set of estimates , defined by (7). Using the SSA procedure, we can get aggregated estimates . It was shown in [6] that for each behaves like almost the best estimate from . Next, we can use the plug-in rule (3). We will only require that for each from to , it holds

 0⩽w(k−1)i⩽w(k)i⩽1. (A1)

For two examples considered above, this condition means that candidates for the best model should be ordered by the number of nearest neighbors or by the bandwidth. The detailed description of the procedure for multiclass classification is given in Section 3. We will refer to it as MSSA (Multiclass Spatial Stagewise Aggregation).

To show consistency of the MSSA procedure we will derive convergence rates for and , where stands for the KL-divergence between two distributions, under certain assumptions. For two Bernoulli distributions with parameters and KL-divergence is defined by

 K(θ,θ′)=θlogθθ′+(1−θ)log1−θ1−θ′ (9)

and it is more informative, than the squared error. In the theoretical study, we will require a regularity of the KL-divergence. Namely, we assume that there exist constants such that

 κ1⩽θ(1−θ)⩽κ2,∀θ∈Θ (A2)

or equally

 supθ1,θ2∈ΘI(θ1)I(θ2)⩽ϰ2 (A2′)

where and

 I(θ)=Eθ(∂logℓ(Y′,θ)∂θ)2=1θ(1−θ)

is a Fischer information. These regularity conditions are usually used (cf. [6], [24]). One may easily notice that is bounded above by .

Assumptions (A2), (A) define metric-like properties of the KL-divergence, under these assumptions

 K1/2(˜θ,ˆθ)≍|˜θ−ˆθ|

More precise statements will be discussed in Section 4.

Besides bounds on the estimation error, we also derive bounds on the classification error. Assume that pairs are such that is drawn according to some distribution and conditional distribution of is given by (1). For every rule define the risk

 R(f)=P(Y≠f(X))

and the excess risk

 E(ˆf)=R(ˆf)−inffR(f)=R(ˆf)−R(f∗),

where stands for the Bayes classifier.

Since we deal with the problem of nonparametric estimation, even the optimal estimator can show poor performance in the case of large dimension . Low noise assumptions are usually used to speed up rates of convergence and allow plug-in classifiers to achieve fast rates. We can rewrite

 R(f)=1−E\mathbbm1(Y=f(X))=1−EXP(Y=f(X)|X)=1−EXθ∗f(X)

In case of binary classification a misclassification often occurs, when is close to with high probability. The well-known Mammen-Tsybakov noise condition ensures that such situation appears with low probability. Namely, it assumes that there exist universal non-negative constants and such that for all it holds

 PX(|2θ∗(X)−1|

This assumption can be generalized to the multiclass case. Suppose, at the point we have

 P(Y=m|X)=θ∗m=θ∗m(X),1⩽m⩽M

Let be ordered values of . Then the condition (10) for multiclass can be formulated as follows (cf. [1], [14])

 PX(|θ∗(1)(X)−θ∗(2)(X)|

for some non-negative constants and . We will use this assumption to establish fast rates for the built plug-in classifier in Section 4.

## 3 Algorithm and numerical experiments

### 3.1 Algorithm

Suppose, one has candidates , , for an optimal localizing scheme. The set of weights induces weak estimates for each class from to . Denote

 Nk=n∑i=1w(k)i (12)

We assume that any two collections of weights differ at least in one element. The idea of the procedure is simple. For each class on the first step we choose the estimate . This estimate is very local and therefore has the smallest bias and the largest variance of order . Next, we try to enlarge the vicinity of averaging under a condition that the bias does not change dramatically. For this purpose we run a likelihood-ratio test on homogeneity: if the hypothesis is correct, then the difference won’t be significant. Otherwise, the function changes quickly in the vicinity of the test point and it is better to utilize . Fix a critical value and construct the estimate , where the coefficient is close to if is much less than , and close to if exceeds the value . On the step , , we repeat this test for a new estimate and the estimate constructed on the previous step.

The choice of is the same as it was in [6]. The MSSA procedure returns aggregated estimates for each class. The classification rule is defined as

 ˆf(X)=\operatornamewithlimitsargmax1⩽m⩽Mˆθm(X)

The choice of parameters is crucial for performance of the procedure. We tune values according to the propagation condition. The propagation condition means that in the homogeneous case the procedure must return the estimate corresponding to the most broadest localizing scheme . If it does not happen such a situation is called an early stopping. The chosen values must ensure that the early stopping occurs with a small probability (e.g. 0.05). In our experiments described further values were tuned only at one point and then used for all test points.

In all numerical experiments, we choose localizing weights according to nearest-neighbor-based schemes. Namely, for a number of neighbors we set equal to the distance to the -th nearest neighbor of the test point . The weight is then defined by the formula

 w(k)i=K(∥Xi−X∥hk),

where is either Gaussian or Epanechnikov kernel. Typically in our experiments, and then it requires time to compute local estimates for one class and time to aggregate them. As result, it takes time to compute estimates at one test point.

### 3.2 Experiments on artificial datasets

We start with presenting the performance of MSSA on artificial datasets. We generate points from a mixture model:

 p(x|Y=m)=pm(x)

and

 P(Y=m)=πm

Then the density of is given by the formula

 p(x)=M∑m=1πmpm(x) (13)

The Bayes rule for this case is given by the formula

Below we provide results for three different experiments.

Typical sample realizations in all three experiments are shown on Figure 1. In each experiment, we took a sequence of integers , and considered -nearest-neighbor-based localizing schemes with Epanechnikov and Gaussian kernels. We computed average leave-one-out cross-validation errors over sample realizations.

In the first experiment, we took classes, points, equal prior class probabilities and considered a mixture of the form (13) with

 p1(x)=ϕ(x,[0,−1],0.5I2) p2(x)=ϕ(x,[√3/2,0],0.5I2) p3(x)=ϕ(x,[−√3/2,0],0.5I2),

where stands for the density of Gaussian random vector with mean and variance .

Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 2.

In the second experiment, we took classes, points, equal prior class probabilities and considered a mixture (13) with

 p1(x)=ϕ(x,[−1,−1],0.7I2) p2(x)=ϕ(x,[1,−1],0.7I2) p3(x)=ϕ(x,[−1,1],0.7I2) p4(x)=ϕ(x,[1,1],0.7I2),

where stands for the density of Gaussian random vector with mean and variance . Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 3.

Finally, in the third experiment, we took classes, points, equal prior class probabilities and considered a mixture (13) with

 p1(x)=0.5ϕ(x,[−1,0],0.5I2)+0.5ϕ(x,[1,0],0.5I2) p2(x)=0.5ϕ(x,[0.5,√3/2],0.5I2)+0.5ϕ(x,[−0.5,−√3/2],0.5I2) p3(x)=0.5ϕ(x,[−0.5,√3/2],0.5I2)+0.5ϕ(x,[0.5,−√3/2],0.5I2),

where stands for the density of Gaussian random vector with mean and variance . Misclassification errors for each weak estimate and SSA estimate for both k-NN and bandwidth-based localizing schemes are shown on Figure 4.

### 3.3 Experiments on the real world datasets

We proceed with experiments on datasets from the UCI repository [12]: Ecoli, Iris, Glass, Pendigits, Satimage, Seeds, Wine and Yeast. Short information about these datasets is given in Table 1.

We compare the performance of our algorithm with boosting of k-NN classifiers considered in [4] and SVM [22]. For Pendigits and Satimage datasets we calculated misclassification error on the test dataset, for all other datasets we used leave-one-out cross-validation. Results of our experiments are shown in Table 2, best ones are boldfaced.

From Table 2, one can observe that localizing schemes with the Gaussian kernel behave slightly better than with the Epanechnikov kernel and MSSA with both kernels is comparable with SVM.

## 4 Theoretical properties

### 4.1 Main results

Before we formulate main theoretical properties of the procedure, we introduce an additional assumption. Namely, we assume that there exist constants and such that

 0

The choice of models, which fulfill the assumption (A3), is up to statistician. Note that this assumption is quite reasonable in sense that if we assume that is of order and is of order then, is of order , and thus, the number of models we aggregate is not huge.

Main theoretical properties of the MSSA procedure can be formulated in the following theorems. First two results concern accuracy of estimation.

###### Theorem 1.

Let assumptions (A1) – (A3) be fulfilled. Fix arbitrary and define a constant from the condition

 NkK(¯¯¯θ(k)m,¯¯¯θ(k−1)m)⩽(1−αα)2⋅4ϰ4u0⋅log4KMδ,m=¯¯¯¯¯¯¯¯¯¯¯1,M,k=¯¯¯¯¯¯¯¯¯¯2,K (14)

Set

 zk=4ϰ4α2u0blog4KMδ (15)

Then with probability at least simultaneously for all from to it holds

 K1/2(θ∗m,ˆθm)⩽min1⩽k⩽K{K1/2(θ∗m,˜θ(k)m)+C1√Nklog124KMδ} (16)

and for all

 K(θ∗m,ˆθm)⩽min1⩽k⩽K{(1+ε)K(θ∗m,˜θ(k)m)+C2(ε)Nklog4KMδ} (17) (θ∗m−ˆθm)2⩽min1⩽k⩽K{(1+ε)(θ∗m−˜θ(k)m)2+C3(ε)Nklog4KMδ} (18)

where is a universal constant and , depend only on (polynomially).

The result of Theorem 1 improves results in [6]. However, note that this theorem does not imply similar results in expectation, since the choice of parameters depends on the predetermined confidence set level . Note that the logarithmic term of the number of models in (17) is usual for problems of model selection and cannot be improved.

The next result establishes rates of convergence for the procedure.

###### Theorem 2.

Let conditions of Theorem 1 be fulfilled and set as in (15). Let

 |θ∗m(x+t)−θ∗(x)|⩽Lh,∀|t|⩽h,1⩽m⩽M

Select

 h=C4⎛⎝log4KMδL2n⎞⎠1d+2 (19)

Suppose that for each from to there exists a localizing scheme , such that

 w(k∗(m))i=0,∀i:|Xi−x|>h

and

 α1nhd⩽Nk∗(m)⩽α2nhd (20)

for some positive constants . Then with probability at least it holds

 max1⩽m⩽MK(θ∗m,ˆθm)⩽C5⎛⎝Ldlog4KMδn⎞⎠2d+2

and

 max1⩽m⩽M(θ∗m−ˆθm)2⩽C6⎛⎝Ldlog4KMδn⎞⎠2d+2

for some absolute constants .

The rate is optimal for estimation of Lipschitz functions under regularity of the design. The MSSA procedure provides the optimal rate up to a logarithmic factor, which can be considered as a payment for adaptation.

Note that the condition (A) implies that the KL-divergence is bounded by . It allows obtaining bounds in expectation for the -th moment of the KL-loss. Indeed, fix arbitrary and choose . Using the result of Theorem 2 we immediately obtain

 EKr(θ∗m,ˆθm)⩽ϰ2r2rκr1δ+Cr5⎛⎝Ldlog8KMδn⎞⎠2r/(d+2)≲(logMnn)2r/(d+2)

Bounds in expectation can be easily improved by a simple modification of the procedure. Namely, fix some and define , . For each let stand for a MSSA estimate with parameters defined by the formula (15). Finally, denote

 ˆθ⟨J⟩m=(2J+1−1)−1J∑j=12jˆθ[j]m (21)

A rigorous result is formulated in the next theorem.

###### Theorem 3.

Under assumptions of Theorem 2, the choice ensures that

 Emax1⩽m⩽MKr(θ∗m,ˆθ⟨J⟩m)⩽Cr7(2rd+2)1+2r/(d+2)(log8KMn)2r/(d+2)

for all .

The proof of this result is given in Section 4.4. Note that the modified procedure requires running the MSSA algorithm times and does not have a significant influence on the computational time.

With guarantees on the performance of estimation, we are ready to provide bounds on the excess risk of misclassification. Now we assume that a test point is drawn randomly according to distribution and has a conditional distribution (1).

###### Theorem 4.

Let be a training sample with independent entries and is a test point generated from the distribution and is given by . Let the multiclass low noise assumption (11) be fulfilled and supposethat for each realization of and a collection of localizing schemes is chosen in a way to ensure (A1) and (A3) with probability 1. Suppose that holds. Choose a constant from the condition (14) and set parameters according to (15). Let

 |θ∗m(x+t)−θ∗(x)|⩽Lh,∀|t|⩽h,1⩽m⩽M

and select

 h=C8(lognL2n)1/(d+2) (22)

Suppose that for each from to there exists a localizing scheme , such that

 w(k∗(m))i=0,∀i:|Xi−x|>h

and

 α1nhd⩽Nk∗(m)⩽α2nhd (23)

for some positive constants . Then for the excess risk one has

 E(ˆf)⩽BC1+α9(13+12α)(2+αd+2)1+(1+α)/(d+2)(log8KMn)(1+α)/(d+2)

for some positive constant .

### 4.2 Proof of Theorem 1

We start with an auxiliary result, which were obtained in [21] and [6] respectively.

###### Lemma 1.

Let (A1) and (A2) be fulfilled. Then for the likelihood estimate of the form (7) and arbitrary it holds

 P(NK(˜θm,¯¯¯θm)>z)⩽2e−z/ϰ2
###### Lemma 2.

Under assumptions (A1)–(A3), it holds

 NkK(ˆθ(k)m,ˆθ(k−1)m)⩽(1+b)zk

Moreover,

 Nk′K(ˆθ(k′)m,ˆθ(k)m)⩽(1+b)ϰ2c2u¯¯¯zk,∀k′>k

where and .

We also use a reparametrization

 ν=logθ1−θ (24)

throughout the proof.

###### Proof of Theorem 1.

Let us fix an arbitrary from to and .

First, note that (14) and (15) imply

 NkK(¯¯¯θ(k)m,¯¯¯θ(k−1)m)⩽(1−α)2bzk,m=¯¯¯¯¯¯¯¯¯¯¯1,M,k=¯¯¯¯¯¯¯¯¯¯2,K

Now we bound . Note that

 K1/2(˜θ(k)m,˜θ(k−1)m)⩽√κ22∣∣˜ν(k)m