Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection

# Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection

Nathalie Akakpo
LPMA, Université Pierre et Marie Curie
###### Abstract

This article is devoted to nonlinear approximation and estimation via piecewise polynomials built on partitions into dyadic rectangles. The approximation rate is studied over possibly inhomogeneous and anisotropic smoothness classes that contain Besov classes. Highlighting the interest of such a result in statistics, adaptation in the minimax sense to both inhomogeneity and anisotropy of a related multivariate density estimator is proved. Besides, that estimation procedure can be implemented with a computational complexity simply linear in the sample size.

## 1 Introduction

When estimating a multivariate function, it seems natural to consider that its smoothness is likely to vary either spatially, or with the direction, or both. We will refer to the first feature as (spatial) inhomogeneity. If the risk is measured in a -norm, measuring the smoothness in a -norm with allows to take into account such an inhomogeneity – all the greater as is smaller – in the sense that functions with some localized singularities and otherwise flat parts may thus keep a high smoothness index. For the second feature, we will talk about anisotropy, which is usually described by different indices of smoothness according to the coordinate directions. Yet, statistical procedures that adapt both to possible inhomogeneity and anisotropy remain rather scarce. Indeed, the existing literature seems to amount to the following references. Neumann and Von Sachs [NvS97], for estimating the evolutionary spectrum of a locally stationary time series, and Neumann [Neu00], in the Gaussian white noise framework, study thresholding procedures in a tensor product wavelet basis. In a Gaussian regression framework, Donoho [Don97] proposes the dyadic CART procedure, a selection procedure among histograms built on partitions into dyadic rectangles, extended to the density estimation framework by Klemelä [Kle09]. Last, Kerkyacharian, Lepski and Picard [KLP01] introduce a kernel estimator with adaptive bandwidth in the Gaussian white noise model. These authors study the performance of their procedures for the -risk, apart from the latter who consider any -risk for . Neumann and Von Sachs [NvS97] measure the smoothness of the function to estimate in the Sobolev scale, whereas the others consider the finer Besov scale. Besides, the -norm in which the smoothness is measured is allowed to vary with the direction, except in [Don97], but always constrained to be greater than 1. Common to those few procedures is the ability to reach the minimax rate over a wide range of possibly inhomogeneous and anisotropic classes, up to a logarithmic factor, the unknown smoothness being as usually limited by the a priori fixed smoothness of the underlying wavelets, piecewise polynomials or kernel.

Adaptation results of the aforementioned type rely as much on Statistics as on Approximation Theory, oracle-type inequalities reflecting the interplay between both domains. Assume for instance that the function to estimate lies in the set of all real-valued functions defined over the unit cube , let be a given family of linear subspaces of and a statistical procedure somehow based on that family. An oracle-type inequality in the -norm roughly takes the form

 Es[∥s−~s∥qq]≤Cinfm∈M{inft∈Sm∥s−t∥qq+(dim(Sm)/n)q/2}, (1)

where is some positive constant, indicating that is able to choose a model in the family that approximately realizes the best compromise between the approximation error and the dimension of the model. Equivalently, it may be written as

 Es[∥s−~s∥qq]≤CinfD∈N⋆{inft∈∪m∈MDSm∥s−t∥qq+(D/n)q/2}, (2)

where On the other hand, the collection should be chosen so as to have good approximation properties over various classes of functions with smoothness measured in a -norm and with semi-norm smaller than . Otherwise said, each approximating space – typically nonlinear to deal with inhomogeneous functions– should satisfy, for a wide range of values of and ,

 sups∈S(α,p,R)inft∈∪m∈MDSm∥s−t∥q≤C(α,p)RD−α/d, (3)

for some positive real that only depends on and . Combining the oracle-type inequality (2) and the approximation result (3) then provides an estimator with rate at most of order over each class , which is usually the minimax rate. Having at one’s disposal spaces that do no depend on any a priori knowledge about the smoothness of the function to estimate – other than the scale of spaces it belongs to – and reaching the approximation rate (3) is thus a real issue for statisticians. In order to deal with inhomogeneity only, in a multivariate framework, such results appear for instance in the following references. DeVore, Jawerth and Popov [DJP92], Birgé and Massart [BM00] or Cohen, Dahmen, Daubechies and DeVore [CDDD01] propose wavelet based approximation algorithms aimed in particular at Besov type smoothness. Applications of the approximation result of [BM00] to statistical estimation may be found in Birgé and Massart [BM97] or Massart [Mas07] for instance. DeVore and Yu [DeV98] are concerned with piecewise polynomials built on partitions into dyadic cubes, notably for functions with Besov type smoothness. But their result will wait until Birgé [Bir06] to be used in Statistics. More generally, such results are in fact hidden behind all adaptive procedures. Thus, for both inhomogeneous and anisotropic functions, we refer in particular to the articles cited in the first paragraph. Let us underline that the procedure studied by Donoho [Don97] and Klemelä [Kle09], though based on dyadic rectangles instead of cubes, does not rely on a nonlinear approximation result via piecewise polynomials such as [DY90]. Indeed, the adaptivity of that estimator follows from its characterization as a wavelet selection procedure among some tree-structured subfamily of the Haar basis. Other nonlinear wavelet based approximation results are proved in Hochmuth [Hoc02b] or [Lei03] for anisotropic Besov spaces. Last, piecewise constant approximation based on dyadic rectangles is studied in Cohen and Mirebeau [CM09] for nonstandard smoothness spaces under the constraint of continuous differentiability.

Our aim here is to provide an approximation result tailored for statisticians, whose interest is illustrated by a new statistical procedure. The first part of the article is devoted to piecewise polynomial approximation based on partitions into dyadic rectangles. Thanks to an approximation algorithm inspired from DeVore and Yu [DY90], we obtain approximation rates akin to (3) over possibly inhomogeneous and anisotropic smoothness classes that contain for instance the more traditional Besov classes. The approximation rate can be measured in any -norm, for , and we allow an arbitrarily high inhomogeneity in the sense that we measure the smoothness in a -norm with allowed to be arbitrarily close to 0. Besides, we take into account a possible restriction on the minimal size of the dyadic rectangles, which may arise in statistical applications. For estimating a multivariate function, we then introduce a selection procedure that chooses from the data the best partition into dyadic rectangles and the best piecewise polynomial built on that partition thanks to a penalized least-squares type criterion. The degree of the polynomial may vary from one rectangle to another, and also according to the coordinate directions, so as to provide a good adaptation both to inhomogeneity and anisotropy. Thus, our procedure extends the dyadic histogram selection procedures of Donoho [Don97], Klemelä [Kle09] or Blanchard, Schäfer, Rozenholc and Müller [BSRM07], and the dyadic piecewise polynomial estimation procedure proposed in a univariate or isotropic framework by Willett and Nowak [WN07]. We study the theoretical performance of the procedure – with no need to resort to the "wavelet trick" used in [Don97, Kle09] – for the -risk in the density estimation framework, as [Kle09], but we propose a more refined form of penalty than [Kle09]. For such a penalty, we provide an oracle-type inequality and adaptivity results in the minimax sense over a wide range of possibly inhomogeneous and anisotropic smoothness classes that contain Besov type classes. We emphasize that, if the maximal degree of the polynomials does not depend on the sample size, we reach the minimax rate up to a constant factor only, contrary to all the previously mentioned estimators. This results not only from the good approximation properties of dyadic piecewise polynomials, but also from the moderate number of dyadic partitions of the same size. We can also allow the maximal degree of the polynomials to grow logarithmically with the sample size, in which case we reach the minimax rate on a growing range of smoothness classes, up to a logarithmic factor. Moreover, our procedure can be implemented with a computational complexity only linear in the sample size, possibly up to a logarithmic factor, depending on the way we choose the maximal degree.

The plan of the paper is as follows. Section 2 is devoted to piecewise polynomial approximation based on partitions into dyadic rectangles. In Section 3, we are concerned with density estimation based on a data-driven choice of a best dyadic piecewise polynomial. We study there the theoretical properties of the procedure and briefly describe the algorithm to implement it. Most proofs of Sections 2 and 3 are deferred respectively to Section 4 and to Sections 5 and 6.

## 2 Adaptive approximation by dyadic piecewise polynomials

In this section, we present an approximation algorithm by piecewise polynomials built on partitions into dyadic rectangles. We study its rate of approximation over some classes of functions that may present at the same time anisotropic and inhomogeneous smoothness.

### 2.1 Notation

Throughout the article, we fix , and throughout this section, we fix some -uple of nonnegative integers that represent the maximal degree of polynomial approximation in each direction. We call dyadic rectangle of any set of the form where, for all ,

 Il=[0,2−jl] or Il=]kl2−jl,(kl+1)2−jl]

with and . Otherwise said, a dyadic rectangle of is defined as a product of dyadic intervals of that may have different lengths. For a partition of into dyadic rectangles, we denote by the number of rectangles in and by the space of all piecewise polynomial functions on which are polynomial with degree in the -th direction, , over each rectangle of . Besides, for , we denote by the set of all real-valued and measurable functions on such that the (quasi-)norm

 ∥s∥p=⎧⎨⎩(∫[0,1]d|s(x)|pdλd(x))1/p if 0

is finite, where is the Lebesgue measure on . Last, , or , stand for a positive reals that only depend on the parameter . Their values may change from one line to another, unless otherwise said.

### 2.2 Approximation algorithm

Let us fix . In order to approximate a possibly anisotropic and inhomogeneous function in the -norm, we propose an approximation algorithm inspired from [DY90]. We shall construct an adequate piecewise polynomial approximation on a partition into dyadic rectangles adapted to , beginning with the trivial partition of the unit square and proceeding to successive refinements. For doing so, we consider the criterion

 Er(s,K)q=infP∈Pr∥(s−P)1IK∥q (4)

measuring the error in approximating on a rectangle by some element from the set of all polynomials on with degree in the -th direction. We also fix some threshold – to be chosen later, according to the smoothness assumptions on . But contrary to [DY90], we allow the degrees of smoothness of to vary with the directions and describe them by a multi-index , in a sense that will be made precise in the next subsection. Thus, our algorithm is based on a special subcollection of dyadic rectangles adapted to an anisotropic smoothness measured by . Indeed, for , we define as the set of all dyadic rectangles such that, for all ,

 Il=[0,2−⌊jσ––/σl⌋] or Il=]kl2−⌊jσ––/σl⌋,(kl+1)2−⌊jσ––/σl⌋],

with and , and we set It should be noticed that, for all , any can be partitioned into dyadic rectangles of , that we call children of . For and for instance, a partition of into dyadic rectangles from will thus be roughly twice as fine in the first direction, as illustrated by Figure 1.

The algorithm begins with the set that only contains . If , then the algorithm stops. Else, is replaced with his children in , hence a new partition . In the same way, the -th step begins with a partition of into dyadic rectangles that belong to . If then the algorithm stops. Else, a dyadic rectangle such that is chosen and replaced with his children in , hence a new partition . Since , tends to 0 when the Lebesgue measure of tends to 0, so the algorithm finally stops. The final partition only contains dyadic rectangles that belong to and such that For all , we approximate on by , a polynomial function with degree in the -th direction such that Otherwise said, we approximate on the unit cube by

 A(s,ϵ)=∑K∈I(s,ϵ)QK(s),

thus committing the error

 ∥s−A(s,ϵ)∥q=⎛⎝∑K∈I(s,ϵ)∥(s−QK(s))1IK∥qq⎞⎠1/q<|I(s,ϵ)|1/qϵ (5)

if , and

 ∥s−A(s,ϵ)∥∞=maxK∈I(s,ϵ)∥(s−QK(s))1IK∥∞<ϵ (6)

if

### 2.3 Approximation rate over anisotropic function classes

In order to study the approximation rate of the previous algorithm, we introduce function spaces that arise naturally from the way the algorithm proceeds. Let us fix and . For and , we set

 er,σ,p,k(s)=infP∈Πr,σk∥s−P∥p (7)

where is the set of all piecewise polynomial functions on that are polynomial with degree in the -th direction over each rectangle in . Then, we define as the set of all functions such that the quantity

 Nr,σ,p,p′(s)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩(∑k∈N(2kσ––er,σ,p,k(s))p′)1/p′ % if 0

is finite. One can easily verify that is a (quasi-)semi-norm on , and that gets larger as increases since

 Nr,σ,p,p′2(s)≤Nr,σ,p,p′1(s) for 0

If , then is obviously embedded in the space in which we measure the quality of approximation. The same property still holds for smaller than , under adequate assumptions on the harmonic mean of , i.e.

 H(σ)=(1dd∑l=11σl)−1.

Indeed, denoting by for any real , we prove in Section 4 the following continuous embedding.

###### Proposition 1

Let , and . If

 H(σ)/d>(1/p−1/q)+,

then and, for all ,

 ∥s∥q≤C(d,r,σ,p,p′,q)(∥s∥p+Nr,σ,p,p′(s)).

The reader familiar with classical function spaces will have noted the similarity between the definition and the embedding properties of spaces and those of Besov spaces. Before going further, let us recall the definition of the latter according to [ST87], for instance. We denote by the canonical basis of and set . For all , , , and , we define

 R(σl,h)={x∈[0,1]d s.t. x,x+hbl,…,x+(⌊σl⌋+1)hbl∈R},
 Δσlhbls(x)=⌊σl⌋+1∑k=0(⌊σl⌋+1k)(−1)⌊σl⌋+1−ks(x+khbl), for x∈R(σl,h),
 ω(l)σl(s,y,R)p=sup0
 |s|σ,p,p′=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩∑dl=1(∫∞0[y−σlω(l)σl(s,y,R)p]p′dyy)1/p′ if 00y−σlω(l)σl(s,y,R)p) if p′=∞.

For , , we denote by the space of all measurable functions such that is finite. According to the proposition below, Besov spaces are embedded in spaces

###### Proposition 2

Let , and . For all ,

 Nr,σ,p,p′(s)≤C(d,r,σ,p,p′)|s|σ,p,p′.

We shall not give a proof of that proposition here, since it relies exactly on the same arguments as those used by [Hoc02a] in the proof of Theorem 4.1 (beginning of page 197) combined with Inequality (14) in the same reference. It should be noticed that the space is in general larger than . Indeed, contrary to , the space contains discontinuous functions (piecewise polynomials, for instance) even for .

We are now able to state approximation rates over anisotropic classes of the form

 S(r,σ,p,p′,R)={s∈Lp([0,1]d) s.t. Nr,σ,p,p′(s)≤R},

where , and , thus extending the result of DeVore and Yu [DY90] (Corollary 3.3), which is only devoted to functions with isotropic smoothness. The approximation rate is related to the harmonic mean of , which in case of isotropic smoothness of order , i.e. if , reduces to .

###### Theorem 1

Let , , and such that

 H(σ)/d>(1/p−1/q)+.

Assume that , where if or , and if . Then, for all , there exists some partition of into dyadic rectangles, that may depend on and , such that

 |m|≤C1(d,σ,p)2kd

and

 inft∈S(m,r)∥s−t∥q≤C2(d,r,σ,p,q)R2−kH(σ). (9)

The same result still holds whatever if or , and whatever if , as a straightforward consequence of Theorem 1 and Inequality (8). Denoting by , , the set of all the partitions of into dyadic rectangles, we obtain uniform approximation rates simultaneously over a wide range of classes by considering the nonlinear approximating space . That property is stated more precisely in Corollary 1 below, which can be immediately derived from Theorem 1.

###### Corollary 1

Let , , , and satisfying the assumptions of Theorem 1. For all , where is given by Theorem 1,

 sups∈S(r,σ,p,p′,R)inft∈∪m∈MDS(m,r)∥s−t∥q≤C′2(d,r,σ,p,q)RD−H(σ)/d.

We also propose of a more refined version of Theorem 1 that allows to take into account constraints on the minimal dimensions of the dyadic rectangles, which will prove most useful for estimation purpose in the next section. We recall that

###### Theorem 2

Let , , , , and such that

 H(σ)/d>(1/p−1/q)+.

Assume that , where if or , and if . Then, for all , there exists some partition of , that may depend on and , only contains dyadic rectangles with sidelength at least in the -th direction, , and satisfies both

 |m|≤C1(d,σ,p)2kd

and

 inft∈S(m,r)∥s−t∥q≤C3(d,r,σ,p,q)R(2−Jd(H(σ)/d−(1/p−1/q)+)σ––/H(σ)+2−kH(σ)). (10)

Remark: Given , that theorem relies on applying the approximation algorithm of Section 2.2 to an approximation of from , where is the partition of into the dyadic rectangles from Thus, the term in (10), which is of order , corresponds with an upper-bound for the linear approximation error The upper-bound (10) is of the same order as (9) – up to a real that only depends on – as long as

 k≤Jσ––H(σ)(H(σ)d−(1p−1q)+)dH(σ). (11)

If and , i.e. if has homogeneous and isotropic smoothness, then that condition simply amounts to . Otherwise, Condition (11) is all the more stringent as is small by comparison with or as is small by comparison with , i.e. all the more stringent as inhomogeneity or anisotropy are pronounced.

Given , let us denote by the set of all the partitions into dyadic rectangles with sidelengths , for . We can still obtain uniform approximation rates simultaneously over a wide range of classes under the constraint that the piecewise polynomial approximations are built over dyadic rectangles with sidelengths , by introducing this time the nonlinear approximation space . Indeed, as for all and , , a straightforward consequence of Theorem 2 is Corollary 2 below.

###### Corollary 2

Let , , , , and satisfying the assumptions of Theorem 2. For all , where is given by Theorem 2,

 sups∈S(r,σ,p,p′,R)inft∈∪m∈MJDS(m,r)∥s−t∥q ≤C′3(d,r,σ,p,q)R(2−Jd(H(σ)/d−(1/p−1/q)+)σ––/H(σ)+2−kH(σ)).

## 3 Application to density estimation

This section aims at illustrating the interest of the previous approximation results in statistics. More precisely, placing ourselves in the density estimation framework, we show that combining estimation via dyadic piecewise polynomial selection and the aforementioned approximation results leads to a new density estimator which is able to adapt to the unknown smoothness of the function to estimate, even though it is both anisotropic and inhomogeneous. Besides, we explain how such a procedure can be implemented efficiently.

### 3.1 Framework and notation

Let , , we observe independent and identically distributed random variables defined on the same measurable space and taking values in . We assume that admit the same density with respect to the Lebesgue measure on and that . We denote by the joint distribution of , that is the probability measure with density

 dPsdλ⊗nd:(y1,…,yn)∈[0,1]d×…×[0,1]d⟼n∏i=1s(yi),

while stands for the underlying probability measure on , so that for all product of rectangles of

 Ps(B)=Ps({ω∈Ω s.t. (Y1(ω),…,Yn(ω))∈B}).

The expectation and variance associated with are denoted by and .

### 3.2 Dyadic piecewise polynomial estimators

Let be some partition of into dyadic rectangles and a sequence such that, for all , . We denote by the space of all functions such that, for all , is polynomial with degree in the -th direction on the rectangle . In particular, if is constant and equal to , then coincides with the space introduced in Section 2. Let be the usual scalar product on . We recall that minimizes over

 ∥s−t∥22−∥s∥22=∥t∥22−2⟨t,s⟩=Es[γ(t)],

where

 γ(t)=∥t∥22−2nn∑i=1t(Yi)

only depends on the observed variables. Thus, a natural estimator of with values in is

 ^s(m,ρ)=argmin t∈S(m,ρ)γ(t),

that we will call a dyadic piecewise polynomial estimator. Such an estimator is just a projection estimator of on . Indeed, if for each dyadic rectangle we set and denote by an orthonormal basis of the space of polynomial functions over with degree in the -th direction, then simple computations lead to

 ^s(m,ρ)=∑K∈m∑k∈Λ(ρK)(1nn∑i=1ΦK,k(Yi))ΦK,k.

For theoretical reasons, we shall choose in the remaining of the article an orthonormal basis derived from the Legendre polynomials in the following way. Let be the orthogonal family of the Legendre polynomials in . For rectangle of , and , we set

 π(k)=d∏l=1(2k(l)+1)

and

 ΦK,k(x)=√π(k)λd(K)d∏l=1Qk(l)(2xl−ul−vlvl−ul)1IK(x).

We recall that, for all , satisfies

 ∥Qj∥∞=1 and ∥Qj∥22=2(2j+1).

Therefore, for rectangle in and , is a basis of the space of piecewise polynomial functions with support and degree in the -th direction, which is orthonormal for the norm and satisfies

 ∥ΦK,k∥2∞=π(k)λd(K). (12)

For each partition of into dyadic rectangles and each , we can evaluate the performance of by giving an upper-bound for its quadratic risk. For that purpose, we introduce the orthogonal projection of on , the dimension of , i.e.

 dim(S(m,ρ))=∑K∈m|Λ(ρK)|=∑K∈md∏l=1(ρK(l)+1),

and define by

 ρmax(l)=maxK∈mρK(l),l=1,…,d. (13)
###### Proposition 3

Let be a partition of into dyadic rectangles and . If , then

 Es[∥s−^s(m,ρ)∥22]=∥s−s(m,ρ)∥22+1n∑K∈m∑k∈Λ(ρK){Var}s(ΦK,k(Y1)).

If is finite, then

 Es[∥s−^s(m,ρ)∥22]≤∥s−s(m,ρ)∥22+π(ρmax)∥s∥∞dim(S(m,ρ))n.

Proof: Pythagoras’ Equality gives

 Es[∥s−^s(m,ρ)∥22]=∥s−s(m,ρ)∥22+Es[∥s(m,ρ)−^s(m,ρ)∥22].

Then, we deduce the first equality in Proposition 3 from the expressions of and in the orthonormal basis of and the fact that are independent and identically distributed.

If is bounded, we deduce from (12) that, for all and ,

 Es[Φ2K,k