On deconvolution of distribution functions

# On deconvolution of distribution functions

[ [    [ [    [ [ University of Haifa, University of Haifa and Université Grenoble I I. Dattner
A. Goldenshluger
Department of Statistics
University of Haifa
31905 Haifa
Israel
A. Juditsky
LMC, B. P. 53
Université Grenoble I
38041 Grenoble Cedex 9
France
\smonth6 \syear2010\smonth5 \syear2011
\smonth6 \syear2010\smonth5 \syear2011
\smonth6 \syear2010\smonth5 \syear2011
###### Abstract

The subject of this paper is the problem of nonparametric estimation of a continuous distribution function from observations with measurement errors. We study minimax complexity of this problem when unknown distribution has a density belonging to the Sobolev class, and the error density is ordinary smooth. We develop rate optimal estimators based on direct inversion of empirical characteristic function. We also derive minimax affine estimators of the distribution function which are given by an explicit convex optimization problem. Adaptive versions of these estimators are proposed, and some numerical results demonstrating good practical behavior of the developed procedures are presented.

[
\kwd
\doi

10.1214/11-AOS907 \volume39 \issue5 2011 \firstpage2477 \lastpage2501 \newproclaimdefinitionDefinition[section] \newproclaimproblemProblem

\runtitle

On deconvolution of distribution functions

{aug}

A]\fnmsI. \snmDattner\thanksreft1label=e1]idattner@stat.haifa.ac.il, A]\fnmsA. \snmGoldenshluger\corref\thanksreft1label=e2]goldensh@stat.haifa.ac.il and B]\fnmsA. \snmJuditskylabel=e3]anatoli.iouditski@imag.fr

\thankstext

t1Supported by BSF Grant 2006075.

class=AMS] \kwd62G05 \kwd62G20. Adaptive estimator \kwddeconvolution \kwdminimax risk \kwdrates of convergence \kwddistribution function.

## 1 Introduction

In this paper we study the problem of estimating a distribution function in the presence of measurement errors.

Let be a sequence of independent, identically distributed random variables with common distribution . Suppose that we observe random variables given by

 Yj=Xj+ζj,j=1,…,n, (1)

where are i.i.d. random variables, independent of ’s with the density w.r.t. the Lebesgue measure on the real line. The objective is to estimate the value of the distribution function of at a given point from the observations .

By an estimator we mean any measurable function of the observations . We adopt the minimax approach for measuring estimation accuracy. Let be a given family of probability distributions on . Given an estimator of , we consider two types of maximal over risks:

 Risk2[˜F;F]:=supF∈F{E|˜F−F(t0)|2}1/2.
• -risk: given a tolerance level we define

 Riskϵ[˜F;F]:=min{δ\dvtxsupF∈FP[|˜F−F(t0)|>δ]≤ϵ}.

An estimator is said to be rate optimal or optimal in order with respect to if

 Risk[˜F∗;F]≤Cinf˜FRisk[˜F;F],

where is taken over all possible estimators of , and is independent of . We will be particularly interested in the classes of distributions having density with respect to the Lebesgue measure on the real line.

The outlined problem is closely related to the density deconvolution problem that has been extensively studied in the literature; see, for example, zhang (), stefanski (), Fan (), butucea-tsy12 (), butucea-tsy (), johannes (), meister () and references therein. In these works the minimax rates of convergence have been derived under different assumptions on the error density and on the smoothness of the density to be estimated. Depending on the tail behavior of the characteristic function of the following two cases are usually distinguished: {longlist}[(ii)]

ordinary smooth errors, when the tails of are polynomial, that is,

 |ˆfζ(ω)|≍|ω|−β,|ω|→∞,

for some ;

supersmooth errors, when the tails are exponential, that is,

 |ˆfζ(ω)|≍exp{−c|ω|β},|ω|→∞,

for some and . The afore cited papers derive minimax rates of convergence for different functional classes under ordinary smooth and supersmooth errors.

In contrast to existence of the voluminous literature on density deconvolution, the problem of deconvolution of the distribution function has attracted much less attention and has been studied in very few papers (see meister (), Section 2.7.2, for a recent review of corresponding contributions). A consistent estimator of a distribution function from observations with additive Gaussian measurement errors was developed by gaffey (). A “plug-in” estimator based on integration of the density estimator in the density deconvolution problem has been studied under moment conditions on in zhang (). The paper Fan () also considered the estimator based on integration of the density deconvolution estimator. It was shown there that under a tail condition on the estimator achieves optimal rates of convergence provided that the errors are supersmooth. For the case of ordinary smooth errors there is a gap between the upper and lower bounds reported in Fan () which leaves open the question of constructing optimal estimators. More recently, some minimax rates of estimation of distribution functions in models with measurement errors were reported in hall (). Note also that butucea-comte () considered a general problem of optimal and adaptive estimation of linear functionals in the model (1). However, their results hold only for representative which is clearly not the case in the problem of recovery of distribution function.

The objective of this paper is to develop optimal methods of minimax deconvolution of distribution functions and to answer several questions raised by known results on this problem: Is a smoothness assumption alone on sufficient in order to secure minimax rates of estimation of the sort for in the case of ordinary smooth errors? Do we need tail or moment conditions on ?

Our contribution is two-fold. First, we characterize the minimax rates of convergence in the case when the unknown distribution belongs to a Sobolev ball, and the observation errors are ordinary smooth. The rates of convergence depend crucially on the relation between the smoothness index of the Sobolev ball and the parameter [the rate at which the characteristic function of errors tends to zero; see (i) above]. In contrast to the density deconvolution problem, it turns out that there are different regions in the -plane where different rates of convergence are attained. We show that in some regions of the -plane the minimax rates of convergence are attained by a linear estimator, which is based on direct inversion of the distribution function from the corresponding characteristic function; cf. hall (). It is worth noting that we do not require any additional tail or moment conditions on the unknown distribution. In the case when the parameters of the regularity class of the distribution are unknown, we also construct an adaptive estimator based on Lepski’s adaptation scheme lepski (). The -risk of this estimator is within a -factor of the minimax -risk.

Second, using recent results on estimating linear functionals developed in JudNem (), we propose minimax and adaptive affine estimators of the cumulative distribution function for a discrete distribution deconvolution problem; see also DonLiuI (), DonLiuII (), DonLiuIII (), CaiLow2003 (), Donoho () for the general theory of affine estimation. These estimators can be applied to the original deconvolution problem provided that it can be efficiently discretized. By efficient discretization we mean that: {longlist}[(2)]

the support of the distributions of () can be “compactified” [one can point out a compact subset of such that the probability of () being outside this set is “small”] and binned into small intervals;

the class of discrete distributions, obtained by the corresponding finite-dimensional cross-section of the class of continuous distributions is a computationally tractable convex closed set.222Roughly speaking, a computationally tractable set can be interpreted as a set given by a finite system of inequalities , , where are convex polynomials; see, for example, BN (), Chapter 4. Under these conditions one can efficiently implement the minimax affine estimator for based on the approach proposed in JudNem (). This estimator is rate minimax with respect to (within a factor for small ) whatever are the noise distribution and a convex and closed class .

We describe construction of the minimax affine estimator of when the class is known and provide an adaptive version of the estimation procedure when the available information allows us to construct an embedded family of classes.

The rest of the paper is structured as follows. We present our results on estimation over the Sobolev classes in Section 2. Section 3 deals with minimax and adaptive affine estimation. Section 4 presents a numerical study of proposed adaptive estimators and discusses their relative merits. Proofs of all results are given in the supplementary article DGJ-suppl ().

## 2 Estimation over Sobolev classes

### 2.1 Notation

We denote by and the densities of random variables and ; with certain abuse of notation we simply denote by the density of unknown distribution of .

Let be a function on ; we denote by the Fourier transform of ,

 ˆg(ω)=∫∞−∞g(x)eiωxdx,ω∈R.

We consider the classes of absolutely continuous distributions. {definition} Let , . We say that belongs to the class if it has a density with respect to the Lebesgue measure on , and

 12π∫∞−∞|ˆf(ω)|2(1+ω2)αdω≤L2.

The set with contains absolutely continuous distributions. If , then the distributions from have bounded continuous densities. Usually is referred to as the Sobolev class.

We use extensively the following inversion formula: for a continuous distribution one has

 F(x)=12−1π∫∞0ω−1I{e−iωxˆf(ω)}dω,x∈R, (2)

where stands for the imaginary part, and the above integral is interpreted as an improper Riemann integral For the proof of (2) see gurland (), gil-pelaez () and kendall (), Section 4.3.

Throughout this section we assume that the error characteristic function does not vanish:

 |ˆfζ(ω)|≠0∀ω∈R.

This is a standard assumption in deconvolution problems.

### 2.2 Minimax rates of estimation

In model (1) we have , and can be easily estimated by the empirical characteristic function of the observations . This motivates the following construction: for we define the estimator of by

 ˜Fλ=12−1nn∑j=11π∫λ01ωI{eiω(Yj−t0)ˆfζ(ω)}dω. (3)

Here is the design parameter to be specified. Note that if the density is symmetric around the origin, then is real, and the estimator takes the form (cf. hall ())

 ˜Fλ=12−1nn∑j=11π∫λ0sin{ω(Yj−t0)}ˆfζ(ω)ωdω.

Note that may be truncated to the interval ; obviously, the risk of such a “projected” estimator is smaller than that of .

Our current goal is to establish an upper bound on the risk of the estimator over the classes . We need the following assumptions on the distribution of the measurement errors :

{longlist}

There exist real numbers , and such that

 cζ(1+ω2)−β/2≤|ˆfζ(ω)|≤Cζ(1+ω2)−β/2∀ω∈R.
{longlist}

There exist positive real numbers , and such that

 |ˆfζ(ω)|≥1−bζ|ω|τ∀|ω|≤ω0.

Assumption 2.2 characterizes the case of the ordinary smooth errors. Assumption 2.2 describes the local behavior of near the origin. It is well known that for any distribution of a nondegenerate random variable there exist positive constants and such that for all (see, e.g., petrov (), Lemma 1.5). Thus in 2.2 we have . Typical examples of distributions satisfying 2.2 and 2.2 are the Laplace and Gamma distributions. For example, for the Laplace distribution 2.2 holds with , and 2.2 holds with . The Gamma distribution provides an example of the distribution satisfying 2.2 with being the shape parameter of the distribution.

As we will see in the sequel, the rates of convergence of the risks ; and are mainly determined by the relationship between parameters and . Consider the following two subsets of the parameter set for the pair :

 Θr:={(α,β)∈Θ\dvtxα+β>1/2},Θs:={(α,β)∈Θ\dvtxα+β<1/2}.

If , then necessarily ; in addition, because , the density can be discontinuous. That is why we will refer to as the singular zone, while the subset will be called the regular zone. We denote by the border zone between and :

 Θb:={(α,β)∈Θ\dvtxα+β=1/2}.

Division of the parameter set into zones , and is displayed in Figure 1.

The figure also shows the sub-regions and , , that are defined by the following formulas:

 Θr,1 := {(α,β)∈Θr\dvtxβ>1/2},Θr,2:={(α,β)∈Θr\dvtxβ<1/2}, Θs,1 := {(α,β)∈Θs\dvtxα+3β≥1/2},Θs,2:={(α,β)∈Θs\dvtxα+3β<1/2}.

The next two theorems present bounds on the risks in the regular zone: Theorem 2.1 states upper bounds on the risks of , while Theorem 2.2 contains the corresponding lower bounds on the minimax risks.

For define

###### Theorem 2.1

Let assumptions 2.2 and 2.2 hold, and suppose that . If is estimator (3) associated with , then for all and large enough ,

 Risk2[˜Fλ⋆;Fα(L)]≤ψn(α,L):=C2(α,L)ψ(n).

In addition, if , then for all and large enough ,

 Riskϵ[˜Fλ⋆;Fα(L)]≤ψn,ϵ(α,L):=C3(α,L)ψ(n/ln[2ϵ−1]),

provided that . The constants , , are specified in the proof of the theorem (see (A.15)–(A.22) in DGJ-suppl ()).

Theorem 2.1 shows that if is in the regular zone and , then the estimator attains the parametric rate of convergence. In the case this rate is within a logarithmic factor of the parametric rate. The natural question is if the estimator is rate optimal whenever , and . The answer is provided by Theorem 2.2.

We need the following assumption. {longlist}

The characteristic function is twice differentiable, and there exist real numbers , and such that

 (1+ω2)β/2maxj=0,1,2{∣∣ˆf(j)ζ(ω)∣∣}≤Cζ∀|ω|≥ω∗.

Assumption 2.2 is rather standard in derivations of lower bounds for deconvolution problems. This assumption should be compared to condition (G3) in Fan (); it is assumed there that for one has as . Note that 2.2 is a weaker assumption.

###### Theorem 2.2

Let assumption 2.2 hold. Suppose that the class is such that and . Then there exist constants and depending on , and only such that, for all large enough,

 inf˜FRisk2[˜F;Fα(L)] ≥ c1L(2β−1)/(2α+2β)ϕn, inf˜FRiskϵ[˜F;Fα(L)] ≥ c2L(2β−1)/(2α+2β)ϕn,ϵ,

where , , , and is taken over all possible estimators of .

The results of Theorems 2.1 and 2.2 deal with the regular zone. While we do not present the lower bound for the case of we expect that the bounds of Theorem 2.2 hold for the whole regular zone.

It is important to realize that the risks of converge to zero for all , and, in particular, for and . The next statement establishes upper bounds on in the singular and border zones, and .

###### Theorem 2.3

Let assumptions 2.2 and 2.2 hold. If is the estimator (3) associated with , then for all and large enough

 Risk2[˜Fλ⋆;Fα(L)]≤C2(α,L)φ(n),

where the sequences and are given in Table 1, and constants and are specified in the proof (see (A.15)–(A.22) in DGJ-suppl ()).

In addition, if , then for large enough

 Riskϵ[˜Fλ⋆;Fα(L)]≤C4(α,L)φ(n/ln[2ϵ−1]).

Several remarks on the results of Theorems 2.12.3 are in order.

#### Remarks

(1) Theorem 2.1 shows that the regular zone is decomposed into three disjoint regions with respect to the upper bounds on the risks of . In the zone where , the rates of convergence are parametric; because of roughness of the error density, here the estimation problem is essentially a parametric one. The region is characterized by nonparametric rates, while in the border zone between and () the rate of convergence differs from the parametric one by a -factor.

(2) The condition on stated in Theorem 2.2 is purely technical; it requires that the family is rich enough. It follows from Theorems 2.1 and 2.2 that the estimator is optimal in order in the regular zone if .

(3) The subdivision of the singular zone into two zones and is a consequence of two types of upper bounds that we have on the variance term; see (14) in DGJ-suppl (). In the border zone the upper bounds on the risk differ from those in the regular zone only by logarithmic in factors. We do not know if the estimator is rate optimal in the singular and border zones.

(4) Note that the results of Theorems 2.1 and 2.3, when put together, allow us to establish risk bounds for any pair from the parameter set . In particular, for any fixed , the rate of convergence of the maximal risk approaches the parametric rate when approaches zero. We would like to stress the fact that no tails or moment conditions on are required to obtain these results; such conditions were systematically imposed in the previous work on deconvolution of distribution functions.

The choice of the smoothing parameter in (3) is crucial in order to achieve the optimal estimation accuracy. As Theorems 2.1 and 2.2 show, if parameters and of the class are known, then one can choose in such a way that the resulting estimator is optimal in order. In practice the functional class is hardly known; in these situations the estimator of Section 2 cannot be implemented. Note, however, that this does not pose a serious problem in the regular zone when . Indeed, here if we choose , then the resulting estimator will be optimal in order for any functional class satisfying , where is defined in Theorem 2.1.

The situation is completely different in the case . In this section we develop an estimator that is nearly optimal for the -risk over a scale of classes . The construction of our adaptive estimator is based on the general scheme by lepski ().

#### 2.3.1 Estimator construction

Consider the family of estimators , where is defined in (3), with , , and , . The adaptive estimator is obtained by selection from the family according to the following rule.

Let

 ω1:=min{ω0,(4bζ)−1/τ},c∗:=2π−2[2+(1/τ)]2, (4)

where constants , and appear in assumption 2.2. For any we define

 ˜σ2λ := (5) ˜Σ2λ := maxμ∈Λ\dvtxμ≤λ˜σ2μ.

Note that can be computed from the data (the parameters and are determined completely by ; hence they are known). In fact, is a plug-in estimator of an upper bound on the variance of , while is a “monotonization” of with respect to .

Define

 ˜v2λ:=˜Σ2λ+11¯m2λ2βn−1ln(4N2ϵ−1),λ∈Λ,

where

 ¯m:=√2c∗+(πcζβ)−121+(β/2−1)+[2+βln+(1/ω1)], (6)

and constant appears in assumption 2.2.

Let ; then with every estimator , we associate the interval

 Qλ:=[˜Fλ−ϑ˜vλn−1/2,˜Fλ+ϑ˜vλn−1/2]. (7)

Define

 ˜λ:=min{λ∈Λ\dvtx⋂μ≥λ,μ∈ΛQμ≠∅}, (8)

and set finally

 ˜F:=˜F˜λ. (9)

Note that is well defined: the intersection in (8) is nonempty for .

#### 2.3.2 Oracle inequality

We will show that the estimator mimics the oracle estimator which is defined as follows:

Let

 σ2λ := c∗+2π2E[∫λω11ωI{eiω(Yj−t0)ˆfζ(ω)}dω]2, Σ2λ := maxμ∈Λ\dvtxμ≤λσ2μ,λ∈Λ.

It is shown in the proof of Lemma 5.2 (see Section A.1.2 in DGJ-suppl ()) that is an upper bound on the variance of the estimator associated with parameter . Note that defined in (5) is the empirical counterpart of the quantity . Define

 v2λ:=Σ2λ+11¯m2λ2βn−1ln(4N2ϵ−1).

Given and let

 λo=λo(α,L):=min{λ∈Λ\dvtxvλn−1/2≥2√2π−1/2Lλ−α−1/2}

and define .

The oracle estimator has attractive minimax properties over classes . In particular, it is easily verified that for any class such that one has

 Riskϵ[˜Fo;Fα(L)]≤2vλon−1/2≤ϰ1ψn,ϵ(α,L)+ϰ2ϕn,ϵ.

Here is the upper bound of Theorem 2.1 on the risk of the estimator that “knows” and , is defined in Theorem 2.2, and and are constants independent of and . Thus, the risk of the oracle estimator admits the same upper bound as the risk of the estimator that is based on the knowledge of the class parameters and .

Now we are in a position to state a bound on the risk of the estimator .

###### Theorem 2.4

Suppose that assumptions 2.2, 2.2 hold, and let

 λmax=[11¯m2ln(4Nϵ−1)]−1n.

If is the estimator defined in (7)–(9) then for any class with such that , one has

 Riskϵ[˜F˜λ;Fα(L)]≤(3−1/√2)ϑvλon−1/2.

Estimator (7)–(9) attains the optimal rates of convergence with respect to -risk within a -factor over the collection of functional classes . In particular, if is chosen to be a constant, and for some , then , and the -risk of the adaptive estimator is within a -factor of the minimax -risk for a scale of Sobolev classes. It can be shown that this -factor is unavoidable price for adaptation when the accuracy is measured by the -risk; see, for example, spokoiny ().

## 3 Minimax and adaptive affine estimation in discrete deconvolution model

The results of Section 2 imply that in the regular zone the minimax rates of convergence on the Sobolev classes are attained by linear estimator (3). It seems interesting to compare the performance of estimator (3) and its adaptive version in Section 2.3 with that of the minimax linear estimator.

Consider the estimation problem as follows; cf. JudNem (), Problem 2.2: {problem} We observe independent realizations of a random variable , taking values in . The distribution of is identified with a vector from the -dimensional simplex by setting , . Suppose that vector is affinely parameterized by an -dimensional “signal”-vector of unknown “parameters” . Here is the linear mapping with , and stands for the th element of . Our goal is to estimate a given linear form at the point underlying the observation .

It is obvious that if distributions of and are compactly supported, or can be “compactified” (i.e., for any one can point out bounded intervals of probability for and ), then under very minor regularity conditions on and , the Problem 3 approximates the initial distribution deconvolution problem with “arbitrary accuracy.” The latter means that given we can compile the discretized problem such that its -solution is the solution to the initial continuous problem with the accuracy with probability .

We consider the following discretization of the deconvolution problem: {longlist}[(4)]

Let be the (finite) observation domain, and let . We split into intervals . We denote , .

Suppose that the (finite) interval contains the support of all . Let , we partition into intervals . We denote , .

Denote . Define the matrix with elements

and the vector , with , . The elements of are the approximations of conditional probabilities , and is an approximation of .

Consider discrete observations as follows:

 ηi=\mathbh1(a0≤Yi≤a1)+m∑j=2j⋅\mathbh1(aj−1

If the sets and are selected so that , for any , if is the class of “regular distributions” and the noise distribution possesses some regularity, and if the partitions of and are “fine enough,” then solving Problem 3 with being the corresponding -dimensional cross-section of will provide us with an estimation of in the continuous deconvolution problem.

We now concentrate on solving the deconvolution problem in the discrete model.

### 3.1 Minimax estimation in the discrete model

An estimate of —a candidate solution to our problem—is a measurable function . Given tolerance , we define the -risk of such an estimate on as

 Riskϵ(˜g;X)=inf{δ\dvtxsupx∈XPx{|˜g(ηn)−gTx|>δ}<ϵ},

where stands for the distribution of observations associated with the “signal” . The minimax optimal -risk is

 Risk∗ϵ(X)=inf˜g(⋅)Riskϵ(˜g;X).

We are particularly interested in the family of estimators of the following structure:

 ˜gφ,c(ηn)=1nn∑i=1φ(ηi)+c=1nn∑i=1m∑k=1φk\mathbh1(ηi=k)+c.

We refer to such estimators as affine. In other words, is an affine function of empirical distribution: for some and ,

 ˜gφ,c(ηn)=m∑k=1φk˜Pn(k)+c,

where is the empirical distribution of the observation sample An important property of the class of affine estimators, when applied to Problem 3 with convex set , is that one can choose an estimator from the class such that its -risk attains (up to a moderate ; see Theorem 3.1 below) the minimax -risk .

From now on let us assume that is a convex closed (and, being a subset of an -dimensional simplex, compact) set.

Let us consider the affine estimator of

 ˜gϵ(ηn)≡˜g¯φ,¯c(ηn)=m∑k=1¯φk˜Pn(k)+¯c,

in which the parameters and of are defined as follows.

Consider the optimization problem

 ¯¯¯¯S(ϵ) = maxx,y∈X{12gT(y−x), maxx,y∈X{h(x,y;ϵ)≡nln(m∑j=1√[Ax]j[Ay]j)+ln(2/ϵ)≥0}.

Let be an optimal solution to (3.1), and let be the Lagrange multiplier of the constraint . We set

 ¯c=12gT[¯y+¯x]and¯φj=νnln[√[A¯y]j[A¯x]j],j=1,…,m.

We have the following result.

###### Theorem 3.1

Let . Then the -risk of the estimator satisfies

 Riskϵ(˜gϵ;X)≤¯¯¯¯S(ϵ)≤ϑ(ϵ)Risk∗ϵ(X),ϑ(ϵ)=2ln(2/ϵ)ln[1/(4ϵ)]. (11)

Note that as ; thus for small tolerance levels the -risk of the estimator is within factor of the minimax -risk. It is important to emphasize that is readily given by a solution to the explicit convex program (3.1), and as such, it can be found in a computationally efficient fashion, provided that is computationally tractable.

In the “historical perspective” the affine estimator represents an alternative to the binary search estimator , proposed in DonLiuII () for the case of “direct” observations. It can be shown that the -risk of that estimator satisfies