Cartoon Approximation with \alpha-Curvelets

# Cartoon Approximation with α-Curvelets

## Abstract

It is well-known that curvelets provide optimal approximations for so-called cartoon images which are defined as piecewise -functions, separated by a singularity curve. In this paper, we consider the more general case of piecewise -functions, separated by a singularity curve for . We first prove a benchmark result for the possibly achievable best -term approximation rate for this more general signal model. Then we introduce what we call -curvelets, which are systems that interpolate between wavelet systems on the one hand () and curvelet systems on the other hand (). Our main result states that those frames achieve this optimal rate for , up to -factors.

## 1 Introduction

In various applications in signal processing it has proven useful to decompose a given signal in a multiscale dictionary, for instance to achieve compression by coefficient thresholding or to solve inverse problems. The most popular family of such dictionaries are undoubtedly wavelets, which have had a tremendous impact in applied mathematics since Daubechies’ construction of orthonormal wavelet bases with compact support in the 1980s. While wavelets are now a well-established tool in numerical signal processing (for instance the JPEG2000 coding standard is based on a wavelet transform), it has been recognized in the past decades that they also possess several shortcomings, in particular with respect to the treatment of multidimensional data where anisotropic structures such as edges in images are typically present. This deficiency of wavelets has given birth to the research area of geometric multiscale analysis where frame constructions which are optimally adapted to anisotropic structures are sought.

### 1.1 Geometric Multiscale Analysis

A milestone in this area has been the construction of curvelet and shearlet frames which are indeed capable of optimally resolving curved singularities in multidimensional data. To be more precise, the landmark paper [3] has shown that simple coefficient thresholding in a curvelet frame yields an, up to (negligible) -factors, optimal -term approximation rate for the class of so-called cartoon images which are roughly defined as compactly supported, bivariate, piecewise functions, separated by a discontinuity curve. After this breakthrough result new constructions of anisotropic frame systems have been introduced which achieve the same optimal approximation rates for cartoon images. Among those we would like to single out shearlets which are better suited for digital implementation than the original curvelets. They were first introduced in [12] and a comprehensive summary of their properties can be found in the survey [14]. The recent work [11] introduced the framework of parabolic molecules which subsumes all the earlier constructions mentioned above and which established a transference principle of approximation results between any two systems of parabolic molecules, provided that one of them constitutes a tight frame for .

### 1.2 General Image Models

By now the model of such cartoon-images is widely recognized as a useful mathematical model for image data. However, for certain applications it might appear still too restrictive. For instance one could imagine signals which consist of piecewise smooth parts separated by a singularity curve which might not necessarily be exactly but of higher or lower smoothness.

The approximation of such a generalized signal class by hybrid shearlets has been studied in [15] for three-dimensional data. In that paper the best -term approximation results which could be established are suboptimal by a small margin. In addition, hybrid shearlets are compactly supported, but do not form a tight frame.

### 1.3 Our Contribution

In the present paper we study optimal approximation schemes for a more general model, namely bivariate piecewise functions, separated by a discontinuity curve, where . We establish a benchmark result which states that one cannot in general achieve an -term approximation rate higher than for such functions. We then introduce the notion of -curvelet frames which generalize the construction of [3], where describes the degree of anisotropy in the scaling operation used for the frame construction. The parameter corresponds to ridgelets [2], the case to second-generation curvelets as in [3], and the case to wavelets. -curvelets form a tight frame for any parameter in contrast to the systems considered in [15]. Our main result then states that all intermediate cases provide optimal (up to -factors) -term approximation rates for the above mentioned signal class with . This is particularly surprising, since for the result in the 3D situation ([15]) for intermediate parameters only a suboptimal rate could be proven. Our proof techniques rely somehow on [3], however we wish to mention that the technical details are actually quite different from those used there. In particular, we have to deal with smoothness spaces of fractional orders which forces us to replace estimates for derivatives in [3] by estimates on certain moduli of smoothness.

Having such an approximation result for a tight frame system at hand will allow us to transform our approximation result to arbitrary systems of frames of -molecules (including compactly supported hybrid shearlets as special case), as first introduced in [9]. This will be the subject of the forthcoming work [10], generalizing the framework of parabolic molecules [11].

### 1.4 Outline

After introducing the general class of image models in Section 2 and deriving the benchmark result in Theorem 2.2, Section 3 is devoted to the construction of -curvelets. Our main result on the optimal approximation properties of -curvelets for any parameter is then stated in the next section as Theorem 4.1. This section as well as the appendix do contain the lengthy, quite technical proof of this result.

### 1.5 Preliminaries

Let us fix some notational conventions used throughout this work. For we denote the Euclidean distance by and the -norm by . Further, we put for .

For a function the forward difference operator , where , is given by

 Δ(h1,…,hd)f(x1,…,xd):=f(x1+h1,…,xd+hd)−f(x1,…,xd).

In the one-dimensional case it takes the simple form . We will often apply the forward difference operator to a bivariate function . To simplify notation, the operator shall exlusively act on variables denoted or , e.g., the symbol denotes the function .

We denote by the derivative in the -th coordinate direction, , and we let for .

We write for the respective norms in the spaces and , where is a discrete set.

For , , the Fourier transform is defined by

 Ff(ξ)=∫Rdf(x)exp(−2πix⋅ξ)dx.

The short-hand notation for is .

We shall also make extensive use of the Radon transform given by

 RF(t,η):=∫Lt,ηF(x)dl(x),

with and .

The Hölder spaces for are defined as follows:

 Cγ(Rd):={f∈C(Rd):dmaxi=1\sl Höl(∂if,γ−1)<∞}

where for a function we we use the notation

 \sl Höl(f,α)=supx,y∈Rd|f(x)−f(y)||x−y|α,α∈[0,1],

for the Hölder constant. The notation refers to compactly supported functions in .

In the sequel, for two quantities , which may depend on several parameters we shall write , if there exists a constant such that , uniformly in the parameters. If the converse inequality holds true, we write and if both inequalities hold we shall write .

## 2 Cartoon Approximation

Many applications require efficient encoding of multivariate data in the sense of optimal (sparse) approximation rates by a suitable representation system. This is typically phrased as a problem of best -term approximation as explained in Subsection 2.1. The performance of an approximation scheme is then analyzed with respect to certain subclasses of the Hilbert space , which is the standard continuum domain model for -dimensional data, in particular in imaging science. The key feature of most multivariate data is the appearance of anisotropic phenomena. Hence such a subclass of is required to provide a suitable model for this fact, which is fulfilled by the subclass of so-called cartoon-like images as defined in Subsection 2.2. The main result of this section, Theorem 2.8, will provide an upper bound for the maximal achievable approximation rate for this class of cartoon images. It serves as a benchmark for the approximation performance of concrete representation systems.

### 2.1 Sparse Approximation

Let us start with a short introduction to some aspects of approximation theory. The standard continuum model for -dimensional data, in particular in imaging science, is the Hilbert space . From a practical standpoint however, a function is a rather intractable object. Therefore, in order to analyze , the most common approach is to represent it with respect to some representation system , i.e., to expand as

 f=∑λ∈Λcλφλ, (1)

and then consider the coefficients .

Moreover, since in real world applications infinitely many coefficients are infeasible, the function has to be approximated by a finite subset of this system. Letting be the number of elements allowed in this approximation, we obtain what is called an -term approximation for with respect to . The best -term approximation, usually denoted by , is optimal among those in terms of a minimal approximation error and is defined by

 fN=argming=∑λ∈ΛNcλφλ∥f−g∥22s.t.#ΛN≤N.

The approximation performance of a system for a subclass is typically measured by the asymptotic approximation rate, i.e., the decay of the -error of the best -term approximation as .

In a very general form the representation system can just be an arbitrary dictionary, i.e. we require only . General dictionaries can be rather pathological, and for there might not even exist a best -term approximation. For example, since is separable, we can choose to be a countable dense subset of . This choice would yield arbitrarily good -term approximations since the elements of can come arbitrarily close to .

Therefore, without reasonable restrictions on the search depth, the investigation of best -term approximation with respect to a given dictionary may not even make sense. A commonly used and suitable restriction is to impose polynomial depth search, which requires that the terms of the -term approximation have to come from the first elements of the dictionary, where is some fixed polynomial [6].

#### Polynomial Depth Search in a Dictionary

Let us get more specific and assume that we have a countable dictionary indexed by the natural numbers, and a fixed polynomial specifying the search depth. A non-linear -term approximation scheme can then be described by a set-valued selection function , which determines for given and the selected dictionary elements, i.e. with . The function shall obey the polynomial depth search constraint, i.e. , and the dependence on allows for adaptive approximation schemes.

Let us recall a benchmark concerning the optimality rate of approximation in a dictionary with polynomial depth search, as derived in [6]. Before, we have to define what it means for a class of functions to contain a copy of .

###### Definition 2.1.
1. A class of functions is said to contain an embedded orthogonal hypercube of dimension and sidelength if there exist and orthogonal functions for with such that the collection of hypercube vertices

 H(m;f0,(ψi)i)={h=f0+m∑i=1εiψi : εi∈{0,1}}

is contained in . It should be noted that just consists of its vertices.

2. A class of functions is said to contain a copy of , , if there exists a sequence of orthogonal hypercubes , embedded in , which have dimensions and side-lengths , such that and for some constant

 mk≥Cδ−pkfor all k∈N. (2)

Note, that if contains a copy of , then it also contains a copy of for all . It was shown in [6] that if a function class contains a copy of there exists an upper bound on the maximal achievable approximation rate via reconstruction in a fixed dictionary.

We state a reformulation of this landmark result, which in its original form [6] was stated in terms of the coefficient decay. The original proof however can be adapted to lead to the following formulation in terms of the best -term approximation, which is more appropriate for our needs. We remark, that this theorem is also valid in a general Hilbert space setting.

###### Theorem 2.2.

Suppose, that a class of functions is uniformly -bounded and contains a copy of . Then, allowing only polynomial depth search in a given dictionary, there is a constant such that for every there is a function and an , such that

 ∥f−fN∥22≥C(Nlog2(N))−(2−p)/p,

where denotes the best -term approximation under the polynomial depth search constraint.

###### Proof.

Let be a given dictionary and the polynomial, specifying the search depth. The best -term approximation of obtained in this setting shall be denoted by , the corresponding selection rule, as described above, by .

Each system can be orthonormalized by the Gram-Schmidt procedure (starting from lower indices to higher indices), giving rise to an orthonormal basis of (with the exception of some possible zero vectors). Therefore we can represent each by the unique set of coefficients obtained from an expansion in this basis. (If a basis element is zero, the corresponding coefficient is chosen to be zero.)

In order to apply information theoretic arguments, we consider the following coding procedure. For we select the dictionary elements and quantize the coefficients of obtained as above by rounding to multiples of the quantity .

We need bits of information to encode the locations of the selected elements and bits for the coefficients themselves, where is the -bound for the elements of . Hence, in this procedure we are encoding with at most

 R(N)=N(C1+C2log2(N)),C1,C2>0,

bits, and for we have for some constant . To decode, we simply reconstruct the rounded values of the coefficients and then synthesize using the selected dictionary elements.

Let be a hypercube in of dimension and sidelength . Starting with a vertex the coding-decoding procedure (for some fixed ) yields some . By passing to the -closest vertex , we again obtain an element of the hypercube .

Every vertex can be represented as a word of bits, each bit corresponding to one side of the cube. Thus the above coding procedure gives a map of the bits, which specify the vertex , to bits. The decoding then reconstructs the bits specifying the vertex . Since at the intermediate step we just have bits of information we unavoidably loose information if .

Now we can apply an information theoretic argument. By rate-distortion theory [6, 1] there must be at least one vertex , where the number of false reconstructed bits is larger than . Here is the so-called -letter distortion rate function. Since each bit determines a side of the cube, the error we make for this vertex obeys

 ∥h−^h∥22≥δ2⋅Dm(R).

Since by construction we have It follows

 ∥~h−h∥22≥14δ2⋅Dm(R).

By assumption, contains a copy of . Therefore we can find a sequence of hypercubes with sidelengths as and dimensions . We pick such that and as . For large we then obey the inequality , in fact is sufficient.

Here we can apply another result from rate-distortion theory. If for some it holds , where is the so-called single-letter distortion-rate function. Hence, if , we have

 ∥~h−h∥22≥14D1(13)δ2m.

Let denote the vertices with maximal reconstruction error at each hypercube . Then we can conclude for large

 ∥~hk−hk∥22≥14D1(13)δ2kmk≳δ2kmk≳(Nklog2(Nk))−(2−p)/p.

Finally we have to take care of the rounding errors. The Euclidean distance between the best -term approximation differs from by at most , i.e.

 ∥~hk−h′k∥2≤q√Nk,

since the coefficients belong to an orthonormal basis. It follows, with some constant ,

 ∥hk−h′k∥2≥∥~hk−hk∥2−∥~hk−h′k∥2≥C(Nklog2(Nk))12−1p−N1/2−2/pk≳(Nklog2(Nk))−(2−p)/(2p).

This finishes the proof. ∎

#### Frame Approximation

In practice one needs representations, which are robust and stable with respect to noise. This leads to the notion of a frame [4], which is a special kind of dictionary.

A system is called a frame for , if there exist constants such that

 C1∥f∥22≤∑λ∈Λ|⟨f,φλ⟩|2≤C2∥f∥22for all f∈L2(R2).

A frame is called tight, if is possible, and Parseval, if . Since the frame operator defined by is always invertible, it follows that one particular sequence of coefficients in the expansion (1) can be computed as

 cλ=⟨f,S−1φλ⟩,λ∈Λ, (3)

where is usually referred to as the canonical dual frame. Note, that in general for a redundant system the expansion coefficients in (1) are not unique. The particular coefficients (3), however, are usually referred to as the frame coefficients. They have the distinct property that they minimize the -norm.

Even in the frame setting, the computation of the best -term approximation is not yet well-understood. The delicacy of this problem can for instance be seen in [7]. A typical approach to circumvent this problem is to consider instead the -term approximation obtained by choosing the largest frame coefficients. It is evident that the associated error then also provides a bound for the error of best -term approximation.

There exists a close relation between the -term approximation rate achieved by a frame, and the decay rate of the corresponding frame coefficients. A typical measure for the sparsity of a sequence are the -(quasi-)norms, for defined by

 ∥(cλ)λ∥ωℓp:=(supε>0εp⋅#{λ:|cλ|>ε})1/p.

Equivalently, for a zero sequence indexed by , these (quasi-)norms can be characterized by , where denotes the non-increasing rearrangement of .

The following lemma shows that membership of the coefficient sequence to an -space for small implies good -term approximation rates. For a proof, we refer to [5, 16].

###### Lemma 2.3.

Let be an expansion of with respect to a frame . Further, assume that the coefficients satisfy for some . Then the best -term approximation rate is at least of order , i.e.

 ∥f−fN∥2≲N−k.

Next, we apply Theorem 2.2 to obtain an upper bound on the approximation rate of cartoon-like images.

### 2.2 Cartoon-like Images

To model the fact that multivariate data appearing in applications is typically governed by anisotropic features – in the 2-dimensional case curvilinear structures –, the so-called cartoon-like functions were introduced in [6]. This class is by now a widely used standard model in particular for natural images. It mimics the fact that natural images often consist of nearly smooth parts separated by discontinuities.

The first rigorous mathematical definition was given in [6] and extensively employed starting from the work in [3]. It postulates that images consist of -regions separated by piecewise smooth -curves. Since its introduction several extensions of this model have been introduced and studied, starting with the extended model in [15]. We consider images that contain two smooth -regions separated by a piecewise smooth -curve, where .

Without loss of generality we will subsequently assume the smoothness of the edge curve. In addition, to avoid technicalities, we restrict our considerations to star-shaped discontinuity curves, which allow an easy parametrization. This is a classical simplification also used in [6, 3, 16, 15] . We remark however that this restriction to star-shaped curves is artificial. In fact, we could work with the class of regular -Jordan domains contained in , where and the Hölder constants in the canonical parametrization are bounded.

We begin by introducing the class of star-shaped sets for and . For that we take a -function defined on the torus , where the boundary points are identified. Additionally, we assume that there exists such that for all . Then we define the subset by

 B={x∈R2 : x=(|x|2,η) in polar coordinates with η∈T,|x|2≤ρ(η)}, (4)

such that the boundary of is a closed regular -Jordan curve in parameterized by

 b(η)=(ρ(η)cos(η)ρ(η)sin(η)),η∈T. (5)

Furthermore, we require the Hölder constant of the radius function to be bounded by , i.e.,

 \sl Höl(ρ′,γ−1)=supη,~η∈T|ρ′(η)−ρ′(~η)||η−~η|γ−1≤ν, (6)

where the distance is measured on the torus .

###### Definition 2.4.

For the set is defined as the collection of all subsets , which are translates of sets of the form (4) with a boundary obeying (5) and (6).

The class of cartoon-like functions is then defined as follows.

###### Definition 2.5.
1. Let , . The cartoon-like functions are defined as the set of functions of the form

 f=f0+f1χB,

where and with and for each . To simplify notation we let .

2. denotes the class of binary cartoon-like images, i.e., functions , where and .

We aim for a benchmark for sparse approximation of functions belonging to . For this we can use the results from the previous section. In order to apply these results to our model of cartoon-like functions we investigate for which the class of cartoon-like images contains a copy of following [15, 13].

###### Theorem 2.6.

Let be fixed, and .

1. The class of binary cartoon-like images contains a copy of for .

2. The class of Hölder smooth functions with contains a copy of for .

###### Proof.

ad 1) This was proved by Donoho in [6].

ad 2) To show that contains a copy of we have to find a sequence of embedded orthogonal hypercubes of dimension and size such that and (2) holds.

Let with and and put for . Then with . By possibly rescaling we can ensure . Further, we define for and the functions

.

These functions are dilated and translated versions of satisfying and . Moreover, it holds , and since the functions and are orthogonal in for .

The hypercubes of dimensions given by

 Hk:=H(k2;0,(ψi,k)) ={h=k∑i1=1k∑i2=1εi1,i2ϕ(k⋅−i1)ϕ(k⋅−i2)|εi1,i2∈{0,1}} ={h=m2∑i=1εiψi,k|εi∈{0,1}}

with side-lengths are clearly contained in . We have to check (2). It holds

 mk=k2 =(δk∥ψ∥2)−2β+1=∥ψ∥2β+12⋅(δk)−2β+1,

and the sequence obeys . This finishes the proof. ∎

An immediate corollary is the following result.

###### Corollary 2.7.

The function class contains a copy of for .

As a direct consequence of Theorem 2.2 we now state the main result of this subsection, which has also been proved in [15].

###### Theorem 2.8.

Given an arbitrary dictionary and allowing only polynomial depth search, for the class of cartoon-like images the decay rate of the error of the best -term approximation cannot exceed .

We observe that the upper bound on the approximation rate for is the same as for the larger class . Therefore we restrict our investigation to classes of the form . Since we are able to establish a lower bound for differing only by a log-factor from this upper bound, we do not loose much.

## 3 Hybrid Curvelets

Second generation curvelets, which are nowadays simply referred to as curvelets, were introduced in 2004 by Candès and Donoho [3]. The construction involves a parabolic scaling law, which can be viewed as a natural compromise between isotropic scaling, as utilized for wavelets, and scaling in only one coordinate direction, as utilized for ridgelets [8].

A more general anisotropic scaling law is realized by so-called -scaling, where the parameter is used to control the degree of anisotropy in the scaling. The associated -scaling matrix is defined by

 Ds=(s00sα),s∈R+.

Isotropic scaling corresponds to , and scaling in one coordinate direction to . Parabolic scaling is the special case for . In this sense, curvelets can be viewed as lying in between ridgelets and wavelets.

Adapting the construction of curvelets to -scaling yields, what we will call hybrid curvelets, or more specifically -curvelets. In the following we will construct bandlimited tight frames of -curvelets for every . Thus, we obtain a whole scale of representation systems, which interpolate between wavelets for on the one end and ridgelets for on the other end.

The construction follows the same recipe used for the tight ridgelet frames in [8]. As mentioned above it can also be seen as a variation of the classical second generation curvelet frame from [3].

The construction is simplified by treating the radial and angular components separately. For the construction of the radial functions at each scale we start with -functions and with the following properties

 \rm supp ˜W(0)⊂[0,2],˜W(0)(r)=1 for all r∈[0,32], \rm supp ˜W⊂[12,2],˜W(r)=1 for all r∈[34,32].

Then we define for the functions on . Finally, we rescale for every (to obtain an integer grid later)

 W(j)(r):=˜W(j)(8πr),r∈R+.

Notice, that it holds .

Next, we define the angular functions on the unit circle , where specifies the scale and the orientation, running through with

 Lj=2⌊j(1−α)⌋,j∈N.

This time we start with a -function , living on the whole of , satisfying

 \rm supp V⊂[−34π,34π] and V(t)=1 for all t∈[−π2,π2].

Since the interval can be identified via with , we obtain -functions for every by restricting the scaled functions to . In order to end up with real-valued curvelets, we then symmetrize

 V(j,0)(ξ):=˜V(j,0)(ξ)+˜V(j,0)(−ξ),ξ∈S1.

At each scale we define the characteristic angle , and for we let

 Rj,ℓ=(cos(ωj,ℓ)−sin(ωj,ℓ)sin(ωj,ℓ)cos(ωj,ℓ)) (7)

be the rotation matrix by the angle . By rotating we finally get ,

 V(j,ℓ)(ξ):=V(j,0)(Rj,ℓξ) for ξ∈S1.

In order to secure the tightness of the frame we need to renormalize with the function

which satisfies for all . Bringing the radial and angular components together, we obtain the functions

 χ0(ξ):=W(0)(|ξ|))√Ψ(ξ),χj,ℓ(ξ)=W(j)(|ξ|)V(j,ℓ)(ξ|ξ|)√Ψ(ξ)for ξ∈R2. (8)

It is obvious that , . Moreover, these functions are real-valued, non-negative, compactly supported, and -bounded by . Let be the first unit vector and put

 Sj:={x∈S1 : |⟨x,e1⟩|≥cos(ωj/2)}.

Indeed, we have and , where for , the sets

 Wj,ℓ:={ξ∈R2 : 2j−1≤8π|ξ|≤2j+1,ξ|ξ|∈R−1j,ℓSj} (9)

are antipodal pairs of symmetric wedges.

After this preparation, we are ready to define the functions and on the Fourier side by

 ˆψ0(ξ):=χ0(ξ) and ˆψj,ℓ(ξ):=χj,ℓ(ξ) for ξ∈R2.

They have the following important property.

###### Lemma 3.1.

For every it holds

 ∥f∥22=∥f∗ψ0∥22+∑j,ℓ∥f∗ψj,ℓ∥22.
###### Proof.

By construction the system satisfies the discrete Caldéron condition, i.e.,

 |ˆψ0(ξ)|2+∑j,ℓ|ˆψj,ℓ(ξ)|2=1 for all ξ∈R2.

This yields for all

 ∥f∥22 =∫R2|ˆf(ξ)|2dξ =∫R2|ˆψ0(ξ)|2|ˆf(ξ)|2dξ+∫R2∑j,ℓ|ˆψj,ℓ(ξ)|2|ˆf(ξ)|2dξ =∥f∗ψ0∥22+∑j,ℓ∥f∗ψj,ℓ∥22.

The full frame of -curvelets is obtained by taking translates of and in the spatial domain and -normalizing afterwards. Accordingly, for , , and we define and

 ψj,ℓ,k:=2−j(1+α)/2⋅ψj,ℓ(⋅−xj,ℓ,k)withxj,ℓ,k=R−1j,ℓD−12jk.

The corresponding set of curvelet indices will henceforth be denoted by . Further, we will use the following notation for this system

 Cα(W(0),W,V)=(ψμ)μ∈M={ψ0,k : k∈Z2}∪{ψj,ℓ,k : j∈N,k∈Z2,ℓ∈{0,1,…,Lj