Cartoon Approximation with Curvelets
Abstract
It is wellknown that curvelets provide optimal approximations for socalled cartoon images which are defined as piecewise functions, separated by a singularity curve. In this paper, we consider the more general case of piecewise functions, separated by a singularity curve for . We first prove a benchmark result for the possibly achievable best term approximation rate for this more general signal model. Then we introduce what we call curvelets, which are systems that interpolate between wavelet systems on the one hand () and curvelet systems on the other hand (). Our main result states that those frames achieve this optimal rate for , up to factors.
1 Introduction
In various applications in signal processing it has proven useful to decompose a given signal in a multiscale dictionary, for instance to achieve compression by coefficient thresholding or to solve inverse problems. The most popular family of such dictionaries are undoubtedly wavelets, which have had a tremendous impact in applied mathematics since Daubechies’ construction of orthonormal wavelet bases with compact support in the 1980s. While wavelets are now a wellestablished tool in numerical signal processing (for instance the JPEG2000 coding standard is based on a wavelet transform), it has been recognized in the past decades that they also possess several shortcomings, in particular with respect to the treatment of multidimensional data where anisotropic structures such as edges in images are typically present. This deficiency of wavelets has given birth to the research area of geometric multiscale analysis where frame constructions which are optimally adapted to anisotropic structures are sought.
1.1 Geometric Multiscale Analysis
A milestone in this area has been the construction of curvelet and shearlet frames which are indeed capable of optimally resolving curved singularities in multidimensional data. To be more precise, the landmark paper [3] has shown that simple coefficient thresholding in a curvelet frame yields an, up to (negligible) factors, optimal term approximation rate for the class of socalled cartoon images which are roughly defined as compactly supported, bivariate, piecewise functions, separated by a discontinuity curve. After this breakthrough result new constructions of anisotropic frame systems have been introduced which achieve the same optimal approximation rates for cartoon images. Among those we would like to single out shearlets which are better suited for digital implementation than the original curvelets. They were first introduced in [12] and a comprehensive summary of their properties can be found in the survey [14]. The recent work [11] introduced the framework of parabolic molecules which subsumes all the earlier constructions mentioned above and which established a transference principle of approximation results between any two systems of parabolic molecules, provided that one of them constitutes a tight frame for .
1.2 General Image Models
By now the model of such cartoonimages is widely recognized as a useful mathematical model for image data. However, for certain applications it might appear still too restrictive. For instance one could imagine signals which consist of piecewise smooth parts separated by a singularity curve which might not necessarily be exactly but of higher or lower smoothness.
The approximation of such a generalized signal class by hybrid shearlets has been studied in [15] for threedimensional data. In that paper the best term approximation results which could be established are suboptimal by a small margin. In addition, hybrid shearlets are compactly supported, but do not form a tight frame.
1.3 Our Contribution
In the present paper we study optimal approximation schemes for a more general model, namely bivariate piecewise functions, separated by a discontinuity curve, where . We establish a benchmark result which states that one cannot in general achieve an term approximation rate higher than for such functions. We then introduce the notion of curvelet frames which generalize the construction of [3], where describes the degree of anisotropy in the scaling operation used for the frame construction. The parameter corresponds to ridgelets [2], the case to secondgeneration curvelets as in [3], and the case to wavelets. curvelets form a tight frame for any parameter in contrast to the systems considered in [15]. Our main result then states that all intermediate cases provide optimal (up to factors) term approximation rates for the above mentioned signal class with . This is particularly surprising, since for the result in the 3D situation ([15]) for intermediate parameters only a suboptimal rate could be proven. Our proof techniques rely somehow on [3], however we wish to mention that the technical details are actually quite different from those used there. In particular, we have to deal with smoothness spaces of fractional orders which forces us to replace estimates for derivatives in [3] by estimates on certain moduli of smoothness.
Having such an approximation result for a tight frame system at hand will allow us to transform our approximation result to arbitrary systems of frames of molecules (including compactly supported hybrid shearlets as special case), as first introduced in [9]. This will be the subject of the forthcoming work [10], generalizing the framework of parabolic molecules [11].
1.4 Outline
After introducing the general class of image models in Section 2 and deriving the benchmark result in Theorem 2.2, Section 3 is devoted to the construction of curvelets. Our main result on the optimal approximation properties of curvelets for any parameter is then stated in the next section as Theorem 4.1. This section as well as the appendix do contain the lengthy, quite technical proof of this result.
1.5 Preliminaries
Let us fix some notational conventions used throughout this work. For we denote the Euclidean distance by and the norm by . Further, we put for .
For a function the forward difference operator , where , is given by
In the onedimensional case it takes the simple form . We will often apply the forward difference operator to a bivariate function . To simplify notation, the operator shall exlusively act on variables denoted or , e.g., the symbol denotes the function .
We denote by the derivative in the th coordinate direction, , and we let for .
We write for the respective norms in the spaces and , where is a discrete set.
For , , the Fourier transform is defined by
The shorthand notation for is .
We shall also make extensive use of the Radon transform given by
with and .
The Hölder spaces for are defined as follows:
where for a function we we use the notation
for the Hölder constant. The notation refers to compactly supported functions in .
In the sequel, for two quantities , which may depend on several parameters we shall write , if there exists a constant such that , uniformly in the parameters. If the converse inequality holds true, we write and if both inequalities hold we shall write .
2 Cartoon Approximation
Many applications require efficient encoding of multivariate data in the sense of optimal (sparse) approximation rates by a suitable representation system. This is typically phrased as a problem of best term approximation as explained in Subsection 2.1. The performance of an approximation scheme is then analyzed with respect to certain subclasses of the Hilbert space , which is the standard continuum domain model for dimensional data, in particular in imaging science. The key feature of most multivariate data is the appearance of anisotropic phenomena. Hence such a subclass of is required to provide a suitable model for this fact, which is fulfilled by the subclass of socalled cartoonlike images as defined in Subsection 2.2. The main result of this section, Theorem 2.8, will provide an upper bound for the maximal achievable approximation rate for this class of cartoon images. It serves as a benchmark for the approximation performance of concrete representation systems.
2.1 Sparse Approximation
Let us start with a short introduction to some aspects of approximation theory. The standard continuum model for dimensional data, in particular in imaging science, is the Hilbert space . From a practical standpoint however, a function is a rather intractable object. Therefore, in order to analyze , the most common approach is to represent it with respect to some representation system , i.e., to expand as
(1) 
and then consider the coefficients .
Moreover, since in real world applications infinitely many coefficients are infeasible, the function has to be approximated by a finite subset of this system. Letting be the number of elements allowed in this approximation, we obtain what is called an term approximation for with respect to . The best term approximation, usually denoted by , is optimal among those in terms of a minimal approximation error and is defined by
The approximation performance of a system for a subclass is typically measured by the asymptotic approximation rate, i.e., the decay of the error of the best term approximation as .
In a very general form the representation system can just be an arbitrary dictionary, i.e. we require only . General dictionaries can be rather pathological, and for there might not even exist a best term approximation. For example, since is separable, we can choose to be a countable dense subset of . This choice would yield arbitrarily good term approximations since the elements of can come arbitrarily close to .
Therefore, without reasonable restrictions on the search depth, the investigation of best term approximation with respect to a given dictionary may not even make sense. A commonly used and suitable restriction is to impose polynomial depth search, which requires that the terms of the term approximation have to come from the first elements of the dictionary, where is some fixed polynomial [6].
Polynomial Depth Search in a Dictionary
Let us get more specific and assume that we have a countable dictionary indexed by the natural numbers, and a fixed polynomial specifying the search depth. A nonlinear term approximation scheme can then be described by a setvalued selection function , which determines for given and the selected dictionary elements, i.e. with . The function shall obey the polynomial depth search constraint, i.e. , and the dependence on allows for adaptive approximation schemes.
Let us recall a benchmark concerning the optimality rate of approximation in a dictionary with polynomial depth search, as derived in [6]. Before, we have to define what it means for a class of functions to contain a copy of .
Definition 2.1.

A class of functions is said to contain an embedded orthogonal hypercube of dimension and sidelength if there exist and orthogonal functions for with such that the collection of hypercube vertices
is contained in . It should be noted that just consists of its vertices.

A class of functions is said to contain a copy of , , if there exists a sequence of orthogonal hypercubes , embedded in , which have dimensions and sidelengths , such that and for some constant
(2)
Note, that if contains a copy of , then it also contains a copy of for all . It was shown in [6] that if a function class contains a copy of there exists an upper bound on the maximal achievable approximation rate via reconstruction in a fixed dictionary.
We state a reformulation of this landmark result, which in its original form [6] was stated in terms of the coefficient decay. The original proof however can be adapted to lead to the following formulation in terms of the best term approximation, which is more appropriate for our needs. We remark, that this theorem is also valid in a general Hilbert space setting.
Theorem 2.2.
Suppose, that a class of functions is uniformly bounded and contains a copy of . Then, allowing only polynomial depth search in a given dictionary, there is a constant such that for every there is a function and an , such that
where denotes the best term approximation under the polynomial depth search constraint.
Proof.
Let be a given dictionary and the polynomial, specifying the search depth. The best term approximation of obtained in this setting shall be denoted by , the corresponding selection rule, as described above, by .
Each system can be orthonormalized by the GramSchmidt procedure (starting from lower indices to higher indices), giving rise to an orthonormal basis of (with the exception of some possible zero vectors). Therefore we can represent each by the unique set of coefficients obtained from an expansion in this basis. (If a basis element is zero, the corresponding coefficient is chosen to be zero.)
In order to apply information theoretic arguments, we consider the following coding procedure. For we select the dictionary elements and quantize the coefficients of obtained as above by rounding to multiples of the quantity .
We need bits of information to encode the locations of the selected elements and bits for the coefficients themselves, where is the bound for the elements of . Hence, in this procedure we are encoding with at most
bits, and for we have for some constant . To decode, we simply reconstruct the rounded values of the coefficients and then synthesize using the selected dictionary elements.
Let be a hypercube in of dimension and sidelength . Starting with a vertex the codingdecoding procedure (for some fixed ) yields some . By passing to the closest vertex , we again obtain an element of the hypercube .
Every vertex can be represented as a word of bits, each bit corresponding to one side of the cube. Thus the above coding procedure gives a map of the bits, which specify the vertex , to bits. The decoding then reconstructs the bits specifying the vertex . Since at the intermediate step we just have bits of information we unavoidably loose information if .
Now we can apply an information theoretic argument. By ratedistortion theory [6, 1] there must be at least one vertex , where the number of false reconstructed bits is larger than . Here is the socalled letter distortion rate function. Since each bit determines a side of the cube, the error we make for this vertex obeys
Since by construction we have It follows
By assumption, contains a copy of . Therefore we can find a sequence of hypercubes with sidelengths as and dimensions . We pick such that and as . For large we then obey the inequality , in fact is sufficient.
Here we can apply another result from ratedistortion theory. If for some it holds , where is the socalled singleletter distortionrate function. Hence, if , we have
Let denote the vertices with maximal reconstruction error at each hypercube . Then we can conclude for large
Finally we have to take care of the rounding errors. The Euclidean distance between the best term approximation differs from by at most , i.e.
since the coefficients belong to an orthonormal basis. It follows, with some constant ,
This finishes the proof. ∎
Frame Approximation
In practice one needs representations, which are robust and stable with respect to noise. This leads to the notion of a frame [4], which is a special kind of dictionary.
A system is called a frame for , if there exist constants such that
A frame is called tight, if is possible, and Parseval, if . Since the frame operator defined by is always invertible, it follows that one particular sequence of coefficients in the expansion (1) can be computed as
(3) 
where is usually referred to as the canonical dual frame. Note, that in general for a redundant system the expansion coefficients in (1) are not unique. The particular coefficients (3), however, are usually referred to as the frame coefficients. They have the distinct property that they minimize the norm.
Even in the frame setting, the computation of the best term approximation is not yet wellunderstood. The delicacy of this problem can for instance be seen in [7]. A typical approach to circumvent this problem is to consider instead the term approximation obtained by choosing the largest frame coefficients. It is evident that the associated error then also provides a bound for the error of best term approximation.
There exists a close relation between the term approximation rate achieved by a frame, and the decay rate of the corresponding frame coefficients. A typical measure for the sparsity of a sequence are the (quasi)norms, for defined by
Equivalently, for a zero sequence indexed by , these (quasi)norms can be characterized by , where denotes the nonincreasing rearrangement of .
The following lemma shows that membership of the coefficient sequence to an space for small implies good term approximation rates. For a proof, we refer to [5, 16].
Lemma 2.3.
Let be an expansion of with respect to a frame . Further, assume that the coefficients satisfy for some . Then the best term approximation rate is at least of order , i.e.
Next, we apply Theorem 2.2 to obtain an upper bound on the approximation rate of cartoonlike images.
2.2 Cartoonlike Images
To model the fact that multivariate data appearing in applications is typically governed by anisotropic features – in the 2dimensional case curvilinear structures –, the socalled cartoonlike functions were introduced in [6]. This class is by now a widely used standard model in particular for natural images. It mimics the fact that natural images often consist of nearly smooth parts separated by discontinuities.
The first rigorous mathematical definition was given in [6] and extensively employed starting from the work in [3]. It postulates that images consist of regions separated by piecewise smooth curves. Since its introduction several extensions of this model have been introduced and studied, starting with the extended model in [15]. We consider images that contain two smooth regions separated by a piecewise smooth curve, where .
Without loss of generality we will subsequently assume the smoothness of the edge curve. In addition, to avoid technicalities, we restrict our considerations to starshaped discontinuity curves, which allow an easy parametrization. This is a classical simplification also used in [6, 3, 16, 15] . We remark however that this restriction to starshaped curves is artificial. In fact, we could work with the class of regular Jordan domains contained in , where and the Hölder constants in the canonical parametrization are bounded.
We begin by introducing the class of starshaped sets for and . For that we take a function defined on the torus , where the boundary points are identified. Additionally, we assume that there exists such that for all . Then we define the subset by
(4) 
such that the boundary of is a closed regular Jordan curve in parameterized by
(5) 
Furthermore, we require the Hölder constant of the radius function to be bounded by , i.e.,
(6) 
where the distance is measured on the torus .
Definition 2.4.
The class of cartoonlike functions is then defined as follows.
Definition 2.5.

Let , . The cartoonlike functions are defined as the set of functions of the form
where and with and for each . To simplify notation we let .

denotes the class of binary cartoonlike images, i.e., functions , where and .
We aim for a benchmark for sparse approximation of functions belonging to . For this we can use the results from the previous section. In order to apply these results to our model of cartoonlike functions we investigate for which the class of cartoonlike images contains a copy of following [15, 13].
Theorem 2.6.
Let be fixed, and .

The class of binary cartoonlike images contains a copy of for .

The class of Hölder smooth functions with contains a copy of for .
Proof.
ad 1) This was proved by Donoho in [6].
ad 2) To show that contains a copy of we have to find a sequence of embedded orthogonal hypercubes of dimension and size such that and (2) holds.
Let with and and put for . Then with . By possibly rescaling we can ensure . Further, we define for and the functions
.
These functions are dilated and translated versions of satisfying and . Moreover, it holds , and since the functions and are orthogonal in for .
The hypercubes of dimensions given by
with sidelengths are clearly contained in . We have to check (2). It holds
and the sequence obeys . This finishes the proof. ∎
An immediate corollary is the following result.
Corollary 2.7.
The function class contains a copy of for .
As a direct consequence of Theorem 2.2 we now state the main result of this subsection, which has also been proved in [15].
Theorem 2.8.
Given an arbitrary dictionary and allowing only polynomial depth search, for the class of cartoonlike images the decay rate of the error of the best term approximation cannot exceed .
We observe that the upper bound on the approximation rate for is the same as for the larger class . Therefore we restrict our investigation to classes of the form . Since we are able to establish a lower bound for differing only by a logfactor from this upper bound, we do not loose much.
3 Hybrid Curvelets
Second generation curvelets, which are nowadays simply referred to as curvelets, were introduced in 2004 by Candès and Donoho [3]. The construction involves a parabolic scaling law, which can be viewed as a natural compromise between isotropic scaling, as utilized for wavelets, and scaling in only one coordinate direction, as utilized for ridgelets [8].
A more general anisotropic scaling law is realized by socalled scaling, where the parameter is used to control the degree of anisotropy in the scaling. The associated scaling matrix is defined by
Isotropic scaling corresponds to , and scaling in one coordinate direction to . Parabolic scaling is the special case for . In this sense, curvelets can be viewed as lying in between ridgelets and wavelets.
Adapting the construction of curvelets to scaling yields, what we will call hybrid curvelets, or more specifically curvelets. In the following we will construct bandlimited tight frames of curvelets for every . Thus, we obtain a whole scale of representation systems, which interpolate between wavelets for on the one end and ridgelets for on the other end.
The construction follows the same recipe used for the tight ridgelet frames in [8]. As mentioned above it can also be seen as a variation of the classical second generation curvelet frame from [3].
The construction is simplified by treating the radial and angular components separately. For the construction of the radial functions at each scale we start with functions and with the following properties
Then we define for the functions on . Finally, we rescale for every (to obtain an integer grid later)
Notice, that it holds .
Next, we define the angular functions on the unit circle , where specifies the scale and the orientation, running through with
This time we start with a function , living on the whole of , satisfying
and 
Since the interval can be identified via with , we obtain functions for every by restricting the scaled functions to . In order to end up with realvalued curvelets, we then symmetrize
At each scale we define the characteristic angle , and for we let
(7) 
be the rotation matrix by the angle . By rotating we finally get ,
In order to secure the tightness of the frame we need to renormalize with the function
which satisfies for all . Bringing the radial and angular components together, we obtain the functions
(8) 
It is obvious that , . Moreover, these functions are realvalued, nonnegative, compactly supported, and bounded by . Let be the first unit vector and put
Indeed, we have and , where for , the sets
(9) 
are antipodal pairs of symmetric wedges.
After this preparation, we are ready to define the functions and on the Fourier side by
They have the following important property.
Lemma 3.1.
For every it holds
Proof.
By construction the system satisfies the discrete Caldéron condition, i.e.,
This yields for all
∎
The full frame of curvelets is obtained by taking translates of and in the spatial domain and normalizing afterwards. Accordingly, for , , and we define and
The corresponding set of curvelet indices will henceforth be denoted by . Further, we will use the following notation for this system