# Deep Convolutional Neural Networks

on Cartoon Functions

###### Abstract

Wiatowski and Bölcskei, 2015, proved that deformation stability and vertical translation invariance of deep convolutional neural network-based feature extractors are guaranteed by the network structure per se rather than the specific convolution kernels and non-linearities. While the translation invariance result applies to square-integrable functions, the deformation stability bound holds for band-limited functions only. Many signals of practical relevance (such as natural images) exhibit, however, sharp and curved discontinuities and are hence not band-limited. The main contribution of this paper is a deformation stability result that takes these structural properties into account. Specifically, we establish deformation stability bounds for the class of cartoon functions introduced by Donoho, 2001.

ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt

Deep Convolutional Neural Networks

on Cartoon Functions

Philipp Grohs^{∗},
Thomas Wiatowski^{†}, and Helmut Bölcskei^{†}
^{∗}Dept. Math., ETH Zurich, Switzerland, and Dept. Math., University of Vienna, Austria
^{†}Dept. IT & EE, ETH Zurich, Switzerland,
^{∗}philipp.grohs@sam.math.ethz.ch, ^{†}{withomas, boelcskei}@nari.ee.ethz.ch

## I Introduction

Feature extractors based on so-called deep convolutional neural networks have been applied with tremendous success in a wide range of practical signal classification tasks [1, 2, 3]. These networks are composed of multiple layers, each of which computes convolutional transforms, followed by the application of non-linearities and pooling operations.

The mathematical analysis of feature extractors generated by deep convolutional neural networks was initiated in a seminal paper by Mallat [4]. Specifically, Mallat analyzes so-called scattering networks, where signals are propagated through layers that compute semi-discrete wavelet transforms (i.e., convolutional transforms with pre-specified filters obtained from a mother wavelet through scaling operations), followed by modulus non-linearities. It was shown in [4] that the resulting wavelet-modulus feature extractor is horizontally translation-invariant [5] and deformation-stable, with the stability result applying to a function space that depends on the underlying mother wavelet.

Recently, Wiatowski and Bölcskei [5] extended Mallat’s theory to incorporate convolutional transforms with filters that are (i) pre-specified and potentially structured such as Weyl-Heisenberg (Gabor) functions [6], wavelets [7], curvelets [8], shearlets [9], and ridgelets [10], (ii) pre-specified and unstructured such as random filters [11], and (iii) learned in a supervised [12] or unsupervised [13] fashion. Furthermore, the networks in [5] may employ general Lipschitz-continuous non-linearities (e.g., rectified linear units, shifted logistic sigmoids, hyperbolic tangents, and the modulus function) and pooling through sub-sampling. The essence of the results in [5] is that vertical translation invariance and deformation stability are induced by the network structure per se rather than the specific choice of filters and non-linearities. While the vertical translation invariance result in [5] is general in the sense of applying to the function space , the deformation stability result in [5] pertains to square-integrable band-limited functions. Moreover, the corresponding deformation stability bound depends linearly on the bandwidth.

Many signals of practical relevance (such as natural images) can be modeled as square-integrable functions that are, however, not band-limited or have large bandwidth. Large bandwidths render the deformation stability bound in [5] void as a consequence of its linear dependence on bandwidth.

### Contributions

The question considered in this paper is whether taking structural properties of natural images into account can lead to stronger deformation stability bounds. We show that the answer is in the affirmative by analyzing the class of cartoon functions introduced in [14]. Cartoon functions satisfy mild decay properties and are piecewise continuously differentiable apart from curved discontinuities along Lipschitz-continuous hypersurfaces. Moreover, they provide a good model for natural images such as those in the MNIST [15], Caltech-256 [16], and CIFAR-100 [17] datasets as well as for images of geometric objects of different shapes, sizes, and colors [18, 19]. The proof of our main result is based on the decoupling technique introduced in [5]. The essence of decoupling is that contractivity of the feature extractor combined with deformation stability of the signal class under consideration—under smoothness conditions on the deformation—establishes deformation stability for the feature extractor. Our main technical contribution here is to prove deformation stability for the class of cartoon functions. Moreover, we show that the decay rate of the resulting deformation stability bound is best possible. The results we obtain further underpin the observation made in [5] of deformation stability and vertical translation invariance being induced by the network structure per se.

### Notation

We refer the reader to [5, Sec. 1] for the general notation employed in this paper. In addition, we will need the following notation. For , we set . The Minkowski sum of sets is . A Lipschitz domain is a set whose boundary is “sufficiently regular” to be thought of as locally being the graph of a Lipschitz-continuous function, for a formal definition see [20, Def. 1.40]. The indicator function of a set is defined as , for , and , for . For a measurable set , we let .

## Ii Deep convolutional neural network-based feature extractors

We set the stage by briefly reviewing the deep convolutional feature extraction network presented in [5], the basis of which is a sequence of triplets referred to as module-sequence. The triplet —associated with the -th network layer—consists of (i) a collection of so-called atoms , indexed by a countable set and satisfying the Bessel condition , for all , for some , (ii) an operator satisfying the Lipschitz property , for all , and for , and (iii) a sub-sampling factor . Associated with , we define the operator

(1) |

and extend it to paths on index sets , , according to

where for the empty path we set and , for .

###### Remark 1.

The Bessel condition on the atoms is equivalent to for a.e. (see [5, Prop. 2]), and is hence easily satisfied even by learned filters [5, Remark 2]. An overview of collections of structured atoms (such as, e.g., Weyl-Heisenberg (Gabor) functions, wavelets, curvelets, shearlets, and ridgelets) and non-linearities widely used in the deep learning literature (e.g., hyperbolic tangent, shifted logistic sigmoid, rectified linear unit, and modulus function) is provided in [5, App. B-D].

For every , we designate one of the atoms as the output-generating atom , , of the -th layer. The atoms are thus used across two consecutive layers in the sense of generating the output in the -th layer, and the remaining atoms propagating signals to the -th layer according to (II), see Fig. 1. From now on, with slight abuse of notation, we write for as well.

The extracted features of a signal are defined as [5, Def. 3]

(2) |

where , , is a feature generated in the -th layer of the network, see Fig. 1. It is shown in [5, Thm. 2] that for all the feature extractor is vertically translation-invariant in the sense of the layer depth determining the extent to which the features , , are translation-invariant. Furthermore, under the condition

(3) |

referred to as weak admissibility condition in [5, Def. 4] and satisfied by a wide variety of module sequences (see [5, Sec. 3]), the following result is established in [5, Thm. 1]: The feature extractor is stable on the space of -band-limited functions w.r.t. deformations , i.e., there exists a universal constant (that does not depend on ) such that for all and all (possibly non-linear) with , it holds that

(4) |

Here, the feature space norm is defined as .

For practical classification tasks, we can think of the deformation as follows. Let be a representative of a certain signal class, e.g., is an image of the handwritten digit “” (see Fig. 2, right). Then, is a collection of images of the handwritten digit “”, where each may be generated, e.g., based on a different handwriting style. The bound on the Jacobian matrix of imposes a quantitative limit on the amount of deformation tolerated, rendering the bound (4) to implicitly depend on . The stability bound (4) now guarantees that the features corresponding to the images in the set do not differ too much.

## Iii Cartoon functions

The bound in (4) applies to the space of square-integrable -band-limited functions. Many signals of practical significance (e.g., natural images) are, however, not band-limited (due to the presence of sharp and possibly curved edges, see Fig. 2) or exhibit large bandwidths. In the latter case, the deformation stability bound (4) becomes void as it depends linearly on .

The goal of this paper is to take structural properties of natural images into account by considering the class of cartoon functions introduced in [14]. These functions satisfy mild decay properties and are piecewise continuously differentiable apart from curved discontinuities along Lipschitz-continuous hypersurfaces. Cartoon functions provide a good model for natural images (see Fig. 2, left) such as those in the Caltech-256 [16] and CIFAR-100 [17] data sets, for images of handwritten digits [15] (see Fig. 2, right), and for images of geometric objects of different shapes, sizes, and colors [18, 19].

We proceed to the formal definition of cartoon functions.

###### Definition 1.

The function is referred to as a cartoon function if it can be written as , where is a compact Lipschitz domain with boundary of finite length, i.e., , and , , satisfies the decay condition

(5) |

for some (that does not depend on ,). Furthermore, we denote by

the class of cartoon functions of maximal size .

We chose the term size to indicate the length of the boundary of the Lipschitz domain . Furthermore, , for all ; this simply follows from the triangle inequality according to , where in the last step we used . Finally, we note that our main results—presented in the next section—can easily be generalized to finite linear combinations of cartoon functions, but this is not done here for simplicity of exposition.

## Iv Main results

We start by reviewing the decoupling technique introduced in [5] to prove deformation stability bounds for band-limited functions. The proof of the deformation stability bound (4) for band-limited functions in [5] is based on two key ingredients. The first one is a contractivity property of (see [5, Prop. 4]), namely , for all . Contractivity guarantees that pairwise distances of input signals do not increase through feature extraction. The second ingredient is an upper bound on the deformation error (see [5, Prop. 5]), specific to the signal class considered in [5], namely band-limited functions. Recognizing that the combination of these two ingredients yields a simple proof of deformation stability is interesting as it shows that whenever a signal class exhibits inherent stability w.r.t. deformations of the form , we automatically obtain deformation stability for the feature extractor . The present paper employs this decoupling technique and establishes deformation stability for the class of cartoon functions by deriving an upper bound on the deformation error for .

###### Proposition 1.

For every there exists a constant such that for all and all (possibly non-linear) with , it holds that

(6) |

###### Proof.

see Appendix A. ∎

The Lipschitz exponent on the right-hand side (RHS) of (6) determines the decay rate of the deformation error as . Clearly, larger results in the deformation error decaying faster as the deformation becomes smaller. The following simple example shows that the Lipschitz exponent in (6) is best possible, i.e., it can not be larger. Consider and , for a fixed satisfying ; the corresponding deformation amounts to a simple translation by with . Let . Then for some and

###### Remark 2.

It is interesting to note that in order to obtain bounds of the form , for , for some (that does not depend on , ) and some , we need to impose non-trivial constraints on the set . Indeed, consider, again, and , for small . Let be a function that has its energy concentrated in a small interval according to . Then, and have disjoint support sets and hence which does not decay with for any . More generally, the amount of deformation induced by a given function depends strongly on the signal (class) it is applied to. Concretely, the deformation with , , will lead to a small bump around the origin only when applied to a low-pass function, whereas the function above will experience a significant deformation.

We are now ready to state our main result.

###### Theorem 1.

Let be a module-sequence satisfying the weak admissibility condition (3). For every size , the feature extractor is stable on the space of cartoon functions w.r.t. deformations , i.e., for every there exists a constant (that does not depend on ) such that for all , and all (possibly non-linear) with and , it holds that

(7) |

###### Proof.

The strength of the deformation stability result in Theorem 1 derives itself from the fact that the only condition we need to impose on the underlying module-sequence is weak admissibility according to (3), which as argued in [5, Sec. 3], can easily be met by normalizing the elements in , for all , appropriately. We emphasize that this normalization does not have an impact on the constant in (7), which is shown in Appendix A to be independent of . The dependence of on does, however, reflect the intuition that the deformation stability bound should depend on the signal class description complexity. For band-limited signals, this dependence is exhibited by the RHS in (4) being linear in the bandwidth . Finally, we note that the vertical translation invariance result [5, Thm. 2] applies to all , and, thanks to , for all , carries over to cartoon functions.

###### Remark 3.

We note that thanks to the decoupling technique underlying our arguments, the deformation stability bounds (4) and (7) are very general in the sense of applying to every contractive (linear or non-linear) mapping . Specifically, the identity mapping also leads to deformation stability on the class of cartoon functions (and the class of band-limited functions). This is interesting as it was recently demonstrated that employing the identity mapping as a so-called shortcut-connection in a subset of layers of a very deep convolutional neural network yields state-of-the-art classification performance on the ImageNet dataset [22]. Our deformation stability result is hence general in the sense of applying to a broad class of network architectures used in practice.

For functions that do not exhibit discontinuities along Lipschitz-continuous hypersurfaces, but otherwise satisfy the decay condition (5), we can improve the decay rate of the deformation error from to .

###### Corollary 1.

Let be a module-sequence satisfying the weak admissibility condition (3). For every size , the feature extractor is stable on the space w.r.t. deformations , i.e., for every there exists a constant (that does not depend on ) such that for all , and all (possibly non-linear) with and , it holds that

## A Proof of Proposition 1

The proof of (6) is based on judiciously combining deformation stability bounds for the components in and for the indicator function . The first bound, stated in Lemma 1 below, reads

(8) |

and applies to functions satisfying the decay condition (11), with the constant not depending on and (see (14)). The bound in (8) needs the assumption . The second bound, stated in Lemma 2 below, is

(9) |

We now show how (8) and (9) can be combined to establish (6). For , we have

(10) | |||

where in (10) we used . With the upper bounds (8) and (9), invoking properties of the class of cartoon functions (namely, (i) , (ii) , satisfy (11) and thus (8) with , and (iii) , this yields

which completes the proof of (6).

###### Lemma 1.

Let be such that

(11) |

for some constant , and let . Then,

(12) |

for a constant that does not depend on and .

###### Proof.

We first upper-bound the integrand in Owing to the mean value theorem [23, Thm. 3.7.5], we have

where the last inequality follows by assumption. The idea is now to split the integral into integrals over the sets and . For , the monotonicity of the function implies , and for , we have , which together with the monotonicity of yields . Putting things together, we hence get

(13) | |||

(14) |

where in (13) we used the change of variables , together with

(15) |

The inequality in (15) follows from , which is by assumption. Since , for (see, e.g., [24, Sec. 1]), and, obviously, , it follows that , which completes the proof. ∎

We continue with a deformation stability result for indicator functions .

###### Lemma 2.

Let be a compact Lipschitz domain with boundary of finite length, i.e., . Then,

###### Proof.

In order to upper-bound we first note that the integrand satisfies , for , where and , for . Since where is a tubular neighborhood of width around the boundary of , we have which completes the proof.

∎

## References

- [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.
- [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of the IEEE, 1998, pp. 2278–2324.
- [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
- [4] S. Mallat, “Group invariant scattering,” Comm. Pure Appl. Math., vol. 65, no. 10, pp. 1331–1398, 2012.
- [5] T. Wiatowski and H. Bölcskei, “A mathematical theory of deep convolutional neural networks for feature extraction,” arXiv:1512.06293, 2015.
- [6] K. Gröchening, Foundations of time-frequency analysis. Birkhäuser, 2001.
- [7] I. Daubechies, Ten lectures on wavelets. Society for Industrial and Applied Mathematics, 1992.
- [8] E. J. Candès and D. L. Donoho, “Continuous curvelet transform: II. Discretization and frames,” Appl. Comput. Harmon. Anal., vol. 19, no. 2, pp. 198–222, 2005.
- [9] G. Kutyniok and D. Labate, Eds., Shearlets: Multiscale analysis for multivariate data. Birkhäuser, 2012.
- [10] E. J. Candès, “Ridgelets: Theory and applications,” Ph.D. dissertation, Stanford University, 1998.
- [11] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in Proc. of IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2146–2153.
- [12] F. J. Huang and Y. LeCun, “Large-scale learning with SVM and convolutional nets for generic object categorization,” in Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 284–291.
- [13] M. A. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8.
- [14] D. L. Donoho, “Sparse components of images and optimal atomic decompositions,” Constructive Approximation, vol. 17, no. 3, pp. 353–382, 2001.
- [15] Y. LeCun and C. Cortes, “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist/, 1998.
- [16] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” http://authors.library.caltech.edu/7694/, 2007.
- [17] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Toronto, 2009.
- [18] “The baby AI school dataset,” http://www.iro.umontreal.ca/%7Elisa/twiki/bin/view.cgi/Public/BabyAISchool, 2007.
- [19] “The rectangles dataset,” http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/RectanglesData, 2007.
- [20] B. Dacorogna, Introduction to the calculus of variations. Imperial College Press, 2004.
- [21] G. Kutyniok and D. Labate, “Introduction to shearlets,” in Shearlets: Multiscale analysis for multivariate data, G. Kutyniok and D. Labate, Eds. Birkhäuser, 2012, pp. 1–38.
- [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
- [23] M. Comenetz, Calculus: The elements. World Scientific, 2002.
- [24] L. Grafakos, Classical Fourier analysis, 2nd ed. Springer, 2008.