A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction
Deep convolutional neural networks have led to breakthrough results in numerous practical machine learning tasks such as classification of images in the ImageNet data set, control-policy-learning to play Atari games or the board game Go, and image captioning. Many of these applications first perform feature extraction and then feed the results thereof into a trainable classifier. The mathematical analysis of deep convolutional neural networks for feature extraction was initiated by Mallat, 2012. Specifically, Mallat considered so-called scattering networks based on a wavelet transform followed by the modulus non-linearity in each network layer, and proved translation invariance (asymptotically in the wavelet scale parameter) and deformation stability of the corresponding feature extractor. This paper complements Mallat’s results by developing a theory that encompasses general convolutional transforms, or in more technical parlance, general semi-discrete frames (including Weyl-Heisenberg filters, curvelets, shearlets, ridgelets, wavelets, and learned filters), general Lipschitz-continuous non-linearities (e.g., rectified linear units, shifted logistic sigmoids, hyperbolic tangents, and modulus functions), and general Lipschitz-continuous pooling operators emulating, e.g., sub-sampling and averaging. In addition, all of these elements can be different in different network layers. For the resulting feature extractor we prove a translation invariance result of vertical nature in the sense of the features becoming progressively more translation-invariant with increasing network depth, and we establish deformation sensitivity bounds that apply to signal classes such as, e.g., band-limited functions, cartoon functions, and Lipschitz functions.
central task in machine learning is feature extraction  as, e.g., in the context of handwritten digit classification . The features to be extracted in this case correspond, for example, to the edges of the digits. The idea behind feature extraction is that feeding characteristic features of the signals—rather than the signals themselves—to a trainable classifier (such as, e.g., a support vector machine (SVM) ) improves classification performance. Specifically, non-linear feature extractors (obtained, e.g., through the use of a so-called kernel in the context of SVMs) can map input signal space dichotomies that are not linearly separable into linearly separable feature space dichotomies . Sticking to the example of handwritten digit classification, we would, moreover, want the feature extractor to be invariant to the digits’ spatial location within the image, which leads to the requirement of translation invariance. In addition, it is desirable that the feature extractor be robust with respect to (w.r.t.) handwriting styles. This can be accomplished by demanding limited sensitivity of the features to certain non-linear deformations of the signals to be classified.
Spectacular success in practical machine learning tasks has been reported for feature extractors generated by so-called deep convolutional neural networks (DCNNs). These networks are composed of multiple layers, each of which computes convolutional transforms, followed by non-linearities and pooling
The mathematical analysis of feature extractors generated by DCNNs was pioneered by Mallat in . Mallat’s theory applies to so-called scattering networks, where signals are propagated through layers that compute a semi-discrete wavelet transform (i.e., convolutions with filters that are obtained from a mother wavelet through scaling and rotation operations), followed by the modulus non-linearity, without subsequent pooling. The resulting feature extractor is shown to be translation-invariant (asymptotically in the scale parameter of the underlying wavelet transform) and stable w.r.t. certain non-linear deformations. Moreover, Mallat’s scattering networks lead to state-of-the-art results in various classification tasks .
Contributions DCNN-based feature extractors that were found to work well in practice employ a wide range of i) filters, namely pre-specified structured filters such as wavelets , pre-specified unstructured filters such as random filters , and filters that are learned in a supervised  or an unsupervised  fashion, ii) non-linearities beyond the modulus function , namely hyperbolic tangents , rectified linear units , and logistic sigmoids , and iii) pooling operators, namely sub-sampling , average pooling , and max-pooling . In addition, the filters, non-linearities, and pooling operators can be different in different network layers . The goal of this paper is to develop a mathematical theory that encompasses all these elements (apart from max-pooling) in full generality.
Convolutional transforms as employed in DCNNs can be interpreted as semi-discrete signal transforms  (i.e., convolutional transforms with filters that are countably parametrized). Corresponding prominent representatives are curvelet  and shearlet  transforms, both of which are known to be highly effective in extracting features characterized by curved edges in images. Our theory allows for general semi-discrete signal transforms, general Lipschitz-continuous non-linearities (e.g., rectified linear units, shifted logistic sigmoids, hyperbolic tangents, and modulus functions), and incorporates continuous-time Lipschitz pooling operators that emulate discrete-time sub-sampling and averaging. Finally, different network layers may be equipped with different convolutional transforms, different (Lipschitz-continuous) non-linearities, and different (Lipschitz-continuous) pooling operators.
Regarding translation invariance, it was argued, e.g., in , that in practice invariance of the features is crucially governed by network depth and by the presence of pooling operators (such as, e.g., sub-sampling , average-pooling , or max-pooling ). We show that the general feature extractor considered in this paper, indeed, exhibits such a vertical translation invariance and that pooling plays a crucial role in achieving it. Specifically, we prove that the depth of the network determines the extent to which the extracted features are translation-invariant. We also show that pooling is necessary to obtain vertical translation invariance as otherwise the features remain fully translation-covariant irrespective of network depth. We furthermore establish a deformation sensitivity bound valid for signal classes such as, e.g., band-limited functions, cartoon functions , and Lipschitz functions . This bound shows that small non-linear deformations of the input signal lead to small changes in the corresponding feature vector.
In terms of mathematical techniques, we draw heavily from continuous frame theory . We develop a proof machinery that is completely detached from the structures
Notation The complex conjugate of is denoted by . We write for the real, and for the imaginary part of . The Euclidean inner product of is , with associated norm . We denote the identity matrix by . For the matrix , designates the entry in its -th row and -th column, and for a tensor , refers to its -th component. The supremum norm of the matrix is defined as , and the supremum norm of the tensor is . We write for the open ball of radius centered at . stands for the orthogonal group of dimension , and for the special orthogonal group.
For a Lebesgue-measurable function , we write for the integral of w.r.t. Lebesgue measure . For , stands for the space of Lebesgue-measurable functions satisfying denotes the space of Lebesgue-measurable functions such that . For we set . For , the space of -band-limited functions is denoted as For a countable set , stands for the space of sets , , for all , satisfying .
denotes the identity operator on . The tensor product of functions is , . The operator norm of the bounded linear operator is defined as . We denote the Fourier transform of by and extend it in the usual way to . The convolution of and is . We write , , for the translation operator, and , , for the modulation operator. Involution is defined by .
A multi-index is an ordered -tuple of non-negative integers . For a multi-index , denotes the differential operator , with order . If , , for . The space of functions whose derivatives of order at most are continuous is designated by , and the space of infinitely differentiable functions is . stands for the Schwartz space, i.e., the space of functions whose derivatives along with the function itself are rapidly decaying  in the sense of , for all . We denote the gradient of a function as . The space of continuous mappings is , and for , the space of -times continuously differentiable mappings is written as . For a mapping , we let be its Jacobian matrix, and its Jacobian tensor, with associated norms , , and .
We set the stage by reviewing scattering networks as introduced in , the basis of which is a multi-layer architecture that involves a wavelet transform followed by the modulus non-linearity, without subsequent pooling. Specifically,  defines the feature vector of the signal as the set
where , and
for all , with
Here, the index set contains pairs of scales and directions (in fact, is the index of the direction described by the rotation matrix ), and
where are directional wavelets  with (complex-valued) mother wavelet . The , , are elements of a finite rotation group (if is even, is a subgroup of ; if is odd, is a subgroup of ). The index is associated with the low-pass filter , and corresponds to the coarsest scale resolved by the directional wavelets .
The family of functions is taken to form a semi-discrete Parseval frame
for  and hence satisfies
for all , where are the underlying frame coefficients. Note that for given , we actually have a continuum of frame coefficients as the translation parameter is left unsampled. We refer to Figure 1 for a frequency-domain illustration of a semi-discrete directional wavelet frame. In Appendix Section 6, we give a brief review of the general theory of semi-discrete frames, and in Appendices Section 7 and Section 8 we collect structured example frames in -D and -D, respectively.
The architecture corresponding to the feature extractor in , illustrated in Figure 2, is known as scattering network , and employs the frame and the modulus non-linearity in every network layer, but does not include pooling. For given , the set in corresponds to the features of the function generated in the -th network layer, see Figure 2.
It is shown in  that the feature extractor is translation-invariant in the sense of
for all and . This invariance result is asymptotic in the scale parameter and does not depend on the network depth, i.e., it guarantees full translation invariance in every network layer. Furthermore,  establishes that is stable w.r.t. deformations of the form More formally, for the function space defined in , it is shown in  that there exists a constant such that for all , and with
Note that this upper bound goes to infinity as translation invariance through is induced. In practice signal classification based on scattering networks is performed as follows. First, the function and the wavelet frame atoms are discretized to finite-dimensional vectors. The resulting scattering network then computes the finite-dimensional feature vector , whose dimension is typically reduced through an orthogonal least squares step , and then feeds the result into a trainable classifier such as, e.g., a SVM. State-of-the-art results for scattering networks were reported for various classification tasks such as handwritten digit recognition , texture discrimination , and musical genre classification .
3General deep convolutional feature extractors
As already mentioned, scattering networks follow the architecture of DCNNs  in the sense of cascading convolutions (with atoms of the wavelet frame ) and non-linearities, namely the modulus function, but without pooling. General DCNNs as studied in the literature exhibit a number of additional features:
the filters, non-linearities, and pooling operators are allowed to be different in different network layers .
As already mentioned, the purpose of this paper is to develop a mathematical theory of DCNNs for feature extraction that encompasses all of the aspects above (apart from max-pooling) with the proviso that the pooling operators we analyze are continuous-time emulations of discrete-time pooling operators. Formally, compared to scattering networks, in the -th network layer, we replace the wavelet-modulus operation by a convolution with the atoms of a general semi-discrete frame for with countable index set (see Appendix Section 6 for a brief review of the theory of semi-discrete frames), followed by a non-linearity that satisfies the Lipschitz property , for all , and for . The output of this non-linearity, , is then pooled according to
where is the pooling factor and satisfies the Lipschitz property , for all , and for . We next comment on the individual elements in our network architecture in more detail. The frame atoms are arbitrary and can, therefore, also be taken to be structured, e.g., Weyl-Heisenberg functions, curvelets, shearlets, ridgelets, or wavelets as considered in  (where the atoms are obtained from a mother wavelet through scaling and rotation operations, see Section 2). The corresponding semi-discrete signal transforms
and amounts to simply retaining every -th sample of . The discrete-time Fourier transform of is given by a summation over translated and dilated copies of according to 
The translated copies of in are a consequence of the -periodicity of the discrete-time Fourier transform. We therefore emulate the discrete-time sub-sampling operation in continuous time through the dilation operation
which in the frequency domain amounts to dilation according to . The scaling by in ensures unitarity of the continuous-time sub-sampling operation. The overall operation in fits into our general definition of pooling as it can be recovered from simply by taking to equal the identity mapping (which is, of course, Lipschitz-continuous with Lipschitz constant and satisfies for ). Next, we consider average pooling. In discrete time average pooling is defined by
for the (typically compactly supported) “averaging kernel” and the averaging factor . Taking to be a box function of length amounts to computing local averages of consecutive samples. Weighted averages are obtained by identifying the desired weights with the averaging kernel . The operation can be emulated in continuous time according to
with the averaging window . We note that can be recovered from by taking , , and noting that convolution with is Lipschitz-continuous with Lipschitz constant (thanks to Young’s inequality ) and trivially satisfies for . In the remainder of the paper, we refer to the operation in as Lipschitz pooling through dilation to indicate that essentially amounts to the application of a Lipschitz-continuous mapping followed by a continuous-time dilation. Note, however, that the operation in will not be unitary in general.
We next state definitions and collect preliminary results needed for the analysis of the general DCNN feature extractor considered. The basic building blocks of this network are the triplets associated with individual network layers and referred to as modules.
The following definition introduces the concept of paths on index sets, which will prove useful in formalizing the feature extraction network. The idea for this formalism is due to .
The operator is well-defined, i.e., , for all , thanks to
For the inequality in we used the Lipschitz continuity of according to , together with for to get . Similar arguments lead to the first inequality in . The last step in is thanks to
which follows from the frame condition on . We will also need the extension of the operator to paths according to
with . Note that the multi-stage operation is again well-defined thanks to
for and , which follows by repeated application of .
In scattering networks one atom , , in the wavelet frame , namely the low-pass filter , is singled out to generate the extracted features according to , see also Figure 2. We follow this construction and designate one of the atoms in each frame in the module-sequence as the output-generating atom , , of the -th layer. The atoms in are thus used across two consecutive layers in the sense of generating the output in the -th layer, and the propagating signals from the -th layer to the -th layer according to , see Figure 3. Note, however, that our theory does not require the output-generating atoms to be low-pass filters
We are now ready to define the feature extractor based on the module-sequence .
The set in corresponds to the features of the function generated in the -th network layer, see Figure 3, where corresponds to the root of the network. The feature extractor , with , is well-defined, i.e., , for all , under a technical condition on the module-sequence formalized as follows.
The proof is given in Appendix Section 10.
As condition is of central importance, we formalize it as follows.
We emphasize that condition is easily met in practice. To see this, first note that is determined through the frame (e.g., the directional wavelet frame introduced in Section 2 has ), is set through the non-linearity (e.g., the modulus function has , see Appendix Section 9), and depends on the operator in (e.g., pooling by sub-sampling amounts to and has ). Obviously, condition is met if
which can be satisfied by simply normalizing the frame elements of accordingly. We refer to Proposition ? in Appendix Section 6 for corresponding normalization techniques, which, as explained in Section 4, affect neither our translation invariance result nor our deformation sensitivity bounds.
4Properties of the feature extractor
4.1Vertical translation invariance
The following theorem states that under very mild decay conditions on the Fourier transforms of the output-generating atoms , the feature extractor exhibits vertical translation invariance in the sense of the features becoming more translation-invariant with increasing network depth. This result is in line with observations made in the deep learning literature, e.g., in , where it is informally argued that the network outputs generated at deeper layers tend to be more translation-invariant.
The proof is given in Appendix Section 11.
We start by noting that all pointwise (also referred to as memoryless in the signal processing literature) non-linearities satisfy the commutation relation in . A large class of non-linearities widely used in the deep learning literature, such as rectified linear units, hyperbolic tangents, shifted logistic sigmoids, and the modulus function as employed in , are, indeed, pointwise and hence covered by Theorem ?. Moreover, as in pooling by sub-sampling trivially satisfies . Pooling by averaging , with , satisfies as a consequence of the convolution operator commuting with the translation operator .
Note that can easily be met by taking the output-generating atoms either to satisfy
see, e.g., , or to be uniformly band-limited in the sense of , for all , with an that is independent of (see, e.g., ). The bound in shows that we can explicitly control the amount of translation invariance via the pooling factors . This result is in line with observations made in the deep learning literature, e.g., in , where it is informally argued that pooling is crucial to get translation invariance of the extracted features. Furthermore, the condition (easily met by taking , for all ) guarantees, thanks to , asymptotically full translation invariance according to
for all and . This means that the features corresponding to the shifted versions of the handwritten digit “” in Figs. Figure 6 (b) and (c) with increasing network depth increasingly “look like” the features corresponding to the unshifted handwritten digit in Figure 6 (a). Casually speaking, the shift operator is increasingly absorbed by as , with the upper bound quantifying this absorption.
In contrast, the translation invariance result in  is asymptotic in the wavelet scale parameter , and does not depend on the network depth, i.e., it guarantees full translation invariance in every network layer. We honor this difference by referring to as horizontal translation invariance and to as vertical translation invariance.
We emphasize that vertical translation invariance is a structural property. Specifically, if is unitary (such as, e.g., in the case of pooling by sub-sampling where simply equals the identity mapping), then so is the pooling operation in owing to
where we employed the change of variables , . Regarding average pooling, as already mentioned, the operators , , , are, in general, not unitary, but we still get translation invariance as a consequence of structural properties, namely translation covariance of the convolution operator combined with unitary dilation according to .
Finally, we note that in practice in certain applications it is actually translation covariance in the sense of , for all and , that is desirable, for example, in facial landmark detection where the goal is to estimate the absolute position of facial landmarks in images. In such applications features in the layers closer to the root of the network are more relevant as they are less translation-invariant and more translation-covariant. The reader is referred to  where corresponding numerical evidence is provided. We proceed to the formal statement of our translation covariance result.
The proof is given in Appendix Section 12.
Corollary ? shows that in the absence of pooling, i.e., taking , for all , leads to full translation covariance in every network layer. This proves that pooling is necessary to get vertical translation invariance as otherwise the features remain fully translation-covariant irrespective of the network depth. Finally, we note that scattering networks  (which do not employ pooling operators, see Section 2) are rendered horizontally translation-invariant by letting the wavelet scale parameter .
4.2Deformation sensitivity bound
The next result provides a bound—for band-limited signals —on the sensitivity of the feature extractor w.r.t. time-frequency deformations of the form
This class of deformations encompasses non-linear distortions as illustrated in Figure 9, and modulation-like deformations which occur, e.g., if the signal is subject to an undesired modulation and we therefore have access to a bandpass version of only.
The deformation sensitivity bound we derive is signal-class specific in the sense of applying to input signals belonging to a particular class, here band-limited functions. The proof technique we develop applies, however, to all signal classes that exhibit “inherent” deformation insensitivity in the following sense.
The constant and the exponents in depend on the particular signal class . Examples of deformation-insensitive signal classes are the class of -band-limited functions (see Proposition ? in Appendix Section 15), the class of cartoon functions , and the class of Lipschitz functions . While a deformation sensitivity bound that applies to all would be desirable, the example in Figure 11 illustrates the difficulty underlying this desideratum. Specifically, we can see in Figure 11 that for given and the impact of the deformation induced by can depend drastically on the function itself. The deformation stability bound for scattering networks reported in  applies to a signal class as well, characterized, albeit implicitly, through  and depending on the mother wavelet and the (modulus) non-linearity.
Our signal-class specific deformation sensitivity bound is based on the following two ingredients. First, we establish—in Proposition ? in Appendix Section 14—that the feature extractor is Lipschitz-continuous with Lipschitz constant , i.e.,
where, thanks to the admissibility condition , the Lipschitz constant in is completely independent of the frame upper bounds and the Lipschitz-constants and of and , respectively. Second, we derive—in Proposition ? in Appendix Section 15—an upper bound on the deformation error for -band-limited functions, i.e., , according to
The deformation sensitivity bound for the feature extractor is then obtained by setting in and using (see Appendix Section 13 for the corresponding technical details). This “decoupling” into Lipschitz continuity of and a deformation sensitivity bound for the signal class under consideration (here, band-limited functions) has important practical ramifications as it shows that whenever we have a deformation sensitivity bound for the signal class, we automatically get a deformation sensitivity bound for the feature extractor thanks to its Lipschitz continuity. The same approach was used in  to derive deformation sensitivity bounds for cartoon functions and for Lipschitz functions.
Lipschitz continuity of according to also guarantees that pairwise distances in the input signal space do not increase through feature extraction. An immediate consequence is robustness of the feature extractor w.r.t. additive noise in the sense of
We proceed to the formal statement of our deformation sensitivity result.
The proof is given in Appendix Section 13.
First, we note that the bound in holds for with sufficiently “small” Jacobian matrix, i.e., as long as . We can think of this condition on the Jacobian matrix as follows
The bound for scattering networks reported in  depends upon first-order and second-order derivatives of . In contrast, our bound depends on implicitly only as we need to impose the condition for the bound to hold
The dependence of the upper bound in on the bandwidth reflects the intuition that the deformation sensitivity bound should depend on the input signal class “description complexity”. Many signals of practical significance (e.g., natural images) are, however, either not band-limited due to the presence of sharp (and possibly curved) edges or exhibit large bandwidths. In the latter case, the bound is effectively rendered void owing to its linear dependence on . We refer the reader to  where deformation sensitivity bounds for non-smooth signals were established. Specifically, the main contributions in  are deformation sensitivity bounds—again obtained through decoupling—for non-linear deformations according to
for the signal classes of cartoon functions  and for Lipschitz-continuous functions. The constant and the exponent in depend on the particular signal class and are specified in . As the vertical translation invariance result in Theorem ? applies to all , the results established in the present paper and in  taken together show that vertical translation invariance and limited sensitivity to deformations—for signal classes with inherent deformation insensitivity—are guaranteed by the feature extraction network structure per se rather than the specific convolution kernels, non-linearities, and pooling operators.
Finally, the deformation stability bound for scattering networks reported in  applies to the space
and denotes the set of paths of length with . While  cites numerical evidence on the series being finite for a large class of signals , it seems difficult to establish this analytically, let alone to show that
In contrast, the deformation sensitivity bound applies provably to the space of -band-limited functions . Finally, the space in depends on the wavelet frame atoms and the (modulus) non-linearity, and thereby on the underlying signal transform, whereas is, trivially, independent of the module-sequence .
5Final remarks and outlook
It is interesting to note that the frame lower bounds of the semi-discrete frames affect neither the vertical translation invariance result in Theorem ? nor the deformation sensitivity bound in Theorem ?. In fact, the entire theory in this paper carries through as long as the collections , for all , satisfy the Bessel property
for all for some , which, by Proposition ?, is equivalent to
Pre-specified unstructured filters  and learned filters  are therefore covered by our theory as long as is satisfied. In classical frame theory guarantees completeness of the set for the signal space under consideration, here . The absence of a frame lower bound therefore translates into a lack of completeness of , which may result in the frame coefficients , , not containing all essential features of the signal . This will, in general, have a (possibly significant) impact on practical feature extraction performance which is why ensuring the entire frame property is prudent. Interestingly, satisfying the frame property for all , , does, however, not guarantee that the feature extractor has a trivial null-space, i.e., if and only if . We refer the reader to  for an example of a feature extractor with non-trivial null-space.
This appendix gives a brief review of the theory of semi-discrete frames. A list of structured example frames of interest in the context of this paper is provided in Appendix Section 7 for the -D case, and in Appendix Section 8 for the -D case. Semi-discrete frames are instances of continuous frames , and appear in the literature, e.g., in the context of translation-covariant signal decompositions , and as an intermediate step in the construction of various fully-discrete frames . We first collect some basic results on semi-discrete frames.
The frame operator associated with the semi-discrete frame is defined in the weak sense as ,