Deep Convolutional Framelets: A General Deep Learning Framework for Inverse ProblemsThe authors would like to thanks Dr. Cynthia MaCollough, the Mayo Clinic, the American Association of Physicists in Medicine (AAPM), and grant EB01705 and EB01785 from the National Institute of Biomedical Imaging and Bioengineering for providing the Low-Dose CT Grand Challenge data set. This work is supported by National Research Foundation of Korea, Grant number NRF-2016R1A2B3008104, NRF-2015M3A9A7029734, and NRF-2017M3C7A1047904.

# Deep Convolutional Framelets: A General Deep Learning Framework for Inverse Problems††thanks: The authors would like to thanks Dr. Cynthia MaCollough, the Mayo Clinic, the American Association of Physicists in Medicine (AAPM), and grant EB01705 and EB01785 from the National Institute of Biomedical Imaging and Bioengineering for providing the Low-Dose CT Grand Challenge data set. This work is supported by National Research Foundation of Korea, Grant number NRF-2016R1A2B3008104, NRF-2015M3A9A7029734, and NRF-2017M3C7A1047904.

Jong Chul Ye111Bio Imaging and Signal Processing Lab., Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea (jong.ye@kaist.ac.kr; hanyoseob@kaist.ac.kr; eunju.cha@kaist.ac.kr). 222Address all correspondence to J. C. Ye at jong.ye@kaist.ac.kr, Ph.:+82-42-3504320, Fax:+82-42-3504310.    Yoseob Han111Bio Imaging and Signal Processing Lab., Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea (jong.ye@kaist.ac.kr; hanyoseob@kaist.ac.kr; eunju.cha@kaist.ac.kr).    Eunju Cha111Bio Imaging and Signal Processing Lab., Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea (jong.ye@kaist.ac.kr; hanyoseob@kaist.ac.kr; eunju.cha@kaist.ac.kr).
###### Abstract

Recently, deep learning approaches with various network architectures have achieved significant performance improvement over existing iterative reconstruction methods in various imaging problems. However, it is still unclear why these deep learning architectures work for specific inverse problems. Moreover, in contrast to the usual evolution of signal processing theory around the classical theories, the link between deep learning and the classical signal processing approaches such as wavelets, non-local processing, compressed sensing, etc, are not yet well understood. To address these issues, here we show that the long-searched-for missing link is the convolution framelets for representing a signal by convolving local and non-local bases. The convolution framelets was originally developed to generalize the theory of low-rank Hankel matrix approaches for inverse problems, and this paper further extends the idea so that we can obtain a deep neural network using multilayer convolution framelets with perfect reconstruction (PR) under rectilinear linear unit nonlinearity (ReLU). Our analysis also shows that the popular deep network components such as residual block, redundant filter channels, and concatenated ReLU (CReLU) do indeed help to achieve the PR, while the pooling and unpooling layers should be augmented with high-pass branches to meet the PR condition. Moreover, by changing the number of filter channels and bias, we can control the shrinkage behaviors of the neural network. This discovery reveals the limitations of many existing deep learning architectures for inverse problems, and leads us to propose a novel theory for deep convolutional framelets neural network. Using numerical experiments with various inverse problems, we demonstrated that our deep convolution framelets network shows consistent improvement over existing deep architectures. This discovery suggests that the success of deep learning is not from a magical power of a black-box, but rather comes from the power of a novel signal representation using non-local basis combined with data-driven local basis, which is indeed a natural extension of classical signal processing theory.

Key words. Convolutional neural network, framelets, deep learning, inverse problems, ReLU, perfect reconstruction condition

AMS subject classifications. Primary, 94A08, 97R40, 94A12, 92C55, 65T60, 42C40 ; Secondary, 44A12

## 1 Introduction

Deep learning approaches have achieved tremendous success in classification problems [44] as well as low-level computer vision problems such as segmentation [59], denoising [76], super-resolution [42, 61], etc. The theoretical origin of its success has been investigated [58, 63], and the exponential expressivity under a given network complexity (in terms of VC dimension [3] or Rademacher complexity [5]) has been often attributed to its success. A deep network is also known to learn high-level abstractions/features of the data similar to the visual processing in human brain using multiple layers of neurons with non-linearity [47].

Inspired by the success of deep learning in low-level computer vision, several machine learning approaches have been recently proposed for image reconstruction problems. In X-ray computed tomography (CT), Kang et al [39] provided the first systematic study of deep convolutional neural network (CNN) for low-dose CT and showed that a deep CNN using directional wavelets is more efficient in removing low-dose related CT noises. Unlike these low-dose artifacts from reduced tube currents, the streaking artifacts originated from sparse projection views show globalized artifacts that are difficult to remove using conventional denoising CNNs [15, 52]. Han et al [29] and Jin et al [35] independently proposed a residual learning using U-Net [59] to remove the global streaking artifacts caused by sparse projection views. In MRI, Wang et al [67] was the first to apply deep learning to compressed sensing MRI (CS-MRI). They trained the deep neural network from downsampled reconstruction images to learn a fully sampled reconstruction. Then, they used the deep learning result either as an initialization or as a regularization term in classical CS approaches. Multilayer perceptron was developed for accelerated parallel MRI [46, 45]. Deep network architecture using unfolded iterative compressed sensing (CS) algorithm was also proposed [25]. Instead of using handcrafted regularizers, the authors in [25] tried to learn a set of optimal regularizers. Domain adaptation from sparse view CT network to projection reconstruction MRI was also proposed [30]. These pioneering works have consistently demonstrated impressive reconstruction performances, which are often superior to the existing iterative approaches. However, the more we have observed impressive empirical results in image reconstruction problems, the more unanswered questions we encounter. For example, to our best knowledge, we do not have the complete answers to the following questions that are critical to a network design:

1. What is the role of the filter channels in convolutional layers ?

2. Why do some networks need a fully connected layers whereas the others do not ?

3. What is the role of the nonlinearity such as rectified linear unit (ReLU) ?

4. Why do we need a pooling and unpooling in some architectures ?

5. What is the role of by-pass connection or residual network ?

6. How many layers do we need ?

Furthermore, the most troubling issue for signal processing community is that the link to the classical signal processing theory is still not fully understood. For example, wavelets [17] has been extensively investigated as an efficient signal representation theory for many image processing applications by exploiting energy compaction property of wavelet bases. Compressed sensing theory [19, 14] has further extended the idea to demonstrate that an accurate recovery is possible from undersampled data, if the signal is sparse in some frames and the sensing matrix is incoherent. Non-local image processing techniques such as non-local means [8], BM3D [16], etc have also demonstrated impressive performance for many image processing applications. The link between these algorithms have been extensively studied for last few years using various mathematical tools from harmonic analysis, convex optimization, etc. However, recent years have witnessed that a blind application of deep learning toolboxes sometimes provides even better performance than mathematics-driven classical signal processing approaches. Does this imply the dark age of signal processing or a new opportunity ?

Therefore, the main goal of this paper is to address these open questions. In fact, our paper is not the only attempt to address these issues. For instance, Papyan et al [56] showed that once ReLU nonlinearity is employed, the forward pass of a network can be interpreted as a deep sparse coding algorithm. Wiatowski et al [69] discusses the importance of pooling for networks, proving that it leads to translation invariance. Moreover, several works including [23] provided explanations for residual networks. The interpretation of a deep network in terms of unfolded (or unrolled) sparse recovery is another prevailing view in research community [24, 71, 25, 35]. However, this interpretation still does not give answers to several key questions: for example, why do we need multichannel filters ? In this paper, we therefore depart from this existing views and propose a new interpretation of a deep network as a novel signal representation scheme. In fact, signal representation theory such as wavelets and frames have been active areas of researches for many years [50], and Mallat [51] and Bruna et al [7] proposed the wavelet scattering network as a translation invariant and deformation-robust image representation. However, this approach does not have learning components as in the existing deep learning networks.

Then, what is missing here? One of the most important contributions of our work is to show that the geometry of deep learning can be revealed by lifting a signal to a high dimensional space using Hankel structured matrix. More specifically, many types of input signals that occur in signal processing can be factored into the left and right bases as well as a sparse matrix with energy compaction properties when lifted into the Hankel structure matrix. This results in a frame representation of the signal using the left and right bases, referred to as the non-local and local base matrices, respectively. The origin of this nomenclature will become clear later. One of our novel contributions was the realization that the non-local base determines the network architecture such as pooling/unpooling, while the local basis allows the network to learn convolutional filters. More specifically, the application-specific domain knowledge leads to a better choice of a non-local basis, on which to learn the local basis to maximize the performance.

In fact, the idea of exploiting the two bases by the so-called convolution framelets was originally proposed by Yin et al [74]. However, the aforementioned close link to the deep neural network was not revealed in [74]. Most importantly, we demonstrate for the first time that the convolution framelet representation can be equivalently represented as an encoder-decoder convolution layer, and multi-layer convolution framelet expansion is also feasible by relaxing the conditions in [74]. Furthermore, we derive the perfect reconstruction (PR) condition under rectified linear unit (ReLU). The mysterious role of the redundant multichannel filters can be then easily understood as an important tool to meet the PR condition. Moreover, by augmenting local filters with paired filters with opposite phase, the ReLU nonlinearity disappears and the deep convolutional framelet becomes a linear signal representation. However, in order for the deep network to satisfy the PR condition, the number of channels should increase exponentially along the layer, which is difficult to achieve in practice. Interestingly, we can show that an insufficient number of filter channels results in shrinkage behavior via a low rank approximation of an extended Hankel matrix, and this shrinkage behavior can be exploited to maximize network performance. Finally, to overcome the limitation of the pooling and unpooling layers, we introduce a multi-resolution analysis (MRA) for convolution framelets using wavelet non-local basis as a generalized pooling/unpooling. We call the new class of deep network using convolution framelets as the deep convolutional framelets.

### 1.1 Notations

For a matrix , denotes the range space of and refers to the null space of . denotes the projection to the range space of , whereas denotes the projection to the orthogonal complement of . The notation denotes a -dimensional vector with 1’s. The identity matrix is referred to as . For a given matrix , the notation refers to the generalized inverse. The superscript of denotes the Hermitian transpose. Because we are mainly interested in real valued cases, is equivalent to the transpose . The inner product in matrix space is defined by where . For a matrix , denotes its Frobenius norm. For a given matrix , denotes its -th column, and is the elements of . If a matrix is partitioned as with sub-matrix , then refers to the -th column of . A vector is referred to the flipped version of a vector , i.e. its indices are reversed. Similarly, for a given matrix , the notation refers to a matrix composed of flipped vectors, i.e. For a block structured matrix , with a slight abuse of notation, we define as

 (1)

Finally, Table LABEL:tbl:notation summarizes the notation used throughout the paper.

## 2 Mathematics of Hankel matrix

Since the Hankel structured matrix is the key component in our theory, this section discusses various properties of the Hankel matrix that will be extensively used throughout the paper.

### 2.1 Hankel matrix representation of convolution

Hankel matrices arise repeatedly from many different contexts in signal processing and control theory, such as system identification [21], harmonic retrieval, array signal processing [33], subspace-based channel identification [64], etc. A Hankel matrix can be also obtained from a convolution operation [72], which is of particular interest in this paper. Here, to avoid special treatment of boundary condition, our theory is mainly derived using the circular convolution.

Let and . Then, a single-input single-output (SISO) convolution of the input and the filter can be represented in a matrix form:

 y=f⊛¯¯¯¯ψ = Hd(f)ψ , (2)

where is a wrap-around Hankel matrix:

 Hd(f)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣f[1]f[2]⋯f[d]f[2]f[3]⋯f[d+1]⋮⋮⋱⋮f[n]f[1]⋯f[d−1]⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ (3)

Similarly, a single-input multi-output (SIMO) convolution using filters can be represented by

 Y=f⊛¯¯¯¯Ψ=Hd(f)Ψ (4)

where

 Y:=[y1⋯yq]∈Rn×q, Ψ:=[ψ1⋯ψq]∈Rd×q.

On the other hand, multi-input multi-output (MIMO) convolution for the -channel input can be represented by

 yi=p∑j=1zj⊛¯¯¯¯ψji,i=1,⋯,q (5)

where and are the number of input and output channels, respectively; denotes the length - filter that convolves the -th channel input to compute its contribution to the -th output channel. By defining the MIMO filter kernel as follows:

 Ψ=⎡⎢ ⎢⎣Ψ1⋮Ψp⎤⎥ ⎥⎦whereΨj=[ψj1⋯ψjq]∈Rd×q (6)

the corresponding matrix representation of the MIMO convolution is then given by

 Y = Z⊛¯¯¯¯Ψ (7) = p∑j=1Hd(zj)Ψj (8) = Hd|p(Z)Ψ (9)

where is a flipped block structured matrix in the sense of (LABEL:eq:block), and is an extended Hankel matrix by stacking Hankel matrices side by side:

 Hd|p(Z):=[Hd(z1)Hd(z2)⋯Hd(zp)] . (10)

For notational simplicity, we denote . Fig. LABEL:fig:hankel illustrates the procedure to construct an extended Hankel matrix from when the convolution filter length is 2.

Finally, as a special case of MIMO convolution for , the multi-input single-output (MISO) convolution is defined by

 y = p∑j=1zj⊛¯¯¯¯ψj=Z⊛Ψ=Hd|p(Z)Ψ (11)

where

 Ψ=⎡⎢ ⎢⎣ψ1⋮ψp⎤⎥ ⎥⎦.

The SISO, SIMO, MIMO, and MISO convolutional operations are illustrated in Fig. LABEL:fig:conv(a)-(d).

The extension to the multi-channel 2-D convolution operation for an image domain CNN (and multi-dimensional convolutions in general) is straight-forward, since similar matrix vector operations can be also used. Only required change is the definition of the (extended) Hankel matrices, which is now defined as block Hankel matrix. Specifically, for a 2-D input with , the block Hankel matrix associated with filtering with filter is given by

 Hd1,d2(X)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣Hd1(x1)Hd1(x2)⋯Hd1(xd2)Hd1(x2)Hd1(x3)⋯Hd1(xd2+1)⋮⋮⋱⋮Hd1(xn2)Hd1(x1)⋯Hd1(xd2−1)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦∈Rn1n2×d1d2. (12)

Similarly, an extended block Hankel matrix from the -channel input image is defined by

 Hd1,d2|p([X(1)⋯X(p)])=[Hd1,d2(X(1))⋯Hd1,d2(X(p))]∈Rn1n2×d1d2p. (13)

Then, the output from the 2-D SISO convolution for a given image with 2-D filter can be represented by a matrix vector form:

 \textscVec(Y)=Hd1,d2(X)\textscVec(K)

where denotes the vectorization operation by stacking the column vectors of the 2-D matrix . Similarly, 2-D MIMO convolution for given input images with 2-D filter can be represented by a matrix vector form:

 \textscVec(Y(i)) = p∑j=1Hd1,d2(X(j))\textscVec(K(j)(i)),i=1,⋯,q (14)

Therefore, by defining

 Y=[\textscVec(Y(1))⋯\textscVec(Y(q))] (15)
 K=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣\textscVec(K(1)(1))⋯\textscVec(K(1)(q))⋮⋱⋮\textscVec(K(p)(1))⋯\textscVec(K(p)(q))⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ (16)

the 2-D MIMO convolution can be represented by

 Y=Hd1,d2|p([X(1)⋯X(p)])K. (17)

Due to these similarities between 1-D and 2-D convolutions, we will therefore use the 1-D notation throughout the paper for the sake of simplicity; however, readers are advised that the same theory applies to 2-D cases.

In convolutional neural networks (CNN), unique multi-dimensional convolutions are used. Specifically, to generate output channels from input channels, each channel output is computed by first convolving - 2D filters and - input channel images, and then applying the weighted sum to the outputs (which is often referred to as convolution). For 1-D signals, this operation can be written by

 yi = p∑j=1wj(zj⊛¯¯¯¯ψji),i=1,⋯,q (18)

where denotes the 1-D weighting. Note that this is equivalent to an MIMO convolution, since we have

 Y = p∑j=1wjHd(zj)Ψj (19) = p∑j=1Hd(zj)Ψwj = Hd|p(Z)Ψw=Z⊛¯¯¯¯Ψw

where

 ¯¯¯¯Ψw=⎡⎢ ⎢ ⎢⎣w1¯¯¯¯Ψ1⋮wp¯¯¯¯Ψp⎤⎥ ⎥ ⎥⎦ . (20)

The aforementioned matrix vector operations using the extended Hankel matrix also describe the filtering operation (LABEL:eq:2dConv) in 2-D CNNs as shown in Fig. LABEL:fig:cnnConv.

Throughout the paper, we denote the space of the wrap-around Hankel structure matrices of the form in (LABEL:eq:hank) as , and an extended Hankel matrix composed of Hankel matrices of the form in (LABEL:eq:ehank) as . The basic properties of Hankel matrix used in this paper are described in Lemma LABEL:lem:calculus in Appendix LABEL:ap1. In the next section, we describe advanced properties of the Hankel matrix that will be extensively used in this paper.

### 2.2 Low-rank property of Hankel Matrices

One of the most intriguing features of the Hankel matrix is that it often has a low-rank structure and its low-rankness is related to the sparsity in the Fourier domain (for the case of Fourier samples, it is related to the sparsity in the spatial domain)[72, 37].

Note that many types of image patches have sparsely distributed Fourier spectra. For example, as shown in Fig. LABEL:fig:flowchart(a), a smoothly varying patch usually has spectrum content in the low-frequency regions, while the other frequency regions have very few spectral components. Similar spectral domain sparsity can be observed in the texture patch shown in Fig. LABEL:fig:flowchart(b), where the spectral components of patch are determined by the spectrum of the patterns. For the case of an abrupt transition along the edge as shown in Fig. LABEL:fig:flowchart(c), the spectral components are mostly localized along the axis. In these cases, if we construct a Hankel matrix using the corresponding image patch, the resulting Hankel matrix is low-ranked [72]. This property is extremely useful as demonstrated by many applications [37, 34, 55, 48, 49, 36]. For example, this idea can be used for image denoising [38] and deconvolution [53] by modeling the underlying intact signals to have low-rank Hankel structure, from which the artifacts or blur components can be easily removed.

In order to understand this intriguing relationship, consider a 1-D signal, whose spectrum in the Fourier domain is sparse and can be modelled as the sum of Diracs:

 ^f(ω)=2πr−1∑j=0cjδ(ω−ωj)ωj∈[0,2π], (21)

where refer to the corresponding harmonic components in the Fourier domain. Then, the corresponding discrete time-domain signal is given by:

 f[k]=r−1∑j=0cje−ikωj . (22)

Suppose that we have a -length filter which has the following z-transform representation [66]:

 ^h(z) = r∑l=0h[l]z−l=r−1∏j=0(1−e−iωjz−1) . (23)

Then, it is easy to see that

 (f⊛h)[k]=0,∀k, (24)

because

 (h∗f)[k] = r∑l=0h[l]f[k−l] (25) = r∑l=0r−1∑j=0cjh[l]uk−lj = r−1∑j=0cj(r∑l=0h[p]u−lj)^h(uj)ukj=0

where and the last equality comes from (LABEL:eq:afilter) [66]. Thus, the filter annihilates the signal , so it is referred to as the annihilating filter. Moreover, using the notation in (LABEL:eq:SISO), Eq. (LABEL:eq:annf) can be represented by

 Hd(f)¯¯¯h=0 .

This implies that Hankel matrix is rank-deficient. In fact, the rank of the Hankel matrix can be explicitly calculated as shown in the following theorem:

###### Theorem 1.

[72] Let denote the minimum length of annihilating filters that annihilates the signal . Then, for a given Hankel structured matrix with , we have

 \textscrankHd(f)=r, (26)

where denotes a matrix rank.

Thus, if we choose a sufficiently large , the resulting Hankel matrix is low-ranked. This relationship is quite general, and Ye et al [72] further showed that the rank of the associated Hankel matrix is if and only if can be represented by

 f[k]=p−1∑j=0mj−1∑l=0cj,lklλjk ,where r=p−1∑j=0mj

for some . If , then it is directly related to the signals with the finite rate of innovations (FRI) [66]. Thus, the low-rank Hankel matrix provides an important link between FRI sampling theory and compressed sensing such that a sparse recovery problem can be solved using the measurement domain low-rank interpolation [72].

In [34], we also showed that the rank of the extended Hankel matrix in (LABEL:eq:ehank) is low, when the multiple signals has the following structure:

 ^zi=f⊛¯¯¯hi,i=1,⋯,p (28)

such that the Hankel matrix has the following decomposition:

 (29)

where is wrap-around Hankel matrix, and for any with is defined by

 Cd(h) = d⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣h[1]⋯0⋮⋱0h[m]⋱h[1]0⋱⋮⋮⋮h[m]⋮⋮⋮000⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦∈Cn×d . (30)

Accordingly, the extended Hankel matrix has the following decomposition:

 Hd|p(Z)=Hn(f)[Cd(h1)⋯Cd(hp)]. (31)

Due to the rank inequality , we therefore have the following rank bound:

 \textscrankHd|p(Z) ≤ min{\textscrankHn(f),\textscrank[C(h1)⋯C(hp)]} (32) = min{r,pd}.

Therefore, if the filter length is chosen such that the number of column of the extended matrix is sufficiently large, i.e. , then the concatenated matrix becomes low-ranked.

Note that the low-rank Hankel matrix algorithms are usually performed in a patch-by-patch manner [38, 37]. It is also remarkable that this is similar to the current practice of deep CNN for low level computer vision applications, where the network input is usually given as a patch. Later, we will show that this is not a coincidence; rather it suggests an important link between the low-rank Hankel matrix approach and a CNN.

### 2.3 Hankel matrix decomposition and the convolution framelets

The last but not least important property of Hankel matrix is that a Hankel matrix decomposition results in a framelet representation whose bases are constructed by the convolution of so-called local and non-local bases [74]. More specifically, for a given input vector , suppose that the Hankel matrix with the rank has the following singular value decomposition:

 Hd(f)=UΣV⊤ (33)

where and denote the left and right singular vector bases matrices, respectively; and is the diagonal matrix whose diagonal components contains the singular values. Then, by multiplying and to the left and right of the Hankel matrix, we have

 Σ = U⊤Hd(f)V . (34)

Note that the -th element of is given by

 σij=u⊤iHd(f)vj=⟨f,ui⊛vj⟩,1≤i,j≤r , (35)

where the last equality comes from (LABEL:eq:inner). Since the number of rows and columns of are and , the right-multiplied vector interacts locally with the neighborhood of the vector, whereas the left-multiplied vector has a global interaction with the entire -elements of the vector. Accordingly, (LABEL:eq:sigma) represents the strength of simultaneous global and local interaction of the signal with bases. Thus, we call and as non-local and local bases, respectively.

This relation holds for arbitrary bases matrix and that are multiplied to the left and right of the Hankel matrix, respectively, to yield the coefficient matrix:

 cij=ϕ⊤iHd(f)ψj=⟨f,ϕi⊛ψj⟩,i=1,⋯,n, j=1,⋯,d, (36)

which represents the interaction of with the non-local basis and local basis . Using (LABEL:eq:C0) as expansion coefficients, Yin et al derived the following signal expansion, which they called the convolution framelet expansion [74]:

###### Proposition 2 ([74]).

Let and denotes the -th and -th columns of orthonormal matrix and , respectively. Then, for any -dimensional vector ,

 f = 1dn∑i=1d∑j=1⟨f,ϕi⊛ψj⟩ϕi⊛ψj (37)

Furthermore, with form a tight frame for with the frame constant .

This implies that any input signal can be expanded using the convolution frame and the expansion coefficient . Although the framelet coefficient matrix in (LABEL:eq:C0) for general non-local and local bases is not as sparse as (LABEL:eq:Sigma) from SVD bases, Yin et al [74] showed that the framelet coefficients can be made sufficiently sparse by optimally learning for a given non-local basis . Therefore, the choice of the non-local bases is one of the key factors in determining the efficiency of the framelet expansion. In the following, several examples of non-local bases in [74] are discussed.

• SVD: From the singular value decomposition in (LABEL:eq:svd0), the SVD basis is constructed by augmenting the left singular vector basis with an orthogonal matrix :

 ΦSVD=[UUext]

such that . Thanks to (LABEL:eq:sigma), this is the most energy compacting basis. However, the SVD basis is input-signal dependent and the calculation of the SVD is computationally expensive.

• Haar: Haar basis comes from the Haar wavelet transform and is constructed as follows:

 Φ=[ΦlowΦhigh] ,

where the low-pass and high-pass operators are defined by

 Φlow=1√2⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣10⋯010⋯001⋯001⋯0⋮⋮⋱⋮00⋯100⋮1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,Φhigh=1√2⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣10⋯0−10⋯001⋯00−1⋯0⋮⋮⋱⋮00⋯100⋮−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

Note that the non-zero elements of each column of Haar basis is two, so one level of Haar decomposition does not represent a global interaction. However, by cascading the Haar basis, the interaction becomes global, resulting in a multi-resolution decomposition of the input signal. Moreover, Haar basis is a useful global basis because it can sparsify the piecewise constant signals. Later, we will show that the average pooling operation is closely related to the Haar basis.

• DCT: The discrete cosine transform (DCT) basis is an interesting global basis proposed by Yin et al [74] due to its energy compaction property proven by JPEG image compression standard. The DCT bases matrix is a fully populated dense matrix, which clearly represents a global interaction. To the best of our knowledge, the DCT basis have never been used in deep CNN, which could be an interesting direction of research.

In addition to the non-local bases used in [74], we will also investigate the following non-local bases:

• Identity matrix: In this case, , so there is no global interaction between the basis and the signal. Interestingly, this non-local basis is quite often used in CNNs that do not have a pooling layer. In this case, it is believed that the local structure of the signal is more important and local-bases are trained such that they can maximally capture the local correlation structure of the signal.

• Learned basis: In extreme case where we do not have specific knowledge of the signal, the non-local bases can be also learnt. However, a care must be taken, since the learned non-local basis has size of that quickly becomes very large for image processing applications. For example, if one is interested in processing (i.e. ) image, the required memory to store the learnable non-local basis becomes , which is not possible to store or estimate. However, if the input patch size is sufficiently small, this may be another interesting direction of research in deep CNN.

## 3 Main Contributions: Deep Convolutional Framelets Neural Networks

In this section, which is our main theoretical contribution, we will show that the convolution framelets by Yin et al [74] is directly related to the deep neural network if we relax the condition of the original convolution framelets to allow multilayer implementation. The multi-layer extension of convolution framelets, which we call the deep convolutional framelet, can explain many important components of deep learning.

### 3.1 Deep Convolutional Framelet Expansion

While the original convolution framelets by Yin et al [74] exploits the advantages of the low rank Hankel matrix approaches using two bases, there are several limitations. First, their convolution framelet uses only orthonormal basis. Second, the significance of multi-layer implementation was not noticed. Here, we discuss its extension to relax these limitations. As will become clear, this is a basic building step toward a deep convolutional framelets neural network.

###### Proposition 3.

Let and denote the non-local and local bases matrices, respectively. Suppose, furthermore, that and denote their dual bases matrices such that they satisfy the frame condition:

 ~ΦΦ⊤ = m∑i=1~ϕiϕ⊤i=In×n, (38) Ψ~Ψ⊤ = q∑j=1ψj~ψ⊤j=Id×d . (39)

Then, for any input signal , we have

 f = 1dm∑i=1q∑j=1⟨f,ϕi⊛ψj⟩~ϕi⊛~ψj , (40)

or equivalently,

 f = 1dm∑i=1(~Φcj)⊛~ψj , (41)

where is the -th column of the framelet coefficient matrix

 C = Φ⊤(f⊛¯¯¯¯Ψ) (42) = ⎡⎢ ⎢ ⎢⎣⟨f,ϕ1⊛ψ1⟩⋯⟨f,ϕ1⊛ψq⟩⋮⋱⋮⟨f,ϕm⊛ψ1⟩⋯⟨f,ϕm⊛ψq⟩⎤⎥ ⎥ ⎥⎦∈Rm×q . (43)

###### Proof.

Using the frame condition (LABEL:eq:phi0) and (LABEL:eq:ri0), we have

 Hd(f) = ~ΦΦ⊤Hd(f)Ψ~Ψ⊤=~ΦC~Ψ⊤ ,

where denotes the framelet coefficient matrix computed by

 C=Φ⊤Hd(f)Ψ=Φ⊤(f⊛¯¯¯¯Ψ)

and its -th element is given by

 cij=ϕ⊤iHd(f)ψj =⟨f,ϕi⊛ψj⟩

where we use (LABEL:eq:inner) for the last equality. Furthermore, using (LABEL:eq:recon1) and (LABEL:eq:invfilter), we have

 f=H†d(Hd(f)) = H†d(~ΦC~Ψ⊤) = 1dq∑j=1(~Φcj)⊛~ψj = m∑i=1q∑j=1⟨f,ϕi⊛ψj⟩~ϕi⊛~ψj

This concludes the proof.

Note that the so-called perfect recovery condition (PR) represented by (LABEL:eq:frame) can be equivalently studied using:

 f=H†d(~Φ(Φ⊤Hd(f)Ψ)~Ψ⊤) . (44)

Similarly, for a given matrix input , the perfect reconstruction condition for a matrix input can be given by

 Z=H†d|p(~Φ(Φ⊤Hd|p(Z)Ψ)~Ψ⊤) . (45)

which is explicitly represented in the following proposition:

###### Proposition 4.

Let denote the non-local basis and its dual, and denote the local basis and its dual, respectively, which satisfy the frame condition:

 ~ΦΦ⊤ = m∑i=1~ϕiϕ⊤i=In×n, (46) Ψ~Ψ⊤ = q∑j=1ψj~ψ⊤j=Ipd×pd . (47)

Suppose, furthermore, that the local bases matrix have block structure:

 Ψ⊤=[Ψ⊤1⋯Ψ⊤p],~Ψ⊤=[~Ψ⊤1⋯~Ψ⊤p] (48)

with whose -th column is represented by and , respectively. Then, for any matrix , we have

 Z = (49)

or equivalently,

 Z = 1d[∑qj=1(~Φcj)⊛~ψ1j⋯∑qj=1(~Φcj)⊛~ψpj] (50)

where is the -th column of the framelet coefficient matrix

 C = Φ⊤(Z⊛¯¯¯¯Ψ) = p∑k=1⎡⎢ ⎢ ⎢⎣⟨zk,ϕ1⊛ψk1⟩⋯⟨zk,ϕ1⊛ψkq⟩⋮⋱⋮⟨zk,ϕm⊛ψk1⟩⋯⟨zk,ϕm⊛ψkq⟩⎤⎥ ⎥ ⎥⎦∈Rm×q .

###### Proof.

For a given , using the frame condition (LABEL:eq:phi0) and (LABEL:eq:ri0), we have

 Hd|p(Z) = ~ΦΦ⊤Hd|p(Z)Ψ~Ψ⊤=~ΦC~Ψ⊤ .

where denotes the framelet coefficient matrix computed by

 C=Φ⊤Hd|p(Z)Ψ=Φ⊤(Z⊛¯¯¯¯Ψ)

and its -th element is given by

 cij=ϕ⊤iHd|p(Z)ψj=p∑k=1⟨zk,ϕi⊛ψkj⟩

Furthermore, using (LABEL:eq:recon1), (LABEL:eq:invfilter) and (LABEL:eq:recon2), we have

 Z=H†d|p(Hd|p(Z)) = H†d|p(~ΦC~Ψ⊤) = [H†d(~ΦC~Ψ⊤1)⋯H†d(~ΦC~Ψ⊤p)] = 1d[∑qj=1(~Φcj)⊛~ψ1j⋯∑qj=1(~Φcj)⊛~ψpj] =

This concludes the proof.

###### Remark 1.

Compared to Proposition LABEL:prp:yin, Propositions LABEL:prp:1 and LABEL:prp:2 are more general, since they consider the redundant and non-orthonormal non-local and local bases by allowing relaxed conditions, i.e. or . The specific reason for is to investigate existing CNNs that have large number of filter channels at lower layers. The redundant global basis with is also believed to be useful for future research, so Proposition LABEL:prp:1 is derived by considering further extension. However, since most of the existing deep networks use the condition , we will mainly focus on this special case for the rest of the paper.

###### Remark 2.

For the given SVD in (LABEL:eq:svd0), the frame conditions (LABEL:eq:phi0) and (LABEL:eq:ri0) can be further relaxed to the following conditions:

 ~ΦΦ⊤=PR(U) , Ψ~Ψ⊤=PR(V)

due to the following matrix identity:

 Hd(f)=PR(U)Hd(f)PR(V)=~Φ(Φ⊤Hd(f)Ψ)~Ψ⊤.

In these case, the number of bases for non-local and local basis matrix can be smaller than that of Proposition LABEL:prp:1 and Proposition LABEL:prp:2, i.e. and . Therefore, smaller number of bases still suffices for PR.

Finally, using Propositions LABEL:prp:1 and LABEL:prp:2 we will show that the convolution framelet expansion can be realized by two matched convolution layers, which has striking similarity to neural networks with encoder-decoder structure [54]. Our main contribution is summarized in the following Theorem.

###### Theorem 5 (Deep Convolutional Framelets Expansion).

Under the assumptions of Proposition LABEL:prp:2, we have the following decomposition of input