Deep Convolutional Framelets: A General Deep Learning Framework for Inverse Problems††thanks: The authors would like to thanks Dr. Cynthia MaCollough, the Mayo Clinic, the American Association of Physicists in Medicine (AAPM), and grant EB01705 and EB01785 from the National Institute of Biomedical Imaging and Bioengineering for providing the Low-Dose CT Grand Challenge data set. This work is supported by National Research Foundation of Korea, Grant number NRF-2016R1A2B3008104, NRF-2015M3A9A7029734, and NRF-2017M3C7A1047904.
Recently, deep learning approaches with various network architectures have achieved significant performance improvement over existing iterative reconstruction methods in various imaging problems. However, it is still unclear why these deep learning architectures work for specific inverse problems. Moreover, in contrast to the usual evolution of signal processing theory around the classical theories, the link between deep learning and the classical signal processing approaches such as wavelets, non-local processing, compressed sensing, etc, are not yet well understood. To address these issues, here we show that the long-searched-for missing link is the convolution framelets for representing a signal by convolving local and non-local bases. The convolution framelets was originally developed to generalize the theory of low-rank Hankel matrix approaches for inverse problems, and this paper further extends the idea so that we can obtain a deep neural network using multilayer convolution framelets with perfect reconstruction (PR) under rectilinear linear unit nonlinearity (ReLU). Our analysis also shows that the popular deep network components such as residual block, redundant filter channels, and concatenated ReLU (CReLU) do indeed help to achieve the PR, while the pooling and unpooling layers should be augmented with high-pass branches to meet the PR condition. Moreover, by changing the number of filter channels and bias, we can control the shrinkage behaviors of the neural network. This discovery reveals the limitations of many existing deep learning architectures for inverse problems, and leads us to propose a novel theory for deep convolutional framelets neural network. Using numerical experiments with various inverse problems, we demonstrated that our deep convolution framelets network shows consistent improvement over existing deep architectures. This discovery suggests that the success of deep learning is not from a magical power of a black-box, but rather comes from the power of a novel signal representation using non-local basis combined with data-driven local basis, which is indeed a natural extension of classical signal processing theory.
Key words. Convolutional neural network, framelets, deep learning, inverse problems, ReLU, perfect reconstruction condition
AMS subject classifications. Primary, 94A08, 97R40, 94A12, 92C55, 65T60, 42C40 ; Secondary, 44A12
Deep learning approaches have achieved tremendous success in classification problems  as well as low-level computer vision problems such as segmentation , denoising , super-resolution [42, 61], etc. The theoretical origin of its success has been investigated [58, 63], and the exponential expressivity under a given network complexity (in terms of VC dimension  or Rademacher complexity ) has been often attributed to its success. A deep network is also known to learn high-level abstractions/features of the data similar to the visual processing in human brain using multiple layers of neurons with non-linearity .
Inspired by the success of deep learning in low-level computer vision, several machine learning approaches have been recently proposed for image reconstruction problems. In X-ray computed tomography (CT), Kang et al  provided the first systematic study of deep convolutional neural network (CNN) for low-dose CT and showed that a deep CNN using directional wavelets is more efficient in removing low-dose related CT noises. Unlike these low-dose artifacts from reduced tube currents, the streaking artifacts originated from sparse projection views show globalized artifacts that are difficult to remove using conventional denoising CNNs [15, 52]. Han et al  and Jin et al  independently proposed a residual learning using U-Net  to remove the global streaking artifacts caused by sparse projection views. In MRI, Wang et al  was the first to apply deep learning to compressed sensing MRI (CS-MRI). They trained the deep neural network from downsampled reconstruction images to learn a fully sampled reconstruction. Then, they used the deep learning result either as an initialization or as a regularization term in classical CS approaches. Multilayer perceptron was developed for accelerated parallel MRI [46, 45]. Deep network architecture using unfolded iterative compressed sensing (CS) algorithm was also proposed . Instead of using handcrafted regularizers, the authors in  tried to learn a set of optimal regularizers. Domain adaptation from sparse view CT network to projection reconstruction MRI was also proposed . These pioneering works have consistently demonstrated impressive reconstruction performances, which are often superior to the existing iterative approaches. However, the more we have observed impressive empirical results in image reconstruction problems, the more unanswered questions we encounter. For example, to our best knowledge, we do not have the complete answers to the following questions that are critical to a network design:
What is the role of the filter channels in convolutional layers ?
Why do some networks need a fully connected layers whereas the others do not ?
What is the role of the nonlinearity such as rectified linear unit (ReLU) ?
Why do we need a pooling and unpooling in some architectures ?
What is the role of by-pass connection or residual network ?
How many layers do we need ?
Furthermore, the most troubling issue for signal processing community is that the link to the classical signal processing theory is still not fully understood. For example, wavelets  has been extensively investigated as an efficient signal representation theory for many image processing applications by exploiting energy compaction property of wavelet bases. Compressed sensing theory [19, 14] has further extended the idea to demonstrate that an accurate recovery is possible from undersampled data, if the signal is sparse in some frames and the sensing matrix is incoherent. Non-local image processing techniques such as non-local means , BM3D , etc have also demonstrated impressive performance for many image processing applications. The link between these algorithms have been extensively studied for last few years using various mathematical tools from harmonic analysis, convex optimization, etc. However, recent years have witnessed that a blind application of deep learning toolboxes sometimes provides even better performance than mathematics-driven classical signal processing approaches. Does this imply the dark age of signal processing or a new opportunity ?
Therefore, the main goal of this paper is to address these open questions. In fact, our paper is not the only attempt to address these issues. For instance, Papyan et al  showed that once ReLU nonlinearity is employed, the forward pass of a network can be interpreted as a deep sparse coding algorithm. Wiatowski et al  discusses the importance of pooling for networks, proving that it leads to translation invariance. Moreover, several works including  provided explanations for residual networks. The interpretation of a deep network in terms of unfolded (or unrolled) sparse recovery is another prevailing view in research community [24, 71, 25, 35]. However, this interpretation still does not give answers to several key questions: for example, why do we need multichannel filters ? In this paper, we therefore depart from this existing views and propose a new interpretation of a deep network as a novel signal representation scheme. In fact, signal representation theory such as wavelets and frames have been active areas of researches for many years , and Mallat  and Bruna et al  proposed the wavelet scattering network as a translation invariant and deformation-robust image representation. However, this approach does not have learning components as in the existing deep learning networks.
Then, what is missing here? One of the most important contributions of our work is to show that the geometry of deep learning can be revealed by lifting a signal to a high dimensional space using Hankel structured matrix. More specifically, many types of input signals that occur in signal processing can be factored into the left and right bases as well as a sparse matrix with energy compaction properties when lifted into the Hankel structure matrix. This results in a frame representation of the signal using the left and right bases, referred to as the non-local and local base matrices, respectively. The origin of this nomenclature will become clear later. One of our novel contributions was the realization that the non-local base determines the network architecture such as pooling/unpooling, while the local basis allows the network to learn convolutional filters. More specifically, the application-specific domain knowledge leads to a better choice of a non-local basis, on which to learn the local basis to maximize the performance.
In fact, the idea of exploiting the two bases by the so-called convolution framelets was originally proposed by Yin et al . However, the aforementioned close link to the deep neural network was not revealed in . Most importantly, we demonstrate for the first time that the convolution framelet representation can be equivalently represented as an encoder-decoder convolution layer, and multi-layer convolution framelet expansion is also feasible by relaxing the conditions in . Furthermore, we derive the perfect reconstruction (PR) condition under rectified linear unit (ReLU). The mysterious role of the redundant multichannel filters can be then easily understood as an important tool to meet the PR condition. Moreover, by augmenting local filters with paired filters with opposite phase, the ReLU nonlinearity disappears and the deep convolutional framelet becomes a linear signal representation. However, in order for the deep network to satisfy the PR condition, the number of channels should increase exponentially along the layer, which is difficult to achieve in practice. Interestingly, we can show that an insufficient number of filter channels results in shrinkage behavior via a low rank approximation of an extended Hankel matrix, and this shrinkage behavior can be exploited to maximize network performance. Finally, to overcome the limitation of the pooling and unpooling layers, we introduce a multi-resolution analysis (MRA) for convolution framelets using wavelet non-local basis as a generalized pooling/unpooling. We call the new class of deep network using convolution framelets as the deep convolutional framelets.
For a matrix , denotes the range space of and refers to the null space of . denotes the projection to the range space of , whereas denotes the projection to the orthogonal complement of . The notation denotes a -dimensional vector with 1’s. The identity matrix is referred to as . For a given matrix , the notation refers to the generalized inverse. The superscript of denotes the Hermitian transpose. Because we are mainly interested in real valued cases, is equivalent to the transpose . The inner product in matrix space is defined by where . For a matrix , denotes its Frobenius norm. For a given matrix , denotes its -th column, and is the elements of . If a matrix is partitioned as with sub-matrix , then refers to the -th column of . A vector is referred to the flipped version of a vector , i.e. its indices are reversed. Similarly, for a given matrix , the notation refers to a matrix composed of flipped vectors, i.e. For a block structured matrix , with a slight abuse of notation, we define as
Finally, Table LABEL:tbl:notation summarizes the notation used throughout the paper.
|non-local basis matrix at the encoder|
|non-local basis matrix at the decoder|
|local basis matrix at the encoder|
|local basis matrix at the decoder|
|encoder and decoder biases|
|-th non-local basis or filter at the encoder|
|-th non-local basis or filter at the decoder|
|-th local basis or filter at the encoder|
|-th local basis or filter at the decoder|
|convolutional framelet coefficients at the encoder|
|convolutional framelet coefficients at the decoder|
|convolutional filter length|
|number of input channels|
|number of output channels|
|single channel input signal, i.e.|
|a -channel input signal, i.e.|
|Hankel operator, i.e.|
|extended Hankel operator, i.e.|
|generalized inverse of Hankel operator, i.e.|
|generalized inverse of an extended Hankel operator, i.e.|
|left singular vector matrix of an (extended) Hankel matrix|
|right singular vector matrix of an (extended) Hankel matrix|
|singular value matrix of an (extended) Hankel matrix|
2 Mathematics of Hankel matrix
Since the Hankel structured matrix is the key component in our theory, this section discusses various properties of the Hankel matrix that will be extensively used throughout the paper.
2.1 Hankel matrix representation of convolution
Hankel matrices arise repeatedly from many different contexts in signal processing and control theory, such as system identification , harmonic retrieval, array signal processing , subspace-based channel identification , etc. A Hankel matrix can be also obtained from a convolution operation , which is of particular interest in this paper. Here, to avoid special treatment of boundary condition, our theory is mainly derived using the circular convolution.
Let and . Then, a single-input single-output (SISO) convolution of the input and the filter can be represented in a matrix form:
where is a wrap-around Hankel matrix:
Similarly, a single-input multi-output (SIMO) convolution using filters can be represented by
On the other hand, multi-input multi-output (MIMO) convolution for the -channel input can be represented by
where and are the number of input and output channels, respectively; denotes the length - filter that convolves the -th channel input to compute its contribution to the -th output channel. By defining the MIMO filter kernel as follows:
the corresponding matrix representation of the MIMO convolution is then given by
where is a flipped block structured matrix in the sense of (LABEL:eq:block), and is an extended Hankel matrix by stacking Hankel matrices side by side:
For notational simplicity, we denote . Fig. LABEL:fig:hankel illustrates the procedure to construct an extended Hankel matrix from when the convolution filter length is 2.
Finally, as a special case of MIMO convolution for , the multi-input single-output (MISO) convolution is defined by
The SISO, SIMO, MIMO, and MISO convolutional operations are illustrated in Fig. LABEL:fig:conv(a)-(d).
The extension to the multi-channel 2-D convolution operation for an image domain CNN (and multi-dimensional convolutions in general) is straight-forward, since similar matrix vector operations can be also used. Only required change is the definition of the (extended) Hankel matrices, which is now defined as block Hankel matrix. Specifically, for a 2-D input with , the block Hankel matrix associated with filtering with filter is given by
Similarly, an extended block Hankel matrix from the -channel input image is defined by
Then, the output from the 2-D SISO convolution for a given image with 2-D filter can be represented by a matrix vector form:
where denotes the vectorization operation by stacking the column vectors of the 2-D matrix . Similarly, 2-D MIMO convolution for given input images with 2-D filter can be represented by a matrix vector form:
Therefore, by defining
the 2-D MIMO convolution can be represented by
Due to these similarities between 1-D and 2-D convolutions, we will therefore use the 1-D notation throughout the paper for the sake of simplicity; however, readers are advised that the same theory applies to 2-D cases.
In convolutional neural networks (CNN), unique multi-dimensional convolutions are used. Specifically, to generate output channels from input channels, each channel output is computed by first convolving - 2D filters and - input channel images, and then applying the weighted sum to the outputs (which is often referred to as convolution). For 1-D signals, this operation can be written by
where denotes the 1-D weighting. Note that this is equivalent to an MIMO convolution, since we have
The aforementioned matrix vector operations using the extended Hankel matrix also describe the filtering operation (LABEL:eq:2dConv) in 2-D CNNs as shown in Fig. LABEL:fig:cnnConv.
Throughout the paper, we denote the space of the wrap-around Hankel structure matrices of the form in (LABEL:eq:hank) as , and an extended Hankel matrix composed of Hankel matrices of the form in (LABEL:eq:ehank) as . The basic properties of Hankel matrix used in this paper are described in Lemma LABEL:lem:calculus in Appendix LABEL:ap1. In the next section, we describe advanced properties of the Hankel matrix that will be extensively used in this paper.
2.2 Low-rank property of Hankel Matrices
One of the most intriguing features of the Hankel matrix is that it often has a low-rank structure and its low-rankness is related to the sparsity in the Fourier domain (for the case of Fourier samples, it is related to the sparsity in the spatial domain)[72, 37].
Note that many types of image patches have sparsely distributed Fourier spectra. For example, as shown in Fig. LABEL:fig:flowchart(a), a smoothly varying patch usually has spectrum content in the low-frequency regions, while the other frequency regions have very few spectral components. Similar spectral domain sparsity can be observed in the texture patch shown in Fig. LABEL:fig:flowchart(b), where the spectral components of patch are determined by the spectrum of the patterns. For the case of an abrupt transition along the edge as shown in Fig. LABEL:fig:flowchart(c), the spectral components are mostly localized along the axis. In these cases, if we construct a Hankel matrix using the corresponding image patch, the resulting Hankel matrix is low-ranked . This property is extremely useful as demonstrated by many applications [37, 34, 55, 48, 49, 36]. For example, this idea can be used for image denoising  and deconvolution  by modeling the underlying intact signals to have low-rank Hankel structure, from which the artifacts or blur components can be easily removed.
In order to understand this intriguing relationship, consider a 1-D signal, whose spectrum in the Fourier domain is sparse and can be modelled as the sum of Diracs:
where refer to the corresponding harmonic components in the Fourier domain. Then, the corresponding discrete time-domain signal is given by:
Suppose that we have a -length filter which has the following z-transform representation :
Then, it is easy to see that
where and the last equality comes from (LABEL:eq:afilter) . Thus, the filter annihilates the signal , so it is referred to as the annihilating filter. Moreover, using the notation in (LABEL:eq:SISO), Eq. (LABEL:eq:annf) can be represented by
This implies that Hankel matrix is rank-deficient. In fact, the rank of the Hankel matrix can be explicitly calculated as shown in the following theorem:
 Let denote the minimum length of annihilating filters that annihilates the signal . Then, for a given Hankel structured matrix with , we have
where denotes a matrix rank.
Thus, if we choose a sufficiently large , the resulting Hankel matrix is low-ranked. This relationship is quite general, and Ye et al  further showed that the rank of the associated Hankel matrix is if and only if can be represented by
for some . If , then it is directly related to the signals with the finite rate of innovations (FRI) . Thus, the low-rank Hankel matrix provides an important link between FRI sampling theory and compressed sensing such that a sparse recovery problem can be solved using the measurement domain low-rank interpolation .
In , we also showed that the rank of the extended Hankel matrix in (LABEL:eq:ehank) is low, when the multiple signals has the following structure:
such that the Hankel matrix has the following decomposition:
where is wrap-around Hankel matrix, and for any with is defined by
Accordingly, the extended Hankel matrix has the following decomposition:
Due to the rank inequality , we therefore have the following rank bound:
Therefore, if the filter length is chosen such that the number of column of the extended matrix is sufficiently large, i.e. , then the concatenated matrix becomes low-ranked.
Note that the low-rank Hankel matrix algorithms are usually performed in a patch-by-patch manner [38, 37]. It is also remarkable that this is similar to the current practice of deep CNN for low level computer vision applications, where the network input is usually given as a patch. Later, we will show that this is not a coincidence; rather it suggests an important link between the low-rank Hankel matrix approach and a CNN.
2.3 Hankel matrix decomposition and the convolution framelets
The last but not least important property of Hankel matrix is that a Hankel matrix decomposition results in a framelet representation whose bases are constructed by the convolution of so-called local and non-local bases . More specifically, for a given input vector , suppose that the Hankel matrix with the rank has the following singular value decomposition:
where and denote the left and right singular vector bases matrices, respectively; and is the diagonal matrix whose diagonal components contains the singular values. Then, by multiplying and to the left and right of the Hankel matrix, we have
Note that the -th element of is given by
where the last equality comes from (LABEL:eq:inner). Since the number of rows and columns of are and , the right-multiplied vector interacts locally with the neighborhood of the vector, whereas the left-multiplied vector has a global interaction with the entire -elements of the vector. Accordingly, (LABEL:eq:sigma) represents the strength of simultaneous global and local interaction of the signal with bases. Thus, we call and as non-local and local bases, respectively.
This relation holds for arbitrary bases matrix and that are multiplied to the left and right of the Hankel matrix, respectively, to yield the coefficient matrix:
which represents the interaction of with the non-local basis and local basis . Using (LABEL:eq:C0) as expansion coefficients, Yin et al derived the following signal expansion, which they called the convolution framelet expansion :
Proposition 2 ().
Let and denotes the -th and -th columns of orthonormal matrix and , respectively. Then, for any -dimensional vector ,
Furthermore, with form a tight frame for with the frame constant .
This implies that any input signal can be expanded using the convolution frame and the expansion coefficient . Although the framelet coefficient matrix in (LABEL:eq:C0) for general non-local and local bases is not as sparse as (LABEL:eq:Sigma) from SVD bases, Yin et al  showed that the framelet coefficients can be made sufficiently sparse by optimally learning for a given non-local basis . Therefore, the choice of the non-local bases is one of the key factors in determining the efficiency of the framelet expansion. In the following, several examples of non-local bases in  are discussed.
SVD: From the singular value decomposition in (LABEL:eq:svd0), the SVD basis is constructed by augmenting the left singular vector basis with an orthogonal matrix :
such that . Thanks to (LABEL:eq:sigma), this is the most energy compacting basis. However, the SVD basis is input-signal dependent and the calculation of the SVD is computationally expensive.
Haar: Haar basis comes from the Haar wavelet transform and is constructed as follows:
where the low-pass and high-pass operators are defined by
Note that the non-zero elements of each column of Haar basis is two, so one level of Haar decomposition does not represent a global interaction. However, by cascading the Haar basis, the interaction becomes global, resulting in a multi-resolution decomposition of the input signal. Moreover, Haar basis is a useful global basis because it can sparsify the piecewise constant signals. Later, we will show that the average pooling operation is closely related to the Haar basis.
DCT: The discrete cosine transform (DCT) basis is an interesting global basis proposed by Yin et al  due to its energy compaction property proven by JPEG image compression standard. The DCT bases matrix is a fully populated dense matrix, which clearly represents a global interaction. To the best of our knowledge, the DCT basis have never been used in deep CNN, which could be an interesting direction of research.
In addition to the non-local bases used in , we will also investigate the following non-local bases:
Identity matrix: In this case, , so there is no global interaction between the basis and the signal. Interestingly, this non-local basis is quite often used in CNNs that do not have a pooling layer. In this case, it is believed that the local structure of the signal is more important and local-bases are trained such that they can maximally capture the local correlation structure of the signal.
Learned basis: In extreme case where we do not have specific knowledge of the signal, the non-local bases can be also learnt. However, a care must be taken, since the learned non-local basis has size of that quickly becomes very large for image processing applications. For example, if one is interested in processing (i.e. ) image, the required memory to store the learnable non-local basis becomes , which is not possible to store or estimate. However, if the input patch size is sufficiently small, this may be another interesting direction of research in deep CNN.
3 Main Contributions: Deep Convolutional Framelets Neural Networks
In this section, which is our main theoretical contribution, we will show that the convolution framelets by Yin et al  is directly related to the deep neural network if we relax the condition of the original convolution framelets to allow multilayer implementation. The multi-layer extension of convolution framelets, which we call the deep convolutional framelet, can explain many important components of deep learning.
3.1 Deep Convolutional Framelet Expansion
While the original convolution framelets by Yin et al  exploits the advantages of the low rank Hankel matrix approaches using two bases, there are several limitations. First, their convolution framelet uses only orthonormal basis. Second, the significance of multi-layer implementation was not noticed. Here, we discuss its extension to relax these limitations. As will become clear, this is a basic building step toward a deep convolutional framelets neural network.
Let and denote the non-local and local bases matrices, respectively. Suppose, furthermore, that and denote their dual bases matrices such that they satisfy the frame condition:
Then, for any input signal , we have
where is the -th column of the framelet coefficient matrix
Using the frame condition (LABEL:eq:phi0) and (LABEL:eq:ri0), we have
where denotes the framelet coefficient matrix computed by
and its -th element is given by
where we use (LABEL:eq:inner) for the last equality. Furthermore, using (LABEL:eq:recon1) and (LABEL:eq:invfilter), we have
This concludes the proof.
Note that the so-called perfect recovery condition (PR) represented by (LABEL:eq:frame) can be equivalently studied using:
Similarly, for a given matrix input , the perfect reconstruction condition for a matrix input can be given by
which is explicitly represented in the following proposition:
Let denote the non-local basis and its dual, and denote the local basis and its dual, respectively, which satisfy the frame condition:
Suppose, furthermore, that the local bases matrix have block structure:
with whose -th column is represented by and , respectively. Then, for any matrix , we have
where is the -th column of the framelet coefficient matrix
For a given , using the frame condition (LABEL:eq:phi0) and (LABEL:eq:ri0), we have
where denotes the framelet coefficient matrix computed by
and its -th element is given by
Furthermore, using (LABEL:eq:recon1), (LABEL:eq:invfilter) and (LABEL:eq:recon2), we have
This concludes the proof.
Compared to Proposition LABEL:prp:yin, Propositions LABEL:prp:1 and LABEL:prp:2 are more general, since they consider the redundant and non-orthonormal non-local and local bases by allowing relaxed conditions, i.e. or . The specific reason for is to investigate existing CNNs that have large number of filter channels at lower layers. The redundant global basis with is also believed to be useful for future research, so Proposition LABEL:prp:1 is derived by considering further extension. However, since most of the existing deep networks use the condition , we will mainly focus on this special case for the rest of the paper.
For the given SVD in (LABEL:eq:svd0), the frame conditions (LABEL:eq:phi0) and (LABEL:eq:ri0) can be further relaxed to the following conditions:
due to the following matrix identity:
In these case, the number of bases for non-local and local basis matrix can be smaller than that of Proposition LABEL:prp:1 and Proposition LABEL:prp:2, i.e. and . Therefore, smaller number of bases still suffices for PR.
Finally, using Propositions LABEL:prp:1 and LABEL:prp:2 we will show that the convolution framelet expansion can be realized by two matched convolution layers, which has striking similarity to neural networks with encoder-decoder structure . Our main contribution is summarized in the following Theorem.
Theorem 5 (Deep Convolutional Framelets Expansion).
Under the assumptions of Proposition LABEL:prp:2, we have the following decomposition of input