Enhanced Signal Recovery via Sparsity Inducing Image Priors

# Enhanced Signal Recovery via Sparsity Inducing Image Priors

Hojjat Seyed Mousavi

The Pennsylvania State University

\setstretch

2 THEORY OF EVERYTHING

A Dissertation  in

Electrical Engineering

by

Richard Feynman

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

December 2017

The dissertation of Richard Feynman was reviewed and approved by the following:
Vishal Monga Associate Professor of Electrical Engineering Dissertation Advisor, Chair of Committee William E. Higgins Distinguished Professor of Electrical Engineering Constantino Lagoa Professor of Electrical Engineering Robert T. Collins Associate Professor of Computer Science and Engineering Kultegin Aydin Professor of Electrical Engineering and Department Head

Signatures are on file in the Graduate School.

## Abstract

Parsimony in signal representation is a topic of active research. Sparse signal processing and representation is the outcome of this line of research which has many applications in information processing and has shown significant improvement in real-world applications such as recovery, classification, clustering, super resolution, etc. This vast influence of sparse signal processing in real-world problems raises a significant need in developing novel sparse signal representation algorithms to obtain more robust systems. In such algorithms, a few open challenges remain in (a) efficiently posing sparsity on signals that can capture the structure of underlying signal and (b) the design of tractable algorithms that can recover signals under aforementioned sparse models.

In this dissertation, we try to view the signal recovery problem from these viewpoints. First, we address the sparse signal recovery problem from a Bayesian perspective where sparsity is enforced on reconstruction coefficients via probabilistic priors. In particular, we focus on a variant of spike and slab prior, which is known to be the gold standard to encourage sparsity. The optimization problem resulting from this model has broad applicability in recovery and regression problems and is known to be a hard non-convex problem whose existing solutions involve simplifying assumptions and/or relaxations. We propose an approach called Iterative Convex Refinement (ICR) that aims to solve the aforementioned optimization problem directly allowing for greater generality in the sparse structure. Essentially, ICR solves a sequence of convex optimization problems such that sequence of solutions converges to a sub-optimal solution of the original hard optimization problem. We propose two versions of our algorithm: (a) an unconstrained version, and (b) with a non-negativity constraint on sparse coefficients, which may be required in some real-world problems. Many signal processing problems in computer vision and recognition world can benefit from ICR. These include face recognition in surveillance applications, object detection and classification in the video, image compression and recovery, image quality enhancement etc.

On the other hand, one of the most significant challenges in image processing is the enhancement of image quality. To address this challenge we aim to recover signals by using the prior structural knowledge about them. In particular, we pose physically meaningful probabilistic priors to promote sparsity on reconstruction coefficients or design parameters of the problem. This has a variety of applications in signal and image processing including but not limited to regression, denoising, inverse problems, demosaicking and super-resolution. In particular, sparsity constrained single image super-resolution (SR) has been of much recent interest. A typical approach involves sparsely representing patches in a low resolution input image via a dictionary of example LR patches, and then using the coefficients of this representation to generate the high-resolution output via an analogous HR dictionary.

In this dissertation, we propose extension of the SR problem which is twofold: (1) extension of sparsity-based SR problems to multiple color channels by taking prior knowledge about the color information into account. Edge similarities amongst RGB color bands are exploited as cross-channel correlation constraints. These additional constraints lead to a new optimization problem, which is not easily solvable; however, a tractable solution is proposed to solve it efficiently. Moreover, to fully exploit the complementary information among color channels, a dictionary learning method is also proposed specifically to learn color dictionaries that encourage edge similarities (2) Tackle the super-resolution problem from a deep learning standpoint and provide deep network structures designed for super-resolution. A step further in this line of research is to integrate sparsifying priors into deep networks and investigate their impact on the performance especially in absence of abundant training.

\@starttoc

toc

\@starttoc

lof

\@starttoc

lot

## Acknowledgments

I am grateful to my parents, who have provided me through moral and emotional support in my life. Without their support I wouldn’t be here and I am very grateful to have them by my side. I am also grateful to my siblings, especially my brother, Mehdi, who was a role model for me and have supported me along the way.

With a special gratitude to Vishal Monga, my advisor, who help me through my PhD. It was fantastic to have the opportunity to work with him.

And finally, last but by no means least, also to everyone in the iPAL. Especially, Mahesh, Tiep, Tiantong and John whom I worked closely with and enjoyed every moment of it. It was great sharing laboratory with all of you during last five years.

## Dedication

To my beloved parents

## Chapter 1 Introduction

In many domains of data processing, a commonly encountered problem is that of signal representation. This is a well studied problem and has a vast number of applications in information processing, in general. In this dissertation, we are particularly interested in signal recovery under sparsity inducing image priors and its application to some real world problems.

In recent years, sparse modeling and sparse representation of signals have received a lot of attention and are very well-studied by the researchers in the signal processing and statistics community. This chapter aims to review some of relevant ideas to sparse signal representation and give a comprehensive understanding of its application in signal and image processing, machine learning and computer vision. Finally, we will motivate the contributions of this disseration.

### 1.1 Sparsity in Signal Processing: Overview

Parsimony in signal representation has been shown to have value in many applications. It is now a well-known fact that a large class of signals can be represented in a compact manner with respect to some basis. Among those bases, the most widely applicable ones are Fourier, wavelet or cosine basis. In fact, this idea has been leveraged successfully in commercial signal compression algorithms known to be JPEG 2000 [2]. Motivated by prevalence of sparsity in human perception [3], research is conducted on sparse coding for signal and images for a variety of applications. Namely, signal and image acquisition [4], data compression [2] and modeling [5].

A sparse vector or matrix is essentially a vector or matrix with only a few number of non-zero elements. The number of non-zero elements is called sparsity level. Signals or their corresponding vector/matrix representation can be sparse in a variety of different ways. Each of which has its own structure and can be applied in real-world problems to induce different kinds of structures and a priori on the signals. Fig. 1.1 shows a few examples of how differently sparse signals can be represented. these structures have different motivations and backgrounds so that make each of them suitable for a specific scenario.

#### 1.1.1 Compressive Sensing

This section surveys the theory of compressive sensing or CS in signal processing. CS theory [4] asserts that certain signals and images are recoverable from far fewer samples or measurements than traditional methods use. Compressive sensing aims to acquire and reconstruct a signal through optimization and finding solutions to underdetermined linear systems. To make this possible, CS significantly relies on the fact that underlying signal is sparse in some proper basis.

In fact, compressive sensing which is also known as sparse sensing can be viewed as a formalization of the quest for economical signal acquisition and representation and it has witnessed a spurt over the past decade [4, 6]. The sparse sensing problem is described as follows:

 y=Φx+n, (1.1)

where is the signal, is the measurement (), is the measurement matrix, and is additive noise. The sparse signal recovery problem is then posed as an optimization problem of the form:

 ^x=argminx∥x∥0 s.t. ∥y−Φx∥2≤ϵ. (1.2)

where is the pseudo-norm and is equal to the number of non-zero elements of vector . Under certain conditions, the -norm can be relaxed to the -norm, leading to a more tractable and convex optimization formulations. It is well-known that usually additional structure among coefficient of sparse signal is incorporated and it can be leveraged for better sparse recovery [7]. For instance, a power-law relationship can be determined among the sorted coefficients of a compressible signal. Or as it is ascertained in [7] specific probability distributions such as the Student’s t-distribution and generalized Pareto distribution are well-known to be compressible. In fact, incorporating such priors as compressible priors in a probabilistic framework for sparse signal recovery problem has already been established and will be reviewed in later chapters.

The work in this area so far has focused on the application of sparsity for signal reconstruction problems, where a signal is recovered from a set of fewer measurements. This idea can also be extended to classification by learning class-specific dictionaries, the novelty arising from the careful choice of dictionaries that best capture the inherent sparsity in a signal representation whilst having discriminative power.

#### 1.1.2 Sparse Representation-based Classification

Classification is a commonly encountered problem in information processing domain. Typically in a classification problem, we have access to data/information belonging to two or more different classes or groups, and the challenge is to develop an effective automatic scheme of assigning a new object to its correct group. A list of classification problems is included but not limited to this indicative list: Object recognition, medical imaging, video tracking, pedestrian detection in video sequences, bioinformatics and biometric applications for security (fingerprint/ face recognition), communications, document classification, automatic target recognition, etc.

A few of common characteristics of real-world classification problems which we aim to address a few of them throughout this dissertation are as follows:

• high-dimensional data,

• data acquisition in the presence of noise,

• incorporating prior knowledge into training model,

A significant contribution to the development of algorithms for image classification that addressed some of the above mentioned challenges up to some extent is a recent sparse representation-based classification (SRC) framework [8], which exploits the discriminative capability of sparse representations. The key idea is that of designing class-specific dictionaries in combination with the analytical framework of compressive sensing. Given a sufficiently diverse collection of training images from each class, any image from a specific class can be approximately represented as linear combination of training images from the same class. Therefore, if we have training images of all classes and form a basis or dictionary based on that, any new and unseen test image has a sparse representation with respect to such overcomplete dictionary. It is worth to mention that sparsity assumption holds due to the class-specific design of dictionaries as well as the assumption of the linear representation model.

Suppose we have sets of training image vectors (vectorized images) from multiple classes, collected into dictionaries . Let there be training images (each in ) corresponding to the predefined classes . The collection of all training images is expressed using the matrix called dictionary

 A=[A1 A2 … AK], (1.3)

where , with . A new test image can now be represented as a sparse linear combination of the training atoms as,

 y ≃ A1x1+A2x2+…+AKxK=Ax, (1.4)

where is ideally expected to be a sparse vector (i.e., only a few entries in are nonzero). Motivated by CS framework, the classifier seeks the sparsest representation by solving the following problem:

 ^x=argmin∥x∥1% subject to∥y−Ax∥2≤ϵ. (1.5)

and then class assignment is simply carried out using the partial reconstruction error terms as follows:

 identity(y)=argmini∥y−AδCi(^x)∥2 (1.6)

where keeps all the elements of which are from class and zero-outs everything else.

Experiments have shown that even in the presence of severe distortions such as occlusion and pixel noise, the resulting sparse representation reveals the class-association of with a high degree of accuracy [8, 9]. The robustness of the SRC method to real-world image distortions resulted in the widespread application of SRC in practical classification tasks.

With emergence of SRC as a powerful and robust classifier, modifications to the original framework have been considered. One simple extension uses regularizers and minimizes the sum of -norms of the sub-vectors , which results in an group sparse regularizer [10]. This idea is very similar to the idea of the group Lasso proposed in [11]. Along the same line, to enforce sparsity within each group, hierarchical Lasso was proposed in [12]. Some other regularizers that promote group structures and extend SRC have been proposed in [13, 14].

#### 1.1.3 Sparse Priors for Signal Representation

Prior information about the structure of signals often leads to significant performance improvements in many information analysis and signal processing applications. In fact, introducing regularizers in the analytical formulation of problems is motivated by having prior knowledge about the signal and the fact that it can help in obtaining solutions to ill-posed problems [15]. Prior information also manifest itself in other forms such as constraints or probability distributions on signals. Further, the success of sparse representation-based methods and its many extensions is based on the validity of the linear representation model which itself relies on the assumption that enough diversity is captured by a large enough set of training samples. In practice however, most real-world applications have the limitation that rich training is not available.

An attempt to alleviate this problem was proposed to use prior information about the underlying signal. The main idea is that the sparse optimization problem in (1.5) can be interpreted as maximizing the posterior probability of observing under a prior assumption that the coefficients of are modeled as i.i.d. Laplacians. The Laplacian prior encourages coefficients to take values closer to zero, thereby approximating sparsity. This framework motivates the interpretation of sparsity as a prior information for signal reconstruction. In fact, this is a particular example of a broader framework: Bayesian. In Bayesian perspective, signal comprehension can be enhanced by incorporating contextual information as priors. One great benefit of leveraging Bayesian framework is the probabilistic prediction of sparse coefficient and automatic estimation of model parameters. [15]

Sparsity is in fact a first-order description of structure in signals. However, often there is a priori structure inherent to the sparse signals that is exploited for better representation, compression or modeling. In fact, We can categorize sparse recovery problems into two different category of approaches. One category uses sparsity inducing regularizers in conjunction with reconstruction error term to obtain a sparse approximation of the signal. Examples of these sort of techniques are [16, 17, 18, 8]. Another approach falls into the category of Model-based Compressive Sensing [19] where a set of priors are introduced on top of the sparse signal to capture both sparsity and structure [19, 20, 21, 22, 23].

As an illustration, a connected tree structure can be enforced on wavelet coefficients to capture the multi-scale dependence [19]. Other such structured prior models have also been integrated into the CS framework [21, 20, 24].The wavelet-based Bayesian approach in [21] employs a “spike-and-slab” prior [25, 26, 27, 28], which is a mixture model of two components representing the zero and nonzero coefficients, and dependencies are encouraged in the mixing weights across resolution scales.

Introducing priors for capturing sparsity is a particular example of Bayesian inference where the signal recovery can be enhanced by exploiting contextual and prior information. As suggested by [29, 30], sparsity can be induced via solving the following optimization problem:

 maxxPx(x) subject to ||y−Ax||2<ϵ. (1.7)

where is the probability density function of that simultaneously captures the sparsity and structure (joint distributions of coefficients) of . In comparison, the standard CS recovery (1.2) captures only the sparse nature of . The most common example is the i.i.d. Laplacian prior which is equivalent to norm minimization [30, 22]. The choice of sparsity promoting priors that can capture joint distribution of coefficients (both structure and sparsity) is a challenging task. Examples of such priors are Laplacian [22], generalized Pareto [7], Spike and Slab [31], etc.

#### 1.1.4 Other Applications

Sparse representation methods have vast number of applications and we mentioned a few of them in the previous subsections. However, applications of sparsity based methods is not only limited to the aforementioned ones. They have also been widely used in Automatic Target Recognition (ATR) in many different modalities such as Synthetic Aperture Radar (SAR) imagery, HyperSpectral Imaging (HSI), etc [32, 33, 34, 35, 36].

Sparsity has been playing an important role in many fields such as acoustic signal processing [37], image processing [38] and recognition [39]. It has also been widely used for clustering and subspace selection [40]. Recently, sparsity has been applied to the problem of single image super resolution as well [41, 42]. In this problem, a single low resolution image is provided and with aim of sparse representation methods the high resolution image is achieved. We will talk about this area later in this dissertation.

### 1.2 Information Fusion

Advances in sensing technology have facilitated the easy acquisition of multiple different measurements of the same underlying physical phenomena. Often there is complimentary information embedded in these different measurements which can be exploited for improved performance. For example, in face recognition or action recognition we could have different views of a person’s face captured under different illumination conditions or with different facial postures [43, 44, 45, 46, 47]. In automatic target recognition, multiple SAR (synthetic aperture radar) views are acquired [34]. The use of complimentary information from different color image channels in medical imaging has been demonstrated in [48, 49]. In border security applications, multi-modal sensor data such as voice sensor measurements, infrared images and seismic measurements are fused [50] for activity classification tasks. The prevalence of such a rich variety of applications where multi-sensor information manifests in different ways is a key motivation for one of our contribution in this dissertation.

#### 1.2.1 Collaborative Data

Motivated by availability of information from different sources, many algorithms have proposed to use the joint and complementary information from different sources or observations. For instance, in classification scenarios, such as multi-view face recognition, multi-view SAR ATR or hyperspectral classification this joint information is readily available and one can exploit these data for better classification accuracies. The SRC model is extended to incorporate this additional information by enforcing a common support set of training images for the correlated test images :

 Y=[y1y2⋯yT]=[Ax1Ax2⋯AxT]=A[x1x2⋯xT]X=AX. (1.8)

The simplest way of enforcing structure on is to assume that the vectors , all have non-zero entries at the same locations, albeit with different weights. This leads to the recovery of a sparse matrix with only a few nonzero rows,

 ^X=argmin∥Y−AX∥Fsubject to∥X∥% row,0≤K0, (1.9)

where denotes the number of non-zero rows of and is the Frobenius norm. The greedy Simultaneous Orthogonal Matching Pursuit (SOMP) [18, 51] algorithm has been proposed to solve this non-convex problem. More general version of such joint sparsity models have been proposed by Zhang et al. in [43] . In their so called Joint Dynamic Sparse Representation-based Classification (JDSRC) they proposed a joint dynamic sparsity model instead of the rigid row sparsity model discussed previously . Yuan et al. proposed a joint classification framework using information from different modalities in [44]. Recently, Srinivas et al. also proposed a simultaneous sparsity model for histopathological image classification and used the complementary color information in RGB channels [49]. Other joint or simultaneous sparse representations methods proposed for regression and multi-view ATR [34, 11].

Collaborative data may come from different sources according to the type of the problem we are dealing with. For instance, in signal recovery multiple measurements of the same phenomena can provide different sources of data or multiple sensor measurements may provide complementary data that can be used for recovery. One typical form of collaborative data that can provide complementary information is multi-spectral or in particular color information. Color information from separate RGB channels in image processing has been successfully applied for medical image classification [49] and image enhancement [52]. Later in this dissertation we exploit color information as strong prior information for image super resolution and illustrate their benefits on this task.

### 1.3 Goals and Contributions

Following this brief overview of the key ideas that will constitute this thesis, it is now appropriate to state the goals of this research as follows:

• Use prior information on sparse signals and model parameters for boosting performance of both recovery and enhancement problems.

• Extend the application of joint information to broader applications and exploit color information in optical images for super-resolution.

• Extend the application of image priors to deep learning frameworks by exploiting signals’ structure using wavelets and regularizing network structures using prior knowledge.

A snapshot of the main contributions of this dissertation is presented next. Each contribution approaches the issue of signal recovery from different viewpoints using multiple instances of prior knowledge, with the deployment of a rich variety of algorithmic tools from signal processing and machine learning.

Chapter 2 addresses the sparse signal recovery problem in a Bayesian framework where sparsity is enforced on reconstruction coefficients via probabilistic priors. In particular, we focus on a variant of spike and slab prior to encourage sparsity. The optimization problem resulting from this model has broad applicability in recovery and regression problems and is known to be a hard non-convex problem whose existing solutions involve simplifying assumptions and/or relaxations. We propose an approach called Iterative Convex Refinement (ICR) that aims to solve the aforementioned optimization problem directly allowing for greater generality in the sparse structure. Essentially, ICR solves a sequence of convex optimization problems such that sequence of solutions converges to a sub-optimal solution of the original hard optimization problem. We propose two versions of our algorithm: a.) an unconstrained version, and b.) with a non-negativity constraint on sparse coefficients, which may be required in some real-world problems. Experimental validation is performed on both synthetic data and for a real-world image recovery problem, which illustrates merits of ICR over state of the art alternatives.

Chapter 3 serves as an introduction to single image super resolution and for the rest of the dissertation. For super resolution task, sparsity constrained single image super-resolution (SR) has been of much recent interest. A typical approach involves sparsely representing patches in a low-resolution (LR) input image via a dictionary of example LR patches, and then using the coefficients of this representation to generate the high-resolution (HR) output via an analogous HR dictionary. However, most existing sparse representation methods for super resolution focus on the luminance channel information and do not capture interactions between color channels. In this contribution, we extend sparsity based super-resolution to multiple color channels by taking color information into account as strong prior information. Edge similarities amongst RGB color bands are exploited as cross channel correlation constraints. These additional constraints lead to a new optimization problem which is not easily solvable; however, a tractable solution is proposed to solve it efficiently. Moreover, to fully exploit the complementary information among color channels, a dictionary learning method is also proposed specifically to learn color dictionaries that encourage edge similarities. Merits of the proposed method over state-of-the-art are demonstrated both visually and quantitatively using image quality metrics.

Finally, chapter 4 extends the single image super resolution problem to deep learning methods where recent advances have seen a surge of such approaches for image super-resolution. Invariably, a network, e.g. a deep convolutional neural network (CNN) or auto-encoder is trained to learn the relationship between low and high-resolution image patches. Most of the deep learning based image super resolution methods work on spatial domain data and aim to reconstruct pixel values as the output of network. In the first part of this chapter, we explore the advantages of exploiting transform domain data in the SR task especially for capturing more structural information in the images to avoid artifacts. In addition to this and motivated by promising performance of VDSR and residual nets in super resolution task, we propose our Deep Wavelet network for Super Resolution (DWSR). Using wavelet coefficients encourages activation sparsity in middle layers as well as output layer. In addition to this, wavelet coefficients decompose the image into sub-bands which provide structural information depending on the types of wavelets used. For example, Haar wavelets provide vertical, horizontal and diagonal edges in wavelet sub-bands which can be used to infer more structural information about the image. Essentially our network uses complementary structural information from other sub-bands to predict the desired high-resolution structure in each sub-band.

On the other hand, deep learning methods have shown promising performance in super resolution and many other tasks in presence of abundant training which means thousands or millions of training data points are available. However, they suffer in cases where training data is not readily available. In the second part of this chapter and as our final contribution in this dissertation, we investigate the performance of such deep structures in low training data scenarios and show that their performance drops significantly. We look for remedies to this performance degradation by exploiting prior knowledge about the problem. This could be in terms of prior knowledge about the structure of images, or inter-pixel dependencies. In particular, we propose to use natural image priors for image super resolution and demonstrate that image priors in low training data scenarios enhance the recovery of high resolution images despite having much less training data available.

## Chapter 2 Contribution I: Iterative Convex Refinement for Sparse Signal Recovery

### 2.1 Introduction

Sparse signal approximation and compressive sensing (CS) have recently gained considerable interest both in signal and image processing as well as statistics. Sparsity is often a natural assumption in inverse problems and sparse reconstruction or representation has variety of applications in image/signal classification [8, 49, 53, 48, 54, 47, 55, 56], dictionary learning [57, 58, 59, 60, 61, 36], signal recovery [62, 17], image denoising and inpainting [63] and MRI image reconstruction [23]. Typically, sparse models assume that a signal can be efficiently represented as sparse linear combination of atoms in a given or learned dictionary [8, 12]. In other words, from CS viewpoint, a sparse signal can be recovered from fewer number of observations [21, 20].

A typical sparse reconstruction algorithm aims to recover a sparse signal from a set of fewer measurements () according to the following model:

 y=Ax+n, (2.1)

where is the measurement matrix (Dictionary) and models the additive Gaussian noise with variance .

In recent years, many sparse recovery algorithms have been proposed including but not limited to the following: proposing sparsity promoting optimization problems involving different regularizers such as norm, pseudo norm, greedy algorithms [17, 64, 16, 65, 66], Bayesian-based methods [20, 67, 68] or general sparse approximation algorithms such as SpaRSA, ADMM, etc. [62, 69, 70, 71].

In Bayesian sparse recovery, the choice of priors plays a key role in promoting sparsity and improving performance. Examples of such priors are Laplacian [22], generalized Pareto [30], Spike and Slab [31], etc. In particular, we focus on the setup of Yen et al. [72] who employ a variant of spike and slab prior to encourage sparsity. The optimization problem resulting from this model has broad applicability in recovery and regression problems and is known to be a hard non-convex problem whose existing solutions involve simplifying assumptions and/or relaxations [72, 53, 23]. However, in this work we aim to solve the resulting optimization problem directly in its general form. Our approach can be seen as a logical evolution of reweighted methods [73, 74]. Motivated by this, the Main Contributions of our work are as follows: (1) We propose a novel Iterative Convex Refinement (ICR) for sparse signal recovery. Essentially, the sequence of solutions from these convex problems approaches a sub-optimal solution of the hard non-convex problem. (2) We propose two versions of ICR: a.) an unconstrained version, and b.) with a non-negativity constraint on sparse coefficients, which may be required in some real-world problems such as image recovery. (3) Finally, we perform experimental validation on both synthetic data and a realistic image recovery problem, which reveals the benefits of ICR over other state-of-the-art sparse recovery methods. Further, we compare the solution of various sparse recovery methods against the global solution for a small-scale problem, and remarkably the proposed ICR finds the most agreement with the global solution. Finally, convergence analysis is provided in support of the proposed ICR algorithm.

### 2.2 Proposed Setup for Sparse Signal Recovery

Introducing priors for capturing sparsity is a particular example of Bayesian inference where the signal recovery can be enhanced by exploiting contextual and prior information. As suggested by [29, 30], sparsity can be induced via solving the following optimization problem:

 maxxPx(x) subject to ||y−Ax||2<ϵ. (2.2)

where is the probability density function of that captures sparsity. The most common example is the i.i.d. Laplacian prior which is equivalent to norm minimization [30, 22]. A well-suited sparsity promoting prior is spike and slab prior which is widely used in sparse recovery and Bayesian inference for variable selection and regression [25, 21, 23, 75]. In fact, it is acknowledged that spike and slab prior is indeed the gold standard for inducing sparsity in Bayesian inference [76]. Using this prior, every coefficient is modeled as a mixture of two densities as follows:

 xi∼(1−wi)δ0+wiPi(xi) (2.3)

where is the Dirac function at zero (spike) and (slab) is an appropriate prior distribution for nonzero values of (e.g. Gaussian). controls the structural sparsity of the signal. If is chosen to be close to zero tends to remain zero. On the contrary, by choosing close to 1, will be the dominant distribution encouraging to take a non-zero value.

Optimization Problem (Hierarchical Bayesian Framework): Any inference from the posterior density for this model will be ill-defined because the Dirac’s delta function is unbounded. Some ways to handle this issue include approximations [76], such as approximation of spike term with a narrow Gaussian [26], approximating the whole posterior function with product of Gaussian(s) and Bernoulli(s) density functions [77, 78, 23, 79, 80], etc. In this work, we focus on the setup of Yen et al. [72] which is an approximate spike and slab prior for inducing sparsity on . Inspired by Bayesian compressive sensing (CS) [20, 75], we employ a hierarchical Bayesian framework for signal recovery. More precisely, the Bayesian formulation is as follows:

 y|A,x,γ,σ2 ∼ N(Ax,σ2I) (2.4) x|γ,λ,σ2 ∼ p∏i=1 γiN(0,σ2λ−1)+(1−γi)I(xi=0) (2.5) γ|κ ∼ p∏i=1 Bernoulli(κi) (2.6)

where represents the Gaussian distribution. Also note that in (2.5) each coefficient of is modeled based on the framework proposed in [72]. Since is a binary variable, it implies that conditioned on , is equal to with probability one. On the other hand, conditioned on , follows a normal distribution with mean and variance . Motivated by Yen et al’s maximum a posteriori (MAP) estimation technique [72, 53] the optimal are obtained by the following MAP estimate.

 (x∗,γ∗) = argmaxx,γ{f(x,γ|A,y,κ,λ,σ2)}. (2.7)
###### Proposition 1.

The MAP estimation above is equivalent to the following minimization problem:

 (x∗,γ∗) = argminx,γ  ||y−Ax||22+λ||x||22+p∑i=1ρiγi (2.8)

where .

###### Proof.

To perform the MAP estimation, note that the posterior probability is given by:

 f(x,γ,|A,y,λ,κ)∝f(y|A,x,γ,σ2)f(x|γ,σ2,λ)f(γ|κ). (2.9)

The optimal are obtained by MAP estimation as:

 (x∗,γ∗)=argminx,γ{−2logf(x,γ,|A,y,λ,κ)}. (2.10)

We now separately evaluate each term on the right hand side of (2.9). According to (2.4) we have:

 f(y|A,x,γ,σ2)=1(2πσ2)q/2exp{−12σ2(y−Ax)T(y−Ax)}
 ⇒−2logf(y|A,x,γ,σ2)=qlogσ2+qlog(2π)+1σ2||y−Ax||2.

Since is assumed to be the indicator variable and only takes values and , we can rewrite (2.5) in the following form:

 x|γ,λ,σ2 ∼ p∏i=1 (N(0,σ2λ−1))γi.(I(xi=0))1−γi

Therefore

 f(x|γ,σ2,λ) = p∏i=1(1(2πσ2/λ)1/2)γiexp(−γix2i2σ2λ−1)(I(xi=0))1−γi = (2πσ2λ)−12∑pi=1γiexp{−12σ2λ−1p∑i=1γix2i}p∏i=1I(xi=0)1−γi = (2πσ2λ)−12∑pi=1γiexp(−||x||222σ2λ−1)p∏i=1I(xi=0)1−γi
 ⇒ −2logf(x|γ,σ2,λ)=||x||22σ2λ−1+log(2πσ2λ)p∑i=1γi−2p∑i=1(1−γi)logI(xi=0).

In fact the final term on the right hand side evaluates to zero, since , and .

Finally (2.6) implies that

 f(γ|κ) = p∏i=1κγii(1−κi)1−γi
 ⇒−2logf(γ|κ) = −2p∑i=1logκγii+log(1−κi)1−γi = −2p∑i=1γilogκi+(1−γi)log(1−κi) = −2p∑i=1γilog(κi1−κi)+log(1−κi) = p∑i=1γilog(1−κiκi)2−2p∑i=1log(1−κi).

Plugging all these expressions back into (2.10) and neglecting constant terms, we obtain:

 (x∗,γ∗) = argminx,γqlogσ2+1σ2||y−Ax||2+||x||22σ2λ−1 (2.11) +log(2πσ2λ)p∑i=1γi+p∑i=1γilog(1−κiκi)2

Essentially, for fixed The cost function will reduce to:

 L(x,γ) = ||y−Ax||22+λ||x||22+p∑i=1ρiγi (2.12)

where . ∎

Remark: Note that we are particularly interested in solving (2.8) which has broad applicability in recovery and regression[72], image classification and restoration [81, 53] and sparse coding[67, 82]. This is a non-convex mixed-integer programming involving the binary indicator variable and is not easily solvable using conventional optimization algorithms. It is worth mentioning that this is a more general formulation than the framework proposed in [53] or [72] where authors simplified the optimization problem by assuming the same for each coefficient . This assumption changes the last term in (2.8) to and the resulting optimization is solved in [72] by using Majorization-Minimization Methods. Further, a relaxation of to norm reduces the problem to the well-known Elastic-Net [83]. The framework in (2.8) therefore offers greater generality in capturing the sparsity of . As an example, consider the scenario in a reconstruction or classification problem where some dictionary (training) columns are more important than others[84]. It is then possible to encourage their contribution to the linear model by assigning higher values to the corresponding ’s, which in turn makes it more likely that the coefficient becomes activated.

We also have to mention that the goal of ICR is NOT to recover the sparsest possible solution but it is to recover a meaningful sparse signal. The word meaningful can be interpreted differently in various contexts. For example in images, it is required from ICR to generate images that have structure. Note that the signal that we recover is not only sparse as it is captured by the spike-and-slab prior but also may satisfy additional properties that are physically meaningful required by the application. For example the smoothness regularizer (prior) in images generates output results that are pleasant images.

### 2.3 Iterative Convex Refinement (ICR)

We first develop a solution to (2.8) for the case when the entries of are non-negative. Then, we propose our method in its general form with no constraints.

The central idea of the proposed Iterative Convex Refinement (ICR) algorithm – see Algorithm 1 – is to generate a sequence of optimization problems that refines the solution of previous iteration based on solving a modified convex problem. At iteration of ICR, the indicator variable is replaced with the normalized ratio and the convex optimization problem in (2.13) is solved which is a simple quadratic programming with non-negativity constraint. Note that, is intuitively the average value of optimal ’s obtained from iteration up to and is rigorously defined as in (2.15). The motivation for this substitution is that, if the sequence of solutions converges to a point in we also expect to converge to . Essentially, ICR is solving a sequence of convex quadratic programming problem that their solution converges to a sub-optimal solution of (2.8).

To generalize ICR to the unconstrained case, a simple modification is needed at each iteration. In fact, at each iteration (2.14) is solved instead of (2.13). Note that (2.14) is still convex and we solve it by alternating direction method of multipliers [70]. Again we expect the ratio to converge to the value of optimal and the result of ICR be a sub-optimal solution for (2.8). ICR in both its versions is summarized in Algorithm 1111The Matlab code for ICR is made available online at http://signal.ee.psu.edu/ICR/ICRpage.htm.

To analyze the convergence properties of ICR, we first define the function as follows:

 fn(x)=xT(ATA+λI)x−2yTAx+p∑i=1ρi∣∣μ(n−1)i∣∣|xi| (2.16)

which is another form of the functions to be minimized at each iteration of ICR. For the rest of our analysis, without loss of generality we assume that , , , and columns of have unity norm. With this definition and assuming is a constant that , we propose the following two lemmas:

If , then . ()

###### Proof.

Assume that for a specific , . Then for the next iteration the cost function to be minimized is as follows:

 fn0+1(x)=xT(ATA+λI)x−2yTAx+p∑i=1ρi∣∣μ(n0)i∣∣|xi| (2.17)

Assume that the argument that minimizes (2.17) is . we can rewrite it in the following form:

 x(n0+1)=xb+xjej (2.18)

where is the basis function with one at component and zeros elsewhere. is the element of and is equal to except at element which is zero. We prove that if , then .

 fn0+1(xb) = xTb(ATA+λI)xb−2yTAxb+p∑i=1ρi∣∣μ(n0)i∣∣|xbi| fn0+1(x(n0+1)) = (xb+xjej)T(ATA+λI)(xb+xjej)

Therefore, their difference is:

 fn0+1(x(n0+1))−fn0+1(xb) = x2jeTj(ATA+λI)ej+2xjxTb(ATA+λI)ej (2.19) −2xjyTAej+ρj∣∣μ(n0)j∣∣|xj| = ∣∣xj∣∣(|xj|(ATA+λI)jj+ρj∣∣μ(n0)j∣∣) −2xj(yTAej−xTb(ATA+λI)ej)

We want to show that this difference is always positive except for which means must be zero in order for to be minimum. To do so, we show the following statements are true for nonzero :

 ∣∣2xj(yTAej−xTb(ATA+λI)ej)∣∣<∣∣xj∣∣(|xj|(ATA+λI)jj+ρj∣∣μ(n0)j∣∣)
 ⇔2∣∣yTAej−xTbATAej+λxTbej∣∣ < |xj|(ATA+λI)jj+ρj∣∣μ(n0)j∣∣ ⇔   2∣∣yTAej−xTbATAej∣∣ < (1+λ)|xj|+ρj∣∣μ(n0)j∣∣ ⇔       2∣∣(y−Axb)TAej∣∣ < (1+λ)|xj|+ρj∣∣μ(n0)j∣∣ (2.20)

In the above derivations, we used the fact that since columns of have unity norm. On the other hand, Cauchy-Schwarz inequality implies that,

 2∣∣(y−Axb)TAej∣∣ ≤  2||y−Axb||.||Aej|| =  2||y−Axb|| ≤   2(||y||+||Axb||) ≤  2(√q+p)

Last inequality holds because of the fact that we assumed that magnitude of and do not exceed one. Also since we assumed and by definition of we have:

Therefore, (2.20) is always true, since the right hand side is always greater than the left hand side. This implies that (2.19) is positive for nonzero and, hence we must have = 0. Otherwise, it would contradict the fact that is the minimum value. Note that these are loose bounds and in practice they are easily satisfied. For example, is practically very small. ∎

This lemma also implies that if for some , then will remain zero for all the following iterations. Remark: This is a very interesting result and may be potentially useful in updating . During the iterations, the moment one of the ’s goes below , should stay zero for all the following iterations. We can use this property and bring the component of and the corresponding column from matrix out of the model and reduce the size of the model. This can significantly expedite the ICR algorithm.

###### Lemma 2.

If for all , then there exists such that for all we have

 ∣∣∣1∣∣μ(n+1)j∣∣−1∣∣μ(n)j∣∣∣∣∣≤cn+1 (2.21)

where is some positive constant.

Another interpretation of this lemma is that as the number of iterations grows, the cost functions at each iteration of ICR get closer to each other. In view of these two lemmas, we can show that the sequence of optimal cost function values obtained from ICR algorithm forms a Quasi-Cauchy sequence [85]. In other words, this is a sequence of bounded values that their difference at two consecutive iterations gets smaller.

###### Proof.

Assume . First, note that it is straightforward to see that the difference of consecutive average values has the following property: . Now, let , then for all we have:

 ∣∣μ(n)j∣∣−1n+1≤∣∣μ(n+1)j∣∣≤∣∣μ(n)j∣∣+1n+1 (2.22)

where the left hand side is positive, since

 ∣∣μ(n)j∣∣−1n+1≥αρj−1Nj=αρj2=δ>0

Using this fact and (2.22) we infer that:

 1∣∣μ(n)j∣∣+1n+1≤1∣∣μ(n+1)j∣∣≤1∣∣μ(n)j∣∣−1n+1 ⇒ ⇒ −1n+1(∣∣μ(n)j∣∣+1n+1)∣∣μ(n)j∣∣≤1∣∣μ(n+1)j∣∣−1∣∣μ(n)j∣∣≤1n+1(∣∣μ(n)j∣∣−1n+1)∣∣μ(n)j∣∣

In the last expression, we have:

 RHS=1(n+1)(∣∣μ(n)j∣∣−1n+1)∣∣μ(n)j∣∣≤1(n+1)ϵδ LHS=−1(n+1)(∣∣μ(n)j∣∣+1n+1)∣∣μ(n)j∣∣≥−1(n+1)ϵδ

Therefore,

 ∣∣∣1∣∣μ(n+1)j∣∣−1∣∣μ(n)j∣∣∣∣∣≤1(n+1)ϵδ,    n>Nj.

###### Theorem 1.

After a sufficiently large , the sequence of optimal cost function values obtained from ICR forms a Quasi-Cauchy sequence. i.e. is a Quasi-Cauchy sequence of numbers.

 ∣∣fn+1(x(n+1))−fn(x(n))∣∣≤c′n. (2.23)
###### Proof.

Before proving the proof, note that we can assume for a sufficiently large , if , then is either always less than or always greater. Because according to Lemma 1, we know that if once becomes smaller than for some , it will remain less than for all the following iterations. Therefore, let be the iteration index that for all , . Note that some ’s may be equal to infinity which means they are never smaller than . For those that , let to be the same as defined in proof of Lemma 2. With these definitions, we now proceed to prove the Theorem. We first show that for , the sequence of has the following property:

 (2.24) ≤ ∑|μ(n−1)i|<ϵρi∣∣∣1∣∣μ(n)i∣∣−1∣∣μ(n−1)i∣∣∣∣∣|xi|+∑|μ(n−1)i|≥ϵρi∣∣∣1∣∣μ(n)i∣∣−1∣∣μ(n−1)i∣∣∣∣∣|xi| ≤ ∑|μ(n−1)i|≥ϵρicn|xi| ≤ pmax{ρi}cn ≤ c′n.

This property also holds for . Finally, We show that for , is Quasi-Cauchy. Since the minimum value is smaller than , we can write:

 fn+1(x(n+1))−fn(x(n))≤fn+1(x(n))−fn(x(n))≤c′n

where we used (2.24) for . With the same reasoning for we have:

 fn+1(x(n+1))−fn(x(n))≥fn+1(x(n+1))−fn(x(n+1))≥−c′n

Therefore,

 ∣∣fn+1(x(n+1))−fn(x(n))∣∣≤c′n

for . ∎

Remark: Despite the fact that analytical results show a decay of order in difference between consecutive optimal cost function values, ICR shows much faster convergence in practice. Combination of this theorem with a reasonable stopping criterion guarantees the termination of the ICR algorithm. The stopping criteria used in this case is the norm of difference in the solutions in consecutive iterations. At termination where the solution converges, the ratio will be zero for zero coefficients and approaches 1 for nonzero coefficients, which matches the value of in both cases.

### 2.4 Experimental Validation

We now apply the ICR method to sparse recovery problem. Two experimental scenarios are considered: 1.) synthetic data and 2.) a real-world image recovery problem. In each case, comparisons are made against state of the art alternatives.

Synthetic data: We set up a typical experiment for sparse recovery as in [16, 72] with a randomly generated Gaussian matrix and a sparse vector . Based on and , we form the observation vector according to the additive noise model: with . The competitive state-of-the-art methods for sparse recovery that we compare against are: 1) SpaRSA [62, 86] which is a powerful method to solve the problems of the form (2.8) 2) Yen et al. framework, Majorization Minimization (MM) algorithm [72] 3) Elastic Net [83] 4) FOCUSS algorithm [73] which is a reweighted algorithm for sparse recovery [74] 5) expectation propagation approach for spike and slab recovery (SS-EP) [77] and finally 6) Variational Garrote (VG) [79]. Initialization for all methods is consistent as suggested in [86].

Table 2.1 reports the experimental results for a small scale problem. We chose to first report results on a small scale problem in order to be able to use the IBM ILOG CPLEX optimizer [87] which is a very powerful optimization toolbox for solving many different optimization problems. It can also find the global solution to non-convex and mixed-integer programming problems. We used this feature of CPLEX to compare ICR’s solution with the global minimizer. For obtaining the results in Table 2.1, we choose , and the sparsity level of is . We generated realizations of and