Sparse Signal Estimation by Maximally Sparse Convex Optimization
Abstract
This paper addresses the problem of sparsity penalized least squares for applications in sparse signal processing, e.g. sparse deconvolution. This paper aims to induce sparsity more strongly than L1 norm regularization, while avoiding nonconvex optimization. For this purpose, this paper describes the design and use of nonconvex penalty functions (regularizers) constrained so as to ensure the convexity of the total cost function, F, to be minimized. The method is based on parametric penalty functions, the parameters of which are constrained to ensure convexity of F. It is shown that optimal parameters can be obtained by semidefinite programming (SDP). This maximally sparse convex (MSC) approach yields maximally nonconvex sparsityinducing penalty functions constrained such that the total cost function, F, is convex. It is demonstrated that iterative MSC (IMSC) can yield solutions substantially more sparse than the standard convex sparsityinducing approach, i.e., L1 norm minimization.
I Introduction
In sparse signal processing, the norm has special significance [5, 4]. It is the convex proxy for sparsity. Given the relative ease with which convex problems can be reliably solved, the norm is a basic tool in sparse signal processing. However, penalty functions that promote sparsity more strongly than the norm yield more accurate results in many sparse signal estimation/reconstruction problems. Hence, numerous algorithms have been devised to solve nonconvex formulations of the sparse signal estimation problem. In the nonconvex case, generally only a local optimal solution can be ensured; hence solutions are sensitive to algorithmic details.
This paper aims to develop an approach that promotes sparsity more strongly than the norm, but which attempts to avoid nonconvex optimization as far as possible. In particular, the paper addresses illposed linear inverse problems of the form
(1) 
where and are sparsityinducing regularizers (penalty functions) for . Problems of this form arise in denoising, deconvolution, compressed sensing, etc. Specific motivating applications include nanoparticle detection for biosensing and near infrared spectroscopic time series imaging [61, 62].
This paper explores the use of nonconvex penalty functions , under the constraint that the total cost function is convex and therefore reliably minimized. This idea, introduced by Blake and Zimmerman [6], is carried out
…by balancing the positive second derivatives in the first term [quadratic fidelity term] against the negative second derivatives in the [penalty] terms [6, page 132].
This idea is also proposed by Nikolova in Ref. [49] where it is used in the denoising of binary images. In this work, to carry out this idea, we employ penalty functions parameterized by variables , i.e., , wherein the parameters are selected so as to ensure convexity of the total cost function . We note that in [6], the proposed family of penalty functions are quadratic around the origin and that all are equal. On the other hand, the penalty functions we utilize in this work are nondifferentiable at the origin as in [52, 54] (so as to promote sparsity) and the are not constrained to be equal.
A key idea is that the parameters can be optimized to make the penalty functions maximally nonconvex (i.e., maximally sparsityinducing), subject to the constraint that is convex. We refer to this as the ‘maximallysparse convex’ (MSC) approach. In this paper, the allowed interval for the parameters , to ensure is convex, is obtained by formulating a semidefinite program (SDP) [2], which is itself a convex optimization problem. Hence, in the proposed MSC approach, the cost function to be minimized depends itself on the solution to a convex problem. This paper also describes an iterative MSC (IMSC) approach that boosts the applicability and effectiveness of the MSC approach. In particular, IMSC extends MSC to the case where is rank deficient or ill conditioned; e.g., overcomplete dictionaries and deconvolution of near singular systems.
The proposed MSC approach requires a suitable parametric penalty function , where controls the degree to which is nonconvex. Therefore, this paper also addresses the choice of parameterized nonconvex penalty functions so as to enable the approach. The paper proposes suitable penalty functions and describes their relevant properties.
Ia Related Work (Threshold Functions)
When in (1) is the identity operator, the problem is one of denoising and is separable in . In this case, a sparse solution is usually obtained by some type of threshold function, . The most widely used threshold functions are the soft and hard threshold functions [21]. Each has its disadvantages, and many other thresholding functions that provide a compromise of the soft and hard thresholding functions have been proposed – for example: the firm threshold [32], the nonnegative (nn) garrote [31, 26], the SCAD threshold function [24, 73], and the proximity operator of the quasinorm () [44]. Several penalty functions are unified by the twoparameter formulas given in [3, 35], wherein threshold functions are derived as proximity operators [19]. (Table 1.2 of [19] lists the proximity operators of numerous functions.) Further threshold functions are defined directly by their functional form [71, 70, 72].
Sparsitybased nonlinear estimation algorithms can also be developed by formulating suitable nonGaussian probability models that reflect sparse behavior, and by applying Bayesian estimation techniques [17, 39, 48, 23, 1, 56, 40]. We note that, the approach we take below is essentially a deterministic one; we do not explore its formulation from a Bayesian perspective.
This paper develops a specific threshold function designed so as to have the three properties advocated in [24]: unbiasedness (of large coefficients), sparsity, and continuity. Further, the threshold function and its corresponding penalty function are parameterized by two parameters: the threshold and the rightsided derivative of at the threshold, i.e. , a measure of the threshold function’s sensitivity. Like other threshold functions, the proposed threshold function biases large less than does the soft threshold function, but is continuous unlike the hard threshold function. As will be shown below, the proposed function is most similar to the threshold function (proximity operator) corresponding to the logarithmic penalty, but it is designed to have less bias. It is also particularly convenient in algorithms for solving (1) that do not call on the threshold function directly, but instead call on the derivative of penalty function, , due to its simple functional form. Such algorithms include iterative reweighted least squares (IRLS) [38], iterative reweighted [69, 11], FOCUSS [58], and algorithms derived using majorizationminimization (MM) [25] wherein the penalty function is upper bounded (e.g. by a quadratic or linear function).
IB Related Work (Sparsity Penalized Least Squares)
Numerous problem formulations and algorithms to obtain sparse solutions to the general illposed linear inverse problem, (1), have been proposed. The norm penalty (i.e., ) has been proposed for sparse deconvolution [41, 16, 66, 10] and more generally for sparse signal processing [15] and statistics [67]. For the norm and other nondifferentiable convex penalties, efficient algorithms for large scale problems of the form (1) and similar (including convex constraints) have been developed based on proximal splitting methods [18, 19], alternating direction method of multipliers (ADMM) [9], majorizationminimization (MM) [25], primaldual gradient descent [22], and Bregman iterations [36].
Several approaches aim to obtain solutions to (1) that are more sparse than the norm solution. Some of these methods proceed first by selecting a nonconvex penalty function that induces sparsity more strongly than the norm, and second by developing nonconvex optimization algorithms for the minimization of ; for example, iterative reweighted least squares (IRLS) [38, 69], FOCUSS [37, 58], extensions thereof [65, 47], halfquadratic minimization [12, 34], graduated nonconvexity (GNC) [6], and its extensions [51, 50, 52, 54].
The GNC approach for minimizing a nonconvex function proceeds by minimizing a sequence of approximate functions, starting with a convex approximation of and ending with itself. While GNC was originally formulated for image segmentation with smooth penalties, it has been extended to general illposed linear inverse problems [51] and nonsmooth penalties [52, 54, 46].
With the availability of fast reliable algorithms for norm minimization, reweighted norm minimization is a suitable approach for the nonconvex problem [11, 69]: the tighter upper bound of the nonconvex penalty provided by the weighted norm, as compared to the weighted norm, reduces the chance of convergence to poor local minima. Other algorithmic approaches include ‘difference of convex’ (DC) programming [33] and operator splitting [13].
In contrast to these works, in this paper the penalties are constrained by the operator and by . This approach (MSC) deviates from the usual approach wherein the penalty is chosen based on prior knowledge of . We also note that, by design, the proposed approach leads to a convex optimization problem; hence, it differs from approaches that pursue nonconvex optimization. It also differs from usual convex approaches for sparse signal estimation/recovery which utilize convex penalties. In this paper, the aim is precisely to utilize nonconvex penalties that induce sparsity more strongly than a convex penalty possibly can.
The proposed MSC approach is most similar to the generalizations of GNC to nonsmooth penalties [50, 52, 54] that have proven effective for the fast image reconstruction with accurate edge reproduction. In GNC, the convex approximation of is based on the minimum eigenvalue of . The MSC approach is similar but more general: not all are equal. This more general formulation leads to an SDP, not an eigenvalue problem. In addition, GNC comprises a sequence of nonconvex optimizations, whereas the proposed approach (IMSC) leads to a sequence of convex problems. The GNC approach can be seen as a continuation method, wherein a convex approximation of is gradually transformed to in a predetermined manner. In contrast, in the proposed approach, each optimization problem is defined by the output of an SDP which depends on the support of the previous solution. In a sense, is redefined at each iteration, to obtain progressively sparse solutions.
By not constraining all to be equal, the MSC approach allows a more general parametric form for the penalty, and as such, it can be more nonconvex (i.e., more sparsity promoting) than if all are constrained to be equal. The example in Sec. IIIF compares the two cases (with and without the simplification that all are equal) and shows that the simplified version gives inferior results. (The simplified form is denoted IMSC/S in Table I and Fig. 9 below).
If the measurement matrix is rank deficient, and if all were constrained to be equal, then the only solution in the proposed approach would have for all ; i.e., the penalty function would be convex. In this case, it is not possible to gain anything by allowing the penalty function to be nonconvex subject to the constraint that the total cost function is convex. On the other hand, the proposed MSC approach, depending on , can still have all or some and hence can admit nonconvex penalties (in turn, promoting sparsity more strongly).
L0 minimizaton: A distinct approach to obtain sparse solutions to (1) is to find an approximate solution minimizing the quasinorm or satisfying an constraint. Examples of such algorithms include: matching pursuit (MP) and orthogonal MP (OMP) [45], greedy [43], iterative hard thresholding (IHT) [8, 7, 42, 55], hard thresholding pursuit [28], smoothed , [46], iterative support detection (ISD) [68], single best replacement (SBR) [63], and ECME thresholding [57].
Compared to algorithms aiming to solve the quasinorm problem, the proposed approach again differs. First, the problem is highly nonconvex, while the proposed approach defines a convex problem. Second, methods for seek the correct support (index set of nonzero elements) of and do not regularize (penalize) any element in the calculated support. In contrast, the design of the regularizer (penalty) is at the center of the proposed approach, and no is left unregularized.
Ii Scalar Threshold Functions
The proposed threshold function and corresponding penalty function is intended to serve as a compromise between soft and hard threshold functions, and as a parameterized family of functions for use with the proposed MSC method for illposed linear inverse problems, to be described in Sect. III.
First, we note the high sensitivity of the hard threshold function to small changes in its input. If the input is slightly less than the threshold , then a small positive perturbation produces a large change in the output, i.e., and where denotes the hard threshold function. Due to this discontinuity, spurious noise peaks/bursts often appear as a result of hardthresholding denoising. For this reason, a continuous threshold function is often preferred. The susceptibility of a threshold function to the phenomenon of spurious noise peaks can be roughly quantified by the maximum value its derivative attains, i.e., , provided is continuous. For the threshold functions considered below, attains its maximum value at ; hence, the value of will be noted. The soft threshold function has reflecting its insensitivity. However, substantially biases (attenuates) large values of its input; i.e., for .
Iia Problem Statement
In this section, we seek a threshold function and corresponding penalty (i) for which the ‘sensitivity’ can be readily tuned from 1 to infinity and (ii) that does not substantially bias large , i.e., decays to zero rapidly as increases.
For a given penalty function , the proximity operator [19] denoted is defined by
(2) 
where . For uniqueness of the minimizer, we assume in the definition of that is strictly convex. Common sparsityinducing penalties include
(3) 
We similarly assume in the following that is three times continuously differentiable for all except , and that is symmetric, i.e., .
If for all for some , and is the maximum such value, then the function is a threshold function and is the threshold.
It is often beneficial in practice if admits a simple functional form. However, as noted above, a number of algorithms for solving (1) do not use directly, but use instead. In that case, it is beneficial if has a simple function form. This is relevant in Sec. III where such algorithms will be used.
In order that approaches zero, the penalty function must be nonconvex, as shown by the following.
Proposition 1.
Suppose is a convex function and denotes the proximity operator associated with , defined in (2). If , then
(4) 
Proof.
Let for . We have,
(5) 
Since , by the monotonicity of both of the terms on the right hand side of (5), it follows that .
If , (4) holds with since .
Suppose now that . Note that the subdifferential is also a monotone mapping since is a convex function. Therefore it follows that if , we should have . Since , the claim follows. ∎
According to the proposition, if the penalty is convex, then the gap between and increases as the magnitude of increases. The larger is, the greater the bias (attenuation) is. The soft threshold function is an extreme case that keeps this gap constant (beyond the threshold , the gap is equal to ). Hence, in order to avoid attenuation of large values, the penalty function must be nonconvex.
IiB Properties
As detailed in the Appendix, the proximity operator (threshold function) defined in (2) can be expressed as
(6) 
where the threshold, , is given by
(7) 
and is defined as
(8) 
As noted in the Appendix, is strictly convex if
(9) 
In addition, we have
(10) 
and
(11) 
Equations (10) and (11) will be used in the following. As noted above, reflects the maximum sensitivity of . The value is also relevant; it will be set in Sec. IID so as to induce to decay rapidly to zero.
IiC The Logarithmic Penalty Function
The logarithmic penalty function can be used for the MSC method to be described in Sec. III. It also serves as the model for the penalty function developed in Sec. IID below, designed to have less bias. The logarithmic penalty is given by
(12) 
which is differentiable except at . For , the derivative of is given by
(13) 
as illustrated in Fig. 1a. The function is illustrated in Fig. 1b. The threshold function , given by (6), is illustrated in Fig. 1c.
Let us find the range of for which is convex. Note that
for . Using the condition (9), it is deduced that if , then is increasing, the cost function in (2) is convex, and the threshold function is continuous.
Using (7), the threshold is given by . To find and , note that , and . Using (10) and (11), we then have
(14) 
As varies between and , the derivative varies between and infinity. As approaches zero, approaches the softthreshold function. We can set so as to specify . Solving (14) for gives
(15) 
Therefore, and can be directly specified by setting the parameters and (i.e., and is given by (15)).
Note that given in (14) is strictly negative except when which corresponds to the soft threshold function. The negativity of inhibits the rapid approach of to the identity function.
The threshold function is obtained by solving for , leading to
(16) 
which leads in turn to the explicit formula
as illustrated in Fig. 1c. As shown, the gap goes to zero for large . By increasing up to , the gap goes to zero more rapidly; however, increasing also changes . The single parameter affects both the derivative at the threshold and the convergence rate to identity.
The next section derives a penalty function, for which the gap goes to zero more rapidly, for the same value of . It will be achieved by setting .
IiD The Arctangent Penalty Function
To obtain a penalty approaching the identity more rapidly than the logarithmic penalty, we use equation (13) as a model, and define a new penalty by means of its derivative as
(17) 
Using (7), the corresponding threshold function has threshold In order to use (10) and (11), we note
The derivatives at zero are given by
(18) 
Using (10), (11), and (18), we have
(19) 
We may set so as to specify . Solving (19) for gives (15), the same as for the logarithmic penalty function.
In order that the threshold function increases rapidly toward the identity function, we use the parameter . To this end, we set so that is approximately linear in the vicinity of the threshold. Setting in (19) gives Therefore, the proposed penalty function is given by
(20) 
From the condition (9), we find that if , then is strictly increasing, is strictly convex, and is continuous. The parameters, and , can be set as for the logarithmic penalty function; namely and by (15).
To find the threshold function , we solve for which leads to
(21) 
for . The value of can be found solving the cubic polynomial for , and multiplying the real root by . Although does not have a simple functional form, the function does. Therefore, algorithms such as MM and IRLS, which use instead of , can be readily used in conjunction with this penalty function.
The penalty function itself, , can be found by integrating its derivative:
(22)  
(23) 
We refer to this as the arctangent penalty function.
The threshold function is illustrated in Fig. 2 for threshold and three values of . With , the function is strictly convex for all . With , one gets and is the softthreshold function. With , one gets and converges to the identity function. With , one gets ; in this case, converges more rapidly to the identity function, but may be more sensitive than desired in the vicinity of the threshold.
Figure 3 compares the logarithmic and arctangent threshold functions where the parameters for each function are set so that and are the same, specifically, . It can be seen that the arctangent threshold function converges more rapidly to the identity function than the logarithmic threshold function. To illustrate the difference more clearly, the lower panel in Fig. 3 shows the gap between the identity function and the threshold function. For the arctangent threshold function, this gap goes to zero more rapidly. Yet, for both threshold functions, has a maximum value of 2. The faster convergence of the arctangent threshold function is due to going to zero like , whereas for the logarithmic threshold function goes to zero like .
Figure 4 compares the logarithmic and arctangent penalty functions. Both functions grow more slowly than and thus induce less bias than the norm for large . Moreover, while the logarithmic penalty tends to , the arctangent penalty tends to a constant. Hence, the arctangent penalty leads to less bias than the logarithmic penalty. All three penalties have the same slope (of 1) at ; and furthermore, the logarithmic and arctangent penalties have the same second derivative (of ) at . But, the logarithmic and arctangent penalties have different thirdorder derivatives at ( and zero, respectively). That is, the arctangent penalty is more concave at the origin than the logarithmic penalty.
IiE Other Penalty Functions
The firm threshold function [32], and the smoothly clipped absolute deviation (SCAD) threshold function [24, 73] also provide a compromise between hard and soft thresholding. Both the firm and SCAD threshold functions are continuous and equal to the identity function for large (the corresponding is equal to zero for above some value). Some algorithms, such as IRLS, MM, etc., involve dividing by , and for these algorithms, dividebyzero issues arise. Hence, the penalty functions corresponding to these threshold functions are unsuitable for these algorithms.
A widely used penalty function is the pseudonorm, , for which However, using (9), it can be seen that for this penalty function, the cost function is not convex for any . As our current interest is in nonconvex penalty functions for which is convex, we do not further discuss the penalty. The reader is referred to [44, 53] for indepth analysis of this and several other penalty functions.
IiF Denoising Example
To illustrate the tradeoff between and the bias introduced by thresholding, we consider the denoising of the noisy signal illustrated in Fig. 5. Wavelet domain thresholding is performed with several thresholding functions.
Each threshold function is applied with the same threshold, . Most of the noise (c.f. the ‘threesigma rule’) will fall below the threshold and will be eliminated. The RMSEoptimal choice of threshold is usually lower than , so this represents a larger threshold than that usually used. However, a larger threshold reduces the number of spurious noise peaks produced by hard thresholding.
The hard threshold achieves the best RMSE, but the output signal exhibits spurious noise bursts due to noisy wavelet coefficients exceeding the threshold. The soft threshold function reduces the spurious noise bursts, but attenuates the peaks and results in a higher RMSE. The arctangent threshold function suppresses the noise bursts, with modest attenuation of peaks, and results in an RMSE closer to that of hard thresholding.
In this example, the signal is ‘bumps’ from WaveLab [20], with length 2048. The noise is additive white Gaussian noise with standard deviation . The wavelet is the orthonormal Daubechies wavelet with 3 vanishing moments.
Iii Sparsity penalized least squares
Consider the linear model,
(24) 
where is a sparse point signal, is the observed signal, is a linear operator (e.g., convolution), and is additive white Gaussian noise (AWGN). The vector is denoted .
Under the assumption that is sparse, we consider the linear inverse problem:
(25) 
where is a sparsitypromoting penalty function with parameter , such as the logarithmic or arctangent penalty functions. In many applications, all are equal, i.e., . For generality, we let this regularization parameter depend on the index .
In the following, we address the question of how to constrain the regularization parameters and so as to ensure is convex, even when is not convex. A problem of this form is addressed in GNC [6, 50], where the are constrained to be equal.
Iiia Convexity Condition
Let denote a penalty function with parameter . Consider the function , defined as
(26) 
Assume can be made strictly convex for special choices of and . We give a name to the set of all such choices.
Definition 1.
Let be the set of pairs for which in (26) is strictly convex. We refer to as the ‘parameter set associated with ’.
For the logarithmic and arctangent penalty functions described above, the set is given by
(27) 
Now, consider the function , defined in (25). The following proposition provides a sufficient condition on ensuring the strict convexity of .
Proposition 2.
Suppose is a positive definite diagonal matrix such that is positive semidefinite. Let denote the th diagonal entry of , i.e., Also, let be the parameter set associated with . If for each , then in (25) is strictly convex.
Proof.
The function can be written as
(28) 
where
(29) 
Note that is convex since is positive semidefinite. Now, since is diagonal, we can rewrite as
(30)  
(31) 
From (31), it follows that if for each , then is strictly convex. Under this condition, being a sum of a convex and a strictly convex function, it follows that is strictly convex. ∎
The proposition states that constraints on the penalty parameters ensuring strict convexity of can be obtained using a diagonal matrix lower bounding . If does not have full rank, then strict convexity is precluded. In that case, will be positive semidefinite. Consequently, will also be positive semidefinite, with some equal to zero. For those indices , where , the quadratic term in (30) vanishes. In that case, we can still ensure the convexity of in (25) by ensuring is convex. For the logarithmic and arctangent penalties proposed in this paper, we have as . Therefore, we define for the log and atan penalties.
In view of (27), the following is a corollary of this result.
Corollary 1.
We illustrate condition (32) with a simple example using variables. We set , , and . Then is a positive diagonal matrix with positive semidefinite. According to (32), is strictly convex if , . Figure 6 illustrates the contours of the logarithmic penalty function and the cost function for three values of . For , the penalty function reduces to the norm. Both the penalty function and are convex. For , the penalty function is nonconvex but is convex. The nonconvexity of the penalty is apparent in the figure (its contours do not enclose convex regions). The nonconvex ‘star’ shaped contours induce sparsity more strongly than the diamond shaped contours of the norm. For , both the penalty function and are nonconvex. The nonconvexity of is apparent in the figure (a convex function can not have more than one stationary point, while the figure shows two). In this case, the star shape is too pronounced for to be convex. In this example, yields the maximally sparse convex (MSC) problem.
Can a suitable be obtained by variational principles? Let us denote the minimal eigenvalue of by . Then is a positive semidefinite diagonal lower bound, as needed. However, this is a suboptimal lower bound in general. For example, if is a nonconstant diagonal matrix, then a tighter lower bound is itself, which is very different from . A tighter lower bound is of interest because the tighter the bound, the more nonconvex the penalty function can be, while maintaining convexity of . In turn, sparser solutions can be obtained without sacrificing convexity of the cost function. A tighter lower bound can be found as the solution to an optimization problem, as described in the following.
IiiB Diagonal Lower Bound Matrix Computation
Given , the convexity conditions above calls for a positive semidefinite diagonal matrix lower bounding . In order to find a reasonably tight lower bound, each should be maximized. However, these parameters must be chosen jointly to ensure is positive semidefinite. We formulate the calculation of as an optimization problem:
(33) 
where is the diagonal matrix . The inequality expresses the constraint that is positive semidefinite (all its eigenvalues nonnegative). Note that the problem is feasible, because satisfies the constraints. We remark that problem (33) is not the only approach to derive a matrix satisfying Proposition 2. For example, the objective function could be a weighted sum or other norm of . One convenient aspect of is that it has the form of a standard convex problem.
Problem (33) can be recognized as a semidefinite optimization problem, a type of convex optimization problem for which algorithms have been developed and for which software is available [2]. The cost function in (33) is a linear function of the variables, and the constraints are linear matrix inequalities (LMIs). To solve (33) and obtain , we have used the MATLAB software package ‘SeDuMi’ [64].
Often, inverse problems arising in signal processing involve large data sets (e.g., speech, EEG, and images). Practical algorithms must be efficient in terms of memory and computation. In particular, they should be ‘matrixfree’, i.e., the operator is not explicitly stored as a matrix, nor are individual rows or columns of accessed or modified. However, optimization algorithms for semidefinite programming usually involve row/column matrix operations and are not ‘matrix free’. Hence, solving problem (33) will likely be a bottleneck for large scale problems. (In the deconvolution example below, the MSC solution using SDP takes from 35 to 55 times longer to compute than the norm solution). This motivates the development of semidefinite algorithms to solve (33) where is not explicitly available, but for which multiplications by and are fast (this is not addressed in this paper).
Nevertheless, for 1D problems of ‘medium’size (arising for example in biomedical applications [61]), (33) is readily solved via existing software. In case (33) is too computationally demanding, then the suboptimal choice can be used as in GNC [6, 50]. Furthermore, we describe below a multistage algorithm whereby the proposed MSC approach is applied iteratively.
IiiC Optimality Conditions and Threshold Selection
When the cost function in (25) is strictly convex, then its minimizer must satisfy specific conditions [29], [4, Prop 1.3]. These conditions can be used to verify the optimality of a solution produced by a numerical algorithm. The conditions also aid in setting the regularization parameters .
If in (25) is strictly convex, and is differentiable except at zero, then minimizes if
(34) 
where denotes the th component of the vector .
The optimality of a numerically obtained solution can be illustrated by a scatter plot of versus , for . For the example below, Fig. 8 illustrates the scatter plot, wherein the points lie on the graph of . We remark that the scatter plot representation as in Fig. 8 makes sense only when the parametric penalty is a function of and , as are the log and atan penalties, (12) and (23). Otherwise, the horizontal axis will not be labelled and the points will not lie on the graph of . This might not be the case for other parametric penalties.
The condition (34) can be used to set the regularization parameters, , as in Ref. [30]. Suppose follows the model (24) where is sparse. One approach for setting is to ask that the solution to (25) be allzero when is allzero in the model (24). Note that, if , then consists of noise only (i.e., ). In this case, (34) suggests that be chosen such that
(35) 
For the norm, logarithmic and arctangent penalty functions, and , so (35) can be written as
(36) 
However, the larger is, the more will be attenuated. Hence, it is reasonable to set to the smallest value satisfying (36), namely,
(37) 
where is the additive noise. Although (37) assumes availability of the noise signal , which is unknown in practice, (37) can often be estimated based on knowledge of statistics of the noise . For example, based on the ‘threesigma rule’, we obtain
(38) 
If is white Gaussian noise with variance , then
(39) 
where denotes column of . For example, if denotes linear convolution, then all columns of have equal norm and (38) becomes
(40) 
where is the impulse of the convolution system.
IiiD Usage of Method
We summarize the forgoing approach, MSC, to sparsity penalized least squares, cf. (25). We assume the parameters are fixed (e.g., set according to additive noise variance).
The penalty function need not be the logarithmic or arctangent penalty functions discussed above. Another parametric penalty function can be used, but it must have the property that in (26) is convex for for some set . Note that with does not qualify because is nonconvex for all . On the other hand, the firm penalty function [32] could be used.
In step (3), for the logarithmic and arctangent penalty functions, one can use
(41) 
When , the penalty function is simply the norm; in this case, the proposed method offers no advantage relative to norm penalized least squares (BPD/lasso). When , the penalty function is maximally nonconvex (maximally sparsityinducing) subject to being convex. Hence, as it is not an arbitrary choice, can be taken as a recommended default value. We have used in the examples below.
IiiE Iterative MSC (IMSC)
An apparent limitation of the proposed approach, MSC, is that for some problems of interest, the parameters are either equal to zero or nearly equal to zero for all , i.e., . In this case, the method requires that be convex or practically convex. For example, for the logarithmic and arctangent penalty functions, leads to . As a consequence, the penalty function is practically the norm. In this case, the method offers no advantage in comparison with norm penalized least squares (BPD/lasso).
The situation wherein arises in two standard sparse signal processing problems: basis pursuit denoising and deconvolution. In deconvolution, if the system is noninvertible or nearly singular (i.e., the frequency response has a null or approximate null at one or more frequencies), then the lower bound will be . In BPD, the matrix often represents the inverse of an overcomplete frame (or dictionary), in which case the lower bound is again close to zero.
In order to broaden the applicability of MSC, we describe iterative MSC (IMSC) wherein MSC is applied several times. On each iteration, MSC is applied only to the nonzero elements of the sparse solution obtained as a result of the previous iteration. Each iteration involves only those columns of corresponding to the previously identified nonzero components. As the number of active columns of diminishes as the iterations progress, the problem (33) produces a sequence of increasingly positive diagonal matrices . Hence, as the iterations progress, the penalty functions become increasingly nonconvex. The procedure can be repeated until there is no change in the index set of nonzero elements.
The IMSC algorithm can be initialized with the norm solution, i.e., using for all . (For the logarithmic and arctangent penalties, .) We assume the norm solution is reasonably sparse; otherwise, sparsity is likely not useful for the problem at hand. The algorithm should be terminated when there is no change (or only insignificant change) between the active set from one iteration to the next.
The IMSC procedure is described as follows, where denotes the iteration index.

Initialization. Find the norm solution:
(42) Set and . Note is of size .

Identify the nonzero elements of , and record their indices in the set ,
(43) This is the support of . Let be the number of nonzero elements of , i.e., .

Check the termination condition: If is not less than , then terminate. The output is .

Define as the submatrix of containing only columns . The matrix is of size .
Find a positive semidefinite diagonal matrix lower bounding , i.e., solve problem (33) or use . The matrix is of size .

Set such that . For example, with the logarithmic and arctangent penalties, one may set
(44) for some .

Solve the dimensional convex problem:
(45) 
Set as
(46) 
Set and go to step 2).
In the IMSC algorithm, the support of can only shrink from one iteration to the next, i.e., and . Once there is no further change in , each subsequent iteration will produce exactly the same result, i.e.,
(47) 
For this reason, the procedure should be terminated when ceases to shrink. In the 1D sparse deconvolution example below, the IMSC procedure terminates after only three or four iterations.
Note that the problem (33) in step 4) reduces in size as the algorithm progresses. Hence each instance of (33) requires less computation than the previous. More importantly, each matrix has a subset of the columns of . Hence, is less constrained than , and the penalty functions become more nonconvex (more strongly sparsityinducing) as the iterations progress. Therefore, the IMSC algorithm produces a sequence of successively sparser .
Initializing the IMSC procedure with the norm solution substantially reduces the computational cost of the algorithm. Note that if the norm solution is sparse, i.e., , then all the semidefinite optimization problems (33) have far fewer variables than , i.e., . Hence, IMSC can be applied to larger data sets than would otherwise be computationally practical, due to the computational cost of (33).
IiiF Deconvolution Example
A sparse signal of length is generated so that (i) the interspike interval is uniform random between 5 and 35 samples, and (ii) the amplitude of each spike is uniform between and . The signal is illustrated in Fig. 7.
The spike signal is then used as the input to a linear timeinvariant (LTI) system, the output of which is contaminated by AWGN, . The observed data, , is written as
where . It can also be written as
where and are banded Toeplitz matrices [60]. In this example, we set and . The observed data, , is illustrated in Fig. 7.
Several algorithms for estimating the sparse signal will be compared. The estimated signal is denoted . The accuracy of the estimation is quantified by the and norms of the error signal and by the support error, denoted L2E, L1E, and SE respectively.

L2E

L1E

SE
The support error, SE, is computed using , the support of . Namely, is defined as
(48) 
where is a small value to accommodate negligible nonzeros. We set . The support error, SE, counts both the false zeros and the false nonzeros of . The numbers of false zeros and false nonzeros are denoted FZ and FN, respectively.
First, the sparse norm solutions, i.e., in (25), with and without debiasing, are computed.^{1}^{1}1Debiasing is a postprocessing step wherein least squares is performed over the obtained support set [27]. We set according to (40), i.e., . The estimated signals are illustrated in Fig. 7. The errors L2E, L1E, and SE, are noted in the figure. As expected, debiasing substantially improves the L2E and L1E errors of the norm solution; however, it does not improve the support error, SE. Debiasing does not make the solution more sparse. The errors, averaged over 200 trials, are shown in Table I. Each trial consists of independently generated sparse and noise signals.
Algorithm  L2E  L1E  SE (  FZ,  FN) 

norm  1.443  10.01  37.60 (  10.3,  27.3) 
norm + debiasing  0.989  7.14  37.57 (  10.3,  27.2) 
AIHT [7]  1.073  6.37  24.90 (  12.4,  12.5) 
ISD [68]  0.911  5.19  19.67 (  11.6,  8.1) 
SBR [63]  0.788  4.05  13.62 (  12.0,  1.6) 
IRL2  0.993  5.80  16.32 (  12.9,  3.4) 
IRL2 + debiasing  0.924  4.82  16.32 (  12.9,  3.4) 
IRL1  0.884  5.29  14.43 (  11.5,  2.9) 
IRL1 + debiasing  0.774  4.18  14.43 (  11.5,  2.9) 
IMSC (log)  0.864  5.08  17.98 (  9.8,  8.2) 
IMSC (log) + debiasing  0.817  4.83  17.98 (  9.8,  8.2) 
IMSC (atan)  0.768  4.29  15.43 (  10.0,  5.5) 
IMSC (atan) + debiasing  0.769  4.35  15.42 (  10.0,  5.5) 
IMSC/S (atan)  0.910  5.45  17.93 (  9.8,  8.1) 
IMSC/S (atan) + debiasing  0.800  4.73  17.92 (  9.8,  8.1) 
Next, sparse deconvolution is performed using three algorithms developed to solve the highly nonconvex quasinorm problem, namely the Iterative Support Detection (ISD) algorithm [68],^{2}^{2}2http://www.caam.rice.edu/%7Eoptimization/L1/ISD/ the Accelerated Iterative Hard Thresholding (AIHT) algorithm [7],^{3}^{3}3http://users.fmrib.ox.ac.uk/%7Etblumens/sparsify/sparsify.html and the Single Best Replacement (SBR) algorithm [63]. In each case, we used software by the respective authors. The ISD and SBR algorithms require regularization parameters and respectively; we found that and were approximately optimal. The AIHT algorithm requires the number of nonzeros be specified; we used the number of nonzeros in the true sparse signal. Each of ISD, AIHT, and SBR significantly improve the accuracy of the result in comparison with the norm solutions, with SBR being the most accurate. These algorithms essentially seek the correct support. They do not penalize the values in the detected support; so, debiasing does not alter the signals produced by these algorithms.
The quasinorm, with , i.e. , also substantially improves upon the norm result. Several methods exist to minimize the cost function in this case. We implement two methods: IRL2 and IRL1 (iterative reweighted and norm minimization, respectively), with and without debiasing in each case. We used , which we found to be about optimal on average for this deconvolution problem. As revealed in Table I, IRL1 is more accurate than IRL2. Note that IRL2 and IRL1 seek to minimize exactly the same cost function; so the inferiority of IRL2 compared to IRL1 is due to the convergence of IRL2 to a local minimizer of . Also note that debiasing substantially improves L2E and L1E (with no effect on SE) for both IRL2 and IRL1. The results demonstrate both the value of a nonconvex regularizer and the vulnerability of nonconvex optimization to local minimizers.
The results of the proposed iterative MSC (IMSC) algorithm, with and without debiasing, are shown in Table I. We used and , in accordance with (40). Results using the logarithmic (log) and arctangent (atan) penalty functions are tabulated, which show the improvement provided by the later penalty, in terms of L2E, L1E, and SE. While debiasing reduces the error (bias) of the logarithmic penalty, it has negligible effect on the arctangent penalty. The simplified form of the MSC algorithm, wherein is used instead of the computed via SDP, is also tabulated in Table I, denoted by IMSC/S. IMSC/S is more computationally efficient than MSC due to the omission of SDP; however, it does lead to an increase in the error measures.
The IMSC algorithm ran for three iterations on average. For example, the IMSC solution illustrated in Fig. 7 ran with , , and . Therefore, even though the signal is of length 1000, the SDPs that had to be solved are much smaller: of sizes 61, 40, and 38, only.
The optimality of the MSC solution at each stage can be verified using (34). Specifically, a scatter plot of