Sparse and Smooth Signal Estimation: Convexification of L0 Formulations
Abstract.
Signal estimation problems with smoothness and sparsity priors can be naturally modeled as quadratic optimization with “norm” constraints. Since such problems are nonconvex and hardtosolve, the standard approach is, instead, to tackle their convex surrogates based on norm relaxations.
In this paper, we propose new iterative conic quadratic relaxations that
exploit not only the “norm” terms, but also the fitness and smoothness functions. The iterative convexification approach substantially closes the gap between the “norm” and its surrogate.
Experiments using an offtheshelf conic quadratic solver on synthetic as well as real datasets indicate that the proposed iterative convex relaxations lead to significantly better estimators than norm while preserving the computational efficiency. In addition, the parameters of the model and the resulting estimators are easily interpretable.
Keywords Mixedinteger quadratic optimization, conic quadratic optimization, perspective formulation, lasso, sparsity.
A. Gómez, S. Han: Department of Industrial Engineering, Swanson School of Engineering, University of Pittsburgh, PA 15261. agomez@pitt.edu, shaoning.han@pitt.edu
18.05
November 2018
1. Introduction
Given nonnegative data corresponding to a noisy realization of an underlying signal, we consider the problem of removing the noise and recovering the original, uncorrupted signal . A successful recovery of the signal requires exploiting prior knowledge on the structure and characteristics of the signal effectively.
A common prior knowledge on the underlying signal is smoothness. Smoothing considerations can be incorporated in denoising problems through quadratic penalties for deviations in successive estimates [62]. In particular, denoising of a smooth signal can be done by solving an optimization problem of the form
(1) 
where corresponds to the estimation for , is a smoothing regularization parameter, is a linear operator, the estimation error term measures the fitness to data, and the quadratic penalty term models the smoothness considerations. In its simplest form
(2) 
where encodes the notion of adjacency, e.g., consecutive observations in a time series or adjacent pixels in an image. If is given according to (2), then problem (1) is a convex Markov Random Fields problem [41] or metric labeling problem [47], commonly used in the image segmentation context [16, 48], for which efficient combinatorial algorithms exist.\ignore Another wellknown special case of (1), arising in timeseries analysis, corresponds to
(3) 
in this case, (1) corresponds to the HodrickPrescott filter [?, ?, ?], popular in the economics literature, supported by standard packages for data analysis such as SAS and R, and solvable in operations. Even in its general form, (1) is a convex quadratic optimization, for which a plethora of efficient algorithms exist.
Another naturally occurring signal characteristic is sparsity, i.e., the underlying signal differs from a base value in only a small proportion of the indexes. Sparsity arises in diverse application domains including medical imaging [51], genomic studies [43], face recognition [80], and is at the core of compressed sensing methods [25]. In fact, the “bet on sparsity” principle [35] calls for systematically assuming sparsity in highdimensional statistical inference problems. Sparsity constraints can be modeled using the “norm”
(4) 
where is a target sparsity and , where is the indicator function equal to if is true and equal to otherwise. Several generalizations of (4) can also be envisioned, see Section 5.3 for details.
Unlike (1), problem (4) is nonconvex and hardtosolve exactly. The Lagrangean relaxation of (4), given by
(5) 
with , has received (slightly) more attention. Problem (5) corresponds to a Markov Random Fields problem with nonconvex deviation functions [see 1, 42], for which a pseudopolynomial combinatorial algorithm of complexity exists, where is a precision parameter; to the best of our knowledge, this algorithm has not been implemented to date. More recently, in the context of signal denoising, Bach [7] proposed another pseudopolynomial algorithm of complexity , and demonstrated its performance for instances with . The aforementioned algorithms rely on a discretization of the variables, and their performance depends on how precise the discretization (given by the parameter ) is. Finally, a recent result of Atamtürk and Gómez [5] on quadratic optimization with Mmatrices and indicators imply that (5) is equivalent to a submodular minimization problem, which leads to a strongly polynomialtime algorithm of complexity . The high complexity by a blackbox submodular minimization algorithm precludes its uses except for small instances. No polynomialtime algorithm is known for the constrained problem (4).
In fact, problems (4) and (5) are rarely tackled directly. One of the most popular techniques used to tackle signal estimation problems with sparsity is lasso, consisting of replacing the nonconvex term with the convex norm, , see Section 2.1 for details. The resulting optimization problems with the norm can be solved very efficiently, even for large instances; however, the problems are often weak relaxations of the exact problem (4), and the estimators obtained may be poor, as a consequence. Alternatively, there is a increasing effort for solving the mixedinteger optimization (MIO) (4) exactly using enumerative techniques, see Section 2.2. While the recovered signals are indeed high quality, MIO approaches todate are unable to handle problems with , and are inadequate to tackle many realistic instances as a consequence.
Contributions and outline
In this paper, we discuss how to bridge the gap between the easytosolve approximations and the often intractable problems in a convex optimization framework. Specifically, we construct a set of iterative convex relaxations for problems (4) and (5) with increasing strength. These convex relaxations are considerably stronger than the usual relaxation, and also significantly improve and generalize other existing convex relations in the literature, including the perspective relaxation (see Section 2.3) and recent convex relaxations obtained from simple pairwise quadratic terms (see Section 2.4). Moreover, the strong convex relaxations can be used as a standalone convex optimization method to obtain high quality, if not optimal, solutions for (4)–(5), or can be embedded in MIO branchandbound methods to solve (4)–(5) to optimality; in both cases, the proposed approach results in better performance than the existing methods. Finally, all the proposed formulations are amenable to conic quadratic optimization techniques, thus can be tackled using offtheshelf solvers, resulting in several benefits: (i) there is no need to develop specialized codes to use the formulations; (ii) the proposed approach is flexible, as it can be used to tackle either (4) or (5), and additional priors can easily be included by changing the objective or adding constraints (see Section 5.3); (iii) the formulations can be used either as convex surrogates for (4)–(5), resulting in betterquality estimators than norm approximations with little computational cost, or can be embedded in branchandbound solvers to solve (4)–(5) to optimality, resulting in significant speedups over branchandbound methods with a “standard” formulation.
Figure 5 illustrates the performance of the classical norm estimators and one of the proposed estimators, Msep; estimator Msep is obtained by solving a convex optimization problem and is computed in a few seconds for the instance shown with .
The rest of the paper is organized as follows. In Section 2 we review the relevant background for the paper. In Section 3 we introduce the strong iterative convex formulations for (4)–(5). In Section 4 we discuss the implementation of the proposed formulations using conic quadratic solvers. In Section 5 we test the performance of the methods on instances with synthetic and real data, and in Section 6 we conclude the paper with few final remarks.
Notation
Throughout the paper, we adopt the following convention for division by : given , if and if . For a set , denotes the closure of the convex hull of .
2. Background
In this section, we review formulations relevant to our discussion. First we review the usual norm approximation (Section 2.1), next we discuss MIO formulations (Section 2.2), then we review the perspective reformulation, a standard technique in the MIO literature, (Section 2.3), and finally pairwise convex relaxations that were recently proposed (Section 2.4).
2.1. L1norm approximations
A standard technique for signal estimation problems with sparsity is to replace the norm with the norm in (4), leading to the convex optimization problem
(6) 
The norm approximation was proposed by Tibshirani [69] in the context of sparse linear regression, and is often referred to as lasso . The main motivation for lasso is that the norm is the convex norm closer to the norm. In fact, for , it is easy to show that ; therefore, the norm approximation is considered to be the best possible convex relaxation of the norm.
Lasso is currently the most commonly used approach for sparsity [37]. It has been applied to a variety of signal estimation problems including signal decomposition and spike detection [e.g., 21, 30, 76, 49], and pervasive in the compressed sensing literature [17, 18, 26]. A common variant of lasso is fused lasso [71], which involves a sparsityinducing term of the form ; fused lasso was further studied in the context of signal estimation [65], and is often used for digital imaging processing under the name of total variation denoising [66, 75, 59]. Several other generalizations of lasso exist [70], including elastic net [85, 58], adaptive lasso [84], group lasso [8, 63] and smooth lasso [38]; related norm techniques have also been proposed for signal estimation, see [46, 53, 73]. The generalized lasso [72] utilizes the regularization term and is also studied in the context of signal approximation.
Despite its widespread adoption, lasso has several drawbacks. First, the norm term may result in excessive shrinkage of the estimated signal, which is undesirable in many contexts [81]. Additionally, lasso may struggle to achieve sparse estimators — in fact, solutions to (6) are often dense, and achieving a target sparsity of requires using a parameter , inducing additional bias on the estimators. As a consequence, desirable theoretical performance of the lasso can only be established under stringent conditions [65, 67], which may not be satisfied in practice. Indeed, approximations have been shown to perform rather poorly in a variety of contexts, e.g., see [45, 56]. To overcome the aforementioned drawbacks, several nonconvex approximations have been proposed [28, 54, 82, 83]; more recently, there is also an increasing effort devoted to enforcing sparsity directly with regularization using enumerative MIO approaches.
2.2. Mixedinteger optimization
Signal estimation problems with sparsity can be naturally modeled as a mixedinteger quadratic optimization (MIQO) problem. Using indicator variables such that for all , problem (4) can be formulated as
(7a)  
s.t.  (7b)  
(7c)  
(7d) 
In this formulation, the nonconvexity of the regularizer is captured by the complementary constraints (7c) and the binary constraints . Constraints (7c) can be alternatively formulated with the socalled “big” constraints with a sufficiently large positive number ,
(8) 
For the signal estimation problem (7), is a valid upper bound for , . Problem (7) is a convex MIQO problem, which can be tackled using offtheshelf MIO solvers. Estimation problems with a few hundred of variables can be comfortably solved to optimality using such solvers, e.g., see [11, 22, 32, 78]. For high SignaltoNoise Ratios (SNR), the estimators obtained from solving the exact problems indeed result in superior statistical performance when compared with the approximations [12]. For low SNR, however, the lack of shrinkage may hamper the estimators obtained from optimal solutions of the problems [36]; nonetheless, if necessary, shrinkage can be easily added to (7) via conic quadratic regularizations terms [55], resulting again in superior statistical performance over corresponding approximations. Unfortunately, current MIO solvers are unable to solve larger problems with thousands of variables.
The standard technique for solving MIO problems is the branchandbound method, which recursively partitions the search space and solves convex relaxations for each partition to produce lower bounds for fathoming sections of the search space. Simple convex relaxations of MIO problems can be obtained by replacing the integrality constraints with convex bound constraints, i.e., replacing with . While proving optimality may require an exponential number of iterations, bounds from strong convex relaxations can help reduce the iterations by many orders of magnitude and even eliminate the need of enumeration completely in some cases in practice. Branchandbound methods have been tremendously successful for mixedinteger linear optimization [13], and practical problems with thousands of variables are routinely solved to optimality or nearoptimality nowadays. The progress for solving estimation problems such as (7), however, has been so far limited. \ignorebecause most solvers are illequipped for constructing and handling strong convex relaxations of problems involving nonlinear functions. In addition, big constraints (8) are often weak and result in poor relaxations of the discrete problems.
Finally, we point out the relationship between the lasso approximation (6) and the MIO formulation (7). It can be verified easily that there exists an optimal solution to the simple convex relaxation with big constraint, where for all . Therefore, the constraint (7b) reduces to , and we find that lasso is in fact the natural convex relaxation of (7) (for a suitable sparsity parameter). This relaxation is often weak and can be improved substantially.
2.3. The perspective reformulation
A simple strengthening technique to improve the convex relaxation of (7) is the perspective reformulation [27], which will be referred to as persp. in the remainder of the paper for brevity. This reformulation technique can be applied to the estimation error terms in (7a) as follows:
(9) 
The term is the closure of the perspective function of the quadratic function , and is therefore convex, see p. 160 of [40]. Reformulation (9) is in fact the best possible for separable quadratic functions with indicator variables. The perspective terms can be replaced with an auxiliary variable along with rotated cone constraints [2, 34]. Therefore, persp. relaxations can be easily solved with conic quadratic solvers and is by now a standard technique for mixedinteger quadratic optimization [15, 39, 52, 79]. Additionally, relationships between the persp. and the sparsityinducing nonconvex penalty functions minimax concave penalty [81] and reverse Huber penalty [60] have recently been established [24]. In the context of the signal estimation problem (4), the persp. yields the convex relaxation
The lasso approximation, as discussed in Section 2.1, is the best convex relaxation that considers only the indicators for the terms. The persp. approximation is the best convex relaxation that exploits both the indicator variables and the separable quadratic estimation error terms; thus, the resulting relaxation is stronger than lasso. However, the persp. cannot be applied to nonseparable quadratic smoothness terms , as the function is nonconvex due to the bilinear term.
2.4. Strong formulations for pairwise quadratic terms
Recently, Jeon et al. [44] gave strong relaxations for the mixedinteger epigraphs of nonseparable convex quadratic functions with two variables and indicator variables. Atamtürk and Gómez [5] further strengthened the relaxations for quadratic functions of the form corresponding to the smoothness terms in (7). Specifically, let
and define the function as
Proposition 1 (Atamtürk and Gómez [5]).
The function is convex and
Using persp. and Proposition 1, one obtains the stronger pairwise convex relaxation of (7) as
(10a)  
(10b)  
(10c)  
(10d) 
Note that is not differentiable everywhere and it is defined by pieces. Therefore, it cannot be used directly with most convex optimization solvers. Atamtürk and Gómez [5] implement (10) using linear outer approximations of function : the resulting method performs adequately for instances with , but was ineffective in instances with as strong linear outer approximations require the addition of a large number of constraints. Moreover, as Example 1 below shows, formulation (10) can be further improved even for .
Example 1.
Consider the signal estimation problem with in Lagrangean form,
(11a)  
s.t.  (11b)  
(11c) 
The optimal solution of (11) is . The optimal solutions of the convex relaxations of (11) are as follows:
 lasso:

Obtained by replacing with . The corresponding optimal solution is , and we find that .
 persp.:

The optimal solution is , and .
 pairwise :

The optimal solution is , and .
Thus, although persp. and pairwise substantially improve upon the lasso relaxation, the resulting solutions are not necessarily integral in .
The optimal solutions and objective values of the convex relaxations of (11) are as follows: lasso, obtained by replacing with , has optimal solution , with optimal objective value ; persp. has optimal solution , with objective value ; and convex relaxation (10) has optimal solution , with objective value . The optimal solution of (11) is , with optimal objective value . Thus, although persp. and (10) substantially improve upon the lasso relaxation, the resulting solutions are not integral in and there is a gap between the optimal values of such relaxations and .
In this paper, we show how to further improve the pairwise formulation to obtain a stronger relaxation of (7). Additionally, we show how to implement pairwise and the stronger formulations derived in the paper in a conic quadratic optimization framework. Therefore, all the proposed convex relaxations benefit from a vast and growing literature on conic quadratic optimization, e.g., see [3, 4, 6, 50, 57], can easily be implemented with offtheshelf solvers, and scale to large instances.
3. Strong convex formulations for signal estimation
In the pairwise formulation each single and twovariable quadratic term is strengthened independently and, consequently, the formulation fails to fully exploit the relationships between different pairs of variables. Observe that problem (7) can be stated as
(12) 
where, for , if and otherwise, and where . In particular, is a symmetric Mmatrix, i.e., and . In this section we derive convex relaxations of (7) that better exploit the Mmatrix structure. We briefly review properties of Mmatrices and refer the reader to [10, 31, 61, 74] and the references therein for an indepth discussion on Mmatrices.
Proposition 2 (Plemmons [61], characterization 37).
An Mmatrix is generalized diagonally dominant, i.e., there exists a positive diagonal matrix such that is (weakly) diagonally dominant.
Generalized diagonally dominant matrices are also called scaled diagonally dominant matrices in the literature.
Proposition 3 (Boman et al. [14]).
A matrix is generalized diagonally dominant iff it has factor width at most two, i.e., there exists a real matrix such that and each column of contains at most two nonzeros.
Proposition 3 implies that if is an Mmatrix, then the quadratic function can be written as sums of quadratic functions of at most two variables each, i.e., where for any at most two entries are nonzero. Thus, to derive stronger formulations for (12), we first study the mixedinteger epigraphs of parametric pairwise quadratic functions with indicators.
3.1. Convexification of the parametric pairwise terms
Consider the mixedinteger epigraph of a parametric pairwise quadratic term (with parameters )
where and , which is the necessary and sufficient condition for convexity of the function . Note that one may assume the product coefficient equals , as otherwise the continuous variables and coefficients can be scaled. Clearly, if , then reduces to .
Consider the decompositions of the twovariable quadratic function in the definition of given by
Intuitively, the decompositions above are obtained by extracting a term from the quadratic function such that is as large as possible and the remainder quadratic term is still convex. Then, applying persp. and Proposition 1 to the separable and pairwise quadratic terms, respectively, one obtains two valid inequalities for :
(13)  
(14) 
Theorem 1 below shows that inequalities (13)–(14) along with the bound constraints are sufficient to describe .
Theorem 1.
Proof.
Consider the mixedinteger optimization problem
(15) 
and the corresponding convex optimization
(16a)  
s.t.  (16b)  
(16c)  
(16d) 
To prove the result it suffices to show that, for any value of , either (15) and (16) are both unbounded, or that (16) has an optimal solution that is also optimal for (15). We assume, without loss of generality, that (if , the result follows from Proposition 1 by scaling), (if , both problems are unbounded by letting , and if , problem (16) reduces to linear optimization over a integral polytope and optimal solutions are integral in ), and (by scaling). Moreover, since , there exists an optimal solution for both (15) and (16).
Let be an optimal solution of (16); we show how to construct from a feasible solution for (15) with same objective value, thus optimal for both problems. Observe that for , . Thus, if , then is also feasible for (16) with objective value . In particular, either there exists an (integral) optimal solution with by setting , or there exists an optimal solution with one of the variables equal to one by increasing . Thus, assume without loss of generality that . Now consider the optimization problem
(17a)  
(17b) 
obtained from (16) by fixing , dropping constraint (16c), and eliminating variable since (16b) holds at equality in optimal solutions. An integer optimal solution for (17) is also optimal for (15) and (16). Let be an optimal solution for (17), and consider the two cases:
Case 1: : If , then the point with is feasible for (17) with objective value
Therefore, there exists an optimal solution where .
Case 2: : In this case, is an optimal solution of
(18a)  
(18b) 
The condition implies that , thus the optimal value of can be found by taking derivatives and setting to . We find
Replacing with his optimal value in (18) and removing constant terms, we find that (18) is equivalent to
(19a)  
(19b) 
If , then the point with is feasible for (19) with objective value
Therefore, there exists an optimal solution where .
Example 1 (continued).
Given , define the function as
(20) 
For any with , function is the pointwise maximum of two convex functions and is therefore convex. Using the convex function , Theorem 1 can be restated as
Finally, it is easy to verity that if , then the maximum in (3.1) corresponds to the first term; if , the maximum corresponds to the second term. Thus, an explicit expression of is
3.2. Convexifications for general Mmatrices
Consider now the set
where is an Mmatrix. In this section, we will show how the convex hull descriptions for can be used to construct strong convex relaxations for . We start with the following motivating example.
Example 2.
Consider the signal estimation in Lagrangean form with , , and ,
(21a)  
s.t.  (21b)  
(21c)  
(21d) 
The optimal solution of (21) is with objective value . The optimal solutions and the corresponding objective values of the convex relaxations of (21) are as follows:
 lasso:

The opt. solution is with value , and .
 persp.:

The opt. solution is with value , and .
 pairwise :

The opt. solution with value , and .
 decomp.1:

The quadratic constraint (21b) can be decomposed and strengthened as follows:
leading to solution is with value , and .
 decomp.2:
As Example 2 shows, strong convex relaxations of can be obtained by decomposing into sums of twovariable quadratic terms (as is an Mmatrix) and convexifying each term. However, such a decomposition is not unique and the strength of the relaxation depends on the decomposition chosen. We now discuss how to best decompose the matrix to derive the strongest lower bound possible.
Consider the separation problem: given a point , find a decomposition of such that, after strengthening each twovariable term, results in a most violated inequality, which is formulated as follows: