Sparse and Smooth Signal Estimation: Convexification of L0 Formulations
Signal estimation problems with smoothness and sparsity priors can be naturally modeled as quadratic optimization with -“norm” constraints. Since such problems are non-convex and hard-to-solve, the standard approach is, instead, to tackle their convex surrogates based on -norm relaxations.
In this paper, we propose new iterative conic quadratic relaxations that
exploit not only the -“norm” terms, but also the fitness and smoothness functions. The iterative convexification approach substantially closes the gap between the -“norm” and its surrogate.
Experiments using an off-the-shelf conic quadratic solver on synthetic as well as real datasets indicate that the proposed iterative convex relaxations lead to significantly better estimators than -norm while preserving the computational efficiency. In addition, the parameters of the model and the resulting estimators are easily interpretable.
Keywords Mixed-integer quadratic optimization, conic quadratic optimization, perspective formulation, lasso, sparsity.
A. Gómez, S. Han: Department of Industrial Engineering, Swanson School of Engineering, University of Pittsburgh, PA 15261. firstname.lastname@example.org, email@example.com
Given nonnegative data corresponding to a noisy realization of an underlying signal, we consider the problem of removing the noise and recovering the original, uncorrupted signal . A successful recovery of the signal requires exploiting prior knowledge on the structure and characteristics of the signal effectively.
A common prior knowledge on the underlying signal is smoothness. Smoothing considerations can be incorporated in denoising problems through quadratic penalties for deviations in successive estimates . In particular, denoising of a smooth signal can be done by solving an optimization problem of the form
where corresponds to the estimation for , is a smoothing regularization parameter, is a linear operator, the estimation error term measures the fitness to data, and the quadratic penalty term models the smoothness considerations. In its simplest form
where encodes the notion of adjacency, e.g., consecutive observations in a time series or adjacent pixels in an image. If is given according to (2), then problem (1) is a convex Markov Random Fields problem  or metric labeling problem , commonly used in the image segmentation context [16, 48], for which efficient combinatorial algorithms exist.\ignore Another well-known special case of (1), arising in time-series analysis, corresponds to
in this case, (1) corresponds to the Hodrick-Prescott filter [?, ?, ?], popular in the economics literature, supported by standard packages for data analysis such as SAS and R, and solvable in operations. Even in its general form, (1) is a convex quadratic optimization, for which a plethora of efficient algorithms exist.
Another naturally occurring signal characteristic is sparsity, i.e., the underlying signal differs from a base value in only a small proportion of the indexes. Sparsity arises in diverse application domains including medical imaging , genomic studies , face recognition , and is at the core of compressed sensing methods . In fact, the “bet on sparsity” principle  calls for systematically assuming sparsity in high-dimensional statistical inference problems. Sparsity constraints can be modeled using the -“norm”
with , has received (slightly) more attention. Problem (5) corresponds to a Markov Random Fields problem with non-convex deviation functions [see 1, 42], for which a pseudo-polynomial combinatorial algorithm of complexity exists, where is a precision parameter; to the best of our knowledge, this algorithm has not been implemented to date. More recently, in the context of signal denoising, Bach  proposed another pseudo-polynomial algorithm of complexity , and demonstrated its performance for instances with . The aforementioned algorithms rely on a discretization of the variables, and their performance depends on how precise the discretization (given by the parameter ) is. Finally, a recent result of Atamtürk and Gómez  on quadratic optimization with M-matrices and indicators imply that (5) is equivalent to a submodular minimization problem, which leads to a strongly polynomial-time algorithm of complexity . The high complexity by a blackbox submodular minimization algorithm precludes its uses except for small instances. No polynomial-time algorithm is known for the constrained problem (4).
In fact, problems (4) and (5) are rarely tackled directly. One of the most popular techniques used to tackle signal estimation problems with sparsity is lasso, consisting of replacing the non-convex term with the convex -norm, , see Section 2.1 for details. The resulting optimization problems with the -norm can be solved very efficiently, even for large instances; however, the problems are often weak relaxations of the exact problem (4), and the estimators obtained may be poor, as a consequence. Alternatively, there is a increasing effort for solving the mixed-integer optimization (MIO) (4) exactly using enumerative techniques, see Section 2.2. While the recovered signals are indeed high quality, MIO approaches to-date are unable to handle problems with , and are inadequate to tackle many realistic instances as a consequence.
Contributions and outline
In this paper, we discuss how to bridge the gap between the easy-to-solve approximations and the often intractable problems in a convex optimization framework. Specifically, we construct a set of iterative convex relaxations for problems (4) and (5) with increasing strength. These convex relaxations are considerably stronger than the usual relaxation, and also significantly improve and generalize other existing convex relations in the literature, including the perspective relaxation (see Section 2.3) and recent convex relaxations obtained from simple pairwise quadratic terms (see Section 2.4). Moreover, the strong convex relaxations can be used as a standalone convex optimization method to obtain high quality, if not optimal, solutions for (4)–(5), or can be embedded in MIO branch-and-bound methods to solve (4)–(5) to optimality; in both cases, the proposed approach results in better performance than the existing methods. Finally, all the proposed formulations are amenable to conic quadratic optimization techniques, thus can be tackled using off-the-shelf solvers, resulting in several benefits: (i) there is no need to develop specialized codes to use the formulations; (ii) the proposed approach is flexible, as it can be used to tackle either (4) or (5), and additional priors can easily be included by changing the objective or adding constraints (see Section 5.3); (iii) the formulations can be used either as convex surrogates for (4)–(5), resulting in better-quality estimators than -norm approximations with little computational cost, or can be embedded in branch-and-bound solvers to solve (4)–(5) to optimality, resulting in significant speedups over branch-and-bound methods with a “standard” formulation.
Figure 5 illustrates the performance of the classical -norm estimators and one of the proposed estimators, M-sep; estimator M-sep is obtained by solving a convex optimization problem and is computed in a few seconds for the instance shown with .
The rest of the paper is organized as follows. In Section 2 we review the relevant background for the paper. In Section 3 we introduce the strong iterative convex formulations for (4)–(5). In Section 4 we discuss the implementation of the proposed formulations using conic quadratic solvers. In Section 5 we test the performance of the methods on instances with synthetic and real data, and in Section 6 we conclude the paper with few final remarks.
Throughout the paper, we adopt the following convention for division by : given , if and if . For a set , denotes the closure of the convex hull of .
In this section, we review formulations relevant to our discussion. First we review the usual -norm approximation (Section 2.1), next we discuss MIO formulations (Section 2.2), then we review the perspective reformulation, a standard technique in the MIO literature, (Section 2.3), and finally pairwise convex relaxations that were recently proposed (Section 2.4).
2.1. L1-norm approximations
A standard technique for signal estimation problems with sparsity is to replace the -norm with the -norm in (4), leading to the convex optimization problem
The -norm approximation was proposed by Tibshirani  in the context of sparse linear regression, and is often referred to as lasso . The main motivation for lasso is that the -norm is the convex -norm closer to the -norm. In fact, for , it is easy to show that ; therefore, the -norm approximation is considered to be the best possible convex relaxation of the -norm.
Lasso is currently the most commonly used approach for sparsity . It has been applied to a variety of signal estimation problems including signal decomposition and spike detection [e.g., 21, 30, 76, 49], and pervasive in the compressed sensing literature [17, 18, 26]. A common variant of lasso is fused lasso , which involves a sparsity-inducing term of the form ; fused lasso was further studied in the context of signal estimation , and is often used for digital imaging processing under the name of total variation denoising [66, 75, 59]. Several other generalizations of lasso exist , including elastic net [85, 58], adaptive lasso , group lasso [8, 63] and smooth lasso ; related -norm techniques have also been proposed for signal estimation, see [46, 53, 73]. The generalized lasso  utilizes the regularization term and is also studied in the context of signal approximation.
Despite its widespread adoption, lasso has several drawbacks. First, the -norm term may result in excessive shrinkage of the estimated signal, which is undesirable in many contexts . Additionally, lasso may struggle to achieve sparse estimators — in fact, solutions to (6) are often dense, and achieving a target sparsity of requires using a parameter , inducing additional bias on the estimators. As a consequence, desirable theoretical performance of the lasso can only be established under stringent conditions [65, 67], which may not be satisfied in practice. Indeed, approximations have been shown to perform rather poorly in a variety of contexts, e.g., see [45, 56]. To overcome the aforementioned drawbacks, several non-convex approximations have been proposed [28, 54, 82, 83]; more recently, there is also an increasing effort devoted to enforcing sparsity directly with regularization using enumerative MIO approaches.
2.2. Mixed-integer optimization
Signal estimation problems with sparsity can be naturally modeled as a mixed-integer quadratic optimization (MIQO) problem. Using indicator variables such that for all , problem (4) can be formulated as
In this formulation, the non-convexity of the regularizer is captured by the complementary constraints (7c) and the binary constraints . Constraints (7c) can be alternatively formulated with the so-called “big-” constraints with a sufficiently large positive number ,
For the signal estimation problem (7), is a valid upper bound for , . Problem (7) is a convex MIQO problem, which can be tackled using off-the-shelf MIO solvers. Estimation problems with a few hundred of variables can be comfortably solved to optimality using such solvers, e.g., see [11, 22, 32, 78]. For high Signal-to-Noise Ratios (SNR), the estimators obtained from solving the exact problems indeed result in superior statistical performance when compared with the approximations . For low SNR, however, the lack of shrinkage may hamper the estimators obtained from optimal solutions of the problems ; nonetheless, if necessary, shrinkage can be easily added to (7) via conic quadratic regularizations terms , resulting again in superior statistical performance over corresponding approximations. Unfortunately, current MIO solvers are unable to solve larger problems with thousands of variables.
The standard technique for solving MIO problems is the branch-and-bound method, which recursively partitions the search space and solves convex relaxations for each partition to produce lower bounds for fathoming sections of the search space. Simple convex relaxations of MIO problems can be obtained by replacing the integrality constraints with convex bound constraints, i.e., replacing with . While proving optimality may require an exponential number of iterations, bounds from strong convex relaxations can help reduce the iterations by many orders of magnitude and even eliminate the need of enumeration completely in some cases in practice. Branch-and-bound methods have been tremendously successful for mixed-integer linear optimization , and practical problems with thousands of variables are routinely solved to optimality or near-optimality nowadays. The progress for solving estimation problems such as (7), however, has been so far limited. \ignorebecause most solvers are ill-equipped for constructing and handling strong convex relaxations of problems involving nonlinear functions. In addition, big- constraints (8) are often weak and result in poor relaxations of the discrete problems.
Finally, we point out the relationship between the lasso approximation (6) and the MIO formulation (7). It can be verified easily that there exists an optimal solution to the simple convex relaxation with big- constraint, where for all . Therefore, the constraint (7b) reduces to , and we find that lasso is in fact the natural convex relaxation of (7) (for a suitable sparsity parameter). This relaxation is often weak and can be improved substantially.
2.3. The perspective reformulation
A simple strengthening technique to improve the convex relaxation of (7) is the perspective reformulation , which will be referred to as persp. in the remainder of the paper for brevity. This reformulation technique can be applied to the estimation error terms in (7a) as follows:
The term is the closure of the perspective function of the quadratic function , and is therefore convex, see p. 160 of . Reformulation (9) is in fact the best possible for separable quadratic functions with indicator variables. The perspective terms can be replaced with an auxiliary variable along with rotated cone constraints [2, 34]. Therefore, persp. relaxations can be easily solved with conic quadratic solvers and is by now a standard technique for mixed-integer quadratic optimization [15, 39, 52, 79]. Additionally, relationships between the persp. and the sparsity-inducing non-convex penalty functions minimax concave penalty  and reverse Huber penalty  have recently been established . In the context of the signal estimation problem (4), the persp. yields the convex relaxation
The lasso approximation, as discussed in Section 2.1, is the best convex relaxation that considers only the indicators for the terms. The persp. approximation is the best convex relaxation that exploits both the indicator variables and the separable quadratic estimation error terms; thus, the resulting relaxation is stronger than lasso. However, the persp. cannot be applied to non-separable quadratic smoothness terms , as the function is non-convex due to the bilinear term.
2.4. Strong formulations for pairwise quadratic terms
Recently, Jeon et al.  gave strong relaxations for the mixed-integer epigraphs of non-separable convex quadratic functions with two variables and indicator variables. Atamtürk and Gómez  further strengthened the relaxations for quadratic functions of the form corresponding to the smoothness terms in (7). Specifically, let
and define the function as
Proposition 1 (Atamtürk and Gómez ).
The function is convex and
Note that is not differentiable everywhere and it is defined by pieces. Therefore, it cannot be used directly with most convex optimization solvers. Atamtürk and Gómez  implement (10) using linear outer approximations of function : the resulting method performs adequately for instances with , but was ineffective in instances with as strong linear outer approximations require the addition of a large number of constraints. Moreover, as Example 1 below shows, formulation (10) can be further improved even for .
Consider the signal estimation problem with in Lagrangean form,
Obtained by replacing with . The corresponding optimal solution is , and we find that .
The optimal solution is , and .
- pairwise :
The optimal solution is , and .
Thus, although persp. and pairwise substantially improve upon the lasso relaxation, the resulting solutions are not necessarily integral in .
The optimal solutions and objective values of the convex relaxations of (11) are as follows: lasso, obtained by replacing with , has optimal solution , with optimal objective value ; persp. has optimal solution , with objective value ; and convex relaxation (10) has optimal solution , with objective value . The optimal solution of (11) is , with optimal objective value . Thus, although persp. and (10) substantially improve upon the lasso relaxation, the resulting solutions are not integral in and there is a gap between the optimal values of such relaxations and .
In this paper, we show how to further improve the pairwise formulation to obtain a stronger relaxation of (7). Additionally, we show how to implement pairwise and the stronger formulations derived in the paper in a conic quadratic optimization framework. Therefore, all the proposed convex relaxations benefit from a vast and growing literature on conic quadratic optimization, e.g., see [3, 4, 6, 50, 57], can easily be implemented with off-the-shelf solvers, and scale to large instances.
3. Strong convex formulations for signal estimation
In the pairwise formulation each single- and two-variable quadratic term is strengthened independently and, consequently, the formulation fails to fully exploit the relationships between different pairs of variables. Observe that problem (7) can be stated as
where, for , if and otherwise, and where . In particular, is a symmetric M-matrix, i.e., and . In this section we derive convex relaxations of (7) that better exploit the M-matrix structure. We briefly review properties of M-matrices and refer the reader to [10, 31, 61, 74] and the references therein for an in-depth discussion on M-matrices.
Proposition 2 (Plemmons , characterization 37).
An M-matrix is generalized diagonally dominant, i.e., there exists a positive diagonal matrix such that is (weakly) diagonally dominant.
Generalized diagonally dominant matrices are also called scaled diagonally dominant matrices in the literature.
Proposition 3 (Boman et al. ).
A matrix is generalized diagonally dominant iff it has factor width at most two, i.e., there exists a real matrix such that and each column of contains at most two non-zeros.
Proposition 3 implies that if is an M-matrix, then the quadratic function can be written as sums of quadratic functions of at most two variables each, i.e., where for any at most two entries are non-zero. Thus, to derive stronger formulations for (12), we first study the mixed-integer epigraphs of parametric pairwise quadratic functions with indicators.
3.1. Convexification of the parametric pairwise terms
Consider the mixed-integer epigraph of a parametric pairwise quadratic term (with parameters )
where and , which is the necessary and sufficient condition for convexity of the function . Note that one may assume the product coefficient equals , as otherwise the continuous variables and coefficients can be scaled. Clearly, if , then reduces to .
Consider the decompositions of the two-variable quadratic function in the definition of given by
Intuitively, the decompositions above are obtained by extracting a term from the quadratic function such that is as large as possible and the remainder quadratic term is still convex. Then, applying persp. and Proposition 1 to the separable and pairwise quadratic terms, respectively, one obtains two valid inequalities for :
Consider the mixed-integer optimization problem
and the corresponding convex optimization
To prove the result it suffices to show that, for any value of , either (15) and (16) are both unbounded, or that (16) has an optimal solution that is also optimal for (15). We assume, without loss of generality, that (if , the result follows from Proposition 1 by scaling), (if , both problems are unbounded by letting , and if , problem (16) reduces to linear optimization over a integral polytope and optimal solutions are integral in ), and (by scaling). Moreover, since , there exists an optimal solution for both (15) and (16).
Let be an optimal solution of (16); we show how to construct from a feasible solution for (15) with same objective value, thus optimal for both problems. Observe that for , . Thus, if , then is also feasible for (16) with objective value . In particular, either there exists an (integral) optimal solution with by setting , or there exists an optimal solution with one of the variables equal to one by increasing . Thus, assume without loss of generality that . Now consider the optimization problem
obtained from (16) by fixing , dropping constraint (16c), and eliminating variable since (16b) holds at equality in optimal solutions. An integer optimal solution for (17) is also optimal for (15) and (16). Let be an optimal solution for (17), and consider the two cases:
Case 1: : If , then the point with is feasible for (17) with objective value
Therefore, there exists an optimal solution where .
Case 2: : In this case, is an optimal solution of
The condition implies that , thus the optimal value of can be found by taking derivatives and setting to . We find
If , then the point with is feasible for (19) with objective value
Therefore, there exists an optimal solution where .
Example 1 (continued).
Given , define the function as
For any with , function is the point-wise maximum of two convex functions and is therefore convex. Using the convex function , Theorem 1 can be restated as
Finally, it is easy to verity that if , then the maximum in (3.1) corresponds to the first term; if , the maximum corresponds to the second term. Thus, an explicit expression of is
3.2. Convexifications for general M-matrices
Consider now the set
where is an M-matrix. In this section, we will show how the convex hull descriptions for can be used to construct strong convex relaxations for . We start with the following motivating example.
Consider the signal estimation in Lagrangean form with , , and ,
The opt. solution is with value , and .
The opt. solution is with value , and .
- pairwise :
The opt. solution with value , and .
The quadratic constraint (21b) can be decomposed and strengthened as follows:
leading to solution is with value , and .
As Example 2 shows, strong convex relaxations of can be obtained by decomposing into sums of two-variable quadratic terms (as is an M-matrix) and convexifying each term. However, such a decomposition is not unique and the strength of the relaxation depends on the decomposition chosen. We now discuss how to best decompose the matrix to derive the strongest lower bound possible.
Consider the separation problem: given a point , find a decomposition of such that, after strengthening each two-variable term, results in a most violated inequality, which is formulated as follows: