Efficient First-Order Algorithms for Adaptive Signal Denoising

Efficient First-Order Algorithms for Adaptive Signal Denoising

Dmitrii Ostrovskii
INRIA – SIERRA project-team
dmitrii.ostrovskii@inria.fr
   Zaid Harchaoui
University of Washington
zaid@uw.edu
Abstract

We consider the problem of discrete-time signal denoising, focusing on a specific family of non-linear convolution-type estimators. Each such estimator is associated with a time-invariant filter which is obtained adaptively, by solving a certain convex optimization problem. Adaptive convolution-type estimators were demonstrated to have favorable statistical properties, see [JN09, JN10, HJNO15, OHJN16a]. Our first contribution is an efficient algorithmic implementation of these estimators via the known first-order proximal algorithms. Our second contribution is a computational complexity analysis of the proposed procedures, which takes into account their statistical nature and the related notion of statistical accuracy. The proposed procedures and their analysis are illustrated on a simulated data benchmark.

1 Introduction

We consider the problem of discrete-time signal denoising. The goal is to estimate a complex discrete-time signal on , observed in complex Gaussian noise on :

(1)

Here, are i.i.d. random variables with standard complex Gaussian distribution , that is, and are independent standard Gaussian random variables.

Signal denoising is a classical problem in statistical estimation and signal processing; see [IK81, Nem00, Tsy08, Was06, Hay91, Kay93]. The conventional approach is to assume that comes from a known set with a simple structure that can be exploited to build the estimator. For example, one might consider signals belonging to linear subspaces or signals whose spectral representation, as given by the Discrete Fourier or Discrete Wavelet transform, comes from a linearly transformed -ball, see [Tsy08, Joh11]. In all these cases, estimators with near-optimal statistical performance can be found explicitly, and correspond to linear functionals of the observations , hence the name linear estimators.

We focus here on a family of non-linear estimators with larger applicability and strong theoretical guarantees, in particular when the structure of the signal is unknown beforehand, studied in [Nem91, JN09, JN10, HJNO15, OHJN16a]. Assuming for convenience that the goal is to estimate on , these estimators write as:

(2)

here is called a filter and is supported on which we write as . For each estimator in this family, the filter is then obtained as an optimal solution of a convex optimization problem. For instance, the Penalized Least-Squares estimator [OHJN16a] is defined as

(3)

where is the Discrete Fourier transform (DFT) on , and is the -norm on . We shall give a summary of the other estimators in the end of this section. The optimization problems associated to all of them rest upon a common principle – minimization of the residual

(4)

regularized via the -norm in the Fourier domain .

While the statistical properties of adaptive convolution-type estimators have been extensively studied [Nem91, JN09, JN10, HJNO15, OHJN16a], the question of their algorithmic implementation remains largely unexplored. In fact, we are not aware of any publicly available implementation of these estimators. Our goal here is to close this gap.

Note that problems similar to (3) belong to the general class of second-order cone problems, and hence can in principle be solved to high numerical accuracy in polynomial time via interior-point methods [BTN01]. However, the computational complexity of interior-point methods grows polynomially with the problem dimension, and quickly becomes prohibitive when facing signal and image denoising problems (for example, in image denoising this number is proportional to the number of pixels which might be as large as ). Furthermore, it is unclear whether high-accuracy solutions are necessary for statistical recovery purposes. The level of accuracy sought should be adjusted to the noise level of the problem at hand.

On the other hand, one can leverage several favorable properties of these problems:

  • Easily accessible first-order information. The objective value and gradient at a given point can be computed in via Fast Fourier Transform (FFT).

  • Simple geometry. After a straightforward re-parametrization, one is left with -norm penalty or -ball as a feasible set in the constrained formulation. Prox-mappings for such problems, with respect to both the Euclidean distance and the “-adapted” distance-generating functions, can be computed efficiently.

  • Medium accuracy is sufficient. Approximate solutions with specified (and medium) accuracy still preserve the statistical performance of the exact solutions.

All these properties make first-order optimization algorithms the tools of choice to deal with (3) and similar problems.

Outline.

In Section 2, we recall two general classes of optimization problems, composite minimization problems [BT09, NN13] and composite saddle-point problems [JN11, NN13], and the corresponding algorithms suitable for their numerical solution. In Section 3, we present efficient first-order optimization algorithms based on these general approaches. In particular, we show how to compute first-order oracles in quasi-linear time in the signal length using FFT. In Section 4, we establish worst-case complexity bounds for the proposed algorithms that explicitly depend on the statistical quantities controlling the statistical difficulty of the problem: the signal length , the noise variance , and the filter norm parameter  for which must be an upper bound. These bounds show that iterations of a suitable first-order algorithm are sufficient to match the statistical properties of an exact estimator, where is the peak signal-to-noise ratio in the Fourier domain. This gives a sharp and rigorous characterization (in the present context) of the performance of “early stopping” strategies that allow to stop an optimization algorithm much earlier than dictated by the pure optimization analysis. In Section 5, we present numerical illustrations of the proposed algorithms on simulated data, which complement our theoretical analysis.

Notation.

We denote the space of all complex-valued signals on , or, simply, the space of all two-sided complex sequences. We call the finite-dimensional subspace of consisting of signals supported on :

its counterpart consists of all signals supported on . The unknown signal will be assumed to come from one of such subspaces, which corresponds to a finite signal length. Note that the signals from can be naturally mapped to column vectors by means of the index-restriction operator defined for any such that :

In particular, and define one-to-one mappings and . For convenience, column-vectors in and will be indexed starting from zero. We define the scaled -seminorms on :

We use the “Matlab notation” for matrix concatenation: is the vertical, and the horizontal concatenation of two matrices with compatible dimensions. We introduce the unitary Discrete Fourier Transform (DFT) operator on , defined by

The unitarity of implies that its inverse coincides with its conjugate transpose . Slightly abusing the notation, we will occasionally shorten to . In other words, is a map , and the adjoint map simply sends to via zero-padding. We use the “Big-O” notation: for two non-negative functions , on the same domain, means that there is a generic constant such that for any admissible value of the argument; means that is replaced with for some ; hereinafter is the natural logarithm, and is a generic constant.

Estimators.

We now summarize all the estimators that are of interest in this paper:

  • Constrained Uniform-Fit estimator, given for by

    (Con-UF)
  • Constrained Least-Squares estimator:

    (Con-LS)
  • Penalized Uniform-Fit estimator, given for by

    (Pen-UF)
  • Penalized Least-Squares estimator:

    (Pen-LS)

2 Tools from convex optimization

In this section, we recall the tools from first-order optimization to be used later. We describe two general type of optimization problems, composite minimization and composite saddle-point problems, together with efficient first-order algorithms for their solution. Following [NN13], we begin by introducing the concept of proximal setup which underlies these algorithms.

2.1 Proximal setup

Let the domain be a closed convex set in a Euclidean space . A proximal setup for is given by a norm on , not necessarily the Euclidean one, and a distance-generating function (d.-g. f.) , such that is continuous and convex on , admits a continuous selection of subgradients on the set , and is -strongly convex w.r.t. .

The concept of proximal setup gives rise to several notions (see [NN13] for a detailed exposition): the -center , the Bregman divergence , the -radius and the prox-mapping defined as

Blockwise proximal setups.

We now describe a specific family of proximal setups which proves to be useful in our situation. Let with ; note that we can identify this space with via (Hermitian) vectorization map ,

(5)

Now, supposing that for some non-negative integers , let us split into blocks of size , and equip with the group -norm:

(6)

We also define the balls .

Theorem 2.1 ([Nn13]).

Given  as above, defined by

(7)

is a d.-g. f. for any ball of the norm (6) with -center . Moreover, for some constant and any and , -radius of is bounded as

(8)

We will use two particular cases of the above construction.

  1. Case , corresponds to the -norm on , and specifies the complex -setup.

  2. Case , corresponds to the -norm on , and specifies the -setup .

It is convenient to introduce the following norms on :

(9)

Note that gives the norm in the complex -setup, while coincides with the standard -norm on .

2.2 Composite minimization problems

In the general problem of composite minimization, the goal is to solve the convex program

(10)

Here, is a domain in equipped with , is convex and continuously differentiable on , and is convex, lower-semicontinuous, finite on the relative interior of , and can be non-smooth. Assuming that is equipped with a proximal setup , let us define the composite prox-mapping [BT09]:

(11)

Fast Gradient Method.

Fast Gradient Method (FGM), summarized as Algorithm 1, was introduced in [N07] as an extention of the celebrated Nesterov algorithm for smooth minimization [Nes83] to the case of constrained problems with non-Euclidean proximal setups. It is guaranteed to find an approximate solution of (10) with accuracy after iterations. We defer the rigorous statement of this accuracy bound to Sec. 4.

0:  stepsize
  
  
  for  do
     
     
     
     
     
     
     
  end for
Algorithm 1 Fast Gradient Method

2.3 Composite saddle-point problems

We also consider another class of optimization problems:

(12)

Here, and are convex subsets of the corresponding Euclidean spaces, and is compact; function is convex in , concave in , and differentiable on ; function is convex, lower-semicontinuous, can be non-smooth, and is such that is easily computable. We can associate with a smooth vector field ,

Saddle-point problem (12) specifies two convex optimization problems: minimization of , or the primal problem, and maximization of , the dual problem. Under some general conditions which hold in the described setting [Sio58], (12) possesses an optimal solution , called a saddle point, such that the value of (12) is , and are optimal solutions to the primal and dual problems. The quality of a candidate solution can be evaluated via the duality gap – the sum of the primal and dual accuracies:

Constructing the joint setup.

When having a saddle-point problem at hand, one usually begins with “partial” proximal setups for , and for , and has to construct a proximal setup on . Let us introduce the segment , where is the -component of the -center of . Folllowing  [NN13], let us assume that both the dual -radius and the “effective primal radius”

are known (note that the primal radius can be infinite but cannot). One can then construct a proximal setup

(13)

Note that the corresponding “joint” prox-mapping is reduced to the prox-mappings for the primal and dual setups.

Composite Mirror Prox.

Composite Mirror Prox (CMP), introduced in [NN13] and summarized here as Algorithm 2, solves the general composite saddle-point problem (12). When applied with proximal setup (13), this algorithm admits an accuracy bound after iterations; the formal statement is deferred to Sec. 4.

0:  stepsize
  
  for  do
     
     
     
  end for
Algorithm 2 Composite Mirror Prox

3 Algorithmic implementation

We now describe the efficient implementations of the proximal first-order algorithms discussed in Sec. 2, as applied to the recovery procedures introduced in Sec. 1.

Change of variables.

Our first step is to pass to the Fourier domain, so that, first, convolution is represented as an efficiently computable linear operator, and second, the feasible set and the penalization term become quasi-separable in the new variables. Noting that the adjoint map of , cf. (5), is given by

consider the transformation

(14)

Note that , and hence

where is defined by

(15)

We are about to demonstrate that all recovery procedures can be recast in one of the “canonical” forms (10), (12). Moreover, the gradient computation is then reduced to evaluating the convolution-type operator and its adjoint .

Problem reformulation.

After the change of variables (14), problems (Con-LS) and (Pen-LS) take form (10):

(16)

where is defined in (9). In particular, (Con-LS) is obtained from (16) by setting and  and (Pen-LS) is obtained by setting . Note that

On the other hand, problems (Con-UF), (Pen-UF), and (Con-LS) (after taking the square root of the objective in the last case) can be recast as saddle-point problems (12) using that the dual norm of is with , whence

As a result, (Con-UF), (Pen-UF) and (Con-LS) are reduced to

(17)

where for (Con-UF) and (Pen-UF), and for (Con-LS) after taking square root. Note that is bilinear, and

We are now in the position to apply the algorithms described in Sec. 2. One iteration of either of them is reduced to a constant number of computations of the gradient (which, in turn, is reduced to evaluating and ) and prox-mappings. We now show how to evaluate convolution-type operator and its adjoint in .

Evaluation of and .

Operator , cf. (15), can be evaluated in time via FFT. The key fact is that the convolution is contained in the first coordinates of the circular convolution of with a zero-padded filter

Via the DFT diagonalization property, this is expressed as

where operator on can be constructed in by FFT, and evaluated in . Let project to the first coordinates of ; note that its adjoint is the zero-padding operator which complements with trailing zeroes. Then,

(18)

where all operators in the right-hand side can be evaluated in . Moreover, we arrive at a similar representation for by formally taking the adjoint of (18).

3.1 Computation of prox-mappings

The composite prox-mappings corresponding to both basic proximal setups considered in Sec. 2 can be computed in . In the penalized case this computation can be done explicitly, and in the constrained case the prox-mappings can be computed with a root-finding algorithm. Below we describe these calculations since they are intersting for their own sake in the context of proximal algorithms applied to signal processing problems.

It suffices to consider partial proximal setups separately; the case of joint setup in saddle-point problems can be treated using that the joint prox-mapping is separable in and , cf. Sec. 2.3. Recall that the possible partial setups comprise the -setup with and the (complex) -setup with ; in both cases, is given by (7). Computing , cf. (11), amounts to solving

(19)

in the constrained case, and

(20)

in the penalized case111For the purpose of future reference, we also consider the case of squared -norm penalty.; in both cases, . In the constrained case with -setup, the task is reduced to the Euclidean projection onto the -ball if , and onto the -ball if ; the latter can be done (exactly) in via the algorithm from [DSSSC08] – for that, one first solves (19) for the complex phases corresponding to the pairs of components of . The constrained case with -setup is reduced to the penalized case by passing to the Langrangian dual problem. Evaluation of the dual function amounts to solving a problem equivalent to (20) with , and (19) can be solved by a simple root-finding procedure if one is able to solve (20). As for (20), below we show how to solve it explicitly when , and reduce it to one-dimensional root search (so that it can be solved in to numerical tolerance) when . Indeed, (20) can be recast in terms of the complex variable :

(21)

where and , cf. (7), whence

(22)

with . Now, (21) can be minimized first with respect to the complex arguments, and then to the absolute values of the components of . Denoting a (unique) optimal solution of (21), the first minimization results in and it remains to compute the absolute values

Case .

The first-order optimality condition implies

(23)

Denoting , and using the soft-thresholding operator

we obtain the explicit solution:

In the case of -setup this reduces to .

Case .

Instead of (23), we arrive at

(24)

which we cannot solve explicitly. However, note that a counterpart of (24), in which is replaced with parameter , can be solved explicitly similarly to (23). Let denote the corresponding solution for a fixed , which can be obtained in time. Clearly, is a non-decreasing function on . Hence, (24) can be solved, up to numerical tolerance, by any one-dimensional root search procedure, in evaluations of .

4 Theoretical analysis

We begin this section by recalling the worst-case bounds on the absolute accuracy of the objective,

(For saddle-point problems, must be replaced with , and is upper-bounded by the duality gap.) These bounds are applicable when solving an arbitrary optimization problem in one of two general classes described in Sec. 2 with the corresponding first-order algorithm, cf. Theorems 4.1-4.2; they are expressed in terms of the “optimization” parameters that specify as the regularity of the objective and the radius of the feasible set. Our first theoretical contribution, cf. Proposition 4.1, is to express these “optimization” parameters, and hence the bounds on , in terms of some purely statistical quantities: the norm of exact estimator and the -norm of the observations in Fourier domain.

4.1 Bounding the absolute accuracy

The following results are taken from [NN13], modulo notation adaptations.

Theorem 4.1.

Suppose that has -Lipschitz gradient:

where is the dual norm to , and let be generated by iterations of Algorithm 1 with stepsize . Then,

Theorem 4.2.

Let be as in (17)222For simplicity, we only state the bound for bilinear ., and assume that vector field is -Lipschitz on :

Let be generated by iterations of Algorithm 2 with joint setup (13) and . Then,

First, note that in the case where the partial domain (for or ) is an -norm ball, we will use the -setup in that variable (then the domain coincides with , cf. (8), while if the domain is an -norm ball, we can choose between or -setups (in the latter case, the domain is contained in ). Radii can then be bounded as follows, cf. (8):

(25)

where

(26)

is the scaled norm of an optimal solution (note that ).

Another observation concerns the Lipschitz constants. Let

(27)

depending on whether one uses the Euclidean setup () or the complex -setup () in each variable ; besides, let . Introducing the complex counterpart of , operator given by

we can express the Lipshitz constants , in terms of the subordinate norms :

(28)

The operator norm corresponds to the partial -setups (in both variables), and the norm  to the complex -setups in both variables; note that . Now, itself can be bounded as follows:

Lemma 4.1.

One has

The proof of this lemma appears in the supplementary material. Together with (25), Lemma 4.1 results in the following

Proposition 4.1.

Solving (Con-LS) or (Pen-LS) by Algorithm 1 with proximal setup as described above, one has

(29)

Similarly, solving (Con-LS) (with square root of the objective), (Con-UF), or (Pen-UF) by Algorithm 2, one has

(30)

Note that Proposition 4.1 gives the same upper bound on the accuracy irrespectively of the proximal setup chosen among the ones described in the premise of (25). This is because we used the operator norm as an upper bound for  and  while these quantities can be as small as or even when one uses the “geometry-adapted” proximal setup in which partial -setups are used for the variables “measured” in -norm. For a general linear operator on the gaps between and or can be as large as or , and hence one might expect Proposition 4.1 to be suboptimal for the “geometry-adapted” setup. However, as can be seen from the representation (45), in our case operator has a special “almost diagonal” structure, and it is unlikely that its different subordinate norms scale differently with . This intuition can be made precise:

Proposition 4.2.

Assume that , and is -periodic: Then, one has

4.2 Statistical accuracy and algorithmic complexity

In this section, we characterize the statistical accuracy of adaptive recovery procedures. It can be informally defined as the accuracy of minimizing the residual, sufficient for the corresponding approximate estimator to admit the same, up to a constant factor, theoretical risk bound as the exact estimator . We obtain the following two results.

Theorem 4.3.

An -accurate solution to (Con-UF) with or to (Pen-UF) with

in both cases with , with prob.  satisfies

(31)

While Theorem 4.3 controls the pointwise loss of uniform-fit estimators, the next theorem controls the -loss for least-squares estimators. To state it, we recall that a linear subspace of is called shift-invariant if it is an invariant subspace of the lag operator : on .

Theorem 4.4.

Assume that belongs to a shift-invariant subspace with . Then, an -accurate solution to (Con-LS) with or to (Pen-LS) with

in all cases with , with prob.  satisfies