Online Hyperparameter-Free Sparse Estimation Method

Online Hyperparameter-Free Sparse Estimation Method

Abstract

In this paper we derive an online estimator for sparse parameter vectors which, unlike the LASSO approach, does not require the tuning of any hyperparameters. The algorithm is based on a covariance matching approach and is equivalent to a weighted version of the square-root LASSO. The computational complexity of the estimator is of the same order as that of the online versions of regularized least-squares (RLS) and LASSO. We provide a numerical comparison with feasible and infeasible implementations of the LASSO and RLS to illustrate the advantage of the proposed online hyperparameter-free estimator.

1 Introduction

Estimating a high-dimensional sparse vector of parameters with a few dominant or nonzero elements has become an important topic in statistics and signal processing. Applications of sparse estimation include spectral analysis [1, 2, 3, 4], array processing [5, 6, 7], biomedical analysis [8, 9, 10], magnetic resonance imaging [11, 12], system identification [13, 14, 15, 16, 17] and synthetic aperture radar imaging [18, 19].

Many sparse estimation approaches can be implemented using various computational methods and it is relevant to formulate estimators that scale well with the size of the data. Furthermore, in several applications data is obtained as a stream of measurements and it is desirable to process them accordingly. Both reasons motivate developing estimation methods that perform ‘online’ processing, that is, successively refining the estimate of the sparse parameter vector for each obtained data sample. Another common issue with sparse estimation methods is the need for the user to select or tune critical hyperparameters to strike a balance between data fidelity and sparsity so as to fit a particular measurement setup [20, 21]. This selection is, however, rarely feasible in online scenarios. Furthermore, when the user has to tune hyperparameters the outcomes become more arbitrary and the reproducibility of the method is reduced. Finally, many convex relaxation-based sparse estimation methods are not well adapted for complex-valued data and parameters and thus they must separate the data into real and imaginary parts. This separating approach requires enforcing pairwise sparsity constraints to avoid performance loss and effectively doubles the size of computed quantities [22, 23].

In this paper we develop a sparse estimation method that addresses the aforementioned issues. Specifically,

  • the estimator is implemented online with the same complexity order as the best existing online methods for sparse estimation.

  • It automatically adapts to the signal model via a covariance matching approach and in this way obviates the need for tuning hyperparameters.

  • The method can estimate complex-valued parameters as simply as real-valued ones.

Notation: , and denote the , and Frobenius norms, respectively. Unless otherwise stated, will denote the -norm and where is a positive definite matrix. is the th column of matrix and is the Moore-Penrose pseudoinverse.

Abbreviations: Least squares (LS), regularized least-squares (Rls), least absolute shrinkage and selector operator (Lasso), sparse iterative covariance-based estimation (Spice), mean-square error (MSE), online (Ol).

2 Background

Consider a sequence of scalar measurements:

(1)

where the regressor vector is given, the unknown sparse parameter vector is and is zero-mean noise with variance . For the sake of generality we consider complex-valued variables; any differences that occur in the real-valued case will be addressed below.

Suppose we have obtained measurements. Then in vector form we can write

(2)

where

To avoid notational clutter we will omit the superindex for the columns and simply write . In the following sub-sections we review a few estimators of in (2), based on regularizations of the least-squares approach, and their online formulations that compute from thus eliminating the need for re-calculating the estimate from scratch.

2.1 LS and RLS

The LS approach is based on solving the quadratic problem [24, 25, 26]

(3)

which has the following minimum -norm solution . If has full column-rank, then the estimator admits a simple closed-form solution that can be computed by recursive updates for . This obviates the need for choosing an initial estimate or any hyperparameter (see, e.g. [27]).

It is more common to consider a regularized LS problem

(4)

with an initial estimate that is well-motivated for sparse parameter vectors and with , where is a hyperparameter chosen by the user to bias the estimate towards with the aim of reducing its variance. This estimator admits an online form , where is a matrix determined from the regressors and [26]. The computational complexity of this -regularized least-squares algorithm is of the order per sample.

One approach that takes sparsity into account would be to perform online Rls estimation only on the nonzero components of , if these were known. In [28] the components are successively detected in a greedy manner using information theoretic criteria at each sample . Since the detection process is subject to errors, the resulting online sparse least-squares estimate is only an approximation of (4) applied to the subvector of nonzero coefficients.

2.2 Lasso

A substantially different approach than Rls consists of replacing the -norm regularization term in (4) with alternative forms that promote sparsity directly in the objective itself [29]. In doing so, sparse solutions can be obtained without the need for concomitantly detecting the nonzero components of . This approach to sparse parameter estimation was popularized in [8, 30]. The Lasso estimator solves the following convex problem

(5)

While the solution does not have a closed-form expression, it can be computed using various numerical methods. Among the more computationally elegant and scalable methods is the cyclic minimization strategy of coordinate descent which updates one element of at a time in an iterative manner, cf. [31, 32] and references therein.

One way of formulating an online solution is to interpret (5) as a penalized maximum likelihood estimator, assuming Gaussian noise in (1). Then it is possible to formulate an iterative expectation maximization algorithm with recursively updated quantities using auxiliarly variables [33]. The drawback, however, is that an additional hyperparameter, besides in (5), needs to be tuned. Another way of dynamically updating the estimate from is the method of homotopy [34, 35], whereby the cost function in (5) with a fixed is modified into . As the scalar parameter is varied from 0 to 1, the transition from to can be computed more efficiently than recalculating from scratch thereby enabling an online formulation.

For the real-valued case, an elegant online formulation is found in [36], which is based on the cyclic minimization strategy mentioned above. The cost function in (5) can be written equivalently as , ignoring any constant, where and can be computed recursively. Then, starting from an initial estimate , the elements of are updated for each sample by solving

in closed form for , where and denotes the current estimate. The complexity of the full online cyclic minimization Lasso is per sample.

Under certain conditions on the regressors, sparsity of , and noise, it is possible to prove that the Lasso estimator possesses ‘oracle’ properties. That is, asymptotically it can identify the support set of and perform as well as Rls applied to the nonzero coefficients of the parameter vector, cf. [37, 38, 39, 36]. This, however, requires selecting the hyperparameter based on the knowledge of the noise variance which is rarely feasible in practical (online) scenarios.

2.3 Square-root Lasso

To circumvent the need to know in the Lasso, a subtle modification of (5) was proposed in [40],

(6)

where the first term, containing the residuals, is the square-root of that in (5). As argued in [40], near-oracle performance for both (5) and (6) can be achieved when is chosen as the smallest value that dominates the gradient of the first term, when evaluated at the true . At this point, the gradient captures the estimation errors arising from noise alone. However, by re-parameterizing (2) as , where , it is seen that the gradients of the first terms in (5) and (6) differ in one crucial respect; namely the latter does not depend on thus rendering the choice of for (6) invariant to the noise level.

Another way to address the dependence on is to estimate it [41]. The square-root Lasso estimator in (6) can in fact be interpreted as an -penalized joint estimator of and used in robust regression. Suppose is a convex loss function of the normalized residuals . Then the concomitant M-estimator of location and scale, and , is given by [42, ch.7]

(7)

where is a user-defined parameter. In robust regression, various loss functions are considered to mitigate noise outliers. For a squared-error loss , we obtain the minimizer in closed form. Penalizing the M-estimator in (7) by and concentrating out the minimizing with yields (6).

While an efficient choice of in (6) is independent of , the user input is still required; furthermore, the choices of in [40] are predicated on the assumption that each column of has unit norm. Such a rescaling of the regressors may not be practical in an online scenario. Note that a cyclic minimization algorithm for the convex square-root Lasso has been presented in the supplementary material of [40] (albeit only for the real-valued case and without any derivation) but an online implementation has not yet been formulated.

2.4 Spice as weighted square-root Lasso

Let us now consider the estimation problem from a statistical point of view. Suppose is a zero-mean random variable with covariance matrix . Then the linear estimator that minimizes the mean square error is obtained by solving

(8)

and can be written in closed form as [43, 25, 26]

(9)

In the problem under consideration, however, neither nor is known. By treating them as unknown parameters, they can be estimated by a covariance-matching approach (e.g., [44, 45]) and then used in (9).

For reasons of parsimony and tractability we do not model any correlations between the elements of and hence is a diagonal matrix. Now consider the covariance matrix of the data , which is a function of and . We choose these nonnegative parameters to match the covariance of the observed data, by minimizing the criterion

with respect to and . This criterion is the basis of the sparse iterative covariance-based estimation (Spice) framework.

Using this covariance-matching approach is equivalent to solving for the parameters jointly in the following augmented problem

(10)

which is similar in form to (8) but contains the additional term . (See Appendix A for a proof of this equivalence.) Furthermore, following [46, 47, 48] it can be shown that solving for and , and concentrating them out from (10), results in

(11)

where

Eq. (11) can be interpreted as a weighted, hyperparameter-free square-root Lasso. As is the case with the square-root Lasso, online formulations of (11) have not appeared in the literature.

2.5 Problem formulation

We have reviewed several approaches to sparse parameter estimation as well as some of their interconnections and limitations. Note that all of the estimators considered above involve convex minimization problems. The and -penalized forms of (3) in (4) and (5) have concise online formulations but require the careful selection of hyperparameters. Furthermore, an efficient choice depends on the unknown noise power . The hyperparameters choice is rendered invariant to by the change in (6). Moreover, this selection is entirely avoided in (11) using the Spice approach.

The goal of the remainder of the paper is to formulate an online Spice estimator for the sparse vector (see (11)) given data . This estimator, denoted ’Ol-Spice’, obviates the need for user-defined hyperparameters, treats the complex-valued case as simply as the real-valued one, and is of the same complexity order as the online solutions of (4) and (5). In the numerical example section we provide results comparing the aforementioned online estimators, viz. Ol-Rls, Ol-Lasso and Ol-Spice.

3 Online Spice

First we formulate a low-complexity cyclic minimization algorithm for the cost function in (11). Then, using this result we derive an online estimator which sequentially processes a stream of data with complexity per sample.

3.1 Cyclic minimization

Let the cost function in (11) be denoted as . We begin by minimizing with respect to one component at a time. Let (omitting the index to lighten the notation); then the cost function can be re-written as

(12)

where is the th diagonal element of and is a constant. To tackle this scalar minimization problem we reparameterize the th variable in polar form where and (or when is real-valued). This enables the following reformulation of the quadratic term in (12):

(13)

Inserting (13) into and noting that , we obtain the following criterion as a function of and ,

(14)

The minimizing is simply

(15)

whether the data is complex or real-valued.

Next, let

(16)

so that we can write (14) as

(17)

Note that by the Cauchy-Schwarz inequality

(18)

Equality in (18) occurs only when is colinear with . Inserting (2) into one obtains , where denote estimation errors when holding the remaining coefficients constant. Due to the random noise , and will not be colinear, making the inequality (18) strict, with probability 1.

We now show that (17) is convex and derive the minimizing of this function (dropping the index , in what follows, for notational simplicity). The first-order derivative is

(19)

where the quadratic expression in the denominator can be factored as

(20)

Given the strict inequality in (18) it follows that the right-hand side of (20), and therefore the denominator of (19), is strictly positive. The second-order derivative can be expressed as

Note that the above equation is positive, in view of (18). Thus the function (17) is convex and the minimizer is given by its stationary point; or else .

The stationary point, for which , can be found by solving (see (19)):

which leads to the condition given that both sides must be negative. By squaring both sides of the above expression we can write

using (20), or

where . Therefore for , we can write the solution more compactly as (reinstating the dependence on ):

(21)

given the fact that .

Finally, we summarize the element-wise minimizer of (12) as

(22)

using (21) and (15). Updating each element while holding the remaining elements constant will monotonically reduce the convex cost function (12). Thus repeating (22) for results in a computationally simple cyclic minimizer.

3.2 Online formulation

We now derive a method for efficiently computing (22), given the current estimate which we denote by , at any , for notational simplicity. At , the estimate is initialized as . We note that the variables in (15) and (16) depend on quantities whose dimensions grow with ; namely, and . By introducing recursively computed variables we derive an estimate update that keeps the complexity and memory storage constant at each sample and is of the same complexity order as online Rls and Lasso.

First, we introduce the auxiliary variable , which will subsequently be eliminated as we proceed in the derivation. Then we can write the following identity , which enables the variables in (16) to be expressed as

(23)

Next, introduce the auxiliary variables

(24)

and the recursively computed variables

(25)

that are initialized as . Then (23) can be simplified as follows:

(26)

where denotes the th element of . Similarly, (15) can be expressed as

(27)

Therefore the computation of (22) can be expressed in terms of (24), (25) and the current estimate .

Once has been computed, the current estimate must be updated along with the auxiliary variables to compute the subsequent coefficients of . The variable can easily be updated as , and it follows that the update of (24) equals

which involves a small number of scalar operations and an addition of two vectors. The variable can now be eliminated, initializing the auxiliary variables (24) for each sample as and . We summarize Ol-Spice in Algorithm 1. The algorithm specifies the update of the estimate for each new sample and is initialized at by . The cyclic computation of , is terminated after repetitions per sample, cf. line 14 in Algorithm 1.

In sum, by introducing the auxiliary variables we can maintain constant storage and a computational complexity of order per sample. Since is a constant independent of , this is the same complexity order as online Rls and Lasso. As reported below, performs well in practice. Other update strategies, akin to those considered in [36], can be explored in online applications where complexity needs to be further reduced.

1:Input: , and
2:
3:
4:
5:
6:
7:repeat
8:      
9:      Compute (26) and (27)
10:      Compute using (22)
11:      
12:      
13:      
14:until  number of iterations equals
15:Output:
Algorithm 1 : Online Spice
Remark 1.

At any , the output of the algorithm converges to the global minimizer (11) as which follows from the above analysis of the convex minimization problem. A convergence analysis for finite and is, however, nontrivial, cf. [25, ch. 9].

Remark 2.

The original Spice batch algorithms [3, 7], with uniform noise variance, and the above online formulation solve the same convex problem iteratively. The former uses an initial batch estimate whereas the latter is initialized by setting . A more important difference, however, is that the former requires repeated inversions of matrices, each of which is of complexity , whereas the latter requires none. This renders batch Spice intractable when takes on large values (such as for a regular PC) and precludes its use in scenarios considered in this work.

Remark 3.

We note that the approach employed to derive Ol-Spice also enables an alternative formulation of Ol-Lasso that treats the complex-valued case as simply as the real-valued one. See Appendix B for a derivation.

4 Numerical evaluation

In this section we compare the derived Ol-Spice with feasible and infeasible versions of the Ol-Rls and Ol-Lasso [36].

The infeasible Ol-Rls is implemented by processing only the (unknown) subset of nonzero coefficients in , whereas the feasible Ol-Rls processes the entire vector, with the regularization parameter set arbitrarily to . The infeasible Ol-Lasso is implemented by setting , which is proportional to the (unknown) noise level [36], whereas for the feasible Ol-Lasso we set .

The performance of the estimators was evaluated using the normalized mean-square error

(28)

When is an unknown deterministic variable, the expectation with respect to it in (28) should be be removed. Note that an NMSE value below 0 dB quantifies the error reduction from the initial guess . The NMSE was evaluated using 100 Monte Carlo simulations. We used a PC with Intel i7 3.4 GHz CPU and 16 GB RAM. The algorithms were implemented in Matlab without any special code optimization or hardware acceleration.

Remark: In the interest of reproducible research we have made the codes for Ol-Spice, as well as for the presented numerical experiments, available at https://www.it.uu.se/katalog/davza513.

4.1 Real-valued example: random regressors

To illustrate the performance of the estimators we consider a scenario with the real-valued regressor elements in (1) drawn from identical and independent Gaussian distributions (i.i.d.) with zero mean and unit variance. The signal to noise ratio is defined as

where is the support set of and . We use Gaussian noise throughout all experiments.

We first consider to be a deterministic parameter. In the first experiment we set nonzero elements, , and . Note that since the regressors are drawn independently, the chosen support set for will not affect the performance. When the number of samples is very small, the estimates for Ol-Lasso and Ol-Spice have a higher variance than Ol-Rls which biases its estimate more strongly towards . For clarity we therefore set during for all estimators. The results are shown in Fig. 1 for SNR=20 dB. We observe a significant performance gap between the feasible and infeasible Ol-Lasso, which illustrates how critical it is to tune the hyperparameter to the unknown noise variance . Both Ol-Lasso and Ol-Spice quickly prune out many nonzero coefficient estimates, the effect of which is visible in the transition phase of the NMSE plot. The performance of Ol-Rls becomes better than that of Ol-Lasso when . Ol-Spice outperforms the feasible Ol-Lasso after about samples and is closer to the infeasible version which uses an optimally tuned .

Fig. 2 presents the variance and bias of the estimators by decomposing the mean square error in (28). Both versions of Ol-Lasso exhibit much lower variance than square-bias whereas Ol-Spice has a more balanced variance-bias composition and noticeably the lowest bias among the considered estimators.

In Fig. 3 we see that the NMSE for feasible Ol-Rls and Ol-Lasso, which is dominated by the bias, remains virtually unaffected by increasing SNR for a fixed number of samples . By contrast, the errors for Ol-Spice and the infeasible Ol-Lasso decrease as the signal conditions improve.

Figure 1: IID regressors and deterministic . NMSE versus . Left: to . Right: to . SNR=20 dB and number of nonzero parameters . The asterisk denotes the infeasible Ol-Lasso.
Figure 2: IID regressors and deterministic . Left: variance versus . Right: square-bias versus . SNR=20 dB and .
Figure 3: IID regressors and deterministic . NMSE versus SNR. samples and . The asterisk denotes the infeasible Ol-Lasso.

Next, we study the effect of the number of iteration cycles per sample in Ol-Spice. The results are illustrated in Fig. 4. We note that the performance characteristics for , and , are very similar. For , a larger leads to slightly faster decrease of the NMSE but the differences rapidly diminish as increases and the NMSE curves almost coincide for . For reference we included the infeasible Ol-Rls, which provides a lower bound on the NMSE in Fig. 4 and can be compared to Fig. 1.

Figure 4: IID regressors and deterministic . NMSE versus . SNR=20 dB and . The asterisk denotes the infeasible Ol-Rls.

In the next experimental setup we set the number of nonzero elements to , , thus increasing element density of to 10%. The results in Fig. 5 show that Ol-Spice can better cope with less sparse parameter vectors than Ol-Lasso. For it exhibits lower NMSE than the rest, owing to a lower bias (not shown here but observed by us in the numerical evaluation).

Figure 5: IID regressors and deterministic . NMSE versus . Left: to . Right: to . SNR=20 dB and .

Finally, we consider a setup where is a random parameter. Since the support set is unimportant in the present case we generate the elements , and using independent Gaussian variables with zero-mean and unit variance, resulting in a wider dynamic range than in the previous experiments. Nevertheless, the results presented in Fig. 6 show performance characteristics similar to the deterministic case presented in Fig. 1.

Figure 6: IID regressors and stochastic . NMSE versus . Left: to . Right: to . SNR=20 dB and .

4.2 Real-valued example: sinusoids in noise

In contrast to the previous example, we now present a case where the regressor columns in (2) are highly correlated. Specifically, as a further example with real-valued parameters, we consider the identification of a sum of sinusoids at given frequencies with unknown phases and amplitudes (most of which are zero). In the following we will consider possible sinusoids on a uniform grid of frequencies. We set two nonzero amplitudes as and for two slowly-varying sinusoids, narrowly spaced with , and for a high-frequency sinusoid. The phases of the three sinusoids were set to 0.

We define the signal to noise ratio as

where is the set of nonzero amplitudes, and parameterize the signal as

where the unknown parameter vector is and . The regressor vector is .

We set SNR=20 dB. First, Ol-Spice is compared with the feasible and infeasible Ol-Lasso which perform substantially different from one another but achieve the same rate of NMSE decrease. The results are presented in Fig. 7. For , Ol-Spice overtakes the feasible Ol-Lasso at about . Notably, the NMSE of Ol-Spice decreases until it reaches a plateau where the estimation errors are very small but where the noise level cannot be properly identified. This interesting transition characteristic still awaits a satisfactory explanation. For , Ol-Spice approaches the infeasible Ol-Lasso as time progresses. For reference, we have also added the feasible Ol-Rls which illustrates the degradation when not taking the parameter sparsity into account. Note, however, that Ol-Rls eventually outperforms the Ol-Lasso estimator for which the hyperparameter has not been finely tuned to the noise level.

Figure 7: Sinusoidal parameters . NMSE versus time. SNR=20 dB. The asterisk denotes the infeasible Ol-Lasso.

Next, Fig. 8 illustrates how affects Ol-Spice. We see that the performance characteristics for , and , are very similar as was the case with weakly correlated regressor columns in Fig. 4. Setting , however, requires slightly more samples to reach the plateau resulting in a gap in NMSE compared to until about .

Figure 8: Sinusoidal parameters . NMSE versus time. SNR=20 dB. The asterisk denotes the infeasible Ol-Rls.

4.3 Complex-valued example: synthetic aperture radar imaging

Finally, we illustrate how Ol-Spice performs in a complex-valued case, and compare it with Ol-Rls and a novel form of Ol-Lasso for this scenario, cf. Appendix B.

We consider a setup similar to that of synthetic aperture radar imaging where an antenna transmits an electromagnetic pulse and the reflected signal carries information about potential scatterers in the scene of interest, cf. [49]. Let be a position coordinate in the scene and the reflection coefficient at . The observed signal is in the spatial frequency domain, where each sample corresponds to a particular angle . If we grid the space of the scene, the signal at sample can be modeled as

where is the vectorized image of reflection coefficients. For simplicity, we consider and such that the observation corresponds to a coefficient of the two-dimensional discrete Fourier transform. Here we consider the discretized scene image to be such that . The true image used in this example is shown in Fig. 9, which comprises point scatterers with amplitudes equal to .

Figure 9: True intensity or reflection coefficient, , as a function of . 10 ideal point-scatterers are present.

The observations at each sample were taken at a randomly chosen angle (corresponding to randomly chosen discrete spatial frequencies). The signal to noise ratio was set to  dB. In Fig. 10 we compare the estimated images using Ol-Rls, Ol-Lasso and Ol-Spice. Note that in this signal setup the hyperparameter in the infeasible Ol-Lasso overpenalizes the -norm of which results in no visible scatterers. To produce some meaningful plots for Ol-Lasso, the hyperparameter is adjusted to , which illustrates the difficulty of selecting it in practical applications. For close to , three methods estimate the locations and intensities of the point scatterers accurately, but Ol-Spice is capable of producing accurate images with far fewer samples than the other two methods which would require fine-tuning. Indeed, the scatterer pattern is already visible at samples in the Ol-Spice image, without any user input.

Figure 10: Estimated images at various time instants for a randomly chosen noise realization. The estimates for Ol-Rls and Ol-Spice are shown in the first and fourth columns, respectively. For infeasible Ol-Lasso with and a user-adjusted version , the estimates are shown in the second and third columns. SNR= dB.

5 Conclusions

We have derived an online sparse estimator, called Ol-Spice, that obviates the need for tuning hyperparameters. Its computational simplicity and adaptability to complex-valued parameters render it suitable for large-scale inference problems as well as real-time applications, such as system identification and synthetic aperture radar imaging. The code for Ol-Spice has been made available to facilitate its use in applications.

Appendix A: Linear minimum mean-square estimator and covariance matching

Here we prove that the minimizer of (10) is equivalent to using the linear minimum mean-square estimator (9) with covariance parameters set through covariance matching.

For notational simplicity, let denote the covariance parameters, namely the diagonal elements of and , and drop subindex . Further, let so that . Now (9) can be written as

We note that (9) is invariant to any scaling of the covariance parameters. That is,

(29)

for any , which follows from . Finally note that since (9) minimizes (8) it is therefore the minimizer of the augmented problem (10) as well.

We proceed by inserting (9) in (10); this will lead to a concentrated cost function that is equivalent to using the covariance-matching criterion. First, using the matrix inversion lemma, note that:

so that

Thus after concentrating out , (10) can be written as

(30)

Now expand the covariance-matching criterion,

where is a constant. The covariance matching problem can thus be written equivalently as

(31)

which is similar to (30). Let the cost functions in (30) and (31) be denoted as and , respectively. We now show that their respective minimizers differ only by a scaling constant. That is, , where . This follows from

so that for the minimizer we have . It follows that for all , and therefore the minimizers for the concentrated cost function (30) and the covariance-matching crierion (31) differ only by a factor . From (29) we know that the linear minimum mean-square estimator is invariant to uniform scaling of the covariance parameters. This concludes the proof. (See also [48] for other details of this result.)

Appendix B: Online Lasso for the complex-valued case

An online cyclic Lasso algorithm that covers both the real and complex-valued case can be derived using the same reparametrization employed in Ol-Spice. Analogous to (12) and the derivation of (17), the cost function can be written as

and in concentrated form,