Online HyperparameterFree Sparse Estimation Method
Abstract
In this paper we derive an online estimator for sparse parameter vectors which, unlike the LASSO approach, does not require the tuning of any hyperparameters. The algorithm is based on a covariance matching approach and is equivalent to a weighted version of the squareroot LASSO. The computational complexity of the estimator is of the same order as that of the online versions of regularized leastsquares (RLS) and LASSO. We provide a numerical comparison with feasible and infeasible implementations of the LASSO and RLS to illustrate the advantage of the proposed online hyperparameterfree estimator.
1 Introduction
Estimating a highdimensional sparse vector of parameters with a few dominant or nonzero elements has become an important topic in statistics and signal processing. Applications of sparse estimation include spectral analysis [1, 2, 3, 4], array processing [5, 6, 7], biomedical analysis [8, 9, 10], magnetic resonance imaging [11, 12], system identification [13, 14, 15, 16, 17] and synthetic aperture radar imaging [18, 19].
Many sparse estimation approaches can be implemented using various computational methods and it is relevant to formulate estimators that scale well with the size of the data. Furthermore, in several applications data is obtained as a stream of measurements and it is desirable to process them accordingly. Both reasons motivate developing estimation methods that perform ‘online’ processing, that is, successively refining the estimate of the sparse parameter vector for each obtained data sample. Another common issue with sparse estimation methods is the need for the user to select or tune critical hyperparameters to strike a balance between data fidelity and sparsity so as to fit a particular measurement setup [20, 21]. This selection is, however, rarely feasible in online scenarios. Furthermore, when the user has to tune hyperparameters the outcomes become more arbitrary and the reproducibility of the method is reduced. Finally, many convex relaxationbased sparse estimation methods are not well adapted for complexvalued data and parameters and thus they must separate the data into real and imaginary parts. This separating approach requires enforcing pairwise sparsity constraints to avoid performance loss and effectively doubles the size of computed quantities [22, 23].
In this paper we develop a sparse estimation method that addresses the aforementioned issues. Specifically,

the estimator is implemented online with the same complexity order as the best existing online methods for sparse estimation.

It automatically adapts to the signal model via a covariance matching approach and in this way obviates the need for tuning hyperparameters.

The method can estimate complexvalued parameters as simply as realvalued ones.
Notation: , and denote the , and Frobenius norms, respectively. Unless otherwise stated, will denote the norm and where is a positive definite matrix. is the th column of matrix and is the MoorePenrose pseudoinverse.
Abbreviations: Least squares (LS), regularized leastsquares (Rls), least absolute shrinkage and selector operator (Lasso), sparse iterative covariancebased estimation (Spice), meansquare error (MSE), online (Ol).
2 Background
Consider a sequence of scalar measurements:
(1) 
where the regressor vector is given, the unknown sparse parameter vector is and is zeromean noise with variance . For the sake of generality we consider complexvalued variables; any differences that occur in the realvalued case will be addressed below.
Suppose we have obtained measurements. Then in vector form we can write
(2) 
where
To avoid notational clutter we will omit the superindex for the columns and simply write . In the following subsections we review a few estimators of in (2), based on regularizations of the leastsquares approach, and their online formulations that compute from thus eliminating the need for recalculating the estimate from scratch.
2.1 LS and RLS
The LS approach is based on solving the quadratic problem [24, 25, 26]
(3) 
which has the following minimum norm solution . If has full columnrank, then the estimator admits a simple closedform solution that can be computed by recursive updates for . This obviates the need for choosing an initial estimate or any hyperparameter (see, e.g. [27]).
It is more common to consider a regularized LS problem
(4) 
with an initial estimate that is wellmotivated for sparse parameter vectors and with , where is a hyperparameter chosen by the user to bias the estimate towards with the aim of reducing its variance. This estimator admits an online form , where is a matrix determined from the regressors and [26]. The computational complexity of this regularized leastsquares algorithm is of the order per sample.
One approach that takes sparsity into account would be to perform online Rls estimation only on the nonzero components of , if these were known. In [28] the components are successively detected in a greedy manner using information theoretic criteria at each sample . Since the detection process is subject to errors, the resulting online sparse leastsquares estimate is only an approximation of (4) applied to the subvector of nonzero coefficients.
2.2 Lasso
A substantially different approach than Rls consists of replacing the norm regularization term in (4) with alternative forms that promote sparsity directly in the objective itself [29]. In doing so, sparse solutions can be obtained without the need for concomitantly detecting the nonzero components of . This approach to sparse parameter estimation was popularized in [8, 30]. The Lasso estimator solves the following convex problem
(5) 
While the solution does not have a closedform expression, it can be computed using various numerical methods. Among the more computationally elegant and scalable methods is the cyclic minimization strategy of coordinate descent which updates one element of at a time in an iterative manner, cf. [31, 32] and references therein.
One way of formulating an online solution is to interpret (5) as a penalized maximum likelihood estimator, assuming Gaussian noise in (1). Then it is possible to formulate an iterative expectation maximization algorithm with recursively updated quantities using auxiliarly variables [33]. The drawback, however, is that an additional hyperparameter, besides in (5), needs to be tuned. Another way of dynamically updating the estimate from is the method of homotopy [34, 35], whereby the cost function in (5) with a fixed is modified into . As the scalar parameter is varied from 0 to 1, the transition from to can be computed more efficiently than recalculating from scratch thereby enabling an online formulation.
For the realvalued case, an elegant online formulation is found in [36], which is based on the cyclic minimization strategy mentioned above. The cost function in (5) can be written equivalently as , ignoring any constant, where and can be computed recursively. Then, starting from an initial estimate , the elements of are updated for each sample by solving
in closed form for , where and denotes the current estimate. The complexity of the full online cyclic minimization Lasso is per sample.
Under certain conditions on the regressors, sparsity of , and noise, it is possible to prove that the Lasso estimator possesses ‘oracle’ properties. That is, asymptotically it can identify the support set of and perform as well as Rls applied to the nonzero coefficients of the parameter vector, cf. [37, 38, 39, 36]. This, however, requires selecting the hyperparameter based on the knowledge of the noise variance which is rarely feasible in practical (online) scenarios.
2.3 Squareroot Lasso
To circumvent the need to know in the Lasso, a subtle modification of (5) was proposed in [40],
(6) 
where the first term, containing the residuals, is the squareroot of that in (5). As argued in [40], nearoracle performance for both (5) and (6) can be achieved when is chosen as the smallest value that dominates the gradient of the first term, when evaluated at the true . At this point, the gradient captures the estimation errors arising from noise alone. However, by reparameterizing (2) as , where , it is seen that the gradients of the first terms in (5) and (6) differ in one crucial respect; namely the latter does not depend on thus rendering the choice of for (6) invariant to the noise level.
Another way to address the dependence on is to estimate it [41]. The squareroot Lasso estimator in (6) can in fact be interpreted as an penalized joint estimator of and used in robust regression. Suppose is a convex loss function of the normalized residuals . Then the concomitant Mestimator of location and scale, and , is given by [42, ch.7]
(7) 
where is a userdefined parameter. In robust regression, various loss functions are considered to mitigate noise outliers. For a squarederror loss , we obtain the minimizer in closed form. Penalizing the Mestimator in (7) by and concentrating out the minimizing with yields (6).
While an efficient choice of in (6) is independent of , the user input is still required; furthermore, the choices of in [40] are predicated on the assumption that each column of has unit norm. Such a rescaling of the regressors may not be practical in an online scenario. Note that a cyclic minimization algorithm for the convex squareroot Lasso has been presented in the supplementary material of [40] (albeit only for the realvalued case and without any derivation) but an online implementation has not yet been formulated.
2.4 Spice as weighted squareroot Lasso
Let us now consider the estimation problem from a statistical point of view. Suppose is a zeromean random variable with covariance matrix . Then the linear estimator that minimizes the mean square error is obtained by solving
(8) 
and can be written in closed form as [43, 25, 26]
(9) 
In the problem under consideration, however, neither nor is known. By treating them as unknown parameters, they can be estimated by a covariancematching approach (e.g., [44, 45]) and then used in (9).
For reasons of parsimony and tractability we do not model any correlations between the elements of and hence is a diagonal matrix. Now consider the covariance matrix of the data , which is a function of and . We choose these nonnegative parameters to match the covariance of the observed data, by minimizing the criterion
with respect to and . This criterion is the basis of the sparse iterative covariancebased estimation (Spice) framework.
Using this covariancematching approach is equivalent to solving for the parameters jointly in the following augmented problem
(10) 
which is similar in form to (8) but contains the additional term . (See Appendix A for a proof of this equivalence.) Furthermore, following [46, 47, 48] it can be shown that solving for and , and concentrating them out from (10), results in
(11) 
where
Eq. (11) can be interpreted as a weighted, hyperparameterfree squareroot Lasso. As is the case with the squareroot Lasso, online formulations of (11) have not appeared in the literature.
2.5 Problem formulation
We have reviewed several approaches to sparse parameter estimation as well as some of their interconnections and limitations. Note that all of the estimators considered above involve convex minimization problems. The and penalized forms of (3) in (4) and (5) have concise online formulations but require the careful selection of hyperparameters. Furthermore, an efficient choice depends on the unknown noise power . The hyperparameters choice is rendered invariant to by the change in (6). Moreover, this selection is entirely avoided in (11) using the Spice approach.
The goal of the remainder of the paper is to formulate an online Spice estimator for the sparse vector (see (11)) given data . This estimator, denoted ’OlSpice’, obviates the need for userdefined hyperparameters, treats the complexvalued case as simply as the realvalued one, and is of the same complexity order as the online solutions of (4) and (5). In the numerical example section we provide results comparing the aforementioned online estimators, viz. OlRls, OlLasso and OlSpice.
3 Online Spice
First we formulate a lowcomplexity cyclic minimization algorithm for the cost function in (11). Then, using this result we derive an online estimator which sequentially processes a stream of data with complexity per sample.
3.1 Cyclic minimization
Let the cost function in (11) be denoted as . We begin by minimizing with respect to one component at a time. Let (omitting the index to lighten the notation); then the cost function can be rewritten as
(12) 
where is the th diagonal element of and is a constant. To tackle this scalar minimization problem we reparameterize the th variable in polar form where and (or when is realvalued). This enables the following reformulation of the quadratic term in (12):
(13) 
Inserting (13) into and noting that , we obtain the following criterion as a function of and ,
(14) 
The minimizing is simply
(15) 
whether the data is complex or realvalued.
Next, let
(16) 
so that we can write (14) as
(17) 
Note that by the CauchySchwarz inequality
(18) 
Equality in (18) occurs only when is colinear with . Inserting (2) into one obtains , where denote estimation errors when holding the remaining coefficients constant. Due to the random noise , and will not be colinear, making the inequality (18) strict, with probability 1.
We now show that (17) is convex and derive the minimizing of this function (dropping the index , in what follows, for notational simplicity). The firstorder derivative is
(19) 
where the quadratic expression in the denominator can be factored as
(20) 
Given the strict inequality in (18) it follows that the righthand side of (20), and therefore the denominator of (19), is strictly positive. The secondorder derivative can be expressed as
Note that the above equation is positive, in view of (18). Thus the function (17) is convex and the minimizer is given by its stationary point; or else .
The stationary point, for which , can be found by solving (see (19)):
which leads to the condition given that both sides must be negative. By squaring both sides of the above expression we can write
using (20), or
where . Therefore for , we can write the solution more compactly as (reinstating the dependence on ):
(21) 
given the fact that .
3.2 Online formulation
We now derive a method for efficiently computing (22), given the current estimate which we denote by , at any , for notational simplicity. At , the estimate is initialized as . We note that the variables in (15) and (16) depend on quantities whose dimensions grow with ; namely, and . By introducing recursively computed variables we derive an estimate update that keeps the complexity and memory storage constant at each sample and is of the same complexity order as online Rls and Lasso.
First, we introduce the auxiliary variable , which will subsequently be eliminated as we proceed in the derivation. Then we can write the following identity , which enables the variables in (16) to be expressed as
(23) 
Next, introduce the auxiliary variables
(24) 
and the recursively computed variables
(25) 
that are initialized as . Then (23) can be simplified as follows:
(26) 
where denotes the th element of . Similarly, (15) can be expressed as
(27) 
Therefore the computation of (22) can be expressed in terms of (24), (25) and the current estimate .
Once has been computed, the current estimate must be updated along with the auxiliary variables to compute the subsequent coefficients of . The variable can easily be updated as , and it follows that the update of (24) equals
which involves a small number of scalar operations and an addition of two vectors. The variable can now be eliminated, initializing the auxiliary variables (24) for each sample as and . We summarize OlSpice in Algorithm 1. The algorithm specifies the update of the estimate for each new sample and is initialized at by . The cyclic computation of , is terminated after repetitions per sample, cf. line 14 in Algorithm 1.
In sum, by introducing the auxiliary variables we can maintain constant storage and a computational complexity of order per sample. Since is a constant independent of , this is the same complexity order as online Rls and Lasso. As reported below, performs well in practice. Other update strategies, akin to those considered in [36], can be explored in online applications where complexity needs to be further reduced.
Remark 1.
Remark 2.
The original Spice batch algorithms [3, 7], with uniform noise variance, and the above online formulation solve the same convex problem iteratively. The former uses an initial batch estimate whereas the latter is initialized by setting . A more important difference, however, is that the former requires repeated inversions of matrices, each of which is of complexity , whereas the latter requires none. This renders batch Spice intractable when takes on large values (such as for a regular PC) and precludes its use in scenarios considered in this work.
Remark 3.
We note that the approach employed to derive OlSpice also enables an alternative formulation of OlLasso that treats the complexvalued case as simply as the realvalued one. See Appendix B for a derivation.
4 Numerical evaluation
In this section we compare the derived OlSpice with feasible and infeasible versions of the OlRls and OlLasso [36].
The infeasible OlRls is implemented by processing only the (unknown) subset of nonzero coefficients in , whereas the feasible OlRls processes the entire vector, with the regularization parameter set arbitrarily to . The infeasible OlLasso is implemented by setting , which is proportional to the (unknown) noise level [36], whereas for the feasible OlLasso we set .
The performance of the estimators was evaluated using the normalized meansquare error
(28) 
When is an unknown deterministic variable, the expectation with respect to it in (28) should be be removed. Note that an NMSE value below 0 dB quantifies the error reduction from the initial guess . The NMSE was evaluated using 100 Monte Carlo simulations. We used a PC with Intel i7 3.4 GHz CPU and 16 GB RAM. The algorithms were implemented in Matlab without any special code optimization or hardware acceleration.
Remark: In the interest of reproducible research we have made the codes for OlSpice, as well as for the presented numerical experiments, available at https://www.it.uu.se/katalog/davza513.
4.1 Realvalued example: random regressors
To illustrate the performance of the estimators we consider a scenario with the realvalued regressor elements in (1) drawn from identical and independent Gaussian distributions (i.i.d.) with zero mean and unit variance. The signal to noise ratio is defined as
where is the support set of and . We use Gaussian noise throughout all experiments.
We first consider to be a deterministic parameter. In the first experiment we set nonzero elements, , and . Note that since the regressors are drawn independently, the chosen support set for will not affect the performance. When the number of samples is very small, the estimates for OlLasso and OlSpice have a higher variance than OlRls which biases its estimate more strongly towards . For clarity we therefore set during for all estimators. The results are shown in Fig. 1 for SNR=20 dB. We observe a significant performance gap between the feasible and infeasible OlLasso, which illustrates how critical it is to tune the hyperparameter to the unknown noise variance . Both OlLasso and OlSpice quickly prune out many nonzero coefficient estimates, the effect of which is visible in the transition phase of the NMSE plot. The performance of OlRls becomes better than that of OlLasso when . OlSpice outperforms the feasible OlLasso after about samples and is closer to the infeasible version which uses an optimally tuned .
Fig. 2 presents the variance and bias of the estimators by decomposing the mean square error in (28). Both versions of OlLasso exhibit much lower variance than squarebias whereas OlSpice has a more balanced variancebias composition and noticeably the lowest bias among the considered estimators.
In Fig. 3 we see that the NMSE for feasible OlRls and OlLasso, which is dominated by the bias, remains virtually unaffected by increasing SNR for a fixed number of samples . By contrast, the errors for OlSpice and the infeasible OlLasso decrease as the signal conditions improve.
Next, we study the effect of the number of iteration cycles per sample in OlSpice. The results are illustrated in Fig. 4. We note that the performance characteristics for , and , are very similar. For , a larger leads to slightly faster decrease of the NMSE but the differences rapidly diminish as increases and the NMSE curves almost coincide for . For reference we included the infeasible OlRls, which provides a lower bound on the NMSE in Fig. 4 and can be compared to Fig. 1.
In the next experimental setup we set the number of nonzero elements to , , thus increasing element density of to 10%. The results in Fig. 5 show that OlSpice can better cope with less sparse parameter vectors than OlLasso. For it exhibits lower NMSE than the rest, owing to a lower bias (not shown here but observed by us in the numerical evaluation).
Finally, we consider a setup where is a random parameter. Since the support set is unimportant in the present case we generate the elements , and using independent Gaussian variables with zeromean and unit variance, resulting in a wider dynamic range than in the previous experiments. Nevertheless, the results presented in Fig. 6 show performance characteristics similar to the deterministic case presented in Fig. 1.
4.2 Realvalued example: sinusoids in noise
In contrast to the previous example, we now present a case where the regressor columns in (2) are highly correlated. Specifically, as a further example with realvalued parameters, we consider the identification of a sum of sinusoids at given frequencies with unknown phases and amplitudes (most of which are zero). In the following we will consider possible sinusoids on a uniform grid of frequencies. We set two nonzero amplitudes as and for two slowlyvarying sinusoids, narrowly spaced with , and for a highfrequency sinusoid. The phases of the three sinusoids were set to 0.
We define the signal to noise ratio as
where is the set of nonzero amplitudes, and parameterize the signal as
where the unknown parameter vector is and . The regressor vector is .
We set SNR=20 dB. First, OlSpice is compared with the feasible and infeasible OlLasso which perform substantially different from one another but achieve the same rate of NMSE decrease. The results are presented in Fig. 7. For , OlSpice overtakes the feasible OlLasso at about . Notably, the NMSE of OlSpice decreases until it reaches a plateau where the estimation errors are very small but where the noise level cannot be properly identified. This interesting transition characteristic still awaits a satisfactory explanation. For , OlSpice approaches the infeasible OlLasso as time progresses. For reference, we have also added the feasible OlRls which illustrates the degradation when not taking the parameter sparsity into account. Note, however, that OlRls eventually outperforms the OlLasso estimator for which the hyperparameter has not been finely tuned to the noise level.
Next, Fig. 8 illustrates how affects OlSpice. We see that the performance characteristics for , and , are very similar as was the case with weakly correlated regressor columns in Fig. 4. Setting , however, requires slightly more samples to reach the plateau resulting in a gap in NMSE compared to until about .
4.3 Complexvalued example: synthetic aperture radar imaging
Finally, we illustrate how OlSpice performs in a complexvalued case, and compare it with OlRls and a novel form of OlLasso for this scenario, cf. Appendix B.
We consider a setup similar to that of synthetic aperture radar imaging where an antenna transmits an electromagnetic pulse and the reflected signal carries information about potential scatterers in the scene of interest, cf. [49]. Let be a position coordinate in the scene and the reflection coefficient at . The observed signal is in the spatial frequency domain, where each sample corresponds to a particular angle . If we grid the space of the scene, the signal at sample can be modeled as
where is the vectorized image of reflection coefficients. For simplicity, we consider and such that the observation corresponds to a coefficient of the twodimensional discrete Fourier transform. Here we consider the discretized scene image to be such that . The true image used in this example is shown in Fig. 9, which comprises point scatterers with amplitudes equal to .
The observations at each sample were taken at a randomly chosen angle (corresponding to randomly chosen discrete spatial frequencies). The signal to noise ratio was set to dB. In Fig. 10 we compare the estimated images using OlRls, OlLasso and OlSpice. Note that in this signal setup the hyperparameter in the infeasible OlLasso overpenalizes the norm of which results in no visible scatterers. To produce some meaningful plots for OlLasso, the hyperparameter is adjusted to , which illustrates the difficulty of selecting it in practical applications. For close to , three methods estimate the locations and intensities of the point scatterers accurately, but OlSpice is capable of producing accurate images with far fewer samples than the other two methods which would require finetuning. Indeed, the scatterer pattern is already visible at samples in the OlSpice image, without any user input.
5 Conclusions
We have derived an online sparse estimator, called OlSpice, that obviates the need for tuning hyperparameters. Its computational simplicity and adaptability to complexvalued parameters render it suitable for largescale inference problems as well as realtime applications, such as system identification and synthetic aperture radar imaging. The code for OlSpice has been made available to facilitate its use in applications.
Appendix A: Linear minimum meansquare estimator and covariance matching
Here we prove that the minimizer of (10) is equivalent to using the linear minimum meansquare estimator (9) with covariance parameters set through covariance matching.
For notational simplicity, let denote the covariance parameters, namely the diagonal elements of and , and drop subindex . Further, let so that . Now (9) can be written as
We note that (9) is invariant to any scaling of the covariance parameters. That is,
(29) 
for any , which follows from . Finally note that since (9) minimizes (8) it is therefore the minimizer of the augmented problem (10) as well.
We proceed by inserting (9) in (10); this will lead to a concentrated cost function that is equivalent to using the covariancematching criterion. First, using the matrix inversion lemma, note that:
so that
Thus after concentrating out , (10) can be written as
(30) 
Now expand the covariancematching criterion,
where is a constant. The covariance matching problem can thus be written equivalently as
(31) 
which is similar to (30). Let the cost functions in (30) and (31) be denoted as and , respectively. We now show that their respective minimizers differ only by a scaling constant. That is, , where . This follows from
so that for the minimizer we have . It follows that for all , and therefore the minimizers for the concentrated cost function (30) and the covariancematching crierion (31) differ only by a factor . From (29) we know that the linear minimum meansquare estimator is invariant to uniform scaling of the covariance parameters. This concludes the proof. (See also [48] for other details of this result.)