Group-Lasso on Splines for Spectrum Cartography{}^{\dagger}

Group-Lasso on Splines for Spectrum Cartography

Juan A. Bazerque, Gonzalo Mateos and Georgios B. Giannakis (contact author)
Abstract

The unceasing demand for continuous situational awareness calls for innovative and large-scale signal processing algorithms, complemented by collaborative and adaptive sensing platforms to accomplish the objectives of layered sensing and control. Towards this goal, the present paper develops a spline-based approach to field estimation, which relies on a basis expansion model of the field of interest. The model entails known bases, weighted by generic functions estimated from the field’s noisy samples. A novel field estimator is developed based on a regularized variational least-squares (LS) criterion that yields finitely-parameterized (function) estimates spanned by thin-plate splines. Robustness considerations motivate well the adoption of an overcomplete set of (possibly overlapping) basis functions, while a sparsifying regularizer augmenting the LS cost endows the estimator with the ability to select a few of these bases that “better” explain the data. This parsimonious field representation becomes possible, because the sparsity-aware spline-based method of this paper induces a group-Lasso estimator for the coefficients of the thin-plate spline expansions per basis. A distributed algorithm is also developed to obtain the group-Lasso estimator using a network of wireless sensors, or, using multiple processors to balance the load of a single computational unit. The novel spline-based approach is motivated by a spectrum cartography application, in which a set of sensing cognitive radios collaborate to estimate the distribution of RF power in space and frequency. Simulated tests corroborate that the estimated power spectrum density atlas yields the desired RF state awareness, since the maps reveal spatial locations where idle frequency bands can be reused for transmission, even when fading and shadowing effects are pronounced.

000 Work in this paper was supported by the NSF grants CCF-0830480 and ECCS-0824007. Part of the paper appeared in the Proc. of the 43rd Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, Nov. 1-4, 2009.000 The authors are with the Dept. of Electrical and Computer Engineering, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455. Tel/fax: (612)626-7781/625-4583; Emails: {bazer002,mate0058,georgios}@umn.edu

Submitted: July 5, 2019

{keywords}

Sparsity, splines, (group-)Lasso, field estimation, cognitive radio sensing, optimization.

I Introduction

Well-appreciated as a tool for field estimation, thin-plate (smoothing) splines find application in areas as diverse as climatology [wahba_weather], image processing [Chui00anew], and neurophysiology [Perrin87]. Spline-based field estimation involves approximating a deterministic map from a finite number of its noisy data samples, by minimizing a variational least-squares (LS) criterion regularized with a smoothness-controlling functional. In the dilemma of trusting a model versus trusting the data, splines favor the latter since only a mild regularity condition is imposed on the derivatives of , which is otherwise treated as a generic function. While this generality is inherent to the variational formulation, the smoothness penalty renders the estimated map unique and finitely parameterized [duchon, p. 85][wahba, p. 31]. With the variational problem solution expressible by polynomials and specific kernels, the aforementioned map approximation task reduces to a parameter estimation problem. Consequently, thin-plate splines operate as a reproducing kernel Hilbert space (RKHS) learning machine in a suitably defined (Sobolev) space [wahba, p. 34].

Although splines emerge as variational LS estimators of deterministic fields, they are also connected to classes of estimators for random fields. The first class assumes that estimators are linearly related to the measured samples, while the second one assumes that fields are Gaussian distributed. The first corresponds to the Kriging method while the second to the Gaussian process model; but in both cases one deals with a best linear unbiased estimator (BLUE) [Stein_Kriging]. Typically, wide sense stationarity is assumed for the field’s spatial correlation needed to form the BLUE. The so-termed generalized covariance model adds a parametric nonstationary term comprising known functions specified a priori [matheron]. Inspection of the BLUE reveals that if the nonstationary part is selected to comprise polynomials, and the spatial correlation is chosen to be the splines kernel, then the Kriging, Gaussian process, and spline-based estimators coincide [wahba, p. 35].

Bearing in mind this unifying treatment of deterministic and random fields, the main subjects of this paper are spline-based estimation, and the practically motivated sparse (and thus parsimonious) description of the wanted field. Toward these goals, the following basis expansion model (BEM) is adopted for the target map

(1)

with , , and the norms normalized to unity.

The bases are preselected, and the functions are to be estimated based on noisy samples of . This way, the model-versus-data balance is calibrated by introducing a priori knowledge on the dependence of the map with respect to (w.r.t.) variable , or more generally a group of variables, while trusting the data to dictate the functions of the remaining variables .

Consider selecting basis functions using the basis pursuit approach [basis_pursuit], which entails an extensive set of bases thus rendering overly large and the model overcomplete. This motivates augmenting the variational LS problem with a suitable sparsity-encouraging penalty, which endows the map estimator with the ability to discard factors in (1), only keeping a few bases that “better” explain the data. This attribute is inherited because the novel sparsity-aware spline-based method of this paper induces a group-Lasso estimator for the coefficients of the optimal finitely-parameterized . Group-Lasso estimators are known to set groups of weak coefficients to zero (here the groups associated with coefficients per ), and outperform the sparsity-agnostic LS estimator by capitalizing on the sparsity present [yuan_group_lasso][puig_hero]. An iterative group-Lasso algorithm is developed that yields closed-form estimates per iteration. A distributed version of this algorithm is also introduced for data collected by cooperating sensors, or, for computational load-balancing of multiprocessor architectures. A related approach to model selection in nonparametric regression is the component selection and smoothing operator (COSSO) [cosso]. Different from the approach followed here, COSSO is limited to smoothing-spline, analysis-of-variance models, where the target function is assumed to be expressible by a superposition of orthogonal component functions. Compared to the single group-Lasso estimate here, COSSO entails an iterative algorithm, which alternates through a sequence of smoothing spline[elements_of_statistics, p. 151] and nonnegative garrote [Breiman_garrote_1995] subproblems.

The motivation behind the BEM in (1) comes from our interest in spectrum cartography for wireless cognitive radio (CR) networks, a sensing application that serves as an illustrating paradigm throughout the paper. CR technology holds great promise to address fruitfully the perceived dilemma of bandwidth under-utilization versus spectrum scarcity, which has rendered fixed-access communication networks inefficient. Sensing the ambient interference spectrum is of paramount importance to the operation of CR networks, since it enables spatial frequency reuse and allows for dynamic spectrum allocation; see, e.g., [ganesan08][mishra06] and references therein. Collaboration among CRs can markedly improve the sensing performance [collaborative_tutorial], and is key to revealing opportunities for spatial frequency reuse [popovski]. Pertinent existing approaches have mostly relied on detecting spectrum occupancy per radio, and do not account for spatio-temporal changes in the radio frequency (RF) ambiance, especially at intended receiver(s) which may reside several hops away from the sensed area.

The impact of this paper’s novel field estimators to CR networks is a collaborative sensing scheme whereby receiving CRs cooperate to estimate the distribution of power in space and frequency , namely the power spectrum density (PSD) map in (1), from local periodogram measurements. The estimator need not be extremely accurate, but precise enough to identify spectrum holes. This justifies adopting the known bases to capture the PSD frequency dependence in (1). As far as the spatial dependence is concerned, the model must account for path loss, fading, mobility, and shadowing effects, all of which vary with the propagation medium. For this reason, it is prudent to let the data dictate the spatial component of (1). Knowing the spectrum at any location allows remote CRs to reuse dynamically idle bands. It also enables CRs to adapt their transmit-power so as to minimally interfere with licensed transmitters. The spline-based PSD map here provides an alternative to [bazerque], where known bases are used both in space and frequency. Different from [cartography] and [bazerque], the field estimator here does not presume a spatial covariance model or pathloss channel model. Moreover, it captures general propagation characteristics including both shadowing and fading; see also [Kim09].

Notation: Bold uppercase letters will denote matrices, whereas bold lowercase letters will stand for column vectors. Operators , , , , , will denote Kronecker product, transposition, matrix trace, rank, block diagonal matrix and expectation, respectively; will be used for the cardinality of a set, and the magnitude of a scalar. The norm of function is , while the norm of vector is for ; and is the matrix Frobenious norm. Positive definite matrices will be denoted by . The identity matrix will be represented by , while will denote the vector of all zeros, and . The -th vector in the canonical basis for will be denoted by , .

Ii BEM for Spectrum Cartography

Consider a set of sources transmitting signals using portions of the overall bandwidth . The objective of revealing which of these portions (sub-bands) are available for new systems to transmit, suggests that the PSD estimate sought does not need to be super accurate. This motivates modeling the transmit-PSD of each as

(2)

where the basis is centered at frequency . The example depicted in Fig. 1 involves (generally overlapping) raised cosine bases with support , where is the symbol period, and stands for the roll-off factor. Such bases can model transmit-spectra of e.g., multicarrier systems. In other situations, power spectral masks may dictate sharp transitions between contiguous sub-bands, cases in which non-overlapping rectangular bases may be more appropriate. All in all, the set of bases should be selected to accommodate a priori knowledge about the PSD.

The power transmitted by source will propagate to the location according to a generally unknown spatial loss function . The propagation model not only captures frequency-flat deterministic pathloss, but also stationary, block-fading and even frequency-selective Rayleigh channel effects, since their statistical moments do not depend on the frequency variable. In this case, the following vanishing memory assumption is required on the transmitted signals for the spatial receive-PSD to be factorizable as ; see [bazerque] for further details.

(as) Sources are stationary, mutually uncorrelated, independent of the channels, and have vanishing correlation per coherence interval; i.e., , where and represent the coherence interval and delay spread of the channels, respectively.

Under (as), the contribution of source to the PSD at point is ; and the PSD due to all sources received at will be given by . Such a model can be simplified by defining the function . With this definition and upon exchanging the order of summation, the spatial PSD model takes the form in (1), where functions are to be estimated. They represent the aggregate distribution of power across space corresponding to the frequencies spanned by the bases . Observe that the sources are not explicitly present in (1). Even if this model could have been postulated directly for the cartography task at hand, the previous discussion justifies the factorization of the map per band in factors depending on each of the variables and .

Iii Cooperative Spline-Based PSD Field Estimation

The sensing strategy will rely on the periodogram estimate at a set of receiving (sampling) locations , frequencies , and time-slots . In order to reduce the periodogram variance and mitigate fading effects, is averaged across a window of time-slots [bazerque], to obtain

(3)

Hence, the envisioned setup consists of receiving CRs, which collaborate to construct the PSD map based on PSD observations . The bulk of processing is performed centrally at a fusion center (FC), which is assumed to know the position vectors of all CRs, and the sensed tones in . The FC receives over a dedicated control channel, the vector of samples taken by node for all .

While a BEM could be introduced for the spatial loss function as well [bazerque], the uncertainty on the source locations and obstructions in the propagation medium may render such a model imprecise. This will happen, e.g., when shadowing is present. The alternative approach followed here relies on estimating the functions based on the data . To capture the smooth portions of , the criterion for selecting will be regularized using a so termed thin-plate penalty [wahba, p. 30]. This penalty extends to the one-dimensional roughness regularization used in smoothing spline models. Accordingly, functions are estimated as

(4)

where denotes the Frobenius norm of the Hessian of .

The optimization is over , the space of Sobolev functions, for which the penalty is well defined [duchon, p. 85]. The parameter controls the degree of smoothing. Specifically, for the estimates in (4) correspond to rough functions interpolating the data; while as the estimates yield linear functions (cf. ). A smoothing parameter in between these limiting values will be selected using a leave-one-out cross-validation (CV) approach, as discussed later.

Iii-a Thin-plate splines solution

The optimization problem (4) is variational in nature, and in principle requires searching over the infinite-dimensional functional space . It turns out that (4) admits closed-form, finite dimensional minimizers , as presented in the following proposition which provides a generalization of standard thin-plate splines results; see e.g., [wahba, p.31], to the multi-dimensional BEM (1).

Proposition 1: The estimates in (4) are thin-plate splines expressible in closed form as

(5)

where , and is constrained to the linear subspace for .

The proof of this proposition is given in Appendix A.

Remark 1 (Overlapping frequency basis). If the basis functions have finite supports which do not overlap, then (4) decouples per , and thus the results in [wahba, duchon] can be applied directly. The novelty of Proposition III-A is that the basis functions with spatial spline coefficients in (1) are allowed to be overlapping. The implication of Proposition III-A is finite parametrization of the PSD map [cf. (5)]. This is particularly important for non-FDMA based CR networks. In the forthcoming Section IV, an overcomplete set is adopted in (1), and overlapping bases naturally arise therein.

What is left to determine are the parameters , and in (5). To this end, define the vector containing the network-wide data obtained at all frequencies in . Three matrices are also introduced collecting the regression inputs: i) with th row for and ; ii) with th row for ; and iii) with -th entry for . Consider also the QR decompositions of and .

Upon plugging (5) into (4), it is shown in Appendix B that the optimal satisfy the following system of equations

(6)
(7)
(8)

Matrix is positive definite, and ; see e.g., [minka]. It thus follows that (6)-(7) admit a unique solution if and only if and are invertible (correspondingly, and have full column rank). These conditions place practical constraints that should be taken into account at the system design stage. Specifically, has full column rank if and only if the points in , i.e., the CR locations, are not aligned. Furthermore, will have linearly independent columns provided the basis functions comprise a linearly independent and complete set, i.e., . Note that completeness precludes all frequencies from falling outside the aggregate support of the basis set, hence preventing undesired all-zero columns in .

Remark 2 (Practicality of uniqueness conditions). The condition on does not introduce an actual limitation as it can be easily satisfied in practice, especially when the CRs are randomly deployed. Likewise, the basis set is part of the system design, and can be chosen to satisfy the conditions on . Nonetheless, these conditions will be bypassed in Section IV by allowing for an overcomplete set of functions .

The combined results in this section can be summarized in the following steps constituting the spline-based spectrum cartography algorithm, which amounts to estimating :

S1.

Given , solve (6)-(8) for , after selecting as detailed in Appendix D.

S2.

Substitute and into (5) to obtain .

S3.

Use in (1) to estimate .

Iii-B PSD tracker

The real-time requirements on the sensing radios and the convenience of an estimator that adapts to changes in the spectrum map are the motivating reasons behind the PSD tracker introduced in this section. The spectrum map estimator will be henceforth denoted by , to make its time dependence explicit.

Define the vector of periodogram samples taken at frequency by all CRs, and form the supervector . Per time-slot , the periodogram is averaged using the following adaptive counterpart of (3):

(9)

which implements an exponentially weighted moving average operation with forgetting factor . For every , the online estimator is obtained by plugging in (1) the solution of (4), after replacing with [cf. the entries of the vector in (9)]. In addition to mitigating fading effects, this adaptive approach can track slowly time-varying PSDs because the averaging in (9) exponentially discards past data.

Suppose that per time-slot , the FC receives raw periodogram samples from the CRs in order to update . The results of Section III apply for every , meaning that are given by (5), while the optimum coefficients are found after solving (6)-(8). Capitalizing on (9), straightforward manipulations of (6)-(8) show that are recursively given for all by

(10)
(11)

where the time-invariant matrices and are

Recursions (10)-(11) provide a means to update sequentially in time, by incorporating the newly acquired data from the CRs in . There is no need to separately update as in (9), yet the desired averaging takes place. Furthermore, matrices and need to be computed only once, during the startup phase of the network.

Iv Group-Lasso on Splines

An improved spline-based PSD estimator is developed in this section to fit the unknown spatial functions in the model , with a large (), and a possibly overcomplete set of known basis functions . These models are particularly attractive when there is an inherent uncertainty on the transmitters’ parameters, such as central frequency and bandwidth of the pulse shapers; or, e.g., the roll-off factor when raised-cosine pulses are employed. In particular, adaptive communication schemes rely on frequently adjusting these parameters [goldsmith, Ch. 9]. A sizeable collection of bases to effectively accommodate most of the possible cases provides the desirable robustness. Still, prior knowledge available on the incumbent communication technologies being sensed should be exploited to choose the most descriptive classes of basis functions; e.g., a large set of raised-cosine pulses. This knowledge justifies why known bases are selected to describe frequency characteristics of the PSD map, while a variational approach is preferred to capture spatial dependencies.

In this context, the envisioned estimation method should provide the CRs with the capability of selecting a few bases that “better explain” the actual transmitted signals. As a result, most functions are expected to be identically zero; hence, there is an inherent form of sparsity present that can be exploited to improve estimation. The rationale behind the proposed approach can be rooted in the basis pursuit principle, a term coined in [basis_pursuit] for finding the most parsimonious sparse signal expansion using an overcomplete basis set. A major differentiating aspect however, is that while the sparse coefficients in the basis expansions treated in [basis_pursuit] are scalars, model (1) here entails bases weighted by functions .

The proposed approach to sparsity-aware spline-based field estimation from the space-frequency power spectrum measurements [cf. (3)], is to obtain as

(12)

Relative to (4), the cost here is augmented with an additional regularization term weighted by a tuning parameter . Clearly, if then (12) boils down to (4). To appreciate the role of the new penalty term, note that the minimization of intuitively shrinks all pointwise functional values to zero for sufficiently large . Interestingly, it will be shown in the ensuing section that this is enough to guarantee that , for large enough.

Iv-a Estimation using the group-Lasso

Consider the classical problem of linear regression; see, e.g. [elements_of_statistics, p. 11], where a vector of observations is available, along with a matrix of inputs. The group Lasso estimate for the vector of features is defined as the solution to [bakin_group_lasso, yuan_group_lasso]

(13)

This criterion achieves model selection by retaining relevant factors in which the features are grouped. In other words, group-Lasso encourages sparsity at the factor level, either by shrinking to zero all variables within a factor, or by retaining them altogether depending on the value of the tuning parameter . As is increased, more sub-vector estimates become zero, and the corresponding factors drop out of the model. It can be shown from the Karush-Kuhn-Tucker optimality conditions that only for it holds that , so that the values of interest are  [danio_GL].

The connection between (13) and the spline-based field estimator (12) builds on Proposition III-A, which still holds in this context. That is, even though criteria (4) and (12) purposely differ, their respective solutions have the same form in (5). Indeed, the adaptation of the proof in Appendix A to the new case is straightforward, since the additional penalty term in (12) depends on evaluated at the knots. The essential difference manifested by this penalty is revealed when estimating the parameters and in (5), as presented in the following proposition.

Proposition 2: The spline-based field estimator (12) is equivalent to group-Lasso (13), under the identities

(14)

with their respective solutions related by

(15)
(16)

where and .

The factors in (13) are in one-to-one correspondence with the vectors through the linear mapping (16). This implies that whenever a factor is dropped from the linear regression model obtained after solving (13), then , and the term corresponding to does not contribute to (1). Hence, by appropriately selecting the value of , criterion (12) has the potential of retaining only the most significant terms in , and thus yields parsimonious PSD map estimates. All in all, the motivation behind the variational problem (12) is now unravelled. The additional penalty term not present in (4) renders (12) equivalent to a group-Lasso problem. This enforces sparsity in the parameters of the splines expansion for at a factor level, which is exactly what is needed to potentially null the less descriptive functions .

Remark 3 (Comparison with the PSD map estimator in Section III). The sparsity-agnostic LS problem (4) will not give rise to identically zero vectors , for any . Even when is not large, a sparsity-aware estimator will perform better if the underlying PSD is generated by a few basis functions. This is expected since the out-of-band residual error will increase when all basis functions enter the model (1); see also [bazerque] for a related assessment. What is more, when the number of bases is sufficiently large () matrix is fat, and the approach in Section III is not applicable . On the other hand, it is admittedly more complex computationally to solve (13) than the system of linear equations (6)-(8). Because (12) is not a linear smoother, a leave-one-out (bi-) CV approach to select the tuning parameters and does not enjoy the computational savings detailed in Appendix D. -fold CV can be utilized instead, with typical choices of or , as suggested in [elements_of_statistics, p. 242].

The group-Lassoed splines-based approach to spectrum cartography developed in this section can be summarized in the following steps to estimate the global PSD map :

S1.

Given and utilizing any group Lasso solver, obtain by solving (13).

S2.

Form the estimates using the change of variables for .

S3.

Substitute and into (15) to obtain .

S4.

Use in (1) to estimate .

Implementing S1-S4 presumes that CRs communicate their local PSD estimates to a fusion center, which uses their aggregation in to estimate the field. But what if an FC is not available for centrally running S1-S4? In certain cases, forgoing with an FC is reasonable when the designer wishes to avoid an isolated point of failure, or, aims at a network topology which scales well with an increasing number of CRs based on power considerations (CRs located far away from the FC will drain their batteries more to reach the FC). These reasons motivate well a fully distributed counterpart of S1-S4, which is pursued next.

V Distributed Group-Lasso for In-Network Spectrum Cartography

Consider networked CRs that are capable of sensing the ambient RF spectrum, performing some local computations, as well as exchanging messages among neighbors via dedicated control channels. In lieu of a fusion center, the CR network is naturally modeled as an undirected graph , where the vertex set corresponds to the sensing radios, and the edges in represent pairs of CRs that can communicate. Radio communicates with its single-hop neighbors in , and the size of the neighborhood is denoted by . The locations of the sensing radios are assumed known to the CR network. To ensure that the measured data from an arbitrary CR can eventually percolate throughout the entire network, it is assumed that the graph is connected; i.e., there exists a (possibly) multi-hop communication path connecting any two CRs.

For the purpose of estimating an unknown vector , each radio has available a local vector of observations as well as its own matrix of inputs . Radios collaborate to form the wanted group-Lasso estimator (13) in a distributed fashion, using

(17)

where with , and . The motivation behind developing a distributed solver of (17) is to tackle (12) based on in-network processing of the local observations available per radio [cf. (3)]. Indeed, it readily follows that (17) can be used instead of (13) in Proposition IV-A when

corresponding to the identifications , . Note that because the locations are assumed known to the entire network, CR can form matrices , , and thus, the local regression matrix .

V-a Consensus-based reformulation of the group-Lasso

To distribute the cost in (17), replace the global variable which couples the per-agent summands with local variables representing candidate estimates of per sensing radio. It is now possible to reformulate (17) as a convex constrained minimization problem

(18)

The equality constraints directly effect local agreement across each CR’s neighborhood. Since the communication graph is assumed connected, these constraints also ensure global consensus a fortiori, meaning that . Indeed, let denote a path on that joins an arbitrary pair of CRs . Because contiguous radios in the path are neighbors by definition, the corresponding chain of equalities dictated by the constraints in (18) imply , as desired. Thus, the constraints can be eliminated by replacing all the with a common , in which case the cost in (18) reduces to the one in (17). This argument establishes the following result.

Lemma 1: If is a connected graph, (17) and (18) are equivalent optimization problems, in the sense that

Problem (18) will be modified further for the purpose of reducing the computational complexity of the resulting algorithm. To this end, for a given consider the problem

(19)

and notice that it is separable in the subproblems

(20)

Interestingly, each of these subproblems admits a closed-form solution as given in the following lemma.

Lemma 2: The minimizer of (20) is obtained via the vector soft-thresholding operator defined by

(21)

where .

Problem (19) is an instance of the group-Lasso (13) when , and . As such, result (21) can be viewed as a particular case of the operators in [puig_hero] and [wright_novak_figueredo]. However it is worth to prove Lemma V-A directly, since in this case the special form of (20) renders the proof neat in its simplicity.

Proof: It will be argued that the solver of (20) takes the form for some scalar . This is because among all with the same -norm, the Cauchy-Schwarz inequality implies that the maximizer of is colinear with (and in the same direction of) . Substituting into (20) renders the problem scalar in , with solution , which completes the proof.

In order to take advantage of Lemma V-A, auxiliary variables are introduced as copies of . Upon introducing appropriate constraints that guarantee the equivalence of the formulations along the lines of Lemma V-A, problem (18) can be recast as

(22)

The dummy variables are inserted for technical reasons that will become apparent in the ensuing section, and will be eventually eliminated.

V-B Distributed group-Lasso algorithm

The distributed group-Lasso algorithm is constructed by optimizing (22) using the alternating direction method of multipliers (AD-MoM) [Bertsekas_Book_Distr]. In this direction, associate Lagrange multipliers and with the constraints , and , respectively, and consider the augmented Lagrangian with parameter

(23)

where for notational convenience we group the variables , and multipliers
.

Application of the AD-MoM to the problem at hand consists of a cycle of minimizations in a block-coordinate fashion w.r.t. firstly, and secondly, together with an update of the multipliers per iteration . Deferring the details to Appendix E, the four main properties of this procedure that are instrumental to the resulting algorithm can be highlighted as follows.

  1. Thanks to the introduction of the local copies and the dummy variables , the minimizations of w.r.t. both and decouple per CR , thus enabling distribution of the algorithm. Moreover, the constraints in (22) involve variables of neighboring CRs only, which allows the required communications to be local within each CR’s neighborhood.

  2. Introduction of the variables separates the quadratic cost from the group-Lasso penalty . As a result, minimization of (V-B) w.r.t. takes the form of (19), which admits a closed-form solution via the vector soft-thresholding operator in Lemma V-A.

  3. Minimization of (V-B) w.r.t. consists of an unconstrained quadratic problem, which can also be solved in closed form. In particular, the optimal at iteration takes the value , and thus can be eliminated.

  4. It turns out that it is not necessary to carry out updates of the Lagrange multipliers separately, but only of their sums which are henceforth denoted by . Hence, there is one price per CR , which can be updated locally.

Building on these four features, it is established in Appendix E that the proposed AD-MoM scheme boils down to four parallel recursions run locally per CR:

(24)
(25)