Sparsity-Promoting Sensor Selection for Non-linear Measurement Models

# Sparsity-Promoting Sensor Selection for Non-linear Measurement Models

Sundeep Prabhakar Chepuri,  and Geert Leus,  This work was supported in part by STW under the FASTCOM project (10551) and in part by NWO-STW under the VICI program (10382).All the authors are with the Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, The Netherlands. Email: {s.p.chepuri;g.j.t.leus}@tudelft.nl.A conference precursor of this manuscript has been published in [Eusipco13chepuri].
###### Abstract

Sensor selection is an important design problem in large-scale sensor networks. Sensor selection can be interpreted as the problem of selecting the best subset of sensors that guarantees a certain estimation performance. We focus on observations that are related to a general non-linear model. The proposed framework is valid as long as the observations are independent, and its likelihood satisfies the regularity conditions. We use several functions of the Cramér-Rao bound (CRB) as a performance measure. We formulate the sensor selection problem as the design of a selection vector, which in its original form is a nonconvex \ell_{0}-(quasi) norm optimization problem. We present relaxed sensor selection solvers that can be efficiently solved in polynomial time. We also propose a projected subgradient algorithm that is attractive for large-scale problems and also show how the algorithm can be easily distributed. The proposed framework is illustrated with a number of examples related to sensor placement design for localization.

{keywords}

Sensor selection, sensor placement, Cramér-Rao bound, selection vector, sparsity, non-linear models, statistical inference, projected subgradient algorithm, convex optimization, sensor networks.

## I Introduction

Advances in sensor technology have enabled a large spectrum of applications and services related to safety and security, surveillance, environmental and climate monitoring, to list a few. The sensor nodes are spatially deployed and operate as a network, with each sensor node capable of sensing, processing, and communicating to other nodes or a central processing unit. As a network, their fundamental task is distributed data sampling (i.e., to sense the environment) from which we seek to extract relevant information. The sensors provide a prohibitively large dataset which is usually gathered at a fusion center. This gathered data has to be optimally processed, rejecting the redundant, identical, or faulty measurements.

Sensor selection is a fundamental design task in sensor networks. The number of sensors are often limited either by economical constraints (hardware costs), or the availability of physical or storage space. In order to reduce the hardware costs, as well as the resulting communications and processing overhead, one would like to smartly deploy the sensors. Sensor selection also enables the design of spatio-temporal sensing patterns that guarantee a certain performance measure such as energy-efficiency, information measure, estimation accuracy, or detection probability. The sensor placement problem can also be interpreted as a sensor selection problem in which the best subset of the available sensor locations are selected subject to a specific performance constraint. Sensor selection is pertinent to various diverse fields, especially to applications dealing with large-scale networks like network monitoring [NetworkCartography, giannakis2013monitoring], location-aware services like target localization and tracking [Gusta05SPM, GinnakisLoc, localizationSPM], field estimation [ToonField, MouraField], and environment monitoring in general. The fundamental questions of interest are:

1. Where to deploy the limited sensors available?

2. Do we need to process all the acquired measurements?

To this end, we focus on processing only the most informative sensors for a general non-linear statistical inference problem.

### I-A Related prior works

A large volume of literature exists on sensor selection [sensel1, and references therein]. The sensor selection problem is often formulated as an optimization problem based on some well-known performance measures from experimental design [pukelsheim1993optimal, sensel1][Boyd, Pg. 384]. The sensor selection problem is expressed as the following optimization problem:

where {\bf w} is a selection vector of length M, and f({\bf E}({\bf w})) is a scalar cost function related to the mean squared error (MSE) covariance matrix {\bf E}. The MSE covariance matrix is optimized to select the best subset of K sensors out of M available sensors such that K\ll M. Different functions f({\bf E}({\bf w})) can be used, and the typical choices for f({\bf E}({\bf w})) are related to:

1.

A-optimality: minimizes the sum of eigenvalues of {\bf E} with f({\bf w}):={\rm tr}\{{\bf E}({\bf w})\}.

2.

E-optimality: minimizes the maximum eigenvalue of {\bf E} with f({\bf w}):=\lambda_{\max}\{{\bf E}({\bf w})\}.

3.

D-optimality: minimizes the determinant of {\bf E} with f({\bf w}):=\ln{\rm det}\{{\bf E}({\bf w})\}.

This is a combinatorial optimization problem involving \binom{K}{M} searches, and it is clearly intractable even for small-scale problems with K=10 and M=100. To simplify this problem, the nonconvex Boolean constraint {\bf w}\in\{0,1\}^{M} is relaxed to a convex box constraint {\bf w}\in[0,1]^{M}. The relaxed optimization problem has been studied in [sensel1] for additive Gaussian linear models, where the matrix {\bf E} is available in closed form, and more importantly, where the above listed performance measures are independent of the unknown parameter. Moreover, in practice, the exact number of sensors {K} to select might not be known. However, this number K can always be tuned to achieve a desired performance.

The above selection problem is applied to sensor placement for power grid monitoring in [PMUplacementGG2012]. Alternative approaches exploiting the submodularity of the objective function [krause2008submodularity, krause2007near, krause2008robust], heuristics based on genetic algorithms [geneticselection], and greedy algorithms [shamaiah2010greedy] are also proposed to solve the sensor selection problem. Sensor selection for dynamical systems often referred to as sensor polling or scheduling, is studied in [varshneySPL, Carmischeduling, GerrySensormanagement2012]. In [vertterlisensorplacement], the sensor placement problem for linear models is addressed as the design of a sensing matrix that optimizes a measure related to the orthogonality of its rows. All the above literature (in general) deals with measurements that are related to additive Gaussian linear models. Experimental design for non-linear models within the Bayesian and sequential design frameworks is discussed in [ExptdesignNon]. In [varshneySPL], sensor selection for target tracking based on extended Kalman filtering (EKF) has been proposed, in which the selection is performed by designing an appropriate gain matrix. Although a non-linear measurement model in additive Gaussian noise is used in [varshneySPL], the past state estimate (not the true state) is used to compute the error covariance matrix leading to a suboptimal solution. Sensor selection for detection problems is studied in [SenselDetection]. In [kekatos2011sparse], reliable sensor selection based on the actual measurements to identify the outliers is presented. A different problem, yet related to sensor selection, is the problem of identifying source-informative sensors, which is studied in [Schizas2013TSP].

### I-B Contributions

The sensor selection problem can be interpreted as the problem to select the best sensors out of M available sensors. The selected sensors are deemed as the best subset of sensors if they guarantee a certain specified estimation accuracy. We consider general scenarios where the measurements of the unknown parameter follow a non-linear model (unlike [sensel1] for instance). Non-linear measurement models are frequently encountered in applications like source localization, tracking, field estimation, or phase retrieval, to list a few. The error covariance matrix for non-linear models is not always available in closed form, and more importantly it depends on the unknown parameter. Our first contribution in the context of sensor selection is to use the Cramér-Rao bound (CRB) as a performance measure. The CRB is a rigorous performance measure for optimality, and it generalizes very well for non-linear measurement models (not necessarily in additive Gaussian noise). Moreover, we do not need the actual measurements, and hence, our framework is also well-suited for solving offline design problems. In addition to this, the number of sensors that have to be selected, i.e., K, is generally not known in practice. Hence, instead of fixing K as in (1), we pose sensor selection as a cardinality minimization problem that provides the number of selected sensors as a byproduct. In order to do this, we use different thresholds that specify the required accuracy.

The proposed sensor selection framework is very generic and can be applied to any non-linear statistical inference problem (linear being a special case). The selection problem is formulated as the design of a selection vector which is an \ell_{0}-(quasi) norm nonconvex Boolean optimization problem. It requires a brute-force evaluation over all the 2^{M} choices. For example, with M=100 available potential sensors, there are in the order of 10^{30} possible choices whose direct enumeration is clearly impossible. The nonconvex sensor selection problem is relaxed using standard convex relaxation techniques which can then be efficiently solved in polynomial time.

To cope with large-scale problems, we further present a projected subgradient algorithm. It is worth mentioning that the projected subgradient algorithm allows a very easy distributed implementation.

A sparsity-enhancing concave surrogate for the \ell_{0}-(quasi) norm is also proposed for sensor selection as an alternative to the traditional best convex relaxation. This is particularly advantageous when there are multiple (nearly) identical sensor measurements. We illustrate the sensor selection problem using examples of sensor placement for source localization.

### I-C Outline and notations

The remainder of the paper is organized as follows. In Section II, we present the non-linear measurement model. In Section III, we show the problem formulation, and we present the algorithms that solve the relaxed optimization problem in Section IV. In Section V, we derive the dual problem, and provide some extensions. In Section LABEL:sec:example, the proposed framework is applied to a number of different models related to sensor selection for localization. The paper finally concludes with Section LABEL:sec:conclusion.

The notations used in this paper can be described as follows. Upper (lower) bold face letters are used for matrices (column vectors). (\cdot)^{T} denotes transposition. \mathrm{diag}(\cdot) refers to a block diagonal matrix with the elements in its argument on the main diagonal. \mathbf{1}_{N} (\mathbf{0}_{N}) denotes the N\times 1 vector of ones (zeros). \mathbf{I}_{N} is an identity matrix of size N. \mathbb{E}\{\cdot\} denotes the expectation operation. {\rm tr}\{\cdot\} is the matrix trace operator. {\rm det}\{\cdot\} is the matrix determinant. \lambda_{\rm min}\{{\bf A}\} (\lambda_{\rm max}\{{\bf A}\}) denotes the minimum (maximum) eigenvalue of a symmetric matrix {\bf A}. {\bf A}\succeq{\bf B} means that {\bf A}-{\bf B} is a positive semidefinite matrix. \mathbb{S}^{N} (\mathbb{S}^{N}_{+}) denotes the set of symmetric (symmetric positive semi-definite) matrices of size N\times N. |\mathcal{U}| denotes the cardinality of the set \mathcal{U}.

## II Non-linear measurement model

In this paper, we consider a generic non-linear measurement model

 {y}_{m}={h}_{m}({\boldsymbol{\theta}},{n}_{m}),\,m=1,2,\ldots,M, (2)

where {y}_{m} is the mth spatial or temporal sensor measurement, {\boldsymbol{\theta}}\in\mathbb{R}^{N} is the unknown parameter, {n}_{m} for m=1,2,\ldots,M, is the noise process, and the regressors h_{m} for m=1,2,\ldots,M, are (in general) non-linear functionals. Let the vector {\bf y}=[y_{1},y_{2},\ldots,y_{M}]^{T}\in\mathbb{R}^{M} collect the measurements. The likelihood of the measurements p({\bf y};{\boldsymbol{\theta}}) is the probability density function (pdf) of {\bf y} parameterized by the unknown vector {\boldsymbol{\theta}}.

We make the following assumptions:

a1.

Regularity conditions: The log-likelihood of the measurements satisfies the regularity condition \mathbb{E}\{\frac{\partial\ln p({\bf y};{\boldsymbol{\theta}})}{\partial{% \boldsymbol{\theta}}}\}=0. This is a well-known condition for the CRB to exist [SKayestimation].

a2.

Independent observations: The measurements y_{m} for m=1,2,\ldots,M, are a sequence of independent random variables.

The proposed framework for sensor selection is valid as long as the above two assumptions hold.

Assuming (a1) holds, the covariance of any unbiased estimate \hat{\boldsymbol{\theta}}\in\mathbb{R}^{N} of the unknown parameter satisfies the well-known inequality [SKayestimation]

 \mathbb{E}\{({\boldsymbol{\theta}}-\hat{\boldsymbol{\theta}})({\boldsymbol{% \theta}}-\hat{\boldsymbol{\theta}})^{T}\}\geq{\bf C}({\boldsymbol{\theta}})={% \bf F}^{-1}({\boldsymbol{\theta}}),

where the Fisher information matrix (FIM) is given by

 {\bf F}({\boldsymbol{\theta}})=\mathbb{E}\left\{\left(\frac{\partial\ln p({\bf y% };{\boldsymbol{\theta}})}{\partial{\boldsymbol{\theta}}}\right)\left(\frac{% \partial\ln p({\bf y};{\boldsymbol{\theta}})}{\partial{\boldsymbol{\theta}}}% \right)^{T}\right\}\in\mathbb{R}^{N\times N},

and {\bf C}({\boldsymbol{\theta}}) is the CRB matrix. An important property of the Fisher information is that it is additive for independent observations, which follows from the fact that

 \ln p({\bf y};{\boldsymbol{\theta}})=\ln\prod_{m=1}^{M}p(y_{m};{\boldsymbol{% \theta}})=\sum_{m=1}^{M}\ln p(y_{m};{\boldsymbol{\theta}}), (3)

where we assume that condition (a2) holds. Using (3), the FIM {\bf F}({\boldsymbol{\theta}}) can be alternatively expressed as

 \displaystyle{\bf F}({\boldsymbol{\theta}}) \displaystyle=\sum_{m=1}^{M}\mathbb{E}\left\{\left(\frac{\partial\ln p(y_{m};{% \boldsymbol{\theta}})}{\partial{\boldsymbol{\theta}}}\right)\left(\frac{% \partial\ln p(y_{m};{\boldsymbol{\theta}})}{\partial{\boldsymbol{\theta}}}% \right)^{T}\right\}

which can be further simplified to

 {\bf F}({\boldsymbol{\theta}})=\sum_{m=1}^{M}{\bf F}_{m}({\boldsymbol{\theta}}), (4)

where

 {\bf F}_{m}({\boldsymbol{\theta}})=\mathbb{E}\left\{\left(\frac{\partial\ln p(% y_{m};{\boldsymbol{\theta}})}{\partial{\boldsymbol{\theta}}}\right)\left(\frac% {\partial\ln p(y_{m};{\boldsymbol{\theta}})}{\partial{\boldsymbol{\theta}}}% \right)^{T}\right\} (5)

is the N\times N FIM of the mth measurement. In other words, (4) means that every independent measurement contributes to the information measure. Note that the FIM for non-linear models depends on the unknown vector {\boldsymbol{\theta}}.

Assume for instance that the observations belong to the family of exponential distributions. The log-likelihood of the observations can then be expressed in the form

 \ln\,p({y}_{m};{\boldsymbol{\theta}})=\ln r({y}_{m})+a_{m}({\boldsymbol{\theta% }})b({y}_{m})-c({\boldsymbol{\theta}}), (6)

where r({y}_{m}) and b({y}_{m}) are known functions of the observations only, while a_{m}({\boldsymbol{\theta}}) and c(\boldsymbol{\theta}) depend only on the unknown parameter. The regularity conditions in general hold for observations that belong to the family of exponential pdfs, and it already includes a large number of distributions.

One specific example that often occurs in practice is the case where the observations y_{m},m=1,2,\ldots,M, are related through the following additive Gaussian non-linear model

 {y}_{m}={h}_{m}({\boldsymbol{\theta}})+{n}_{m},\,m=1,2,\ldots,M, (7)

where {h}_{m}(\cdot) is a non-linear function, and {n}_{m} is a zero-mean Gaussian random variable with variance \sigma_{m}^{2}. The log-likelihood of y_{m} is then given by (6) with

 \displaystyle r({y}_{m}) \displaystyle=\frac{1}{\sqrt{2\pi\sigma_{m}^{2}}}\exp(-\frac{1}{2\sigma_{m}^{2% }}y_{m}^{2}), \displaystyle b(y_{m}) \displaystyle=y_{m}/\sigma_{m}^{2}, \displaystyle a_{m}({\boldsymbol{\theta}}) \displaystyle=h_{m}({\boldsymbol{\theta}}), \displaystyle\text{and}\quad c({\boldsymbol{\theta}}) \displaystyle=\frac{1}{2\sigma_{m}^{2}}h_{m}^{2}({\boldsymbol{\theta}}).

Assuming (a2) holds, it is then easy to verify that (5) simplifies to

 {\bf F}_{m}({\boldsymbol{\theta}})=\frac{1}{\sigma^{2}_{m}}\left(\frac{% \partial h_{m}({\boldsymbol{\theta}})}{\partial{\boldsymbol{\theta}}}\right)% \left(\frac{\partial h_{m}({\boldsymbol{\theta}})}{\partial{\boldsymbol{\theta% }}}\right)^{T}.
###### Remark 1 (Additive Gaussian linear model).

As a special case, when the measurement process is linear, we have y_{m}={\bf h}_{m}^{T}{\boldsymbol{\theta}}+n_{m},m=1,2,\ldots,M, i.e., h_{m}({\boldsymbol{\theta}},n_{m}):={\bf h}_{m}^{T}{\boldsymbol{\theta}}+n_{m} with {\bf h}_{m}\in\mathbb{R}^{N} being the regressor. The computation of the FIM for a linear model is straightforward, and is given by

 {\bf F}=\sum_{m=1}^{M}\frac{1}{\sigma^{2}_{m}}{\bf h}_{m}{\bf h}_{m}^{T}.

The CRB for linear models in additive Gaussian noise is also the MSE, and more importantly it is independent of the unknown vector.

## III Problem formulation

Our goal is now to select the best subset (\geq N) of the available M sensor measurements such that a certain accuracy on the estimate \hat{\boldsymbol{\theta}} is guaranteed. We next mathematically formulate this sensor selection problem.

### III-A Sensor selection

In order to select the sensors, we introduce a selection vector

 {\bf w}=[w_{1},w_{2},\ldots,w_{M}]^{T}\in\{0,1\}^{M},

where w_{m}=1(0) indicates that the mth sensor measurement is (not) selected. The measurement model including the virtual hard selection parameter can be visualized as

where the selection vector modifies the log-likelihood of the measurements as \ln\prod_{m=1}^{M}p(y_{m};{\boldsymbol{\theta}})^{w_{m}}=\sum_{m=1}^{M}w_{m}% \ln p(y_{m};{\boldsymbol{\theta}}). The corresponding FIM matrix in (4) can then be expressed as

 \displaystyle{\bf F}({\bf w},{\boldsymbol{\theta}}) \displaystyle=\sum_{m=1}^{M}w_{m}{\bf F}_{m}({\boldsymbol{\theta}}). (9)

### III-B Performance measures

We do not restrict ourselves to any specific estimator, however, we use the CRB as a performance measure. The motivation behind using the CRB is as follows:

• The CRB is a measure for the (local) identifiability of the problem [rothenberg1971identification]. More specifically, a non-singular FIM implies (local) solvability and a unique estimate of {\boldsymbol{\theta}}, however, the converse is not necessarily true. The sensor selection problem presented in this paper seeks a subset of sensors for which the FIM has full rank in some domain such that the solvability of the problem in that domain is always ensured.

• Typically, the subset of selected sensors that yields a lower CRB also yields a lower MSE, and thus improves the performance of any practical system.

The CRB also has a very attractive mathematical structure resulting in a selection problem that can be efficiently solved using convex optimization techniques.

We next use the consistency assumption of the estimator to derive thresholds for the performance measures. We constrain the estimation error {\boldsymbol{\varepsilon}}=\hat{\boldsymbol{\theta}}-{\boldsymbol{\theta}} to be within an origin-centered circle of radius R_{e} with a probability higher than P_{e}, i.e.,

 \mathrm{Pr}({\|\boldsymbol{\varepsilon}\|}_{2}\leq R_{e})\geq P_{e}, (10)

where \mathrm{Pr}(\cdot) denotes probability, and the values of R_{e} and P_{e} define the accuracy required and are assumed to be known. A higher accuracy level is obtained by reducing R_{e} and/or increasing P_{e}. This metric is used in several occasions as an accuracy measure (e.g., see [cover2012elements, Gusta05SPM, Tao09TSP]). We next discuss two popular performance measures that satisfy the above requirement.

#### III-B1 Trace constraint

A sufficient condition to satisfy the accuracy requirement in (10) is (see Appendix LABEL:app:RSPconstraints)

 {\rm tr}\{{\bf C}({\bf w},{\boldsymbol{\theta}})\}={\rm tr}\{(\sum_{i=1}^{M}w_% {m}{\bf F}_{m}(\boldsymbol{\theta}))^{-1}\}\leq\lambda_{\rm tr}=(1-P_{e})R_{e}% ^{2}.

This measure is related to the A-optimality.

#### III-B2 Minimum eigenvalue constraint

Another popular sufficient condition that also satisfies the accuracy requirement in (10) is

 \lambda_{\rm min}\{{\bf F}({\bf w},{\boldsymbol{\theta}})\}\geq\lambda_{\rm eig% }=\frac{N}{R_{e}^{2}}\left(\frac{1}{1-P_{e}}\right),

where \lambda_{\rm eig} is derived in [Tao09TSP] (see also Appendix LABEL:app:RSPconstraints). This measure is related to the E-optimality. The inequality constraint \lambda_{\rm min}\{{\bf F}\}\geq\lambda_{\rm eig} can be equivalently expressed as the following linear matrix inequality (LMI):

 \sum_{m=1}^{M}w_{m}{\bf F}_{m}({\boldsymbol{\theta}})-\lambda_{\rm eig}{\bf I}% _{N}\succeq{\bf 0}_{N}. (11)

In other words, we put a lower bound on each eigenvalue of the matrix {\bf F}. The solution set of {\bf w} satisfying this LMI is convex as {\bf F}_{m}({\boldsymbol{\theta}})\in\mathbb{S}^{N},m=1,2,\ldots,M and \lambda_{\rm eig}{\bf I}_{N}\in\mathbb{S}^{N} [Boyd, Pg. 38].

The trace constraint has a larger feasible set as compared to the minimum eigenvalue constraint. However, although the trace constraint is a sufficient condition, the resulting sensor selection problem is computationally less attractive compared to the minimum eigenvalue constraint (as we show later). Moreover, LMIs can be used to also represent the trace constraint. For these reasons, we focus on the minimum eigenvalue (LMI) constraints from now on. However, without loss of generality (w.l.o.g.) either one of the two performance constraints can be used.

The above performance measures depend on the unknown parameter. In practice, the unknown parameter {\boldsymbol{\theta}} has a physical meaning and takes values within a certain domain denoted by \mathcal{U}. For example, in the case of direction-of-arrival estimation, \mathcal{U} is the sector where the source is expected or for target localization it is the surveillance area where the target resides. Since the FIM for non-linear models depends on the unknown {\boldsymbol{\theta}}, we propose to constrain every point within the domain \mathcal{U}.

###### Remark 2 (Bayesian CRB constraint).

In a Bayesian setting, when prior information of the unknown parameter {\boldsymbol{\theta}} is available, this additional knowledge typically yields a lower CRB, and the related information matrix is often called the Bayesian information matrix (BIM). The BIM is given by {\bf F}_{\rm B}({\bf w},{\boldsymbol{\theta}})={\bf F}({\bf w},{\boldsymbol{% \theta}})+{\bf J}_{\rm p}, where {\bf J}_{\rm p} is some prior information matrix {\bf J}_{\rm p}=-\mathbb{E}_{\boldsymbol{\theta}}\left\{\frac{\partial}{% \partial{\boldsymbol{\theta}}}\left(\frac{\ln p({\boldsymbol{\theta}})}{% \partial{\boldsymbol{\theta}}}\right)^{T}\right\} with the (log) prior \ln p({\boldsymbol{\theta}}). The LMI constraint in (11) for the Bayesian setting will then be

 {\bf J}_{\rm p}+\sum_{m=1}^{M}w_{m}{\bf F}_{m}({\boldsymbol{\theta}})\succeq% \lambda_{\rm eig}{\bf I}_{N}. (12)

The prior information typically comes from the dynamics, previous measurements, or combining other available measurements.

### III-C Problem statement

Having introduced the selection vector as well as the performance measure we can now formally state the problem.

###### Problem statement (Sensor selection).

Given the likelihoods p(y_{m};{\boldsymbol{\theta}}),m=1,2,\ldots,M, of the measurements, and assuming (a1) and (a2) hold, find a vector {\bf w}\in\{0,1\}^{M} that selects the minimum number of most informative sensors satisfying the performance measure \sum_{m=1}^{M}w_{m}{\bf F}_{m}({\boldsymbol{\theta}})-\lambda_{\rm eig}{\bf I}% _{N}\succeq{\bf 0}_{N},\,\;\forall{\boldsymbol{\theta}}\,\in\,\mathcal{U}.

In order to reduce the hardware costs, storage, processing, and communication overhead, we minimize the number of selected sensors. This can be achieved by minimizing the cardinality of the selection vector, i.e., by minimizing the number of non-zero entries of the selection vector. Mathematically, the sensor selection problem is formulated as the design of a selection vector which can be expressed as the following optimization problem

 \displaystyle{\bf w}^{\ast}= \displaystyle\operatorname*{arg\,min}_{\bf w}\quad{\|{\bf w}\|}_{0} (13a) \displaystyle {\rm s.t.}\quad\sum_{m=1}^{M}w_{m}{\bf F}_{m}({\boldsymbol{% \theta}})-\lambda_{\rm eig}{\bf I}_{N}\succeq{\bf 0}_{N},\quad\forall{% \boldsymbol{\theta}}\,\in\,\mathcal{U}, (13b) \displaystyle         {{\bf w}\,\in\,\{0,1\}^{M}}, (13c)

where the \ell_{0}-(quasi) norm refers to the number of non-zero entries in {\bf w}, i.e., {\|{\bf w}\|}_{0}:=|\{m\,:\,w_{m}\neq 0\}|. The threshold \lambda_{\rm eig} imposes the accuracy requirement. The threshold \lambda_{\rm eig} is also the sparsity-inducing parameter, where \lambda_{\rm eig}\rightarrow 0 implies a sparser solution.

Suppose the domain \mathcal{U} consists of D points, obtained by gridding the entire domain at a certain resolution. The resulting multiple LMI constraints can be stacked together as a single LMI constraint. Let us consider the domain \mathcal{U}=\{{\boldsymbol{\theta}}_{1},{\boldsymbol{\theta}}_{2},\ldots,{% \boldsymbol{\theta}}_{D}\} with |\mathcal{U}|=D. The constraints in (13b) can then be equivalently expressed as a single LMI constraint written as \sum_{m=1}^{M}w_{m}{\bf F}_{m}-\lambda_{\rm eig}{\bf I}_{DN}\succeq{\bf 0}_{DN}, where {\bf F}_{m}={\rm diag}({\bf F}_{m}({\boldsymbol{\theta}}_{1}),{\bf F}_{m}({% \boldsymbol{\theta}}_{2}),\ldots,{\bf F}_{m}({\boldsymbol{\theta}}_{D}))\in% \mathbb{S}^{DN} for m=1,2,\ldots,M. Note that the FIM after gridding is independent of {\boldsymbol{\theta}}, and we denote this simply by {\bf F}_{m} (not explicitly as a function of {\boldsymbol{\theta}}).

###### Remark 3 (Worst-case constraints).

If there exists some \tilde{\boldsymbol{\theta}}\in\mathcal{U}_{w}\subset\mathcal{U} such that \lambda_{\rm min}({\bf F}({\bf w},\tilde{\boldsymbol{\theta}}))\leq\lambda_{% \rm min}({\bf F}({\bf w},{\boldsymbol{\theta}})),\,\forall{\bf w}\in\{0,1\}^{M} and \forall{\boldsymbol{\theta}}\in\mathcal{U}_{w}, then it is sufficient to constrain the performance for only the worst-case \tilde{\boldsymbol{\theta}}\in\mathcal{U}_{w} instead of \forall{\boldsymbol{\theta}}\in\mathcal{U}_{w}. This property can be used a guideline for gridding.

## IV Sensor selection solvers

It is well known that the \ell_{0}-(quasi) norm optimization is NP-hard and nonconvex. More specifically, the original sensor selection problem in (13) is NP-hard. The Boolean constraint in (13c) is non-convex and incurs a combinatorial complexity. We next present a number of solvers with which the relaxed convex problem can be solved efficiently in polynomial time.

### IV-A Convex approximation based on \ell_{1}-norm

A computationally tractable (suboptimal) solution is to use the traditional best convex surrogate for the \ell_{0}-(quasi) norm namely the \ell_{1}-norm heuristic. The \ell_{1}-norm is known to represent an efficient heuristic for the \ell_{0}-(quasi) norm optimization with convex constraints especially when the solution is sparse [Polyakl1convex2013]. Such relaxations are well-studied for problems with linear constraints in the context of compressed sensing (CS) and sparse signal recovery [donoho2006most]. The non-convex Boolean constraint in (13c) is further relaxed to the convex box constraint [0,1]^{M}.

The relaxed optimization problem is given as the following SDP problem

 \displaystyle\hat{\bf w}= \displaystyle\operatorname*{arg\,min}_{{\bf w}\,\in\,\mathbb{R}^{M}}\quad{\|{% \bf w}\|}_{1} (14a) \displaystyle {\rm s.t.}\,\sum_{m=1}^{M}w_{m}{\bf F}_{m}-\lambda_{\rm eig}{\bf I% }_{DN}\succeq{\bf 0}_{DN}, (14b) \displaystyle       {0}\leq{w}_{m}\leq{1},\quad m=1,2,\ldots,M, (14c)

where {\|{\bf w}\|}_{1}=\sum_{m=1}^{M}|w_{m}| denotes the \ell_{1}-norm. Due to the positivity constraint, the objective function \|{\bf w}\|_{1} will simply be an affine function {\bf 1}_{M}^{T}{\bf w}. The optimization problem in (14) is a standard SDP problem in the inequality form, which can be efficiently solved in polynomial time using interior-point methods [Boyd]. An implementation of the interior-point method for solving SDP problems in the inequality form is typically based on Newton’s method using an approximating barrier function. A brief description of the projected Newton’s method is provided in Appendix LABEL:app:Newton's which is used to analyze the computational complexity of the relaxed sensor selection problem.

###### Remark 4 (Complexity per iteration).

The computational cost involved during each iteration is as follows [Boyd, Pg. 619]. The matrices {\bf F}_{m},m=1,2,\ldots,M, have a block-diagonal structure with D blocks. Forming the matrix {\bf S}=\sum_{m=1}^{M}w_{m}{\bf F}_{m}-\lambda_{\rm eig}{\bf I}_{DN} costs O(DMN^{2}) flops; computing {\bf S}^{-1}{\bf F}_{i}\;\forall i via Cholesky factorization costs O(MDN^{3}) flops; the Hessian matrix is computed via the inner product of the matrices {\bf S}^{-1}{\bf F}_{i} and {\bf S}^{-1}{\bf F}_{j}, which costs O(DM^{2}N^{2})\;\forall i,j. Finally, the Newton step is computed via Cholesky factorization costing O(M^{3}) flops, and the projection costs O(M) flops. Assuming that M\gg N, the overall computational complexity per iteration of the projected Newton’s algorithm is then O(M^{3}).

Implementations of the interior-point methods are easily available in the form of well-known toolboxes like Yalmip [YALMIP], SeDuMi [Sturm98usingsedumi], and CVX [cvx].

The second-order Newton’s method (cf. Appendix LABEL:app:Newton's) is typically intractable when the number of sensors is very large (M\gg 1000 for example). To circumvent this problem, we propose a subgradient based algorithm. The projected subgradient algorithm is a first-order method which is attractive for large-scale problems as each iteration is much cheaper to process.

The subgradient method is typically used for optimizations involving non-differentiable functions [boyd2003subgradient, bertsekas1999nonlinear]. The subgradient method is a generalization of the gradient method for non-smooth and non-differentiable functions, such as, the \ell_{1}-norm and the minimum eigenvalue constraint functions. We next derive the projected subgradient algorithm.

The relaxed sensor selection problem in (14) can be equivalently expressed as

 \displaystyle\operatorname*{arg\,min}_{{\bf w}}\quad{\|{\bf w}\|}_{1} (15a) \displaystyle {\rm s.t.}\,f_{\rm eig}({\bf w})\geq\lambda_{\rm eig}, (15b) \displaystyle       {\bf w}\in\mathcal{W}, (15c)

where f_{\rm eig}({\bf w}):=\lambda_{\rm min}\{\sum_{m=1}^{M}w_{m}{\bf F}_{m}\} is the constraint function in (14b), and the set \mathcal{W}=\{{\bf w}\in\mathbb{R}^{M}\mid 0\leq w_{m}\leq 1,m=1,2,\ldots,M\} denotes the box constraints in (14c).

The objective {\bf 1}_{M}^{T}{\bf w} is affine, so a subgradient of the objective is the all-one vector {\bf 1}_{M}. Let {\bf g}^{k}\in\partial f_{\rm eig}({\bf w}^{k}) denote a subgradient of the constraint function f_{\rm eig}({\bf w}) at {\bf w}={\bf w}^{k}. Here, the set \partial f_{\rm eig}({\bf w}^{k}) denotes the subdifferential of f_{\rm eig}({\bf w}) evaluated at {\bf w}={\bf w}^{k}. To compute {\bf g}^{k}, we express the constraint function f_{\rm eig}({\bf w}^{k}) as

 f_{\rm eig}({\bf w}^{k})=\inf_{\|{\bf v}\|\leq 1}\quad{\bf v}^{T}\left(\sum_{m% =1}^{M}w_{m}^{k}{\bf F}_{m}\right){\bf v}.

The computation of a subgradient is straightforward, and is given by

 {\bf g}^{k}=[({\bf v}^{k}_{\rm min})^{T}{\bf F}_{1}{\bf v}^{k}_{\rm min},% \ldots,({\bf v}^{k}_{\rm min})^{T}{\bf F}_{m}{\bf v}^{k}_{\rm min}]^{T}\in% \partial f_{\rm eig}({\bf w}^{k}),

where {\bf v}^{k}_{\rm min} is the eigenvector corresponding to the minimum eigenvalue \lambda_{\rm min}\{\sum_{m=1}^{M}w_{m}^{k}{\bf F}_{m}\}. The minimum eigenvalue and the corresponding eigenvector can be computed using a low-complexity iterative algorithm called the power method (see Appendix LABEL:app:poweriter) or using the standard eigenvalue decomposition [golub1996matrix]. Let the projection of a point onto the set \mathcal{W} be denoted by \mathcal{P}_{\mathcal{W}}(\cdot), which can be expressed elementwise as

 [\mathcal{P}_{\mathcal{W}}({\bf w})]_{m}=\begin{cases}0&\mbox{if }w_{m}\leq 0,% \\ w_{m}&\mbox{if }0

The projected subgradient algorithm then proceeds as follows:

 \displaystyle{\bf w}^{k+1}=\begin{cases}\mathcal{P}_{\mathcal{W}}({\bf w}^{k}-% \alpha^{k}{\bf 1}_{M})&\mbox{if }f_{\rm eig}({\bf w}^{k})\geq\lambda_{\rm eig}% ,\\ \mathcal{P}_{\mathcal{W}}({\bf w}^{k}+\alpha^{k}{\bf g}^{k})&\mbox{if }f_{\rm eig% }({\bf w}^{k})<\lambda_{\rm eig}.\end{cases} (17)

In other words, if the current iterate {\bf w}^{k} is feasible (i.e., f_{\rm eig}({\bf w}^{k})\geq\lambda_{\rm eig}), we update {\bf w} in the direction of a negative objective subgradient, as if the LMI constraints were absent; If the current iterate {\bf w}^{k} is infeasible (i.e., f_{\rm eig}({\bf w}^{k})<\lambda_{\rm eig}), we update {\bf w} in the direction of a subgradient {\bf g}^{k} associated with the LMI constraints. After the update is computed, the iterate is projected onto the constraint set \mathcal{W} using \mathcal{P}_{\mathcal{W}}(\cdot).

When the kth iterate is feasible, a diminishing non-summable step size \alpha^{k}=1/\sqrt{k} is used. When the iterate is not feasible Polyak’s step size \alpha^{k}=\frac{f_{\rm eig}({\bf w}^{k})+\epsilon}{\|{\bf g}^{k}\|_{2}^{2}} is used, where we adopt the optimal value for \epsilon:={\bf 1}_{M}^{T}{\bf w}^{\ast} when \|{\bf w}\|_{0} known (i.e., the number of sensors to be selected is known). If this is not known, then we approximate it with \epsilon:=f_{\rm best}^{k}+\gamma, where \gamma=10/(10+k), and f_{\rm best}^{k}=\min\{f_{\rm best}^{k-1},{\bf 1}_{M}^{T}{\bf w}^{k}\} [boyd2003subgradient]. The algorithm is terminated after a specified maximum number of iterations k_{\rm max}. Finally, the estimate is denoted by \hat{\bf w}={\bf w}^{k_{\rm max}}.

The convergence results of the subgradient method for the constrained optimization (i.e., without the projection step) are derived in [boyd2003subgradient]. Since the projection onto a convex set is non-expansive [bertsekas1999nonlinear], it does not affect the convergence. The projected subgradient algorithm is summarized as Algorithm 1.

###### Remark 5 (Complexity per iteration).

We first form the matrix \sum_{m=1}^{M}w_{m}{\bf F}_{m}, which costs O(DMN^{2}) flops. The minimum eigenvalue and the corresponding eigenvector can be computed using the power method at a cost of O(DN^{2}) flops [golub1996matrix]. Forming the vector {\bf g} costs O(DMN^{2}) flops, computing its norm costs O(M) flops, and the update and projection together cost O(M) flops. Assuming that M\gg N as earlier, the computational cost of the projected subgradient algorithm is O(DMN^{2}) which is much lower than the complexity of the projected Newton’s method.

A distributed implementation of the projected subgradient algorithm is very easy. A simple distributed averaging algorithm (e.g., [xiao2004fast]) can be used to compute the sum of matrices \sum_{m=1}^{M}w_{m}{\bf F}_{m}. The minimum eigenvalue and the corresponding eigenvector can then be computed using power iterations at each node independently. The update equation (17), the subgradient vector {\bf g}, and the projection are computed coordinatewise and are already distributed.

Subgradient methods are typically very slow compared to the interior-point method involving Newton iterations, and subgradient methods typically require a few hundred iterations. Newton’s method typically requires in the order of ten steps. On the other hand, unlike the projected subgradient method, Newton’s method cannot be easily distributed, and requires a relatively high complexity per iteration due to the computation and storage of up to second-order derivatives. Depending on the scale of the problem and the resources available for processing one could choose between the subgradient or Newton’s algorithm.

### IV-C Concave surrogate: sparsity-enhancing iterative algorithm

The \ell_{1}-norm is customarily used as the best convex relaxation for the \ell_{0}-norm. However, the intersection of the \ell_{1}-norm ball (or an affine subspace) with the positive semi-definite cone (i.e., the LMI constraint) is not always a unique point as shown in the following Theorem.

###### Theorem 1 (Uniqueness).

The projection of a point {\bf w}\in[0,1]^{M} onto a convex LMI constraint set \sum_{m=1}^{M}w_{m}{\bf F}_{m}-\lambda_{\rm eig}{\bf I}_{DN}\succeq{\bf 0}_{DN} under the \ell_{1}-norm is not always unique.

###### Proof:

The proof follows from the fact that the \ell_{1}-norm is not strictly convex, and from the linearity of the constraint set. Let us consider an example with M=2 (w.l.o.g.), and {\bf F}_{1}={\bf F}_{2}\succeq\lambda_{\rm eig}{\bf I}_{DN}. In other words, the observations are identical. In this case, the extreme points of the \ell_{1}-norm ball, i.e., \hat{\bf w}_{1}=(1,0) and \hat{\bf w}_{2}=(0,1) are two example solutions. Moreover, since the solution set of a convex minimization problem is convex, \tau\hat{\bf w}_{1}+(1-\tau)\hat{\bf w}_{2} is also a solution for any 0<\tau<1, which gives an infinite number of solutions to the relaxed optimization problem (14). For such cases, the \ell_{1}-norm relaxation will typically not result in a sparse solution. ∎

To improve upon the \ell_{1}-norm solution due to its non-uniqueness following from Theorem 1, we propose an alternative relaxation for the original sensor selection problem which also results in fewer selected sensors. Instead of relaxing the \ell_{0}-(quasi) norm with the \ell_{1}-norm, using a nonconvex surrogate function can yield a better approximation. It is motivated in [candes08] that the logarithm of the geometric mean of its elements can be used as an alternative surrogate function for linear inverse problems in CS. Adapting this to our sensor selection problem, we arrive at the optimization problem

 \displaystyle\operatorname*{arg\,min}_{{\bf w}\,\in\,\mathbb{R}^{M}}\quad\sum_% {m=1}^{M}\ln\,(w_{m}+\delta) (18a) \displaystyle {\rm s.t.}\,\sum_{m=1}^{M}w_{m}{\bf F}_{m}-\lambda_{\rm eig}{\bf I% }_{DN}\succeq{\bf 0}_{DN}, (18b) \displaystyle       {0}\leq{w}_{m}\leq{1},\quad m=1,2,\ldots,M, (18c)

where \delta>0 is a small constant that prevents the cost from tending to -\infty. The cost (18a) is concave, but since it is smooth w.r.t. {\bf w}, iterative linearization can be performed to obtain a local minimum [candes08]. The first-order approximation of \ln\,(w_{m}+\delta) around (w_{m}[i-1]+\delta) results in

 \ln\,(w_{m}+\delta)\leq\ln\,(w_{m}[i-1]+\delta)+\frac{(w_{m}-w_{m}[i-1])}{(w_{% m}[i-1]+\delta)}.

Instead of minimizing the original cost, the majorizing cost (second term on the right-hand side of the above inequality) can be optimized to attain a local minima. More specifically, the optimization problem (18) can be iteratively driven to a local minimum using the iterations

 \displaystyle\hat{\bf w}[i]= \displaystyle\operatorname*{arg\,min}_{{\bf w}\,\in\,\mathbb{R}^{M}}\quad\sum_% {m=1}^{M}\frac{w_{m}}{\hat{w}_{m}[i-1]+\delta} (19a) \displaystyle {\rm s.t.}\,\sum_{m=1}^{M}w_{m}{\bf F}_{m}-\lambda_{\rm eig}{\bf I% }_{DN}\succeq{\bf 0}_{DN}, (19b) \displaystyle       {0}\leq{w}_{m}\leq{1},\quad m=1,2,\ldots,M. (19c)

The iterative algorithm is summarized as Algorithm 2. Each iteration in (19) solves a weighted \ell_{1}-norm optimization problem. The weight updates force the small entries of the vector \hat{\bf w}[i] to zero and avoid inappropriate suppression of larger entries. The parameter \delta provides stability, and guarantees that the zero-valued entries of \hat{\bf w}[i] do not strictly prohibit a nonzero estimate at the next step. Finally, the estimate is given by \hat{\bf w}=\hat{\bf w}[i_{\rm max}], where i_{\rm max} is the specified maximum number of iterations.

###### Remark 6 (Sparsity-enhancing projected subgradient algorithm).

The projected subgradient algorithm can be adapted to fit into the sparsity-enhancing iterative algorithm as well. The optimization problem (20) is then replaced with the following update equations:

 \displaystyle{\bf w}^{k+1}[i]=\begin{cases}\mathcal{P}_{\mathcal{W}}({\bf w}^{% k}[i]-\alpha^{k}{\bf u}[i])&\mbox{if }f_{\rm eig}({\bf w}^{k}[i])\geq\lambda_{% \rm eig},\\ \mathcal{P}_{\mathcal{W}}({\bf w}^{k}[i]+\alpha^{k}{\bf g}^{k}[i])&\mbox{if }f% _{\rm eig}({\bf w}^{k}[i])<\lambda_{\rm eig},\end{cases}

where we solve a number of iterations (inner loop) of the projected subgradient algorithm within the ith iteration (outer loop) of Algorithm 2. Here, the kth iterate of the inner loop in the ith outer loop is denoted as (\cdot)^{k}[i].

From the solution of the relaxed optimization problem, the approximate Boolean solution to {\bf w}\in\{0,1\}^{M} can be obtained using randomization techniques, as described next.

### IV-D Randomized rounding

The solution of the relaxed optimization problem is used to compute the suboptimal Boolean solution for the selection problem. A straightforward technique that is often used is the simple rounding technique, in which the Boolean estimate is given by {\rm round}(\hat{w}_{m}),\;m=1,2,\ldots,M, where we define \hat{\bf w}\triangleq[\hat{w}_{1},\hat{w}_{2},\ldots,\hat{w}_{M}]^{T}, and the {\rm round}(.) operator rounds its arguments towards the nearest integer. However, there is no guarantee that the Boolean estimates obtained from the rounding technique always satisfy the LMI constraint. Hence, we propose a randomized rounding technique, where the suboptimal Boolean estimates are computed based on random experiments guided by the solution from the SDP problem in (14) or the iterative version in (19). The randomized rounding technique is summarized as Algorithm 3.

## V Extensions

### V-A The dual problem

The dual of the relaxed primal optimization problem has an interesting relation to the diameter of the confidence ellipsoids, and is closely related to the dual of the E-optimal design [Boyd, Pg. 388]. The dual problem of (14) is given as follows

 \displaystyle\hskip-2.845276pt(\hat{\bf Z},\hat{\boldsymbol{\mu}})= \displaystyle\operatorname*{arg\,max}_{{\bf Z},\,{\boldsymbol{\mu}}}\quad% \lambda_{\rm eig}\,{\rm tr}\{{\bf Z}\}-{\bf 1}_{M}^{T}{\boldsymbol{\mu}} (21) \displaystyle{\rm s.t.}\quad\mathbb{E}\{{\bf s}_{m}^{T}{\bf Z}{\bf s}_{m}\}% \leq 1+\mu_{m},m=1,2,\ldots,M, \displaystyle\hskip 28.452756pt{\bf Z}\succeq{\bf 0},\,\mu_{m}\geq 0,\,m=1,2,% \ldots,M,

where {\bf Z}\in\mathbb{S}^{DN}_{+} and {\boldsymbol{\mu}}\in\mathbb{R}^{N} are the dual variables, and we use \mathbb{E}\{{\bf s}_{m}^{T}{\bf Z}{\bf s}_{m}\}={\rm tr}\{{\bf F}_{m}{\bf Z}\} with {\bf s}_{m}=[\frac{\partial h_{m}(\tilde{\boldsymbol{\theta}}_{1},n_{m})}{% \partial\tilde{\boldsymbol{\theta}}_{1}^{T}},\ldots,\frac{\partial h_{m}(% \tilde{\boldsymbol{\theta}}_{D},n_{m})}{\partial\tilde{\boldsymbol{\theta}}_{D% }^{T}}]^{T}\in\mathbb{R}^{DN}. For a detailed derivation of the dual problem, see Appendix LABEL:app:dual. The dual problem can be interpreted as the problem of maximizing the (average) diameter of the confidence ellipsoid. If we set \mu_{m}=0,m=1,2,\ldots,M, the optimal solution \hat{\bf Z} to the problem (21) is also the solution to the dual of the E-optimal design problem [Boyd, Pg. 388], which maximizes the diameter of the confidence ellipsoid centered around the origin.

The dual formulation is often easier to solve and has only M inequality constraints. The dual problem can be solved using Yalmip, SeDuMi, or CVX as earlier. Suppose {\bf Z} and {\boldsymbol{\mu}} are dual feasible, and {\bf w} is primal feasible, then the dual problem yields the following bound on the primal SDP problem: \lambda_{\rm eig}\,{\rm tr}\{{\bf Z}\}-{\bf 1}_{M}^{T}{\boldsymbol{\mu}}\leq{% \bf 1}_{M}^{T}{\bf w}.

### V-B Scalar constraints

#### Trace constraint

The relaxed sensor selection problem with the scalar trace constraint is given as follows

The trace constraint in (22) is convex in {\bf w}; this is easier to verify when the above trace constraint is expressed as an LMI [Boyd, Pg. 387]. The optimization problem in (22) is a convex problem, and can be cast as an SDP:

 \begin{aligned} &\displaystyle\operatorname*{arg\,min}_{{\bf w}\in\mathbb{R}^{% M},\,{\bf x}\in\mathbb{R}^{N}}\quad{\|{\bf w}\|}_{1}\\ &\displaystyle\hskip 2.845276pt{\rm s.t.}\,\left[\begin{array}[]{cc}\sum_{m=1}% ^{M}w_{m}{\bf F}_{m}({\boldsymbol{\theta}})&{\boldsymbol{\delta}}_{n}\\ {\boldsymbol{\delta}}_{n}^{T}&x_{n}\cr\omit\span\@@LTX@noalign{ }\omit\\ \end{array}\end{aligned}
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters