On the Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measurements

# On the Fundamental Limits of Recovering Tree Sparse Vectors from Noisy Linear Measurements

Akshay Soni,   and Jarvis Haupt,   Manuscript submitted June 18, 2013; revised October 15, 2013. The authors are with the Department of Electrical and Computer Engineering, University of Minnesota – Twin Cities, Minneapolis, MN, 55455 USA; emails: {sonix022, jdhaupt}@umn.edu. A portion of this work appeared at the 2011 IEEE Asilomar Conference on Signals, Systems, and Computers, and a shorter summary version of this paper appeared at the 2013 IEEE Global Conference on Signal and Information Processing. This work was supported by DARPA/ONR Award No. N66001-11-1-4090. Copyright 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending request to pubs-permissions@ieee.org.
###### Abstract

Adaptive sensing, compressive sensing, minimax lower bounds, sparse support recovery, structured sparsity, tree sparsity.

## I Introduction

In recent years, the development and analysis of new sampling and inference methods that make efficient use of measurement resources has received a renewed and concentrated focus. Many of the compelling new investigations in this area share a unifying theme – they leverage the phenomenon of sparsity as a means for describing inherently simple (i.e., low-dimensional) structure that is often present in many signals of interest.

Consider the task of inferring a (perhaps very high-dimensional) vector . Compressive sensing (CS) prescribes collecting non-adaptive linear measurements of by “projecting” it onto a collection of -dimensional “measurement vectors.” Formally, CS observations may be modeled as

 yj=⟨aj,x⟩+wj=aTjx+wj,  for j=1,2,…,m, (1)

where is the -th measurement vector and describes the additive error associated with the -th measurement, which may be due to modeling error or stochastic noise. Initial breakthrough results in CS established that sparse vectors having no more than nonzero elements can be exactly recovered (in noise-free settings) or reliably estimated (in noisy settings) from a collection of only measurements of the form (1) using, for example, ensembles of randomly generated measurement vectors whose entries are iid realizations of certain zero-mean random variables (e.g., Gaussian) – see, for example, [1] as well as numerous CS-related efforts at dsp.rice.edu/cs.

While many of the initial efforts in CS focused on purely randomized measurement vector designs and considered recovery of arbitrary sparse vectors, several powerful extensions to the original CS paradigm have been investigated in the literature. One such extension allows for additional flexibility in the measurement process, so that information gleaned from previous observations may be employed in the design of future measurement vectors. Formally, such adaptive sensing strategies are those for which the -th measurement vector is obtained as a (deterministic or randomized) function of previous measurement vectors and observations , for each . Non-adaptive sensing strategies, by contrast, are those for which each measurement vector is independent of all past (and future) observations. The randomized measurement vectors typically employed in CS settings comprise an example of a non-adaptive sensing strategy. Adaptive sensing techniques have been shown beneficial in sparse inference tasks, enabling an improved resilience to measurement noise relative to techniques based on non-adaptive measurements (see, for example, [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] as well as the summary article [17] and the references therein) and further reductions in the number of compressive measurements required for recovering sparse vectors in noise-free settings [18, 19].

Another powerful extension to the canonical CS framework corresponds to the exploitation of additional structure that may be present in the locations of the nonzeros of . To formalize this notion, we first define the support of a vector as

 S(x)≜{i:xi≠0}, (2)

and note that, in general, the support of a -sparse -dimensional vector corresponds to one of the distinct subsets of of cardinality . The term structured sparsity describes a restricted class of sparse signals whose supports may occur only on a (known) subset of these distinct subsets. Generally speaking, knowledge of the particular structure present in the object being inferred can be incorporated into sparse inference procedures, and for certain types of structure this can result either in a reduction in the number of measurements required for accurate inference, or improved estimation error guarantees, or both (see, e.g., [20, 21, 22], as well as the recent survey article [23] on structured sparsity in compressive sensing).

To the best of our knowledge, our own previous work [24] was the first to identify and quantify the benefits of using adaptive sensing strategies that are tailored to certain types of structured sparsity, in noisy sparse inference tasks. Specifically, the work [24] established that a simple adaptive compressive sensing strategy for tree-sparse vectors could successfully identify the support of much weaker signals than what could be recovered using non-adaptive or adaptive sensing strategies that were agnostic to the structure present in the signal being acquired. Subsequent efforts by other authors have similarly identified benefits of adaptive sensing techniques tailored to other forms of structured sparsity in noisy sparse inference tasks [14, 15, 25].

The primary aim of this effort is to establish the optimality of the strategy analyzed in [24], by identifying the fundamental performance limits associated with the task of support recovery of tree-sparse signals from noisy measurements that may be obtained adaptively. For completeness, and in an effort to put these results into a broader context, we also identify here the performance limits associated with the same support recovery task in settings where measurements are obtained non-adaptively using randomized (Gaussian) measurement vector ensembles, as in the initial efforts in CS. We begin by formalizing the notion of tree-structured sparsity, and reviewing the results of [24].

### I-a Adaptive Sensing of Tree Sparse Signals

Tree sparsity essentially describes the phenomenon where the nonzero elements of the signal being inferred exhibit clustering along paths in some known underlying tree. For the purposes of our investigation here, we formalize the notion of tree sparsity as follows. Suppose that the set that indexes the elements of is put into a one-to-one correspondence with the nodes of a known tree of degree having nodes, which we refer to as the underlying tree. We say that a vector is -tree sparse (with respect to the underlying tree) when the indices of the support set correspond, collectively, to a rooted connected subtree of the underlying tree. In the sequel we restrict our attention to -dimensional signals that are tree sparse in a known underlying binary tree (), though our approach and main results can be extended, in a relatively straightforward manner, to underlying trees having degree . For illustration, Figure 1 depicts a graphical representation of a signal that is -tree sparse in an underlying complete tree of degree with nodes.

Tree sparsity arises naturally in the wavelet coefficients of many signals including, in particular, natural images (see, for example, [26, 27, 28]), and this fact has motivated several investigations into CS inference techniques that exploit or leverage underlying tree structure in the signals being acquired [29, 30, 20, 21, 31]. More aligned with our focus here are several prior efforts that have examined specialized sensing techniques, designed to exploit the inherent tree-based structure present in the wavelet-domain representations of certain signals in various application domains. The work [32], for example, examined dynamic MRI applications where non-Fourier (in this case, wavelet domain) encoding is employed along one of the spatial dimensions, and proposed a sequential sensing strategy that acquires observations of the wavelet coefficients of the object being observed in a “coarse-to-fine” (i.e., top-down, in the wavelet representation) manner. The work [33] compared a coarse-to-fine direct wavelet coefficient sensing approach to a sensing approach based on Bayesian experimental design in the context of an imaging application. More recently, [34] proposed a top-down adaptive wavelet sensing strategy in the context of compressive imaging and provided an analysis of the sample complexity of such strategies in noise-free settings, but did not investigate how such procedures would perform in noisy scenarios; see also [35]. Motivated by these existing efforts, the essential aim of the authors’ own prior work [24] was to assess the performance of such strategies in noisy settings; for completeness, we summarize the approach and main results of that work here.

Let us assume, for simplicity, that the signal being acquired is tree sparse in the canonical (identity) basis, though extensions to signals that are tree sparse in any other orthonormal basis (e.g., a wavelet basis) are straightforward. Noisy observations of are obtained according to (1) by projecting onto a sequence of adaptively designed measurement vectors, each of which corresponds to a basis vector of the canonical basis, and we assume that each measurement vector has unit norm. Now, to simplify the description of the procedure, we introduce some slightly different notation to index the individual observations. Specifically, rather than indexing observations by the order in which they were obtained as in (1), we instead index each measurement according to the index of the basis vector onto which is projected, or equivalently here, according to the location of that was observed. To that end, let us denote by the measurement obtained by projecting onto the vector having a single nonzero in the -th location for any .

Now, begin by specifying a threshold , and by initializing a support estimate and a data structure (which could be a stack, queue, or simply a set) to contain the index corresponding to the root of the underlying tree. While the data structure is nonempty, remove an element from , collect a noisy measurement by projecting onto , and perform the following hypothesis test. If , add the indices corresponding to the children of node in the underlying tree to the data structure and update the support estimate to include the index ; on the other hand, if , then keep and unchanged. Continue in this fashion, at each step obtaining a new measurement and performing a corresponding hypothesis test to determine whether the amplitude of the coefficient measured in that step was significant. When the overall procedure terminates it outputs its final support estimate , which essentially corresponds to the set of locations of for which the corresponding measurements exceeded in amplitude.

The main result of [24] quantifies the performance of this type of sensing strategy for acquiring tree-sparse signals in settings where each measurement is corrupted by additive white Gaussian noise; the overall approach in this context is depicted as Algorithm 1. We provide a restatement of the main result of [24]111We note that we have not attempted to optimize constants in our derivation of Lemma I.1, opting instead for simple expressions that better illustrate the scaling behavior with respect to the problem parameters. here as a Lemma, and provide a proof in the appendix, for completeness. It is worth noting that the choice of data structure in the procedure implicitly determines the order in which measurements are obtained; our analysis, however, is applicable regardless of which particular data structure is used.

###### Lemma I.1.

Specify a sparsity parameter , intended to be an upper-bound for the true sparsity level of the signal being acquired, and choose any . Set the threshold in Algorithm 1 to be

 τ=√2σ2log(4k′δ). (3)

Now, if the signal being acquired by the procedure is -tree sparse for some , the specified sparsity parameter satisfies for some , and the nonzero components of satisfy

 |xi|≥√8[1+log(4βδ)]⋅√σ2logk, (4)

for every , then with probability at least the following are true: the algorithm terminates after collecting measurements, and the support estimate produced by the procedure satisfies .

In words, this result ensures that when the magnitudes of the nonzero signal components are sufficiently large – satisfying the condition specified in (4) – the procedure depicted in Algorithm 1 will correctly identify the support of the tree sparse vector (with high probability), and will do so using no more than measurements.

Now, as a simple extension, suppose that we seek to identify the support of a -tree sparse vector, and are equipped with a budget of measurements, where for some integer constant . In this setting, the procedure described above may be easily modified to obtain a total of measurements (each with its own independent additive noise) at each step. If these replicated measurements are averaged prior to performing the hypothesis test at each step, the results of Lemma I.1 can be extended directly to this setting. We formalize this extension here as a corollary.

###### Corollary I.1.

Let be as in Lemma I.1, and consider acquiring using a variant of the adaptive tree sensing procedure described in Algorithm 1, where measurements are obtained in each step and averaged to reduce the effective measurement noise prior to each hypothesis test. Choose and sparsity parameter , and set the threshold as

 τ=√2(σ2r)log(4k′δ). (5)

If is -tree sparse for some , the sparsity parameter for some , and the amplitudes of the nonzero components of satisfy

 |xi|≥√8[1+log(4βδ)]⋅√(σ2r)logk, (6)

for every then with probability at least the following are true: the algorithm terminates after collecting measurements, and the support estimate produced by the procedure satisfies .

Note that since we have that provided . It follows from the corollary that when the sparsity parameter does not overestimate the true sparsity level by more than a constant factor (i.e., is a constant), then a sufficient condition to ensure that the support estimate produced by the repeated-measurements variant of the tree sensing procedure is correct with probability at least , is that the nonzero components of satisfy

 |xi|≥√24[1+log(4βδ)]⋅√σ2(km)logk, (7)

for all . Identifying whether any other procedure can accurately recover the support of tree-sparse signals having fundamentally weaker amplitudes is the motivation for our present effort.

### I-B Problem Statement

As stated above, the essential aim of this work is to establish whether the adaptive sensing procedure for tree-sparse signals analyzed by the authors in the previous work [24], and summarized above as Algorithm 1 is optimal. Our specific focus here is on establishing fundamental performance limits for the support recovery task – that of identifying the locations of the nonzeros of – in settings where is tree-sparse, and when observations may be designed either non-adaptively (e.g., measurement vectors whose elements are random and iid, as in traditional CS) or adaptively based on previous observations. We formalize this problem here.

#### I-B1 Signal Model

Let denote the set of all unique supports for -dimensional vectors that are -tree sparse in the same underlying binary tree with nodes. For technical reasons, we further assume that the underlying trees are nearly complete, meaning that all levels of the underlying tree are full with the possible exception of the last (i.e., the bottom) level, and all nodes in any partially full level are as far to the left as possible.

Our specific focus will be on classes of -tree sparse signals, , where each -sparse signal has support , and for which the amplitudes of all nonzero signal components are greater or equal to some non-negative quantity . Formally, for a given underlying tree, fixed sparsity level , and as described above, we define the signal class

 Xμ;Tn,k≜{x∈Rn:xi=αi1{i∈T}, |αi|≥μ>0, T∈Tn,k}, (8)

where denotes the indicator function of the event . In the sequel, we choose to simplify the exposition by denoting the signal class using the shorthand notation , effectively leaving the problem dimension and specification of the underlying tree (and corresponding set of allowable -tree sparse supports) to be implicit. As we will see, the conditions required for accurate support recovery of -tree sparse signals as defined above are directly related to the signal amplitude parameter .

#### I-B2 Sensing Strategies

We examine the support recovery task under both adaptive and non-adaptive sensing strategies. The non-adaptive sensing strategies that we examine here are motivated by initial efforts in CS, which prescribe collecting observations using ensembles of randomly generated measurement vectors. Here, when considering performance limits of non-adaptive sensing, we consider observations obtained according to the model (1), where each , , is an independent random vector, whose elements are iid random variables. This normalization ensures that each measurement vector has norm one in expectation; that is, for all . Our investigation of adaptive sensing strategies focuses on observations obtained according to (1), using measurement vectors satisfying , for , and for which is allowed to explicitly depend on for , as described above.

Overall, as noted in [16], we can essentially view any (non-adaptive, or adaptive) sensing strategy in terms of a collection of conditional distributions of measurement vectors given for . We adopt this interpretation here, denoting by the specific sensing strategy based on non-adaptive Gaussian random measurements described above, and by be the collection of all adaptive (or non-adaptive) sensing strategies based on measurements, where each measurement vector is exactly norm one (with probability one).

#### I-B3 Observation Noise

In each case, we model the noises associated with the linear measurements as a sequence of independent random variables. We further assume that each noise is independent of the present and all past measurement vectors . For the non-adaptive sensing strategies we examine here noises will also be independent of future measurement vectors, though by design, future measurement vectors generally will not be independent of present noises when adaptive sensing strategies are employed.

#### I-B4 The Support Estimation Task

We define a support estimator to be a (measurable) function from the space of measurement vectors and associated observations to the power set of . In other words, an estimator takes as its input a collection of measurement vectors and associated observations, , denoted by in the sequel (for shorthand), and outputs a subset of the index set . We note that any estimator can, in general, have knowledge of the sensing strategy that was employed during the measurement process, and we make that dependence explicit here. Overall, we denote a support estimate based on observations obtained using sensing strategy by .

Now, under the loss function defined on elements , the (maximum) risk of an estimator based on sensing strategy over the set is given by

 RXμ,k(ψ,M) ≜ supx∈Xμ,kEx[d(ψ(Am,ym;M),S(x))] (9) = supx∈Xμ,kEx[1{ψ(Am,ym;M)≠S(x)}] = supx∈Xμ,kPrx(ψ(Am,ym;M)≠S(x)),

where and denote, respectively, expectation and probability with respect to the joint distribution of the quantities that is induced when is the true signal being observed. In words, the (maximum) risk essentially quantifies the worst-case performance of a specified estimator when estimating the “most difficult” element (here, the element whose support is most difficult to accurately estimate) from observations obtained via sensing strategy .

Now, we define the minimax risk associated with the class of distributions induced by elements and the class of allowable sensing strategies as the infimum of the (maximum) risk over all estimators and sensing strategies ; that is,

 R∗Xμ,k,M ≜ infψ;M∈MRXμ,k(ψ,M) (10) = infψ;M∈Msupx∈Xμ,kPrx(ψ(Am,ym;M)≠S(x)).

In words, the minimax risk quantifies the error incurred by the best possible estimator when estimating the support of the “most difficult” element using observations obtained via any sensing strategy .

Note that when the minimax risk is bounded away from zero, so that for some , it follows that regardless of the particular estimator and sensing strategy employed, there will always be at least one signal for which . Clearly, such settings may be undesirable in practice, since in this case we can make no uniform guarantees regarding accurate support recovery of signals – there will always be some worst-case scenario for which the support recovery error probability will exceed . Our aim here is to identify these problematic scenarios; formally, we aim to identify signal classes of the form (8), parameterized by their corresponding signal amplitude parameters , for which the minimax risk will necessarily be bounded away from zero.

### I-C Summary of Our Contributions

Our first main result analyzes the support recovery task for tree-sparse signals in a non-adaptive sensing scenario motivated by the randomized sensing strategies typically employed in compressive sensing. We state the result here as a theorem, and provide a proof in the next section.

###### Theorem I.1.

Let be the class of -tree sparse -dimensional signals defined in (8) where , and consider acquiring measurements of using the non-adaptive (random, Gaussian) sensing strategy . If

 (11)

for some then the minimax risk defined in (10) obeys the bound

 R∗Xμ,k,Mm,na≥γ. (12)

As alluded above, the direct implication of Theorem I.1 is that no uniform recovery guarantees can be made for any estimation procedure for recovering the support of tree-sparse signals when the signal amplitude parameter is “too small.”

Our second main result concerns support recovery in scenarios where adaptive sensing strategies may be employed. We state this result as Theorem I.2, and provide a proof in the next section.

###### Theorem I.2.

Let be the class of -tree sparse -dimensional signals defined in (8) where , and consider acquiring measurements of using any sensing strategy . If

 μ≤(1−2γ)√σ2(km), (13)

for some then the minimax risk defined in (10) obeys the bound

 R∗Xμ,k,Mm≥γ. (14)

Similar to the discussion following the statement of Theorem I.1 above, here we have that that no uniform guarantees can be made regarding accurate support recovery of signals for small .

Table I depicts a summary of our main results in a broader context. Overall, we compare four distinct scenarios corresponding to a taxonomy of adaptive and non-adaptive sensing strategies for recovering -sparse signals under assumptions of unstructured sparsity and tree sparsity. For each, we identify (up to an unstated constant) a critical value of the signal amplitude parameter, say , such that for the support recovery task the minimax risk over the class will necessarily be bounded away from zero when . The conditions for support estimation of unstructured sparse vectors listed in Table I are a restatement of some known results, and are provided here (with references) for comparison222Necessary conditions on the signal amplitude parameter required for exact support recovery from non-adaptive compressive samples (and for unstructured sparse signals) were provided in [36]; related efforts along these lines include [37, 38, 39, 40, 41]. Necessary conditions for exact support recovery using adaptive sensing strategies were provided in [42] for the case where the number of measurements exceeds the signal dimension (), while to the best of our knowledge results of this flavor have not yet been established for the compressive regime (where ). Finally, we note that several related efforts have established necessary conditions for weaker metrics of approximate support recovery using non-adaptive sensing [43, 44] and adaptive sensing strategies [11, 16]. . Our main contributions here are depicted in the bottom row of the table, which correspond to the values identified in equations (11) and (13), respectively (with the leading multiplicative factors suppressed).

Two salient points are worth noting when comparing the necessary conditions summarized in Table I with the sufficient condition (7) for the repeated-measurement variant of the adaptive tree sensing procedure of Algorithm 1. First, the results of Theorem I.2, summarized in the lower-right corner of Table I, address our overall question – the simple adaptive tree sensing procedure described above is indeed nearly optimal for estimating the support of -tree sparse vectors, in the following sense: Corollary I.1 describes a technique that accurately recovers (with probability at least , where can be made arbitrarily small) the support of any -tree sparse signal from measurements, provided the amplitudes of the nonzero signal components all exceed for some constant . On the other hand, for any estimation strategy based on any adaptive or non-adaptive sensing method, support recovery will fail (with probability at least ) to accurately recover the support of some signal or signals in a class comprised of -tree sparse vectors whose nonzero components exceed in amplitude, for a constant .

The second noteworthy point here concerns the relative performances of the four strategies summarized in Table I. Overall, we see that techniques that either employ adaptive sensing strategies or exploit tree structure in the signal being inferred (but not both) may indeed outperform non-adaptive sensing techniques that do not exploit structure, in the sense that either may succeed in recovering signals whose nonzero components are weaker. That said, the potential improvement arises only in the logarithmic factor present in the amplitudes, implying that either of these improvements by themselves can recover signals whose amplitudes are weaker by a factor that is (at best) a constant multiple of . On the other hand, techniques that leverage both adaptivity and structure, such as the adaptive tree sensing strategy analyzed above, can provably recover signals whose nonzero component amplitudes are significantly weaker than those that can be recovered via any of the other strategies depicted in the table. Specifically, in this case the relative difference in amplitudes is on the order of a constant times , which could be much more significant, especially in high-dimensional settings. The experimental evaluation in Section III provides some additional empirical evidence along these lines.

### I-D Relations to Existing Works

As alluded above, several recent efforts have proposed (e.g., [29, 30, 31]) and analyzed (e.g., [20, 21]) specialized techniques for estimating tree-sparse signals from non-adaptive compressive samples, each of which are designed to exploit the fundamental connectivity structure present in the underlying signal during the inference task. The work [33] was among the first to propose and experimentally evaluate a direct wavelet sensing approach for acquiring and estimating wavelet sparse signals (there, images) in the context of compressive sensing tasks, and the sample complexity of a similar procedure in noise free settings was analyzed in [34],[35]. These works served as the motivation for our initial investigation [24] into the performance of such approaches in noisy settings.

Since our work [24] appeared, several related efforts in the literature have investigated adaptive sensing strategies for structured sparse signals. The work [14], for example, examined the problem of localizing block-structured activations in matrices from noisy measurements, and established fundamental limits for this task using proof techniques based on [45]. We adopt a similar approach based on [45] below in the proof of one of our main results. A follow-on work [15] examined a more general setting, that of support recovery of signals whose supports correspond to (unions of) smaller clusters in some underlying graph. That work assumed that the clusters comprising the signal model were such that they could be organized into a (nearly balanced) hierarchical clustering having relatively few levels. While this model is quite general, we note that the class of tree sparse signals we consider here comprise a particularly difficult (in fact, nearly pathological!) scenario for the strategy of [15]; indeed, the tree-sparse case comprises one example of a problematic scenario identified in [15] where that approach “does not significantly help when distinguishing clusters that differ only by a few vertices.”

It is interesting to note that different structure models can give rise to different thresholds for localization from non-adaptive measurements. We note, for example, that the thresholds identified in [14] for localizing block-sparse signals using non-adaptive compressive measurements are weaker than the corresponding threshold we identify in Theorem I.1 here for localizing tree-sparse signals333Specifically, the results of [14] imply (adapted to the notation we employ here) that accurate localization of block-sparse signals is impossible when the nonzero signal components have amplitudes smaller than a constant times .. This difference arises as a direct result of the different signal models, and in particular, how these differences manifest themselves in the reduction strategy inherent in the proofs based on the ideas of [45]. For the analysis of block-sparse signals in [14] the reduction to hypotheses that are difficult to distinguish leads to consideration of block-sparse signals that either differ on about locations or do not overlap at all, while in contrast, the performance limits in our case are dictated by tree sparse signals that can differ on as few as two locations. Stated another way, the tree-sparse signal model we consider here contains subsets of signals that are necessarily more difficult to discern than does the block-sparse model analyzed in [14], and this gives rise to the higher necessary signal amplitude thresholds required for localization using non-adaptive compressive measurements for the tree-sparse model we examine here, as compared with the block-sparse model examined in [14].

We also note a recent related work which proposed a technique for sensing signals that are “almost” tree-sparse in a wavelet representation, in the sense that their supports may correspond to disconnected subtrees in some underlying tree [25]. While the sensing strategy proposed in that work was demonstrated experimentally to be effective for acquiring natural images, only a partial analysis of the procedure was provided. Specifically, [25] analyzed their procedure only for the case where the signal supports do correspond to connected subtrees in some underlying tree, which was effectively the case analyzed in [24]. Further, the analysis in [25] did not explicitly quantify the sufficient conditions on the signal component amplitudes for which the procedure would successfully recovery the signal support, stating instead only that measurements were sufficient to recover the support provided the SNR was “sufficiently large.”

While our focus here is specifically on the support recovery task, we note that the related prior work [14] also identified fundamental limits for the task of detecting the presence of block-structured activations in matrices using adaptive or non-adaptive measurements, and established that signals whose nonzero components are essentially “too weak” cannot be reliably detected by any method. Analogous fundamental limits for the detection of certain tree-sparse signals have also been established in the literature. Specifically, in the context of our effort here, the problem examined in [46] may be viewed in terms of identifying the support of (a subset of) tree sparse signals whose nonzero elements have the same amplitude , from a total of noisy measurements, corresponding to one measurement per node of the underlying tree. Interestingly, that work established that all detection approaches (for simple trees with no branching) are unreliable when for a specified constant . This threshold differs from the lower bound we establish for the support recovery task by only a logarithmic factor. This slight difference may arise from the fact that our tree-sparse model contains many more allowable supports (and therefore, more signal candidates) than the path-based model examined in [46], or it may be that, (at least for the “full-measurement” scenario where ) the support recovery task is slightly more difficult than the detection task. A full characterization of this type of detection problem for general tree-sparse signals, in settings where measurements may be compressive () as well as adaptive or non-adaptive, is beyond of the scope of our effort here, and remains an (as yet) open problem.

Finally, while our focus here was specifically on the adaptive tree-sensing strategy and fundamental recovery limits for tree-sparse signals, we note that previous results have established that the necessary conditions for recovery of unstructured sparse signals in the top row of Table I are essentially tight, in the sense that there exist sensing strategies and associated estimation procedures in each case that are capable of accurate support recovery of sparse signals whose nonzero components exceed a constant times the specified quantity – see, for example, [37, 47, 36, 48], which consider the identification of necessary conditions for support recovery of (unstructured) sparse signals from non-adaptive measurements, and [42], which analyzes an adaptive sensing strategy for recovering (unstructured) sparse vectors in noisy settings. Support recovery of (group) structured sparse signals was also examined recently in [49, 50, 51].

### I-E Organization

The remainder of this paper is organized as follows. The proofs of our main results, Theorems I.1 and I.2, are presented in Section II. In Section III we provide an experimental evaluation of the support recovery task for tree sparse signals. Specifically, where we compare the performance of the tree sensing procedure described above with an inference procedure based on non-adaptive (compressive) sensing that is designed to exploit the tree structure, as well as with adaptive and non-adaptive CS techniques that are agnostic to the underlying tree structure. We also provide experimental evidence to validate the scaling behavior predicted in (7) for a fixed measurement budget. We discuss some natural extensions of this effort, and provide a few concluding remarks, in Section IV. Several auxiliary results, as well as a proof of Lemma I.1, are relegated to the Appendix.

## Ii Proofs of Main Results

Our first main result, Theorem I.1, concerns the support recovery task for tree-sparse signals in a non-adaptive sensing scenario motivated by the randomized sensing strategies typically employed in compressive sensing. Our analysis here follows a similar strategy as in a recent related effort [14], which is based on the general reduction strategy described by Tsybakov [45]. Our second main result, Theorem I.2, concerns support recovery for tree-sparse vectors in scenarios where adaptive sensing strategies may be employed. Our proof approach in this scenario is again based on a reduction strategy – we argue (formally) that the support recovery task in this case is at least as difficult as the task of localizing a single nonzero signal component of a vector of reduced dimension, and leverage a result of the recent work [11] which examined support recovery from non-adaptive measurements for general (unstructured) sparse signals.

Before we proceed, we first introduce some notation that will be used throughout the proofs here. For any with , corresponding to the support of a rooted connected subtree with nodes (in some underlying nearly complete binary tree with nodes), we define to be the set of locations in the underlying tree, such that for any the augmented set corresponds to a tree with nodes that is itself another rooted connected subtree of the same underlying tree. Formally, for we define

 N(T)≜{j∈{1,2,…,n} : {T∪j}∈Tn,ℓ+1}. (15)

With this, we are in position to proceed with the proofs of Theorems I.1 and I.2.

### Ii-a Proof of Theorem i.1

The result of Theorem I.1 quantifies the limits of support recovery for tree sparse signals using non-adaptive randomized sensing strategies. Our analysis is based on the general reduction strategy proposed by Tsybakov [45], and follows a similar approach as that in a recent, related effort that identified performance limits for estimating block-structured matrices from noisy measurements [14].

Recall the problem formulation and notation introduced in the previous section, and note that for any set , any estimator , and any measurement strategy , we have that

 supx∈Xμ,kPrx(ψ(Am,ym;M)≠S(x))≥supx∈X′μ,kPrx(ψ(Am,ym;M)≠S(x)), (16)

where as described above the notation denotes probability with respect to the joint distribution of the quantities and that is induced when is the true signal being observed. This implies, in particular, that

 R∗Xμ,k,M≥R∗X′μ,k,M (17)

and it follows that we can obtain valid lower bounds on by instead seeking lower bounds on the minimax risk over any restricted signal class . This is the strategy we employ here.

For technical reasons we address the cases and separately, but the essential approach is similar in both cases. Namely, for each we construct a set of signals whose nonzero components have the same amplitude , and whose supports are “close” in the sense that the symmetric difference between supports of any pair of distinct signals in the class is a set of cardinality two. In each case these signal classes are of the form

 X′μ,k(T∗)≜{x∈Rn:xi=μ1{i∈T}, T=T∗∪j, j∈N(T∗)}, (18)

for some (specific) and is as defined above. This allows us to reduce our problem to the consideration of a hypothesis testing problem over a countable (and finite) number of elements .

#### Ii-A1 Case 1: k=2

We begin by choosing to be an element of for which , and for this we form the set of the form (18) above444Note that this is a somewhat degenerate scenario – here, can be chosen to be the set that contains only the index of the root node of the underlying tree. Further, that implies here, and since the underlying tree is assumed nearly complete, it follows that the root node has two descendants in the underlying tree.. It follows from the definition of that is a set of signals whose supports are each an element of , and since each nonzero element has amplitude exactly equal to , it follows that every is also an element of the class of signals defined in (8) when . Thus, we have overall that . Now, our approach is to obtain lower bounds on the minimax risk when by considering the minimax risk over the set , which ultimately corresponds to assessing the error performance of a hypothesis testing problem with two simple hypotheses.

Our analysis relies on a result of Tsybakov [45, Theorem 2.2], which provides lower-bounds on the minimax probability of error for a binary hypothesis testing problem. We state that result here as a lemma.

###### Lemma II.1 (Tsybakov).

Let be probability distributions (on a common measurable space) for which the Kullback-Leibler (KL) divergence of from satisfies . Then, the minimax probability of error over all (measurable) tests that map observations to an element of the set , given by

 pe,1≜infψmaxj=0,1Prj(ψ≠j), (19)

where denotes probability with respect to the distribution induced on the observations when hypothesis is the correct hypothesis, obeys the bound

 pe,1≥max{14exp(−α),1−√α/22}≥1−√α/22. (20)

In order to apply this result in our setting, we first need to evaluate the KL divergence , where and are distributions that characterize our testing problem of identifying which of the two unique elements , respectively, was observed. Now, under the assumption here that the elements of each measurement vector are (iid) Gaussian distributed, we have that the KL divergence of from can be expressed in terms of the corresponding probability densities and as

 K(P1,P0)=E1[log(f1(Am,ym)f0(Am,ym))], (21)

which is just the expectation of the log-likelihood ratio with respect to the distribution .

It follows from the assumptions of our measurement model, specifically that the measurement vectors and noises are mutually independent, that each of the densities , , can be factored in the form

 fp(Am,ym)=m∏i=1f(ai) fp(yi|ai) (22)

where each is multivariate Gaussian density and is a (signal-dependent) conditional density of the observation given the measurement vector . Note that the conditional densities of given are also Gaussian distributed because of the additive noise modeling assumptions. Overall, the log-likelihood ratio in (21) can be simplified as

 log(f1(Am,ym)f0(Am,ym)) = m∑i=1log(f1(yi|ai)f0(yi|ai)) (23) = m∑i=1(yi−aTix0)2−(yi−aTix1)22σ2 = m∑i=1(aTix0)2−2yiaTix0−(aTix1)2+2yiaTix12σ2.

Now, using the fact that under the distribution we have that for , and that the noise is zero mean and independent of , we can simplify the expression (21) as

 K(P1,P0)=E1⎡⎣m∑i=1(aTi(x1−x0))22σ2⎤⎦.

Note that by the construction of , the vector has exactly two nonzero elements, each having amplitude (but with different signs). It follows that for each , and thus the KL divergence can be expressed simply as

 K(P1,P0)=mμ2nσ2. (24)

Letting , it is easy to see from (20) that if , or equivalently, if

 μ ≤ √2(1−2γ)2⋅√σ2(nm) (25) = √2(1−2γ)2log2⋅√σ2(nm)log2,

for any , then .

#### Ii-A2 Case 2: 3≤k≤(n+1)/2

Analogously to the case, we begin by choosing to be an element of for which (the existence of such an element is established by Lemma A.2 in the appendix) and constructing the set to be of the form (18). As in the previous case, it follows here that , so our approach here ultimately corresponds to assessing the error performance of a multiple hypothesis testing problem with simple hypotheses.

We again employ a result of Tsybakov [45, Proposition 2.3], which provides lower-bounds on the minimax probability of error for a hypothesis testing problem deciding among some hypotheses. We state that result here as a lemma.

###### Lemma II.2 (Tsybakov).

Let be probability distributions (on a common measurable space) satisfying

 1LL∑j=1K(Pj,P0)≤α (26)

with . Then, the minimax probability of error over all (measurable) tests that map observations to an element of the set , given by

 pe,L≜infψmax0≤j≤LPrj(ψ≠j), (27)

obeys the bound

 pe,L≥sup0<τ<1[τL1+τL(1+α+√α/2logτ)]. (28)

As in the previous case we again need to evaluate KL divergences, this time for pairs of distributions and induced by signals . The computation of each KL divergence mirrors the derivation in the previous case; overall, it is straightforward to show that

 1LL∑j=1K(Pj,P0)=mμ2nσ2. (29)

Now, note that we can lower-bound the supremum term in the minimum probability of error expression (28) by evaluating the right hand side for any . Since our test is over hypotheses we let here. Further, since we consider the case here, we have that , so we can choose to obtain that under the conditions of Lemma II.2,

 pe,L ≥ √L1+√L(1+α+√α/2log(1/√L)) (30) ≥ 12(1−(2α+√2α)logL) = 12(1−(2α+√2α)log(k−1)).

Now, note that for any , we have whenever , or equivalently, whenever satisfies

 0≤√α≤√2+8(1−2γ)log(k−1)−√24, (31)

which follows from the monotonicity of the function and a straightforward application of the quadratic formula.

As in the previous case, we let and simplify to obtain that whenever

 μ≤⎡⎣ √1+1fγ,k−√1fγ,k ⎤⎦√1−2γ2⋅√σ2nmlog(k−1), (32)

where . Now, for the range of and values we consider here we have that , implying (after a straightforward calculation) that the term in square brackets in (32) is always greater than . Thus, we see that whenever

 μ≤√2(1−2γ)25⋅√σ2(nm)log(k−1), (33)

We make one more simplification, using the fact that when , to claim that if

 (34)

then .

#### Ii-A3 Putting the Results Together

In order to combine the results from the previous two cases into one concise form, we first note that for ,

 √1−2γ25<√2(1−2γ)2log2. (35)

With this, we can claim overall that for any , if for some ,

 (36)

then the minimax risk over the class of -tree sparse signals defined in (8) satisfies , as claimed.

### Ii-B Proof of Theorem i.2

Our proof approach in this scenario leverages an essential result from recent efforts characterizing the fundamental limits of support recovery for one-sparse -dimensional vectors [11]. In order to put the results of that work into context here, let us define a class of one-sparse -dimensional vectors as

 X(1)μ≜{x∈Rn:xi=αi1{i∈T}, |αi|≥μ>0, T∈[n]}, (37)

where . Note that we use slightly different notation for the signal class to distinguish it from the tree-sparse classes described above. In particular, signals in the class (37) could have their support on any element of , while in contrast, one-sparse signals that are also tree-sparse must be such that their single nonzero occurs at the root of the underlying tree.

In terms of the definition (37) above, the results of [11] (see also the discussion following[52, Theorem 2]) can be summarized as a lemma.

###### Lemma II.3.

The minimax risk

 R∗X(1)μ,Mm=infψ;M∈Mm supx∈X(1)μPrx(ψ(Am,ym;M)≠S(x)) (38)

over all support estimators and sensing strategies satisfies the bound

 R∗X(1)μ,Mm≥12⎛⎝1−√mμ2nσ2⎞⎠. (39)

It follows directly from this result that if

 μ≤(1−2γ)√σ2(nm) (40)

for some , then . We proceed here by showing (formally) that our problem of interest – recovering the support of a -tree sparse -dimensional vector using any estimator and any adaptive sensing strategy – is at least as difficult as recovering the support of a one-sparse vector in some dimensional space using any estimator and any sensing strategy . Then, we adapt the result of Lemma II.3 to establish Theorem I.2.

We will find it useful in the derivation that follows to introduce an alternative, but equivalent, notation to describe the support estimators and signal supports. Namely, we associate with any support estimator a corresponding -dimensional vector-valued function , such that each support estimate corresponds to a vector whose elements are given by

 φi(Am,ym;M)=1{i∈ψ(Am,ym;M)}, (41)

for . Similarly, we can interpret the signal support of any vector in terms of an -dimensional binary vector with elements

 Si(x)=1{i∈S(x)}, (42)

for .

As in the proof of Theorem I.1, for any fixed we choose to be an element of for which , and let be of the form (18). Now, observe

 supx∈Xμ,kPrx(ψ(Am,ym;M)≠S(x)) ≥ supx∈X′μ,k(T∗)Pr% x(ψ(Am,ym;M)≠S(x)) (43) = supx∈X′μ,k(T∗)Pr% x(∪ni=1{φi(Am,ym;M)≠Si(x)}) ≥ supx∈X′μ,k(T∗)Pr% x(∪i∈I{φi(Am,ym;M)≠Si(x)}),

where in the last line is any subset of . In particular, this implies that

 R∗Xμ,k,Mm≥infφ;M∈Mm supx∈X′μ,k(T∗)Prx(∪i∈N(T∗){Ei}), (44)

where is the event . Now, since for any signal the collection contains exactly one ‘’ and zeros, it follows that the right hand side of (44) is equivalent to the minimax risk associated with the task of recovering the support of a one-sparse -dimensional vector whose single nonzero element has amplitude , in settings where measurements can be obtained via any (possibly adaptive) sensing strategy . Thus, we can employ the result of Lemma II.3 to conclude that

 R∗Xμ,k,Mm