High Dimensional Structured Superposition Models

High Dimensional Structured Superposition Models

Abstract

High dimensional superposition models characterize observations using parameters which can be written as a sum of multiple component parameters, each with its own structure, e.g., sum of low rank and sparse matrices, sum of sparse and rotated sparse vectors, etc. In this paper, we consider general superposition models which allow sum of any number of component parameters, and each component structure can be characterized by any norm. We present a simple estimator for such models, give a geometric condition under which the components can be accurately estimated, characterize sample complexity of the estimator, and give high probability non-asymptotic bounds on the componentwise estimation error. We use tools from empirical processes and generic chaining for the statistical analysis, and our results, which substantially generalize prior work on superposition models, are in terms of Gaussian widths of suitable sets.

1 Introduction

For high-dimensional structured estimation problems [7, 27], considerable advances have been made in accurately estimating a sparse or structured parameter even when the sample size is far smaller than the ambient dimensionality of , i.e., . Instead of a single structure, such as sparsity or low rank, recent years have seen interest in parameter estimation when the parameter is a superposition or sum of multiple different structures, i.e., , where may be sparse, may be low rank, and so on [1, 8, 9, 11, 15, 16, 18, 19, 31, 32].

In this paper, we substantially generalize the non-asymptotic estimation error analysis for such superposition models such that (i) the parameter can be the superposition of any number of component parameters , and (ii) the structure in each can be captured by any suitable norm . We will analyze the following linear measurement based superposition model

(1)

where is a random sub-Gaussian design or compressive matrix, is the number of components, is one component of the unknown parameters, is the response vector, and is random noise independent of . The structure in each component is captured by any suitable norm , such that has a small value, e.g., sparsity captured by , low-rank (for matrix ) captured by the nuclear norm , etc. Popular models such as Morphological Component Analysis (MCA) [14] and Robust PCA [8, 11] can be viewed as a special cases of this framework (see Section 9).

The superposition estimation problem can be posed as follows: Given generated following (1), estimate component parameters such that all the component-wise estimation errors , where is the population mean, are small. Ideally, we want to obtain high-probability non-asymptotic bounds on the total componentwise error measured as , with the bound improving (getting smaller) with increase in the number of samples.

We propose the following estimator for the superposition model in (1):

(2)

where are suitable constants. In this paper, we focus on the case where , noting that recent advances [22] can be used to extend our results to more general settings.

The superposition estimator in (2) succeeds if a certain geometric condition, which we call structural coherence (SC), is satisfied by certain sets (cones) associated with the component norms . Since the estimate is in the feasible set of the optimization problem (2), the error vector satisfies the constraint where . The SC condition of is a geometric relationship between the corresponding error cones (see Section 3).

If SC is satisfied, then we can show that the sum of componentwise estimation error can be bounded with high probability, and the bound takes the form:

(3)

where is the sample size, is the number of components, and is the Gaussian width [3, 10, 30] of the intersection of the error cone with the unit Euclidean ball . Interestingly, the estimation error converges at the rate of , similar to the case of single parameter estimators [21, 3], and depends only logarithmically on the number of components . Further, while dependency of the error on Gaussian width of the error has been shown in recent results involving a single parameter [3, 30], the bound in (3) depends on the maximum of the Gaussian width of individual error cones, not their sum. The analysis thus gives a general way to construct estimators for superposition problems along with high-probability non-asymptotic upper bounds on the sum of componentwise errors. To show the generality of our work, we provide a detailed review and comparison with related work in Appendix 8.

Notation: In this paper, we use to denote vector norm, and to denote operator norm. For example, is the Euclidean norm for a vector or matrix, and is the nuclear norm of a matrix. We denote as the smallest closed cone that contains a given set . We denote as the inner product.

The rest of this paper is organized as follows: We start with an optimization algorithm in Section 6 and a deterministic estimation error bound in Section 2, while laying down the key geometric and statistical quantities involved in the analysis. In Section 3, we discuss the geometry of the structural coherence (SC) condition, and show that the geometric SC condition implies statistical restricted eigenvalue (RE) condition. In Section 5, we develop the main error bound on the sum of componentwise errors which hold with high probability for sub-Gaussian designs and noise. In Section 7, we compare an estimator using “infimal convolution”[25] of norms with our estimator (2) for the noiseless case. We discuss related work in Section 8. We apply our error bound to practical problems in Section 9, present experimental results in Section 10, and conclude in Section 11. The proofs of all technical results are in the Appendix.

2 Error Structure and Recovery Guarantees

In this section, we start with some basic results and, under suitable assumptions, provide a deterministic bound for the componentwise estimation error in superposition models. Subsequently, we will show that the assumptions made here hold with high probability as long as a purely geometric non-probabilistic condition characterized by structural coherence (SC) is satisfied.

Let be a solution to the superposition estimation problem in (2), be the optimal (population) parameters involved in the true data generation process. Let be the error vector for component of the superposition. Our goal is to provide a preliminary understanding of the structure of error sets where live, identify conditions under which a bound on the total componentwise error will hold, and provide a preliminary version of such a bound, which will be subsequently refined to the form in (3) in Section 5. Since lies in the feasible set of (2), as discussed in Section 1, the error vectors will lie in the error sets respectively. For the analysis, we will be focusing on the cone of such error sets, given by

(4)

Let , , and , so that . From the optimality of as a solution to (2), we have

(5)

using and . In order to establish recovery guarantees, under suitable assumptions we construct a lower bound to , the left hand side of (5). The lower bound is a generalized form of the restricted eigenvalue (RE) condition studied in the literature [5, 7, 24]. We also construct an upper bound to , the right hand side of (5), which needs to carefully analyze the noise-design (ND) interaction, i.e., between the noise and the design .

We start by assuming that a generalized form of RE condition is satisfied by the superposition of errors: there exists a constant such that for all :

(6)

The above RE condition considers the following set:

(7)

which involves all the error cones, and the lower bound is over the sum of norms of the component wise errors. If , the RE condition in (6) above simplifies to the widely studied RE condition in the current literature on Lasso-type and Dantzig-type estimators [5, 24, 3] where only one error cone is involved. If we set all components but to zero, then (6) becomes the RE condition only for component . We also note that the general RE condition as explicitly stated in (6) has been implicitly used in [1] and [32]. For subsequent analysis, we introduce the set defined as

(8)

noting that .

The general RE condition in (6) depends on the random design matrix , and is hence an inequality which will hold with certain probability depending on and the set . For superposition problems, the probabilistic RE condition as in (6) is intimately related to the following deterministic structural coherence (SC) condition on the interaction of the different component cones , without any explicit reference to the random design matrix : there is a constant such that for all ,

(9)

If , the SC condition is trivially satisfied with . Since most existing literature on high-dimensional structured models focus on the setting [5, 24, 3], there was no reason to study the SC condition carefully. For , the SC condition (9) implies a non-trivial relationship among the component cones. In particular, if the SC condition is true, then the sum being zero implies that each component must also be zero. As presented in (9), the SC condition comes across as an algebraic condition. In Section 3, we present a geometric characterization of the SC condition [18], and illustrate that the condition is both necessary and sufficient for accurate recovery of each component. In Section 4, we show that for sub-Gaussian design matrices , the SC condition in (9) in fact implies that the RE condition in (6) will hold with high probability, after the number of samples crosses a certain sample complexity, which depends on the Gaussian width of the component cones. For now, we assume the RE condition in (6) to hold, and proceed with the error bound analysis.

To establish recovery guarantee, following (5), we need an upper bound on the interaction between noise and design [3, 20]. In particular, we consider the noise-design (ND) interaction

(10)

where is a constant, and is the scaled version of where the scaling factor is . Here, denotes the minimal scaling needed on such that one obtains a uniform bound over of the form: . Then, from the basic inequality in (5), with the bounds implied by the RE condition and the ND interaction, we have

(11)

which implies a bound on the component-wise error. The main deterministic bound below states the result formally:

Theorem 1 (Deterministic bound)

Assume that the RE condition in (6) is satisfied in with parameter . Then, if , we have .

The above bound is deterministic and holds only when the RE condition in (6) is satisfied with constant such that . In the sequel, we first give a geometric characterization of the SC condition in Section 3, and show that the SC condition implies the RE condition with high probability in Section 4. Further, we give a high probability characterization of based on the noise and design in terms of the Gaussian widths of the component cones, and also illustrate how one can choose in Section 5. With these characterizations, we will obtain the desired component-wise error bound of the form (3).

3 Geometry of Structural Coherence

In this section, we give a geometric characterization of the structural coherence (SC) condition in (9). We start with the simplest case of two vectors . If they are not reflections of each other, i.e., , then the following relationship holds:

Proposition 2

If there exists a such that , then

(12)
Figure 1: Geometry of SC condition when . The error sets and are respectively shown as blue and green squares, and the corresponding error cones are and respectively. is the reflection of error cone . If and do not share a ray, i.e., the angle between the cones is larger than , then , and the SC condition will hold.

Next, we generalize the condition of Proposition 2 to vectors in two different cones and . Given the cones, define

(13)

By construction, for all and . If , then continues to hold for all and with constant . Note that this corresponds to the SC condition with and . We can interpret this geometrically as follows: first reflect cone to get , then is the cosine of the minimum angle between and . If , then and share a ray, and structural coherence does not hold. Otherwise, , implying , i.e., the two cones intersect only at the origin, and structural coherence holds.

For the general case involving cones, denote

(14)

In recent work, [18] concluded that if for each then and does not share a ray, and the original signal can be recovered in noiseless case. We show that the condition above in fact implies for the SC condition in (9), which is sufficient for accurate recovery even in the noisy case. In particular, with , we have the following result:

Theorem 3 (Structural Coherence (SC) Condition)

Let with as defined in (14). If , there exists a such that for any , the SC condition in (9) holds, i.e.,

(15)

Thus, the SC condition is satisfied in the general case as long as the reflection of any cone does not intersect, i.e., share a ray, with the Minkowski sum of the other cones.

4 Restricted Eigenvalue Condition for Superposition Models

Assuming that the SC condition is satisfied by the error cones , in this section we show that the general RE condition in (6) will be satisfied with high probability when the number of samples in the sub-Gaussian design matrix crosses the sample complexity . We give a precise characterization of the sample complexity in terms of the Gaussian width of the set .

Our analysis is based on the results and techniques in [28, 20], and we note that [3] has related results using mildly different techniques. We start with a restricted eigenvalue condition on . For a random vector , we define marginal tail function for an arbitrary set as

(16)

noting that it is deterministic given the set . Let be independent Rademacher random variables, i.e., random variable with probability of being either or , and let be independent copies of . We define empirical width of as

(17)

With this notation, we recall the following result from [28, Proposition 5.1]:

Lemma 1

Let be a random design matrix with each row the independent copy of sub-Gaussian random vector . Then for any , we have

(18)

with probability at least .

From Lemma 1, in order to obtain lower bound of in RE condition (6), we need to lower bound and upper bound . To lower bound , we consider the spherical cap

(19)

From [28, 20], one can obtain a lower bound to based on the Paley-Zygmund inequality. The Paley-Zygmund inequality lower bound the tail distribution of a random variable by its second momentum. Let be an arbitrary vector, we use the following version of the inequality.

(20)

In the current context, the following result is a direct consequence of SC condition, which shows that is lower bounded by , which in turn is strictly bounded away from 0. The proof of Lemma 2 is given in Appendix C.1.

Lemma 2

Let sets and be as defined in (7) and (19) respectively. If the SC condition in (9) holds, then the marginal tail functions of the two sets have the following relationship:

(21)

Next we discuss how to upper bound the empirical width . Let set be arbitrary, and random vector be a standard Gaussian random vector in . The Gaussian width [3] of is defined as

(22)

Empirical width can be seen as the supremum of a stochastic process. One way to upper bound the supremum of a stochastic process is by generic chaining [26, 3, 28], and by using generic chaining we can upper bound the stochastic process by a Gaussian process, which is the Gaussian width.

As we can bound and , we come to the conclusion on RE condition. Let be a random matrix where each row is an independent copy of the sub-Gaussian random vector , and where has sub-Gaussian norm [29]. Let so that [20, 28]. We have the following lower bound of the RE condition. The proof of Theorem 4 is based on the proof of [28, Theorem 6.3], and we give it in appendix C.2.

Theorem 4 (Restricted Eigenvalue Condition)

Let be the sub-Gaussian design matrix that satisfies the assumptions above. If the SC condition (9) holds with a , then with probability at least , we have

(23)

where and are positive constants determined by , and .

To get a in (6), one can simply choose . Then as long as for , we have

with high probability.

From the discussion above, if SC condition holds and the sample size is large enough, then we can find a matrix such that RE condition holds. On the other hand, once there is a matrix such that RE condition holds, then we can show that SC must also be true. Its proof is give in Appendix C.3.

Proposition 5

If is a matrix such that the RE condition (6) holds for , then the SC condition (9) holds.

Proposition 5 demonstrates that SC condition is a necessary condition for the possibility of RE. If SC condition does not hold, then there is such that for some , but which implies . Then for every matrix , we have , and RE condition is not possible.

5 General Error Bound

Recall that the error bound in Theorem 1 is given in terms of the noise-design (ND) interaction

(24)

In this section, we give a characterization of the ND interaction, which yields the final bound on the componentwise error as long as , i.e., the sample complexity is satisfied.

Let be a centered sub-Gaussian random vector, and its sub-Gaussian norm . Let be a row-wise i.i.d. sub-Gaussian random matrix, for each row , its sub-Gaussian norm . The ND interaction can be bounded by the following conclusion, and the proof of lemma 3 is given in Appendix D.1.

Lemma 3

Let design be a row-wise i.i.d. sub-Gaussian random matrix, and noise be a centered sub-Gaussian random vector. Then for some constant with probability at least . Constant depends on and .

In lemma 3 and theorem 6, we need the Gaussian width of and respectively. From definition, both and is related to the union of different cones; therefore bounding the width of and may be difficult. We have the following bound of and in terms of the width of the component spherical caps. The proof of Lemma 4 is given in Appendix D.2.

Lemma 4 (Gaussian width bound)

Let and be as defined in (7) and (8) respectively. Then, we have and .

By applying lemma 4, we can derive the error bound using the Gaussian width of individual error cone. From our conclusion on deterministic bound in theorem 1, we can choose an appropriate such that . Then, by combining the result of theorem 1, theorem 4, lemma 3 and lemma 4, we have the final form of the bound, as originally discussed in (3):

Theorem 6

For estimator (3), let , design be a random matrix with each row an independent copy of sub-Gaussian random vector , noise be a centered sub-Gaussian random vector, and be the centered unit euclidean ball. Suppose SC condition holds with

for any ans a constant . If sample size , then with high probability,

(25)

for constants that depend on sub-Gaussian norms and .

Thus, assuming the SC condition in (9) is satisfied, the sample complexity and error bound of the estimator depends on the largest Gaussian width, rather than the sum of Gaussian widths. The result can be viewed as a direct generalization of existing results for , when the SC condition is always satisfied, and the sample complexity and error is given by and [3, 10].

6 Accelerated Proximal Algorithm

  Inputs: , .
  Initialize: , , , .
  for  do
     Set .
     while true do
        for   do
           
        end for
        if   then
           break
        end if
        
     end while
     
  end for
Algorithm 1 Accelerated Proximal Algorithm

In this section, we propose a general purpose algorithm for solving problem (2). For convenience, with , we set and . While the norms may be non-smooth, one can design a general algorithm as long as the proximal operators for each set can be efficiently computed. The algorithm is simply the proximal gradient method [23], where each component is cyclically updated in each iteration (see Algorithm 1):

(26)

where is the learning rate. To determine a proper , we use a backtracking step [4]. Starting from a constant , in each step we first update ; then we decide whether satisfies condition:

(27)

If the condition (27) does not hold, then we decrease till (27) is satisfied. Based on existing results [4], the basic method can be accelerated by setting the starting point of the next iteration as a proper combination of and . By [4], one can use the updates:

(28)

Convergence of Algorithm 1 has been studied in [4]. The backtracking step ensures that the convergence of algorithm 1. The work [4] also give the convergence rate of Algorithm 1, which is . Therefore, we can always reach a stationary point of problem (2) using Algorithm 1.

7 Noiseless Case: Comparing Estimators

In this section, we present a comparative analysis of estimator

(29)

with the proposed estimator (2) in the noiseless case, i.e., . In essence, we show that the two estimators have similar recovery conditions, but the existing estimator (29) needs additional structure for unique decomposition of into the components .

Figure 2: The relationship of different norm balls when . The blue and purple polygons are the norm ball of norms and respectively. The red line is the outline of norm ball. Note that for any point in the red line, we will be able to decompose it to the two vertexes around it.

The estimator (29) needs to consider the so-called “infimal convolution” [25, 32] over different norms to get a (unique) decomposition of in terms of the components . Denote

(30)

Results in [25] show that (30) is also a norm. Thus estimator (29) can be rewritten as

(31)

Interestingly, the above discussion separates the estimation problem in into two parts—solving (31) to get , and then solving (30) to get the components . The problem (31) is a simple structured recovery problem, and is well studied [10, 28]. Using infimal convolution based decomposition problem (30) to get the components will be our focus in the sequel.

To get some properties of decomposition (30), we consider the unit norm balls for norm and component norms :

The norm balls are related by the following result, we give the proof in appendix F.1.

Lemma 5

For a given set , the infimal convolution norm ball is the convex hull of , i.e., .

Figure 3: Consider the case when . Let for . (a) is the structure of error around the true value . The green segment is a subspace determined by and . For the superposition in (a), error of is composed of three parts: , and . In (b), we move the green segment and two error cones to the origin, then the uniquely recovery condition is that if we reflect one of the three structures, their intersection remains .

Lemma 5 illustrates what the decomposition (30) should be like. If is a point on the surface of the norm ball , then the value of is the convex combination of some on the surface of such that . Hence if can be successfully decomposed into different components along the direction of , then we should be able to connect and by a surface on the norm ball, or they have to be “close”. Interestingly, the above intuition of “closeness” between different components can be described in the language of cones, in a way similar to the structural coherence property discussed in Section 3.

Given the intuition above, we state the main result in this section below. Its proof is given in appendix F.2.

Theorem 7

Given and define

(32)

Suppose , then there exist such that are unique solutions of (30) if and only if there are with and such that for the corresponding error cone of and defined above, for .

Theorem 7 illustrate that the successful decomposition of (30) requires an additional condition, i.e., beyond that is needed by the SC condition (see Section 3). The additional condition needs us to choose parameters properly. Theorem 7 shows that depends on both and . For appropriate , there may be a range of such that the solution is unique. Therefore, in noiseless situation, if we know , then solving estimator (29) would be a better idea, because it requires less condition to recover the true value and we do not need to choose parameters .

8 Related Work

Structured superposition models have been studied in recent literatures. Early work focus on the case when k=2 and noise , and assume specific structures such as sparse+sparse [14], and low-rank+sparse [11]. [16] analyze error bound for low-rank and sparse matrix decomposition with noise. Recent work have considered more generalized models and structures. [1] analyze the decomposition of a low-rank matrix plus another matrix with generalized structure. [15] propose an estimator for the decomposition of two generalize structured matrices, while one of them has a random rotation. Because of the increase in practical application and non-trivial of such problem, people have begun to work on unified frameworks for superposition model. In [31], the authors generalize the noiseless matrix decomposition problem to arbitrary number of superposition under random orthogonal measurement. [32] consider the superposition of structures of structures captured by decomposable norm, while [18] consider general norms but with a different measurement model, involving componentwise random rotations. These two papers are similar in spirit to our work, so we briefly discuss and differentiate our work from these papers.

[32] consider a general framework for superposition model, and give a high-probability bound for the following estimation problem:

(33)

they assume each to be a special kind of norm called decomposable norm. the authors used a different approach for RE condition. They decompose into two parts. One is

(34)

which characterizes the restricted eigenvalue of each error cone. The other is

(35)

which characterizes the interaction between different error cones. (35) is a strong assumption, and RE condition can hold without it. If and are positively correlated, then large interaction terms will make our RE condition stronger. Therefore their results are restricted.

[19] consider an estimator like (2), which is

(36)

where are known random rotations. Problem (36) is then transformed into a geometric problem: whether random cones intersect. The componentwise random rotation can ensure that any kind of combination can be recovered with high probability. However, in practical problems, we need not have such random rotations available as part of the measurements. Further, their analysis is primarily focused on the noiseless case.

9 Application of General Bound

In this section, we instantiate the general error bounds on Morphological Component Analysis (MCA), and low-rank and sparse matrix decomposition. The proofs are provided in appendix E.

9.1 Morphological Component Analysis Using Norm

In Morphological Component Analysis [14], we consider the following linear model

where vector is sparse and vector is sparse under a rotation . In [14], the authors introduced a quantity

(37)

For small enough , if the sum of their sparsity is lower than a constant related to , we can recovery them. We show that for two given sparse vectors, our SC condition is more general.

Consider the following estimator

(38)

where vector is the observation, vectors are the parameters we want to estimate, matrix is a sub-Gaussian random design, matrix is orthogonal. We assume and are -sparse and -sparse vectors respectively. Function is still a norm.

Suppose , , and the i-th entry of and the j-th entry of are non-zero. If

then we have

(39)

Thus we will have chance to separate and successfully. It is easy to see that is lower bounded by . Large leads to larger , but also leads to larger , which is better for separating and . The proof of above bound of is given in Appendix E.1.

In general, it is difficult for us to derive a lower bound of like 39. Instead, we can derive the following sufficient condition in terms of :

Theorem 8

If , then for problem (38) with high probability

When , this condition is much stronger than , because every entry of has to be smaller than ;

9.2 Morphological Component Analysis Using -support Norm

-support norm [2] is another way to induce sparse solution instead of norm. Recent works [2, 12] have shown that -support norm has better statistical guarantee than norm. For arbitrary , its -support norm