Fast Algorithms for Demixing Sparse Signals from Nonlinear Observations

# Fast Algorithms for Demixing Sparse Signals from Nonlinear Observations

Electrical and Computer Engineering Department
Iowa State University
This work was supported in part by the National Science Foundation under the grant CCF-1566281. Parts of this work also appear in an Iowa State University technical report [1] and a conference paper to be presented in the 2016 Asilomar Conference in November 2016 [2].
###### Abstract

We study the problem of demixing a pair of sparse signals from nonlinear observations of their superposition. Mathematically, we consider a nonlinear signal observation model, , where denotes the superposition signal, and are orthonormal bases in , and are sparse coefficient vectors of the constituent signals. Further, we assume that the observations are corrupted by a subgaussian additive noise. Within this model, represents a nonlinear link function, and is the -th row of the measurement matrix, . Problems of this nature arise in several applications ranging from astronomy, computer vision, and machine learning.

In this paper, we make some concrete algorithmic progress for the above demixing problem. Specifically, we consider two scenarios: (i) the case when the demixing procedure has no knowledge of the link function, and (ii) the case when the demixing algorithm has perfect knowledge of the link function. In both cases, we provide fast algorithms for recovery of the constituents and from the observations. Moreover, we support these algorithms with a rigorous theoretical analysis, and derive (nearly) tight upper bounds on the sample complexity of the proposed algorithms for achieving stable recovery of the component signals. Our analysis also shows that the running time of our algorithms is essentially as good as the best possible.

We also provide a range of numerical simulations to illustrate the performance of the proposed algorithms on both real and synthetic signals and images. Our simulations show the superior performance of our algorithms compared to existing methods for demixing signals and images based on convex optimization. In particular, our proposed methods yield demonstrably better sample complexities as well as improved running times, thereby enabling their applicability to large-scale problems.

## 1 Introduction

### 1.1 Setup

In numerous signal processing applications, the problem of demixing is of special interest. In simple terms, demixing involves disentangling two (or more) constituent signals from observations of their linear superposition. Formally, consider a discrete-time signal that can be expressed as the superposition of two signals:

 x=Φw+Ψz,

where and are orthonormal bases of , and are the corresponding basis coefficients. The goal of signal demixing, in this context, is to reliably recover the constituent signals (equivalently, their basis representations and ) from the superposition signal .

Demixing suffers from a fundamental identifiability issue since the number of unknowns () is greater than the number of observations (). This is easy to see: suppose for simplicity that , the canonical basis of , and therefore, . Now, suppose that both and have only one nonzero entry in the first coordinate. Then, there is an infinite number of and that are consistent with the observations , and any hope of recovering the true components is lost. Therefore, for the demixing problem to have an identifiable solution, one inevitably has to assume some type of incoherence between the constituent signals (or more specifically, between the corresponding bases and [3, 4]. Such an incoherence assumption certifies that the components are sufficiently “distinct” and that the recovery problem is well-posed. Please see Section 3 for a formal definition of incoherence.

However, even if we assume that the signal components are sufficiently incoherent, demixing poses additional challenges under stringent observation models. Suppose, now, that we only have access to undersampled linear measurements of the signal, i.e., we record:

 y=Ax, (1.1)

where denotes the measurement operator and where . In this scenario, the demixing problem is further confounded by the fact that possesses a nontrivial null space. In this case, it might seem impossible to recover the components and since possesses a nontrivial null space. Once again, this problem is highly ill-posed and further structural assumptions on the constituent signals are necessary. Under-determined problems of this kind have recently received significant attention in signal processing, machine learning, and high-dimensional statistics. In particular, the emergent field of compressive sensing [5, 6, 7] shows that it is indeed possible to exactly reconstruct the underlying signals under certain assumptions on , provided the measurement operator is designed carefully. This intuition has enabled the design of a wide range of efficient architectures for signal acquisition and processing [8, 9].

In this paper, we address an even more challenging question in the demixing context. Mathematically, we consider a noisy, nonlinear signal observation model, formulated as follows:

 yi=g(⟨ai,Φw+Ψz⟩)+ei, i=1,…,m. (1.2)

Here, as before, the superposition signal is modeled as . Each observation is generated by the composition of a linear functional of the signal , with a (scalar) nonlinear function . Here, is sometimes called a link or transfer function, and denotes the row of a linear measurement matrix . For full generality, in (1.2) we assume that each observation is corrupted by additive noise; the noiseless case is realized by setting . We will exclusively consider the “measurement-poor” regime where the number of observations is much smaller than the ambient dimension .

For all the reasons detailed above, the problem of recovering the coefficient vectors and from the measurements seems daunting. Therefore, we make some structural assumptions. Particularly, we assume that and are -sparse (i.e., they contain no more than nonzero entries). Further, we will assume perfect knowledge of the bases and , and the measurement matrix . The noise vector is assumed to be stochastic, zero mean, and bounded. Under these assumptions, we will see that it is indeed possible to stably recover the coefficient vectors, with a number of observations that is proportional to the sparsity level , as opposed to the ambient dimension .

The nonlinear link function plays a crucial role in our algorithm development and analysis. In signal processing applications, such nonlinearities may arise due to imperfections caused during a measurement process, or inherent limitations of the measurement system, or due to quantization or calibration errors. We discuss such practical implications more in detail below. On an abstract level, we consider two distinct scenarios. In the first scenario, the link function may be non-smooth, non-invertible, or even unknown to the recovery procedure. This is the more challenging case, but we will show that recovery of the components is possible even without knowledge of . In the second scenario; the link function is a known, smooth, and strictly monotonic function. This is the somewhat simpler case, and we will see that this leads to significant improvements in recovery performance both in terms of theory and practice.

### 1.2 Our Contributions

In this paper, we make some concrete algorithmic progress in the demixing problem under nonlinear observations. In particular, we study the following scenarios depending on certain additional assumptions made on (1.2):

1. Unknown . We first consider the (arguably, more general) scenario where the nonlinear link function may be non-smooth, non-invertible, or even unknown. In this setting, we do not explicitly model the additive noise term in (1.2). For such settings, we introduce a novel demixing algorithm that is non-iterative, does not require explicit knowledge of the link function , and produces an estimate of the signal components. We call this algorithm OneShot to emphasize its non-iterative nature. It is assumed that OneShot possess oracle knowledge of the measurement matrix , and orthonormal bases and .

We supplement our proposed algorithm with a rigorous theoretical analysis and derive upper bounds on the sample complexity of demixing with nonlinear observations. In particular, we prove that the sample complexity of OneShot to achieve an estimation error is given by provided that the entries of the measurement matrix are i.i.d. standard normal random variables.

2. Known . Next, we consider the case where the nonlinear link function is known, smooth, and monotonic. In this setting, the additive noise term in (1.2) is assumed to be bounded either absolutely, or with high probability. For such (arguably, milder) settings, we provide an iterative algorithm for demixing of the constituent signals in (1.2) given the nonlinear observations . We call this algorithm Demixing with Hard Thresholding, or DHT for short. In addition to knowledge of , we assume that DHT possesses oracle knowledge of , , and .

Within this scenario, we also analyze two special sub-cases:

Case 2a: Isotropic measurements. We assume that the measurement vectors are independent, isotropic random vectors that are incoherent with the bases and . This assumption is more general than the i.i.d. standard normal assumption on the measurement matrix made in the first scenario, and is applicable to a wider range of measurement models. For this case, we show that the sample complexity of DHT is upper-bounded by , independent of the estimation error .

Case 2b: Subgaussian measurements. we assume that the rows of the matrix are independent subgaussian isotropic random vectors. This is also a generalization of the i.i.d. standard normal assumption made above, but more restrictive than Case 2a. In this setting, we obtain somewhat better sample complexity. More precisely, we show that the sample complexity of DHT is for sample complexity, matching the best known sample complexity bounds for recovering a superposition of -sparse signals from linear observations [10, 11].

In both the above cases, the underlying assumption is that the bases and are sufficiently incoherent, and that the sparsity level is small relative to the ambient dimension . In this regime, we show that DHT exhibits a linear rate of convergence, and therefore the computational complexity of DHT is only a logarithmic factor higher than OneShot. Table 1 provides a summary of the above contributions for the specific case where is the identity (canonical) basis and is the discrete cosine transform (DCT) basis, and places them in the context of the existing literature on some nonlinear recovery methods [12, 13, 14]. We stress that these previous works do not explicitly consider the demixing problem, but in principle the algorithms of [12, 13, 14] can be extended to the demixing setting as well.

### 1.3 Techniques

At a high level, our recovery algorithms are based on the now-classical method of greedy iterative thresholding. In both methods, the idea is to first form a proxy of the signal components, followed by hard thresholding to promote sparsity of the final estimates of the coefficient vectors and . The key distinguishing factor from existing methods is that the greedy thresholding procedures used to estimate and are deliberately myopic, in the sense that each thresholding step operates as if the other component did not exist at all. Despite this apparent shortcoming, we are still able to derive bounds on recovery performance when the signal components are sufficiently incoherent.

Our first algorithm, OneShot, is based on the recent, pioneering approach of [12], which describes a simple (but effective) method to estimate a high-dimensional signal from unknown nonlinear observations. Our first main contribution of this paper is to extend this idea to the nonlinear demixing problem, and to precisely characterize the role of incoherence in the recovery process. Indeed, a variation of the approach of [12] (described in Section 5) can be used to solve the nonlinear demixing problem as stated above, with a similar two-step method of first forming a proxy, and then performing a convex estimation procedure (such as the LASSO [15]) to produce the final signal estimates. However, as we show below in our analysis and experiments, OneShot offers superior performance to this approach. The analysis of OneShot is based on a geometric argument, and leverages the Gaussian mean width for the set of sparse vectors, which is a statistical measure of complexity of a set of points in a given space.

While OneShot is simple and effective, one can potentially do much better if the link function were available at the time of recovery. Our second algorithm, DHT, leverages precisely this intuition. First, we formulate our nonlinear demixing problem in terms of an optimization problem with respect to a specially-defined loss function that depends on the nonlinearity . Next, for solving the proposed optimization problem, we propose an iterative method to solve the optimization problem, up to an additive approximation factor. Each iteration with DHT involves a proxy calculation formed by computing the gradient of the loss function, followed by (myopic) projection onto the constraint sets. Again, somewhat interestingly, this method can be shown to be linearly convergent, and therefore only incurs a small (logarithmic) overhead in terms of running time. The analysis of DHT is based on bounding certain parameters of the loss function known as the restricted strong convexity (RSC) and restricted strong smoothness (RSS) constants.111Quantifying algorithm performance by bounding RSC and RSC constants of a given loss function are quite widespread in the machine learning literature [16, 17, 18, 19], but have not studied in the context of signal demixing.

Finally, we provide a wide range of simulations to verify empirically our claims both on synthetic and real data. We first compare the performance of OneShot with the convex optimization method of [12] for nonlinear demixing via a series of phase transition diagrams. Our simulation results show that OneShot outperforms this convex method significantly in both demixing efficiency as well as running time, and consequently makes it an attractive choice in large-scale problems. However, as discussed below, the absence of knowledge of the link function induces an inevitable scale ambiguity in the final estimation222Indeed, following the discussion in [12], any demixing algorithm that does not leverage knowledge of is susceptible to such a scale ambiguity.. For situations where we know the link function precisely, our simulation results show that DHT offers much better statistical performance compared to OneShot, and is even able to recover the scale of the signal components explicitly. We also provide simulation results on real-world natural images and astronomical data to demonstrate robustness of our approaches.

### 1.4 Organization

The rest of this paper is organized as follows. Section 2 describes several potential applications of our proposed approach, and relationship with prior art. Section 3 introduces some key notions that are used throughout the paper. Section 4 contains our proposed algorithms, accompanied by analysis of their performance; complete proofs are deferred to Section 6. Section 5 lists the results of a series of numerical experiments on both synthetic and real data, and Section 7 provides concluding remarks.

## 2 Applications and Related Work

Demixing problems of various flavors have been long studied in research areas spanning signal processing, statistics, and physics, and we only present a small subset of relevant related work. In particular, demixing methods have been the focus of significant research over the fifteen years, dating back at least to [20]. The work of Elad et al. [3] and Bobin et al. [21] posed the demixing problem as an instance of morphological components analysis (MCA), and formalized the observation model (1.1). Specifically, these approaches posed the recovery problem in terms of a convex optimization procedure, such as the LASSO [15]. The work of Pope et al. [22] analyzed somewhat more general conditions under which stable demixing could be achieved.

More recently, the work of [23] showed a curious phase transition behavior in the performance of the convex optimization methods. Specifically, they demonstrated a sharp statistical characterization of the achievable and non-achievable parameters for which successful demixing of the signal components can be achieved. Moreover, they extended the demixing problem to a large variety of signal structures beyond sparsity via the use of general atomic norms in place of the -norm in the above optimization. See [24] for an in-depth discussion of atomic norms, their statistical and geometric properties, and their applications to demixing.

Approaches for (linear) demixing has also considered a variety of signal models beyond sparsity. The robust PCA problem [25, 26, 27] involves the separation of low-rank and sparse matrices from their sum. This idea has been used in several applications ranging from video surveillance to sensor network monitoring. In machine learning applications, the separation of low-rank and sparse matrices has been used for latent variable model selection [28] as well as the robust alignment of multiple occluded images [29]. Another type of signal model is the low-dimensional manifold model. In [10, 11], the authors proposed a greedy iterative method for demixing signals, arising from a mixture of known low-dimensional manifolds by iterative projections onto the component manifolds.

The problem of signal demixing from linear measurements belongs to a class of linear inverse problems that underpin compressive sensing [5, 6]; see [7] for an excellent introduction. There, the overarching goal is to recover signals from (possibly randomized) linear measurements of the form (1.1). More recently, it has been shown that compressive sensing techniques can also be extended to inverse problems where the available observations are manifestly nonlinear. For instance, in -bit compressive sensing [30, 31] the linear measurements of a given signal are quantized in the extreme fashion such that the measurements are binary () and only comprise the sign of the linear observation. Therefore, the amplitude of the signal is completely discarded by the quantization operator. Another class of such nonlinear recovery techniques can be applied to the classical signal processing problem of phase retrieval [32] which is somewhat more challenging than -bit compressive sensing. In this problem, the phase information of the signal measurements may be irrecovably lost and we have only access to the amplitude information of the signal [32]. Therefore, the recovery task here is to retrieve the phase information of the signal from random observations. Other related works include approaches for recovering low-rank matrices from nonlinear observations [33, 34]. We mention in passing that inverse problems involving nonlinear observations have also long been studied in the statistical learning theory literature; see [35, 36, 37, 38] for recent work in this area. Analogous to our scenarios above, these works consider both known as well as unknown link functions; these two classes of approaches are respectively dubbed as Generalized Linear Models (GLM) learning methods and Single Index Model (SIM) learning methods.

For our algorithmic development, we build upon a recent line of efficient, iterative methods for signal estimation in high dimensions [39, 17, 18, 12, 19, 40]. The basic idea is to pose the recovery as a (non-convex) optimization problem in which an objective function is minimized over the set of -sparse vectors. Essentially, these algorithms are based on well-known iterative thresholding methods proposed in the context of sparse recovery and compressive sensing [41, 42]. The analysis of these methods heavily depends on the assumption that the objective function satisfies certain (restricted) regularity conditions; see Sections 3 and 6 for details. Crucially, we adopt the approach of [16], which introduces the concept of the restricted strong convexity (RSC) and restricted strong smoothness (RSS) constants of a loss function. Bounding these constants in terms of problem parameters and , as well as the level of incoherence in the components, enables explicit characterization of both sample complexity and convergence rates.

## 3 Preliminaries

In this section, we introduce some notation and key definitions. Throughout this paper, denotes the -norm of a vector in , and denotes the spectral norm of the matrix . Let and be orthonormal bases of . Define the set of sparse vectors in the bases and as follows:

 K1 ={Φa | ∥a∥0≤s1}, K2 ={Ψa | ∥a∥0≤s2},

and define We use to denote the unit ball. Whenever we use the notation , the vector is comprised by stacking column vectors and .

In order to bound the sample complexity of our proposed algorithms, we will need some concepts from high-dimensional geometry. First, we define a statistical measure of complexity of a set of signals, following [12].

###### Definition 3.1.

(Local gaussian mean width.) For a given set , the local gaussian mean width (or simply, local mean width) is defined as follows :

 Wt(K)=Esupx,y∈K,∥x−y∥2≤t⟨g,x−y⟩.

where .

Next, we define the notion of a polar norm with respect to a given subset of the signal space:

###### Definition 3.2.

(Polar norm.) For a given and a subset of , the polar norm with respect to is defined as follows:

 ∥x∥Qo=supu∈Q⟨x,u⟩.

Furthermore, for a given subset of , we define . Since is a symmetric set, one can show that the polar norm with respect to defines a semi-norm. Next, we use the following standard notions from random matrix theory [43]:

###### Definition 3.3.

(Subgaussian random variable.) A random variable is called subgaussian if it satisfies the following:

 Eexp⎛⎝cX2∥X∥2ψ2⎞⎠≤2,

where is an absolute constant and denotes the -norm which is defined as follows:

 ∥X∥ψ2=supp≥11√p(E|X|p)1p.
###### Definition 3.4.

(Isotropic random vectors.) A random vector-valued variable is said to be isotropic if .

In order to analyze the computational aspects of our proposed algorithms (in particular, DHT), we will need the following definition from [16]:

###### Definition 3.5.

A loss function satisfies Restricted Strong Convexity/Smoothness (RSC/RSS) if:

 m4s≤∥∇2ξf(t)∥≤M4s,

where , for all and . Also, and are (respectively) called the RSC and RSS constants. Here denotes a sub-matrix of the Hessian matrix, , comprised of row/column indices in .

As discussed earlier, the underlying assumption in all demixing problems of the form (3.4) is that the constituent bases are sufficiently incoherent as per the following definition:

###### Definition 3.6.

(-incoherence.) The orthonormal bases and are said to be -incoherent if:

 ε=sup∥u∥0≤s, ∥v∥0≤s∥u∥2=1, ∥v∥2=1|⟨Φu,Ψv⟩|. (3.1)

The parameter is related to the so-called mutual coherence parameter of a matrix. Indeed, if we consider the (overcomplete) dictionary , then the mutual coherence of is given by . Moreover, one can show that [7].

We now formally establish our signal model. Consider a signal that is the superposition of a pair of sparse vectors in different bases, i.e.,

 x=Φw+Ψz, (3.2)

where are orthonormal bases, and such that , and . We define the following quantities:

 ¯x=Φ¯w+Ψ¯z∥Φ¯w+Ψ¯z∥2=α(Φ¯w+Ψ¯z), (3.3)

where Also, define the coefficient vector, . as the vector obtaining by stacking the individual coefficient vectors and of the component signals.

We now state our measurement model. Consider the nonlinear observation model:

 yi=g(aTix)+ei, i=1…m, (3.4)

where is the superposition signal given in (3.2), and represents a nonlinear link function. We denote as the derivative of , i.e., . As mentioned above, depending on the knowledge of the link function , we consider two scenarios:

1. In the first scenario, the nonlinear link function may be non-smooth, non-invertible, or even unknown. In this setting, we assume the noiseless observation model, i.e., . In addition, we assume that the measurement matrix is populated by i.i.d. unit normal random variables.

2. In this setup, represents a known nonlinear, differentiable, and strictly monotonic function. Further, in this scenario, we assume that the observation is corrupted by a subgaussian additive noise with for . We also assume that the additive noise has zero mean and independent from , i.e., for . In addition, we assume that the measurement matrix consists of either (2a) isotropic random vectors that are incoherent with and , or (2b) populated with subgaussian random variables.

We highlight some additional clarifications for the second case. In particular, we make the following :

###### Assumption 3.7.

There exist nonnegative such that .

In words, the derivative of the link function is strictly bounded either within a positive interval or within a negative interval. In this paper, we focus on the case when . The analysis of the complementary case is similar.

The lower bound on guarantees that the function is a monotonic function, i.e., if then . Moreover, the upper bound on guarantees that the function is Lipschitz with constant . Such assumptions are common in the nonlinear recovery literature [16, 40].333Using the monotonicity property of that arises from Assumption 3.7, one might be tempted to simply apply the inverse of the link function on the measurements in (3.4) convert the nonlinear demixing problem to the more amenable case of linear demixing, and then use any algorithm (e.g., [11]) for recovery of the constituent signals. However, this naïve way could result in a large error in the estimation of the components, particularly in the presence of the noise in (3.4). This issue has been also considered in [40] for generic nonlinear recovery both from a theoretical as well as empirical standpoint.

In Case 2a, the vectors (i.e., the rows of ) are independent isotropic random vectors. For this case, in addition to incoherence between the component bases, we also need to define a measure of cross-coherence between the measurement matrix and the dictionary . The following notion of cross-coherence was introduced in the early literature of compressive sensing [44]:

###### Definition 3.8.

(Cross-coherence.) The cross-coherence parameter between the measurement matrix and the dictionary is defined as follows:

 ϑ=maxi,jaTiΓj∥ai∥2, (3.5)

where and denote the row of the measurement matrix and the column of the dictionary .

The cross-coherence assumption implies that for , where denotes the restriction of the columns of the dictionary to set , with such that columns are selected from each basis and .

## 4 Algorithms and Theoretical Results

Having defined the above quantities, we now present our main results. As per the previous section, we study two distinct scenarios:

### 4.1 When the link function g is unknown

Recall that we wish to recover components and given the nonlinear measurements and the matrix . Here and below, for simplicity we assume that the sparsity levels and , specifying the sets and , are equal, i.e., . The algorithm (and analysis) effortlessly extends to the case of unequal sparsity levels. Our proposed algorithm, that we call OneShot, is described in pseudocode form below as Algorithm 1.

The mechanism of OneShot is simple, and deliberately myopic. At a high level, OneShot first constructs a linear estimator of the target superposition signal, denoted by . Then, it performs independent projections of onto the constraint sets and . Finally, it combines these two projections to obtain the final estimate of the target superposition signal.

In the above description of OneShot, we have used the following projection operators:

 ˆw=Ps(Φ∗ˆx\text{lin}),^z=Ps(Ψ∗ˆx\text{lin}).

Here, denotes the projection onto the set of (canonical) -sparse signals and can be implemented by hard thresholding, i.e., any procedure that retains the largest coefficients of a vector (in terms of absolute value) and sets the others to zero444The typical way is to sort the coefficients by magnitude and retain the largest entries, but other methods such as randomized selection can also be used.. Ties between coefficients are broken arbitrarily. Observe that OneShot is not an iterative algorithm, and this in fact enables us to achieve a fast running time.

We now provide a rigorous performance analysis of OneShot. Our proofs follow the geometric approach provided in [12], specialized to the demixing problem. In particular, we derive an upper bound on the estimation error of the component signals and , modulo scaling factors. In our proofs, we use the following result from [12], restated here for completeness.

###### Lemma 4.1.

(Quality of linear estimator). Given the model in Equation (3.2), the linear estimator, , is an unbiased estimator of (defined in (3.3)) up to constants. That is, and: where

We now state our first main theoretical result, with the full proof provided below in Section 6.

###### Theorem 4.2.

Let be the set of measurements generated using a nonlinear function that satisfies the conditions of Lemma (4.9) in [12]555Based on this lemma, the nonlinear function is odd, nondecreasing, and sub-multiplicative on .. Let be a random matrix with i.i.d. standard normal entries. Also, let are bases with , where is as defined in Def. 3.6. If we use Oneshot to recover estimates of and (modulo a scaling) described in equations (3.2) and (3.3), then the estimation error for (similarly, ) satisfies the following upper bound in expectation :

 E∥ˆw−μαw∥2≤ρ+2√m(4σ+ηWρ(K)ρ)+8με. (4.1)

The constant is chosen for convenience and can be strengthened. The authors of [45, 12] provide upper bounds on the local mean width of the set of -sparse vectors. In particular, for any they show that for some absolute constant . By plugging in this bound and letting , we can combine components and which gives the following:

###### Corollary 4.3.

With the same assumptions as Theorem 4.2, the error of nonlinear estimation incurred by the final output satisfies the upper bound:

 E∥ˆx−μ¯x∥2≤4√m(4σ+Cη√slog(2n/s))+16με. (4.2)
###### Corollary 4.4.

(Example quantitative result). The constants depend on the nature of the nonlinear function , and are often rather mild. For example, if , then we may substitute

 μ=√2π≈0.8,σ2=1−2π≈0.6,η2=1,

in the above statement. Hence, the bound in (4.2) becomes:

 E∥ˆx−μ¯x∥2≤4√m(3.1+C√slog(2n/s))+13ε. (4.3)
###### Proof.

Using Lemma 4.1, where . Since and has unit norm, . Thus, where . Moreover, we can write . Here, we have used the fact that obeys the distribution with mean 1. Finally, . ∎

In contrast with demixing algorithms for traditional (linear) observation models, our estimated signal outputting from OneShot can differ from the true signal by a scale factor. Next, suppose we fix as a small constant, and suppose that the incoherence parameter for some constant , and that the number of measurements scales as:

 m=O(sκ2logns). (4.4)

Then, the (expected) estimation error . In other words, the sample complexity of OneShot is given by , which resembles results for the linear observation case [11, 12]666Here, we use the term “sample-complexity” as the number of measurements required by a given algorithm to achieve an estimation error . However, we must mention that algorithms for the linear observation model are able to achieve stronger sample complexity bounds that are independent of ..

We observe that the estimation error in (4.2) is upper-bounded by . This is meaningful only when , or when . Per the Welch Bound [7], the mutual coherence satisfies . Therefore, Theorem 4.2 provides non-trivial results only when . This is consistent with the square-root bottleneck that is often observed in demixing problems; see [46] for detailed discussions.

The above theorem obtains a bound on the expected value of the estimation error. We can derive a similar upper bound that holds with high probability. In this theorem, we assume that the measurements for have a sub-gaussian distribution (according to Def. 3.3). We obtain the following result, with full proof deferred to Section 6.

###### Theorem 4.5.

(High-probability version of Thm. 4.2.) Let be a set of measurements with a sub-gaussian distribution. Assume that is a random matrix with standard normal entries. Also, assume that are two bases with incoherence as in Definition 3.6. Let . If we use Oneshot to recover and (up to a scaling) described in (3.2) and (3.3), then the estimation error of the output of Oneshot satisfies the following:

 ∥ˆx−μ¯x∥2≤4η√m(3s′+C′√slog2ns)+16με, (4.5)

with probability at least where are absolute constants. The coefficients , and are given in Lemma 4.1. Here, denotes the -norm of the first measurement (Definition 3.3).

In Theorem 4.5, we stated the tail probability bound of the estimation error for the superposition signal, . Similar to Theorem 4.2, we can derive a completely analogous tail probability bound in terms of the constituent signals and .

### 4.2 When the link function g is known

The advantages of OneShot is that it enables fast demixing, and can handle even unknown, non-differentiable link functions. But its primary weakness is that the sparse components are recovered only up to an arbitrary scale factor. This can lead to high estimation errors in practice, and this can be unsatisfactory in applications. Moreover, even for reliable recovery up to a scale factor, its sample complexity is inversely dependent on the estimation error. To solve these problems, we propose a different, iterative algorithm for recovering the signal components. Here, the main difference is that the algorithm is assumed to possess (perfect) knowledge of the nonlinear link function, .

Recall that we define and . First, we formulate our demixing problem as the minimization of a special loss function :

 mint∈R2n  F(t)=1mm∑i=1Θ(aTiΓt)−yiaTiΓt (4.6) s.\ t.∥t∥0≤2s.

Observe that the loss function is not the typical squared-error function commonly encountered in statistics and signal processing applications. In contrast, it heavily depends on the nonlinear link function (via its integral ). Instead, such loss functions are usually used in GLM and SIM estimation in the statistics literature [16]. In fact, the objective function in (4.6) can be considered as the sample version of the problem:

 mint∈R2n E(Θ(aTΓt)−yaTΓt),

where and satisfies the model (3.4). It is not hard to show that the solution of this problem satisfies . We note that the gradient of the loss function can be calculated in closed form:

 ∇F(t) =1mm∑i=1ΓTaig(aTiΓt)−yiΓTai, (4.7) =1mΓTAT(g(AΓt)−y).

We now propose an iterative algorithm for solving (4.6) that we call it Demixing with Hard Thresholding (DHT). The method is detailed in Algorithm 2. At a high level, DHT iteratively refines its estimates of the constituent signals (and the superposition signal ). At any given iteration, it constructs the gradient using (4.7). Next, it updates the current estimate according to the gradient update being determined in Algorithm 2. Then, it performs hard thresholding using the operator to obtain the new estimate of the components and . This procedure is repeated until a stopping criterion is met. See Section 5 for the choice of stopping criterion and other details. We mention that the initialization step in Algorithm 2 is arbitrary and can be implemented (for example) by running OneShot and obtaining initial points . We use this initialization in our simulation results.

Implicitly, we have again assumed that both component vectors and are -sparse; however, as above we mention that Algorithm 2 and the corresponding analysis easily extend to differing levels of sparsity in the two components. In Algorithm 2, denotes the projection of vector on the set of sparse vectors, again implemented via hard thresholding.

We now provide our second main theoretical result, supporting the convergence analysis of DHT. In particular, we derive an upper bound on the estimation error of the constituent vector (and therefore, the component signals ). The proofs of Theorems 4.64.7 and 4.8 are deferred to section 6.

###### Theorem 4.6.

(Performance of DHT) Consider the measurement model (3.4) with all the assumptions mentioned for the second scenario in Section 3. Suppose that the corresponding objective function satisfies the RSS/RSC properties with constants and on the set with ( denotes the iteration) such that . Choose a step size parameter with Then, DHT outputs a sequence of estimates that satisfies the following upper bound (in expectation) for :

 ∥tk+1−t∗∥2≤(2q)k∥t0−t∗∥2+Cτ√sm, (4.8)

where and is a constant that depends on the step size and the convergence rate . Also, where and are the true (unknown) vectors in model (1.2).

Equation (4.8) indicates that Algorithm 2 (DHT) enjoys a linear rate of convergence. In particular, for the noiseless case , this implies that Alg. 2 returns a solution with accuracy after iterations. The proof of Theorem 4.6 leverages the fact that the objective function in (4.6) satisfies the RSC/RSS conditions specified in Definition 3.5. Please refer to Section 6 for a more detailed discussion. Moreover, we observe that in contrast with OneShot, DHT can recover the components and without any ambiguity in scaling factor, as depicted in the bound (4.8). We also verify this observation empirically in our simulation results in Section 5.

Echoing our discussion in Section 3, we consider two different models for the measurement matrix and derive upper bounds on the sample complexity of DHT corresponding to each case. First, we present the sample complexity of Alg. 2 when the measurements are chosen to be isotropic random vectors, corresponding to Case (2a) described in the introduction:

###### Theorem 4.7.

(Sample complexity when the rows of are isotropic.) Suppose that the rows of are independent isotropic random vectors. In order to achieve the requisite RSS/RSC properties of Theorem 4.6, the number of samples needs to scale as:

 m=O(slognlog2slog(slogn)),

provided that the bases and are incoherent enough.

The sample complexity mentioned in Theorem 4.7 incurs an extra (possibly parasitic) poly-logarithmic factor relative to the sample complexity of OneShot, stated in (4.4). However, the drawback of OneShot is that the sample complexity depends inversely on the estimation error , and therefore a very small target error would incur a high overhead in terms of number of samples.

Removing all the extra logarithmic factors remains an open problem in general (although some improvements can be obtained using the method of [47]). However, if we assume additional structure in the measurement matrix , we can decrease the sample complexity even further. This corresponds to Case 2b.

###### Theorem 4.8.

(Sample complexity when the elements of are subgaussian.) Assume that all assumptions and definitions in Theorem 4.6 holds except that the rows of matrix are independent subgaussian isotropic random vectors. Then, in order to achieve the requisite RSS/RSC properties of Theorem 4.6, the number of samples needs to scale as:

 m=O(slogns),

provided that the bases and are incoherent enough.

The leading big-Oh constant in the expression for in Theorems 4.7 and 4.8 is somewhat complicated, and hides the dependence on the incoherence parameter , the mutual coherence , the RSC/RSS constants, and the growth parameters of the link function and . Please see section 6 for more details.

In Theorem 4.6, we expressed the upper bounds on the estimation error in terms of the constituent vector, . It is easy to translate these results in terms of the component vectors and using the triangle inequality:

 max{∥w0−w∗∥2,∥z0−z∗∥2}≤∥t0−t∗∥2≤∥w0−w∗∥2+∥z0−z∗∥2.

See Section 6 for proofs and futher details.

## 5 Experimental Results

In this section, we provide a range of numerical experiments for our proposed algorithms based on synthetic and real data. We compare the performance of OneShot and DHT with a LASSO-type technique for demixing, as well as a heuristic version of OneShot based on soft thresholding (inspired by the approach proposed in [48]). We call these methods Nonlinear convex demixing with LASSO or (NlcdLASSO), and Demixing with Soft Thresholding or DST, respectively. Before describing our simulation results, we briefly describe these two methods.

NlcdLASSO is a heuristic method motivated by [12], although it was not explicitly developed in the demixing context. Using our notation from Section 3 and 4, NlcdLASSO solves the following convex problem:

 minz,w ∥∥ˆxlin−[Φ Ψ][w;z]∥∥2 (5.1) subject to ∥w∥1≤√s,∥z∥1≤√s.

Here, denotes the proxy of (equal to ) and denotes the sparsity level of signals and in basis and , respectively. The constraints in problem (5.1) are convex penalties reflecting the knowledge that and are -sparse and have unit -norm (since the nonlinearity is unknown, we have a scale ambiguity, and therefore w.l.o.g. we can assume that the underlying signals lie in the unit ball). The outputs of this algorithm are the estimates , , and

To solve the optimization problem in (5.1), we have used the SPGL1 solver [49, 50]. This solver can handle large scale problems, which is the scenario that we have used in our experimental evaluations. We impose the joint constraint which is a slight relaxation of the constraints in  5.1. The upper-bound of in the constraints is a worst-case criterion; therefore, for a fairer comparison, we also include simulation results with the constraint , where has been tuned to the best of our ability.

On the other hand, DST solves the optimization problem (4.6) via a convex relaxation of the sparsity constraint. In other words, this method attempts to solve the following relaxed version of the problem (4.6):

 mint  1mm∑i=1Θ(aTiΓt)−yiaTiΓt+β′∥t∥1, (5.2)

where represents -norm of the constituent vector and denotes the tuning parameter. The solution of this problem at iteration is given by soft thresholding operator as follows:

 tk+1=Sβ′η′(tk−η′∇F(tk)),

where denotes the step size, and the soft thresholding operator, is given by:

 Sλ(y)=⎧⎨⎩y−λ,if y>λ0,if |y|≤λy+λ,if y<−λ.

Both OneShot and NlcdLASSO do not assume knowledge of the link function, and consequently return a solution up to a scalar ambiguity. Therefore, to compare performance across algorithms, we use the (scale-invariant) cosine similarity between the original superposition signal and the output of a given algorithm defined as:

 cos(x,ˆx)=xTˆx∥x∥2∥ˆx∥2.

### 5.1 Synthetic Data

As discussed above, for successful recovery we require the constituent signals to be sufficiently incoherent. To achieve this, we choose to be the 1D Haar wavelets basis, and to be the noiselet basis777These bases are known to be maximally incoherent relative to each other [51]. For the measurement operator , we choose a partial DFT matrix. Such matrices are known to have similar recovery performance as random Gaussian matrices, but enable fast numerical operations [52]. Also, we present our experiments based on both non-smooth as well as differentiable link functions. For the non-smooth case, we choose ; here, we only present recovery results using OneShot and NlcdLASSO since in our analysis DHT and DST can only handle smooth link functions.

The results of our first experiment are shown in Figure 1(a) and Figure 1(b). The test signal is generated as follows: set length , and generate the vectors and by randomly selecting a signal support with nonzero elements, and populating the nonzero entries with random coefficients. The plot illustrates the performance of Oneshot and NlcdLASSO measured by the cosine similarity for different choices of sparsity level , where the nonlinear link function is set to and we have used both and constraints. The horizontal axis denotes an increasing number of measurements. Each data point in the plot is obtained by conducting a Monte Carlo experiment in which a new random measurement matrix is generated, recording the cosine similarity between the true signal and the reconstructed estimate and averaging over trials.

As we can see, notably, the performance of NlcdLASSO is worse than OneShot for any fixed choice of and no matter what upper bound we use on . Even when the number of measurements is high (for example, at in plot (b)), we see that OneShot outperforms NlcdLASSO by a significant degree. In this case, NlcdLASSO is at least worse in terms of signal estimation quality, while OneShot recovers the (normalized) signal perfectly. This result indicates the inefficiency of NlcdLASSO for nonlinear demixing.

Next, we contrast the running time of both algorithms, illustrated in Figure 1(c). In this experiment, we measure the wall-clock running time of the two recovery algorithms (OneShot and NlcdLASSO), by varying signal size from to . Here, we set , , and the number of Monte Carlo trials to . Also, the nonlinear link function is considered as . As we can see from the plot, OneShot is at least times faster than NlcdLASSO when the size of signal equals to . Overall, OneShot is efficient even for large-scale nonlinear demixing problems. We mention that in the above setup, the main computational costs incurred in OneShot involve a matrix-vector multiplication followed by a thresholding step, both of which can be performed in time that is nearly-linear in terms of the signal length for certain choices of

Next, we turn to differentiable link functions. In this case, we generate the constituent signal coefficient vectors, with , and compare performance of the four above algorithms. The nonlinear link function is chosen to be ; it is easy to check that the derivative of this function is strictly bounded between and . The maximal number of iterations for both DHT and DST is set to to with an early stopping criterion if convergence is detected. The step size is hard to estimate in practice, and therefore is chosen by manual tuning such that both DHT and DST obtain the best respective performance.

Figure 2 illustrates the performance of the four algorithms in terms of phase transition plots, following [23]. In these plots, we varied both the sparsity level and the number of measurements . For each pair , as above we randomly generate the test superposition signal by choosing both the support and coefficients of at random, as well as the measurement matrix. We repeat this experiment over Monte Carlo trials. We calculate the empirical probability of successful recovery as the number of trials in which the output cosine similarity is greater than . Pixel intensities in each figure are normalized to lie between 0 and 1, indicating the probability of successful recovery.

As we observe in Fig. 2, DHT has the best performance among the different methods, and in particular, outperforms both the convex-relaxation based methods. The closest algorithm to DHT in terms of the signal recovery is DST, while the LASSO-based method fails to recover the superposition signal (and consequently the constituent signals and ). The improvements over OneShot are to be expected since as discussed before, this algorithm does not leverage the knowledge of the link function and is not iterative.

In Fig. 3, we fix the sparsity level and plot the probability of recovery of different algorithms with a varying number of measurements. The number of Monte Carlo trials is set to 20 and the empirical probability of successful recovery is defined as the number of trials in which the output cosine similarity is greater than . The nonlinear link function is set to be for figure (a) and for figure (b). As we can see, DHT has the best performance, while NlcdLASSO for figure (a) and Oneshot, and NlcdLASSO for figure (b) cannot recover the superposition signal even with the maximum number of measurements.

### 5.2 Real Data

In this section, we provide representative results on real-world 2D image data using Oneshot and NlcdLASSO for non-smooth link function given by . In addition, we illustrate results for all four algorithms using smooth as our link function.

We begin with a test image. First, we obtain its 2D Haar wavelet decomposition and retain the largest coefficients, denoted by the -sparse vector . Then, we reconstruct the image based on these largest coefficients, denoted by . Similar to the synthetic case, we generate a noise component in our superposition model based on 500 noiselet coefficients . In addition, we consider a parameter which controls the strength of the noiselet component contributing to the superposition model. We set this parameter to 0.1. Therefore, our test image is given by .

Figure 4 illustrates both the true and the reconstructed images and using Oneshot and NlcdLASSO. The number of measurements is set to (using subsampled Fourier matrix with rows). From visual inspection we see that the reconstructed image, , using Oneshot is better than the reconstructed image by NlcdLASSO. Quantitatively, we also calculate Peak signal-to-noise-ratio (PSNR) of the reconstructed images using both algorithms relative to the test image, . We obtain PSNR of 19.8335 dB using OneShot, and a PSNR of 17.9092 dB using NlcdLASSO, again illustrating the better performance of Oneshot compared to NlcdLASSO.

Next, we show our results using a differentiable link function. For this experiment, we consider an astronomical image illustrated in Fig. 5. This image includes two components; the “stars” component, which can be considered to be sparse in the identity basis (), and the “galaxy” component which are sparse when they are expressed in the discrete cosine transform basis (). The superposition image is observed using a subsampled Fourier matrix with rows multiplied with a diagonal matrix with random entries [53]. Further, each measurement is nonlinearly transformed by applying the (shifted) logistic function as the link function. In the recovery procedure using DHT, we set the number of iterations to and step size to . As is visually evident, our proposed DHT method is able to reliably recover the component signals.

## 6 Proofs

In this section, we derive the proofs of our theoretical results stated in Section 4.

### 6.1 Analysis of OneShot

Our analysis mostly follows the techniques of [12]. However, several additional complications in the proof arise due to the structure of the demixing problem. As a precursor, we need the following lemma from geometric functional analysis, restated from [12].

###### Lemma 6.1.

Assume is a closed star-shaped set. Then for , and , one has the following result :

 ∥PK(a)−u∥2≤max(t,2t∥a−u∥Kot). (6.1)

We also use the following result of [12].

###### Claim 6.2.

(Orthogonal decomposition of .) Suppose we decompose the rows of , , as:

 ai=⟨ai,¯x⟩¯x+bi, (6.2)

where is orthogonal to . Then we have since . Also, Moreover, the measurements in equation (3.4) and the orthogonal component are statistically independent.

###### Proof of Theorem 4.2.

Observe that the magnitude of the signal may be lost due to the action of the nonlinear measurement function (such as the function). Therefore, our recovered signal approximates the true signal modulo a scaling factor. Indeed, for defined in Lemma 4.1, we have:

 ∥ˆx−μ¯x∥2 =∥Φˆw+Ψˆz−αμΦw−αμΨz∥2 ≤∥Φ∥∥ˆw−μαw∥2+∥Ψ∥∥ˆz−μαz∥2 ≤(ρ+2ρ∥Φ∗ˆxlin−μαw∥Koρ)+(ρ+2ρ∥Ψ∗ˆx%lin−μαz∥Koρ). (6.3)

The equality comes from the definition of . The first inequality results from an application of the triangle inequality and the definition of the operator norm of a matrix, while the second inequality follows from Lemma 6.1. Now, it suffices to derive a bound on the first term in the above expression (since a similar bound will hold for the second term). This proves the first part of Theorem 4.2. We have:

 ∥Φ∗ˆxlin−μαw∥Koρ =∥Φ∗1mΣi(yi⟨ai,¯x⟩¯x+yibi)−μαw∥Koρ ≤∥Φ∗1mΣi(yi⟨ai,¯x⟩¯x)−μαw∥Koρ+∥Φ∗1mΣiyibi∥Koρ ≤∥Φ∗1mΣi(yi⟨ai,¯x⟩¯x)−μΦ∗¯x∥KoρS1+∥μαΦ∗Ψz∥KoρS2 +∥Φ∗1mΣiyibi∥KoρS3. (6.4)

The first equality follows from Claim 6.2, while the second and third inequalities result from the triangle inequality. Also:

 S1 =∥Φ∗1mΣi(yi⟨ai,¯x⟩¯x)−μΦ∗¯x∥Koρ =∥(1mΣi(yi⟨ai