Regularized Gradient Descent: A Nonconvex Recipe for Fast Joint Blind Deconvolution and Demixing The authors acknowledge support from the NSF via grants DTRA-DMS 1322393 and DMS 1620455.

Regularized Gradient Descent: A Nonconvex Recipe
for Fast Joint Blind Deconvolution and Demixing thanks: The authors acknowledge support from the NSF via grants DTRA-DMS 1322393 and DMS 1620455.

Shuyang Ling  Thomas Strohmer Courant Institute of Mathematical Sciences, New York University (Email: sling@cims.nyu.edu).Department of Mathematics, University of California at Davis (Email: strohmer@math.ucdavis.edu).
Abstract

We study the question of extracting a sequence of functions from observing only the sum of their convolutions, i.e., from . While convex optimization techniques are able to solve this joint blind deconvolution-demixing problem provably and robustly under certain conditions, for medium-size or large-size problems we need computationally faster methods without sacrificing the benefits of mathematical rigor that come with convex methods. In this paper we present a non-convex algorithm which guarantees exact recovery under conditions that are competitive with convex optimization methods, with the additional advantage of being computationally much more efficient. Our two-step algorithm converges to the global minimum linearly and is also robust in the presence of additive noise. While the derived performance bounds are suboptimal in terms of the information-theoretic limit, numerical simulations show remarkable performance even if the number of measurements is close to the number of degrees of freedom. We discuss an application of the proposed framework in wireless communications in connection with the Internet-of-Things.

1 Introduction

The goal of blind deconvolution is the task of estimating two unknown functions from their convolution. While it is a highly ill-posed bilinear inverse problem, blind deconvolution is also an extremely important problem in signal processing [1], communications engineering [39], imaging processing [5], audio processing [24], etc. In this paper, we deal with an even more difficult and more general variation of the blind deconvolution problem, in which we have to extract multiple convolved signals mixed together in one observation signal. This joint blind deconvolution-demixing problem arises in a range of applications such as acoustics [24], dictionary learning [2], and wireless communications [39].

We briefly discuss one such application in more detail. Blind deconvolution/demixing problems are expected to play a vital role in the future Internet-of-Things. The Internet-of-Things will connect billions of wireless devices, which is far more than the current wireless systems can technically and economically accommodate. One of the many challenges in the design of the Internet-of-Things will be its ability to manage the massive number of sporadic traffic generating devices which are most of the time inactive, but regularly access the network for minor updates with no human interaction [43]. This means among others that the overhead caused by the exchange of certain types of information between transmitter and receiver, such as channel estimation, assignment of data slots, etc, has to be avoided as much as possible [36, 27].

Focusing on the underlying mathematical challenges, we consider a multi-user communication scenario where many different users/devices communicate with a common base station, as illustrated in Figure 1. Suppose we have users and each of them sends a signal through an unknown channel (which differs from user to user) to a common base station,. We assume that the -th channel, represented by its impulse response , does not change during the transmission of the signal . Therefore acts as convolution operator, i.e., the signal transmitted by the -th user arriving at the base station becomes , where “” denotes convolution.

Figure 1: Single-antenna multi-user communication scenario without explicit channel estimation: Each of the users sends a signal through an unknown channel to a common base station. The base station measures the superposition of all those signals, namely, (plus noise). The goal is to extract all pairs of simultaneously from .

The antenna at the base station, instead of receiving each individual component , is only able to record the superposition of all those signals, namely,

(1.1)

where represents noise. We aim to develop a fast algorithm to simultaneously extract all pairs from (i.e., estimating the channel/impulse responses and the signals jointly) in a numerically efficient and robust way, while keeping the number of required measurements as small as possible.

1.1 State of the art and contributions of this paper

A thorough theoretical analysis concerning the solvability of demixing problems via convex optimization can be found in [26]. There, the authors derive explicit sharp bounds and phase transitions regarding the number of measurements required to successfully demix structured signals (such as sparse signals or low-rank matrices) from a single measurement vector. In principle we could recast the blind deconvolution/demixing problem as the demixing of a sum of rank-one matrices, see (2.3). As such, it seems to fit into the framework analyzed by McCoy and Tropp. However, the setup in [26] differs from ours in a crucial manner. McCoy and Tropp consider as measurement matrices (see the matrices in (2.3)) full-rank random matrices, while in our setting the measurement matrices are rank-one. This difference fundamentally changes the theoretical analysis. The findings in [26] are therefore not applicable to the problem of joint blind deconvolution/demixing. The compressive principal component analysis in [42] is also a form of demixing problem, but its setting is only vaguely related to ours. There is a large amount of literature on demixing problems, but the vast majority does not have a “blind deconvolution component”, therefore this body of work is only marginally related to the topic of our paper.

Blind deconvolution/demixing problems also appear in convolutional dictionary learning, see e.g. [2]. There, the aim is to factorize an ensemble of input vectors into a linear combination of overcomplete basis elements which are modeled as shift-invariant—the latter property is why the factorization turns into a convolution. The setup is similar to (1.1), but with an additional penalty term to enforce sparsity of the convolving filters. The existing literature on convolutional dictionary learning is mainly focused on empirical results, therefore there is little overlap with our work. But it is an interesting challenge for future research to see whether the approach in this paper can be modified to provide a fast and theoretically sound solver for the sparse convolutional coding problem.

There are numerous papers concerned with blind deconvolution/demixing problems in the area of wireless communications [31, 37, 20]. But the majority of these papers assumes the availability of multiple measurement vectors, which makes the problem significantly easier. Those methods however cannot be applied to the case of a single measurement vector, which is the focus of this paper. Thus there is essentially no overlap of those papers with our work.

Our previous paper [23] solves (1.1) under subspace conditions, i.e., assuming that both and belong to known linear subspaces. This contributes to generalizing the pioneering work by Ahmed, Recht, and Romberg [1] from the “single-user” scenario to the “multi-user” scenario. Both [1] and [23] employ a two-step convex approach: first “lifting” [9] is used and then the lifted version of the original bilinear inverse problems is relaxed into a semi-definite program. An improvement of the theoretical bounds in [23] was announced in [29].

While the convex approach is certainly effective and elegant, it can hardly handle large-scale problems. This motivates us to apply a nonconvex optimization approach [8, 21] to this blind-deconvolution-blind-demixing problem. The mathematical challenge, when using non-convex methods, is to derive a rigorous convergence framework with conditions that are competitive with those in a convex framework.

In the last few years several excellent articles have appeared on provably convergent nonconvex optimization applied to various problems in signal processing and machine learning, e.g., matrix completion [17, 16, 34], phase retrieval [8, 11, 33, 3], blind deconvolution [19, 4, 21], dictionary learning [32], super-resolution [12] and low-rank matrix recovery [35, 41]. In this paper we derive the first nonconvex optimization algorithm to solve (1.1) fast and with rigorous theoretical guarantees concerning exact recovery, convergence rates, as well as robustness for noisy data. Our work can be viewed as a generalization of blind deconvolution [21] to the multi-user scenario .

The idea behind our approach is strongly motivated by the nonconvex optimization algorithm for phase retrieval proposed in [8]. In this foundational paper, the authors use a two-step approach: (i) Construct a good initial guess with a numerically efficient algorithm; (ii) Starting with this initial guess, prove that simple gradient descent will converge to the true solution. Our paper follows a similar two-step scheme. However, the techniques used here are quite different from [8]. Like the matrix completion problem [7], the performance of the algorithm relies heavily and inherently on how much the ground truth signals are aligned with the design matrix. Due to this so-called “incoherence” issue, we need to impose extra constraints, which results in a different construction of the so-called basin of attraction. Therefore, influenced by [17, 34, 21], we add penalty terms to control the incoherence and this leads to the regularized gradient descent method, which forms the core of our proposed algorithm.

To the best of our knowledge, our algorithm is the first algorithm for the blind deconvolution/blind demixing problem that is numerically efficient, robust against noise, and comes with rigorous recovery guarantees.

1.2 Notation

For a matrix , denotes its operator norm and is its the Frobenius norm. For a vector , is its Euclidean norm and is the -norm. For both matrices and vectors, and denote their complex conjugate transpose. is the complex conjugate of . We equip the matrix space with the inner product defined by For a given vector , represents the diagonal matrix whose diagonal entries are . For any , let

2 Preliminaries

Obviously, without any further assumption, it is impossible to solve (1.1). Therefore, we impose the following subspace assumptions throughout our discussion [1, 23].

  • Channel subspace assumption: Each finite impulse response is assumed to have maximum delay spread , i.e.,

    Here is the nonzero part of and

  • Signal subspace assumption: Let be the outcome of the signal encoded by a matrix with , where the encoding matrix is known and assumed to have full rank111Here we use the conjugate instead of because it will simplify our notation in later derivations..

Remark 2.1.

Both subspace assumptions are common in various applications. For instance in wireless communications, the channel impulse response can always be modeled to have finite support (or maximum delay spread, as it is called in engineering jargon) due to the physical properties of wave propagation [14]; and the signal subspace assumption is a standard feature found in many current communication systems [14], including CDMA where is known as spreading matrix and OFDM where is known as precoding matrix.

The specific choice of the encoding matrices depends on a variety of conditions. In this paper, we derive our theory by assuming that is a complex Gaussian random matrix, i.e., each entry in is i.i.d. . This assumption, while sometimes imposed in the wireless communications literature, is somewhat unrealistic in practice, due to the lack of a fast algorithm to apply and due to storage requirements. In practice one would rather choose to be something like the product of a Hadamard matrix and a diagonal matrix with random binary entries. We hope to address such more structured encoding matrices in our future research. Our numerical simulations (see Section 4) show no difference in the performance of our algorithm for either choice.

Under the two assumptions above, the model actually has a simpler form in the frequency domain. We assume throughout the paper that the convolution of finite sequences is circular convolution222This circular convolution assumption can often be reinforced directly (for example in wireless communications the use of a cyclic prefix in OFDM renders the convolution circular) or indirectly (e.g. via zero-padding). In the first case replacing regular convolution by circular convolution does not introduce any errors at all. In the latter case one introduces an additional approximation error in the inversion which is negligible, since it decays exponentially for impulse responses of finite length [30].. By applying the Discrete Fourier Transform DFT) to (1.1) along with the two assumptions, we have

where is the normalized unitary DFT matrix with . The noise is assumed to be additive white complex Gaussian noise with where , and is the ground truth. We define and assume without loss of generality that and are of the same norm, i.e., , which is due to the scaling ambiguity333Namely, if the pair is a solution, then so is for any . . In that way, actually is a measure of SNR (signal to noise ratio).

Let be the first nonzero entries of and be a low-frequency DFT matrix (the first columns of an unitary DFT matrix). Then a simple relation holds,

We also denote and . Due to the Gaussianity, also possesses complex Gaussian distribution and so does From now on, instead of focusing on the original model, we consider (with a slight abuse of notation) the following equivalent formulation throughout our discussion:

(2.1)

where . Our goal here is to estimate all from and . Obviously, this is a bilinear inverse problem, i.e., if all are given, it is a linear inverse problem (the ordinary demixing problem) to recover all , and vice versa. We note that there is a scaling ambiguity in all blind deconvolution problems that cannot be resolved by any reconstruction method without further information. Therefore, when we talk about exact recovery in the following, then this is understood modulo such a trivial scaling ambiguity.

Before proceeding to our proposed algorithm we introduce some notation to facilitate a more convenient presentation of our approach. Let be the -th column of and be the -th column of . Based on our assumptions the following properties hold:

Moreover, inspired by the well-known lifting idea [9, 1, 6, 22], we define the useful matrix-valued linear operator and its adjoint by

(2.2)

for each under canonical inner product over Therefore, (2.1) can be written in the following equivalent form

(2.3)

Hence, we can think of as the observation vector obtained from taking linear measurements with respect to a set of rank-1 matrices In fact, with a bit of linear algebra (and ignoring the noise term for the moment), the -th entry of in (2.3) equals the inner product of two block-diagonal matrices:

(2.4)

where and is defined as the ground truth matrix. In other words, we aim to recover such a block-diagonal matrix from linear measurements with block structure if

By stacking all (and ) into a long column, we let

(2.5)

We define as a bilinear operator which maps a pair into a block diagonal matrix in , i.e.,

(2.6)

Let and where is the ground truth as illustrated in (2.4). Define as

(2.7)

where and blkdiag is the standard MATLAB function to construct block diagonal matrix. Therefore, and The adjoint operator is defined naturally as

(2.8)

which is a linear map from to To measure the approximation error of given by , we define as the global relative error:

(2.9)

where is the relative error within each component:

Note that and are functions of and respectively and in most cases, we just simply use and if no possibility of confusion exists.

2.1 Convex versus nonconvex approaches

As indicated in (2.4), joint blind deconvolution-demixing can be recast as the task to recover a rank- block-diagonal matrix from linear measurements. In general, such a low-rank matrix recovery problem is NP-hard. In order to take advantage of the low-rank property of the ground truth, it is natural to adopt convex relaxation by solving a convenient nuclear norm minimization program, i.e.,

(2.10)

The question of when the solution of (2.10) yields exact recovery is first answered in our previous work [23]. Late, [29, 15] have improved this result to the near-optimal bound up to some -factors where the main theoretical result is informally summarized in the following theorem.

Theorem 2.2 (Theorem 1.1 in [15]).

Suppose that are i.i.d. complex Gaussian matrices and is an partial DFT matrix with . Then solving (2.10) gives exact recovery if the number of measurements yields

with probability at least where is an absolute scalar only depending on linearly.

While the SDP relaxation is definitely effective and has theoretic performance guarantees, the computational costs for solving an SDP already become too expensive for moderate size problems, let alone for large scale problems. Therefore, we try to look for a more efficient nonconvex approach such as gradient descent, which hopefully is also reinforced by theory. It seems quite natural to achieve the goal by minimizing the following nonlinear least squares objective function with respect to

(2.11)

In particular, if we write

(2.12)

As also pointed out in [21], this is a highly nonconvex optimization problem. Many of the commonly used algorithms, such as gradient descent or alternating minimization, may not necessarily yield convergence to the global minimum, so that we cannot always hope to obtain the desired solution. Often, those simple algorithms might get stuck in local minima.

2.2 The basin of attraction

Motivated by several excellent recent papers of nonconvex optimization on various signal processing and machine learning problem, we propose our two-step algorithm: (i) Compute an initial guess carefully; (ii) Apply gradient descent to the objective function, starting with the carefully chosen initial guess. One difficulty of understanding nonconvex optimization consists in how to construct the so-called basin of attraction, i.e., if the starting point is inside this basin of attraction, the iterates will always stay inside the region and converge to the global minimum. The construction of the basin of attraction varies for different problems [8, 3, 34]. For this problem, similar to [21], the construction follows from the following three observations. Each of these observations suggests the definition of a certain neighborhood and the basin of attraction is then defined as the intersection of these three neighborhood sets

  1. Ambiguity of solution: in fact, we can only recover up to a scalar since and are both solutions for . From a numerical perspective, we want to avoid the scenario when and while is fixed, which potentially leads to numerical instability. To balance both the norm of and for all , we define

    which is a convex set.

  2. Incoherence: the performance depends on how large/small the incoherence is, where is defined by

    The idea is that: the smaller the is, the better the performance is. Let us consider an extreme case: if is highly sparse or spiky, we lose much information on those zero/small entries and cannot hope to get satisfactory recovered signals. In other words, we need the ground truth has “spectral flatness” and is not highly localized on the Fourier domain.

    A similar quantity is also introduced in the matrix completion problem [7, 34]. The larger is, the more is aligned with one particular row of To control the incoherence between and , we define the second neighborhood,

    (2.13)

    where is a parameter and . Note that is also a convex set.

  3. Close to the ground truth: we also want to construct an initial guess such that it is close to the ground truth, i.e.,

    (2.14)

    where is a predetermined parameter in .

Remark 2.3.

To ensure , it suffices to ensure where . This is because

which implies

Remark 2.4.

When we say or , it means for all we have , or respectively. In particular, where and are defined in (2.5).

2.3 Objective function and Wirtinger derivative

To implement the first two observations, we introduce the regularizer , defined as the sum of components

(2.15)

For each component , we let , , for all and

(2.16)

where . Here both and are data-driven and well approximated by our spectral initialization procedure; and is a tuning parameter which could be estimated if we assume a specific statistical model for the channel (for example, in the widely used Rayleigh fading model, the channel coefficients are assumed to be complex Gaussian). The idea behind is quite straightforward though the formulation is complicated. For each in (2.16), the first two terms try to force the iterates to lie in and the third term tries to encourage the iterates to lie in What about the neighborhood ? A proper choice of the initialization followed by gradient descent which keeps the objective function decreasing will ensure that the iterates stay in .

Finally, we consider the objective function as the sum of nonlinear least squares objective function in (2.11) and the regularizer ,

(2.17)

Note that the input of the function consists of complex variables but the output is real-valued. As a result, the following simple relations hold

Similar properties also apply to both and .

Therefore, to minimize this function, it suffices to consider only the gradient of with respect to and , which is also called Wirtinger derivative [8]. The Wirtinger derivatives of and w.r.t. and can be easily computed as follows

(2.18)
(2.19)
(2.20)
(2.21)

where and is defined in (2.8). In short, we denote

(2.22)

Similar definitions hold for and . It is easy to see that and .

3 Algorithm and Theory

3.1 Two-step algorithm

As mentioned before, the first step is to find a good initial guess such that it is inside the basin of attraction. The initialization follows from this key fact:

where we use , and

Therefore, it is natural to extract the leading singular value and associated left and right singular vectors from each and use them as (a hopefully good) approximation to This idea leads to Algorithm 1, the theoretic guarantees of which are given in Section 6.5. The second step of the algorithm is just to apply gradient descent to with the initial guess or , where stems from stacking all into one long vector444It is clear that instead of gradient descent one could also use a second-order method to achieve faster convergence at the tradeoff of increased computational cost per iteration. The theoretical convergence analysis for a second-order method will require a very different approach from the one developed in this paper..

1:for   do
2:     Compute
3:     Find the leading singular value, left and right singular vectors of , denoted by .
4:     Solve the following optimization problem for :
5:     Set .
6:end for
7:Output: or .
Algorithm 1 Initialization via spectral method and projection
1:Initialization: obtain via Algorithm 1.
2:for   do
3:     for   do
4:         ,
5:         ,
6:     end for
7:end for
Algorithm 2 Wirtinger gradient descent with constant stepsize
Remark 3.1.

For Algorithm 2, we can rewrite each iteration into

where and are in (2.22), and

3.2 Main results

Our main findings are summarized as follows: Theorem 3.2 shows that the initial guess given by Algorithm 1 indeed belongs to the basin of attraction. Moreover, also serves as a good approximation of for each . Theorem 3.3 demonstrates that the regularized Wirtinger gradient descent will guarantee the linear convergence of the iterates and the recovery is exact in the noisefree case and stable in the presence of noise.

Theorem 3.2.

The initialization obtained via Algorithm 1 satisfies

(3.1)

and

(3.2)

holds with probability at least if the number of measurements satisfies

(3.3)

Here is any predetermined constant in , and is a constant only linearly depending on with .

Theorem 3.3.

Starting with the initial value satisfying (3.1),the Algorithm 2 creates a sequence of iterates which converges to the global minimum linearly,

(3.4)

with probability at least where and

if the number of measurements satisfies

(3.5)
Remark 3.4.

Our previous work [23] shows that the convex approach via semidefinite programming (see (2.10)) requires to ensure exact recovery. Later, [15] improves this result to the near-optimal bound up to some -factors. The difference between nonconvex and convex methods lies in the appearance of the condition number in (3.5). This is not just an artifact of the proof—empirically we also observe that the value of affects the convergence rate of our nonconvex algorithm, see Figure 5.

Remark 3.5.

Our theory suggests -dependence for the number of measurements , although numerically in fact depends on linearly, as shown in Section 4. The reason for -dependence will be addressed in details in Section 5.2.

Remark 3.6.

In the theoretical analysis, we assume that (or equivalently ) is a Gaussian random matrix. Numerical simulations suggest that this assumption is clearly not necessary. For example, may be chosen to be a Hadamard-type matrix which is more appropriate and favorable for communications.

Remark 3.7.

If  (3.4) shows that converges to the ground truth at a linear rate. On the other hand, if noise exists, is guaranteed to converge to a point within a small neighborhood of More importantly, if the number of measurements gets larger, decays at the rate of .

4 Numerical simulations

In this section we present a range of numerical simulations to illustrate and complement different aspects of our theoretical framework. We will empirically analyze the number of measurements needed for perfect joint deconvolution/demixing to see how this compares to our theoretical bounds. We will also study the robustness for noisy data. In our simulations we use Gaussian encoding matrices, as in our theorems. But we also try more realistic structured encoding matrices, that are more reminiscent of what one might come across in wireless communications.

While Theorem 3.3 says that the number of measurements depends quadratically on the number of sources , numerical simulations suggest near-optimal performance. Figure 2 demonstrates that actually depends linearly on , i.e., the boundary between success (white) and failure (black) is approximately a linear function of . In the experiment, are fixed, all are complex Gaussians and all are standard complex Gaussian vectors. For each pair of , 25 experiments are performed and we treat the recovery as a success if For our algorithm, we use backtracking to determine the stepsize and the iteration stops either if or if the number of iterations reaches 500. The backtracking is based on the Armijo-Goldstein condition [25]. The initial stepsize is chosen to be . If , we just divide by two and use a smaller stepsize.

We see from Figure 2 that the number of measurements for the proposed algorithm to succeed not only seems to depend linearly on the number of sensors, but it is actually rather close to the information-theoretic limit . Indeed, the green dashed line in Figure 2, which represents the empirical boundary for the phase transition between success and failure corresponds to a line with slope about . It is interesting to compare this empirical performance to the sharp theoretical phase transition bounds one would obtain via convex optimization [10, 26]. Considering the convex approach based on lifting in [23], we can adapt the theoretical framework in [10] to the blind deconvolution/demixing setting, but with one modification. The bounds in [10] rely on Gaussian widths of tangent cones related to the measurement matrices . Since simply analytic formulas for these expressions seem to be out of reach for the structured rank-one measurement matrices used in our paper, we instead compute the bounds for full-rank Gaussian random matrices, which yields a sharp bound of about (the corresponding bounds for rank-one sensing matrices will likely have a constant larger than 3). Note that these sharp theoretical bounds predict quite accurately the empirical behavior of convex methods. Thus our empirical bound for using a non-convex methods compares rather favorably with that of the convex approach.

Figure 2: Phase transition plot for empirical recovery performance under different choices of where are fixed. Black region: failure; white region: success. The red solid line depicts the number of degrees of freedom and the green dashed line shows the empirical phase transition bound for Algorithm 2.

Similar conclusions can be drawn from Figure 3; there all are in the form of where is the unitary DFT matrix, all are independent diagonal binary matrices and is an fixed partial deterministic Hadamard matrix. The purpose of using is to enhance the incoherence between each channel so that our algorithm is able to tell apart each individual signal and channel. As before we assume Gaussian channels, i.e., Therefore, our approach does not only work for Gaussian encoding matrices but also for the matrices that are interesting to real-world applications, although no satisfactory theory has been derived yet for that case. Moreover, due to the structure of and , fast transform algorithms are available, potentially allowing for real-time deployment.

Figure 3: Empirical probability of successful recovery for different pairs of when are fixed.

Figure 4 shows the robustness of our algorithm under different levels of noise. We also run 25 samples for each level of SNR and different and then compute the average relative error. It is easily seen that the relative error scales linearly with the SNR and one unit of increase in SNR (in dB) results in one unit of decrease in the relative error.

Figure 4: Relative error vs. SNR (dB): SNR = .

Theorem 3.3 suggests that the performance and convergence