Inexact Gradient Projection and Fast Data Driven Compressed Sensing
We study convergence of the iterative projected gradient (IPG) algorithm for arbitrary (possibly non-convex) sets and when both the gradient and projection oracles are computed approximately. We consider different notions of approximation of which we show that the Progressive Fixed Precision (PFP) and the -optimal oracles can achieve the same accuracy as for the exact IPG algorithm. We show that the former scheme is also able to maintain the (linear) rate of convergence of the exact algorithm, under the same embedding assumption. In contrast, the -approximate oracle requires a stronger embedding condition, moderate compression ratios and it typically slows down the convergence.
We apply our results to accelerate solving a class of data driven compressed sensing problems, where we replace iterative exhaustive searches over large datasets by fast approximate nearest neighbour search strategies based on the cover tree data structure. For datasets with low intrinsic dimensions our proposed algorithm achieves a complexity logarithmic in terms of the dataset population as opposed to the linear complexity of a brute force search. By running several numerical experiments we conclude similar observations as predicted by our theoretical analysis.
Signal inference under limited and noisy observations is a major line of research in signal processing, machine learning and statistics and it has a wide application ranging from biomedical imaging, astrophysics, remote sensing to data mining. Incorporating the structure of signals is proven to significantly help with an accurate inference since natural datasets often have limited degrees of freedom as compared to their original ambient dimensionality. This fact has been invoked in Compressed sensing (CS) literature by adopting efficient signal models to achieve accurate signal reconstruction given near-minimal number of measurements i.e. much smaller than the signal ambient dimension (see [1, 2, 3, 4, 5, 6] and e.g.  for an overview on different CS models). CS consists of a linear sampling protocol:
where a linear mapping samples a -dimensional vector of noisy measurements from a ground truth signal which typically lives in a high ambient dimension . Natural signals often have efficient compact representations using nonlinear models such as low dimensional smooth manifolds, low-rank matrices or the Union of Subspaces (UoS) that itself includes sparse (unstructured) or structured sparse (e.g. group, tree or analysis sparsity) representations in properly chosen orthobases or redundant dictionaries . CS reconstruction algorithms for estimating from are in general more computationally complex (as opposed to the simple linear acquisition of CS) as they typically require solving a nonlinear optimization problem based around a prior signal model. A proper model should be carefully chosen in order to efficiently promote the low-dimensional structure of signal meanwhile not bringing a huge computational burden to the reconstruction algorithm.
Consider the following constrained least square problem for CS reconstruction:
where, the constraint set represents the signal model. First order algorithms in the form of projected Landweber iteration a.k.a. iterative projected gradient (IPG) descent or Forward-Backward are very popular for solving (2). Interesting features of IPG include flexibility of handling various and often complicated signal models, e.g. might be convex, nonconvex or/and semi-algebraic such as sparsity or rank constraints (these last models result in challenging combinatorial optimization problems but with tractable projection operators). Also IPG (and more generally the proximal-gradient methods) has been considered to be particularly useful for big data applications . It is memory efficient due to using only first order local oracles e.g., the gradient and the projection onto , it can be implemented in a distributed/parallel fashion, and it is also robust to using cheap statistical estimates e.g. in stochastic descend methods  to shortcut heavy gradient computations.
In this regard a major challenge that IPG may encounter is the computational burden of performing an exact projection step onto certain complex models (or equivalently, performing an exact but complicated gradient step). In many interesting inverse problems the model projection amounts to solving another optimization within each iteration of IPG. This includes important cases in practice such as the total variation penalized least squares [10, 11], low-rank matrix completion  or tree sparse model-based CS . Another example is the convex inclusion constraints , appearing in multi constrained problems e.g. [13, 14], where one might be required to perform a Djkstra type feasibility algorithm at each iteration [15, 16]. Also, for data driven signal models the projection will typically involve some form of search through potentially large datasets. In all these cases accessing an exact oracle could be either computationally inefficient or even not possible (e.g. in analysis sparse recovery  or tensor rank minimization  where the exact projection is NP hard), and therefore a natural line of thought is to carry those steps with cheaper approximate oracles.
In this paper we feature an important property of the IPG algorithm; that it is robust against deterministic errors in calculation of the projection and gradient steps. We cover different types of oracles: i) A fixed precision (FP) oracle which compared to the exact one has an additive bounded approximation error. ii) A progressive fixed precision (PFP) oracle which allows for larger (additive) approximations in the earlier iterations and refines the precision as the algorithm progresses. iii) A -approximate oracle which introduces a notion of relatively optimal approximation with a multiplicative error (as compared to the exact oracle).
Our analysis uses a notion of model restricted bi-Lipschitz embedding similar to e.g. , however in a more local form and with an improved conditioning (we discuss this in more details in Section IV). With that respect, our analysis differs from the previous related works in the convex settings as the embedding enables us for instance to prove a globally optimal recovery result for nonconvex models, as well as establishing linear rate of convergences for the inexact IPG applied for solving CS problems (e.g. results of  on linear convergence of the inexact IPG assumes strong convexity which does not hold in solving underdetermined least squares such as CS).
In summary, we show that the FP type oracles restrict the final accuracy of the main reconstruction problem. This limitation can be overcome by increasing the precision at an appropriate rate using the PFP type oracles where one could achieve the same solution accuracy as for the exact IPG algorithm under the same embedding assumptions (and even with the convergence rate). We show that the -approximate projection can also achieve the accuracy of exact IPG however under a stronger embedding assumption, moderate compression ratios and using possibly more iterations (since using this type of oracle typically decreases the rate of convergence). In all the cases above we study conditions that provide us with linear convergence results.
Finally we apply this theory to a stylized data driven compressed sensing application that requires a nearest neighbour search order to calculate the model projection. We shortcut the computations involved, (iteratively) performing exhaustive searches over large datasets, by using approximate nearest neighbour search strategies corresponding to the aforementioned oracles and motivated by the cover tree structure introduced in . Our proposed algorithm achieves a complexity logarithmic in terms of the dataset population (as opposed to the linear complexity of a brute force search). By running several numerical experiments on different datasets we conclude similar observations as predicted by our theoretical results.
I-B Paper organization
The rest of this paper is organized as follows: In Section II we review and compare our results to the previous related works on inexact IPG. In Section III we define the inexact IPG algorithm for three types of approximate oracles. Section IV includes our main theoretical results on robustness and linear convergence of the inexact IPG for solving CS reconstruction problem. In Section VI we discuss an application of the proposed inexact algorithms to accelerate solving data driven CS problems. We also briefly discuss the cover tree data structure and the associated exact/approximate search strategies. Section VII is dedicated to the numerical experiments on using inexact IPG for data driven CS. And finally we discuss and conclude our results in Section VIII.
Ii Related works
Inexact proximal-gradient methods (in particular IPG) and their Nesterov accelerated variants have been the subject of a substantial amount of work in convex optimization [22, 23, 24, 20, 25]. Here we review some highlights and refer the reader for a comprehensive literature review to . Fixed precision approximates have been studied for the gradient step e.g. for using the smoothing techniques where the gradient is not explicit and requires solving an auxiliary optimization, see for more details  and  for a semi-definite programming example in solving sparse PCA. A similar approach has been extensively applied for carrying out the projection (or the proximity operator) approximately in cases where it does not have an analytical expression and requires solving another optimization within each iteration e.g. in total variation constrained inverse problems [11, 10] or the overlapping group Lasso problem . Fortunately in convex settings one can stop the inner optimization when its duality gap falls below a certain threshold and achieve a fixed precision approximate projection. In this case the solution accuracy of the main problem is proportional to the approximation level introduced within each iteration. Recently  studied the progressive fixed precision (PFP) type approximations for solving convex problems, e.g. a sparse CUR type factorization, via gradient-proximal (and its accelerated variant) methods. The authors show that IPG can afford larger approximation errors (in both gradient and projection/proximal steps) in the earlier stages of the algorithm and by increasing the approximation accuracy at an appropriate rate one can attain a similar convergence rate and accuracy as for the exact IPG however with significantly less computation.
A part of our results draws a similar conclusion for solving nonconvex constrained least squares that appears in CS problems by using inexact IPG. Note that an FP type approximation has been previously considered for the nonconvex IPG e.g. for the UoS  or the manifold  CS models, however these works only assume inexact projections whereas our result covers an approximate gradient step as well. Of more importance and to the best of our knowledge, our results on incorporating the PFP type approximation in nonconvex IPGs and analysing the associated rate of (global) convergence is the first result of its kind. We also note that the results in  are mainly about minimizing convex cost functions (i.e. not necessarily recovery guarantees) and that the corresponding linear convergence results only hold for uniformly strong convex objectives. We instead cover cases with local (and not uniform) strong convexity and establish the linear convergence of IPG for solving underdetermined inverse problems such as CS.
Using relative -approximate projections (described in Section IV-D) for CS recovery has been subject of more recent research activities (and mainly for nonconvex models). In  the authors studied this type of approximation for an IPG sparse recovery algorithm under the UoS model in redundant dictionaries. Our result encompasses this as a particular choice of model and additionally allows for inexactness in the gradient step. The work of  studied similar inexact oracles for a stochastic gradient-projection type algorithm customized for sparse UoS and low-rank models (see also  for low-rank CS using an accelerated variant of IPG). We share a similar conclusion with those works; for a large more measurements are required for CS recovery and the convergence becomes slow. Hegde et al.  proposed such projection oracle for tree-sparse signals and use it for the related model-based CS problem using a CoSamp type algorithm (see also [31, 32] for related works on inexact CoSamp type algorithms). In a later work  the authors consider a modified variant of IPG with -approximate projections for application to structured sparse reconstruction problems (specifically tree-sparse and earthmover distance CS models). For this scenario they are able to introduce a modified gradient estimate (called the Head approximation oracle) to strengthen the recovery guarantees by removing the dependency on , albeit with a more conservative Restricted Isometry Property. Unfortunately, this technique does not immediately generalize to arbitrary signal models and we therefore do not pursue this line of research here.
Iterative projected gradient iterates between calculating the gradient and projection onto the model i.e. for positive integers the exact form of IPG follows:
where, is the step size, and denote the exact gradient and the Euclidean projection oracles, respectively. The exact IPG requires the constraint set to have a well defined, not necessarily unique but computationally tractable Euclidean projection
Throughout we use as a shorthand for the Euclidean norm .
In the following we define three types of approximate oracles which frequently appear in the literature and could be incorporated within the IPG iterations. We also briefly discuss their applications. Each of these approximations are applicable to the data driven problem we will consider in Section VI.
Iii-a Fixed Precision (FP) approximate oracles
We first consider approximate oracles with additive bounded-norm errors, namely the fixed precision gradient oracle where:
and the fixed precision projection oracle where:
The values of denote the levels of inaccuracy in calculating the gradient and projection steps respectively.111Note that our fixed precision projection (5) is defined on the squared norms in a same way as in e.g. [19, 27]. The corresponding inexact IPG iterates as follows:
Note that, unlike [19, 27] in this formulation we allow for variations in the inexactness levels at different stages of IPG. The case where the accuracy levels are bounded by a constant threshold and , , refers to an inexact IPG algorithm with fixed precision (FP) approximate oracles.
Such errors may occur for instance in distributed network optimizations where the gradient calculations could be noisy during the communication on the network, or in CS under certain UoS models with infinite subspaces  where an exact projection might not exist by definition (e.g. when is an open set) however a FP type approximation could be achievable. It also has application in finite (discrete) super resolution , source localization and separation [35, 36] and data driven CS problems e.g., in Magnetic Resonance Fingerprinting [37, 38], where typically a continuous manifold is discretized and approximated by a large dictionary for e.g. sparse recovery tasks.
Iii-B Progressive Fixed Precision (PFP) approximate oracles
One obtains a Progressive Fixed Precision (PFP) approximate IPG by refining the FP type precisions thorough the course of iterations. Therefore any FP gradient or projection oracle which has control on tuning the accuracy parameter could be used in this setting and follows (6) with decaying sequences .
For instance this case includes projection schemes that require iteratively solving an auxiliary optimization (e.g. the total variation ball , sparse CUR factorization  or the multi-constraint inclusions [15, 16], etc) and can progress (at an appropriate rate) on the accuracy of their solutions by adding more subiterations. We also discuss in Section VI-C another example of this form which is customized for fast approximate nearest neighbour searches with application to the data driven CS framework.
Iii-C -approximate projection
Obtaining a fixed precision (and thus PFP) accuracy in projections onto certain constraints might be still computationally exhaustive, whereas a notion of relative optimality could be more efficient to implement. The -approximate projection is defined as follows: for a given ,
We note that might not be unique. In this regard, the inexact IPG algorithm with a -approximate projection takes the following form:
Note that we still assume a fixed precision gradient oracle with flexible accuracies . One could also consider a -approximate gradient oracle and in combination with those aforementioned inexact projections however for brevity and since the analysis would be quite similar to the relative approximate projection we decide to skip more details on this case.
The tree -sparse projection in and in the exact form requires solving a dynamical programming with running time  whereas solving this problem approximately with accuracy requires the time complexity  which better suits imaging problems in practice with a typical Wavelet sparsity level . Also [28, 29] show that one can reduce the cost of low-rank matrix completion problem by using randomized linear algebra methods, e.g. see [40, 41], and carry out fast low-rank factorizations with a type approximation. In a recent work on low-rank tensor CS  authors consider using a approximation (within an IPG algorithm) for low-rank tensor factorization since the exact problem is generally NP hard . Also in Section VI-C we discuss a data driven CS recovery algorithm which uses approximate nearest neighbour searches in order to break down the complexity of the projections from (using an exhaustive search) to , for denoting a large number of data points with low intrinsic dimensionality.
Iv Main results
Iv-a Uniform linear embeddings
The success of CS paradigm heavily relies on the embedding property of certain random sampling matrices which preserves signal information for low dimensional but often complicated/combinatorial models. It has been shown that IPG can stably predict the true signal from noisy CS measurements provided that satisfies the so called Restricted Isometry Property (RIP):
for a small constant . This has been shown for models such as sparse, low-rank and low-dimensional smooth manifold signals and by using IPG type reconstruction algorithms which in the nonconvex settings are also known as Iterative Hard Thresholding [43, 12, 44, 27, 6]. Interestingly these results indicate that under the RIP condition (and without any assumption on the initialization) the first order IPG algorithms with cheap local oracles can globally solve nonconvex optimization problems.
For instance random orthoprojectors and i.i.d. subgaussian matrices satisfy RIP when the number of measurements is proportional to the intrinsic dimension of the model (i.e. signal sparsity level, rank of a data matrix or the dimension of a smooth signal manifold, see e.g.  for a review on comparing different CS models and their measurement complexities) and sublinearly scales with the ambient dimension .
A more recent work generalizes the theory of IPG to arbitrary bi-Lipschitz embeddable models , that is for given and it holds
for some constants . Similar to the RIP these constants are defined uniformly over the constraint set i.e. . There Blumensath shows that if
then IPG robustly solves the corresponding noisy CS reconstruction problem for all . This result also relaxes the RIP requirement to a nonsymmetric and unnormalised notion of linear embedding whose implication in deriving sharper recovery bounds is previously studied by .
Iv-B Hybrid (local-uniform) linear embeddings
Similarly the notion of restricted embedding plays a key role in our analysis. However we adopt a more local form of embedding and show that it is still able to guarantee stable CS reconstruction.
Given there exists constants for which the following inequalities hold:
Uniform Upper Lipschitz Embedding (ULE)
Local Lower Lipschitz Embedding (LLE)
Upon existence, and denote respectively the smallest and largest constants for which the inequalities above hold.
This is a weaker assumption compared to RIP or the uniform bi-Lipschitz embedding. Note that for any we have:
(where denotes the matrix spectral norm i.e. the largest singular value). However with such an assumption one has to sacrifice the universality of the RIP-dependent results for a signal dependent analysis. Depending on the study, local analysis could be very useful to avoid e.g. worst-case scenarios that might unnecessarily restrict the recovery analysis . Similar local assumptions in the convex settings are shown to improve the measurement bound and the speed of convergence up to very sharp constants [47, 48].
Unfortunately we are currently unable to make the analysis fully local as we require the uniform ULE constraint. Nonetheless, one can always plug the stronger bi-Liptchitz assumption into our results throughout (i.e. replacing with ) and regain the universality.
Iv-C Linear convergence of (P)FP inexact IPG for CS recovery
In this section we show that IPG is robust against deterministic (worst case) errors. Moreover, we show that for certain decaying approximation errors, the IPG solution maintain the same accuracy as for the approximation-free algorithm.
Assume satisfy the main Lipschitz assumption with constants . Set the step size . The sequence generated by Algorithm (6) obeys the following bound:
Theorem 1 implications for the exact IPG (i.e. ) and inexact FP approximate IPG (i.e. ) improve [19, Theorem 2] in three ways: first by relaxing the uniform lower Lipschitz constant to a local form with the possibility of conducting a local recovery/convergence analysis. Second, by improving the embedding condition for CS stable recovery to
or for a uniform recovery . And third, by improving the rate of convergence.
The following corollary is an immediate consequence of the linear convergence result established in Theorem 1 for which we do not provide a proof:
With assumptions in Theorem 1 the IPG algorithm with FP approximate oracles achieves the solution accuracy
for any and in a finite number of iterations
As it turns out in our experiments and aligned with the result of Corollary 1, the solution accuracy of IPG can not exceed the precision level introduced by a PF oracle. In this sense Corollary 1 is tight as a trivial converse example would be that IPG starts from the optimal solution but an adversarial FP scheme projects it to another point within a fixed distance.
Interestingly one can deduce another implication from Theorem 1 and overcome such limitation by using a PFP type oracle. Remarkably one achieves a linear convergence to a solution with the same accuracy as for the exact IPG, as long as geometrically decays. The following corollary makes this statement explicit:
Assume for some error decay rate and a constant . Under the assumptions of Theorem 1 the solution updates of the IPG algorithm with PFP approximate oracles is bounded above by:
Which implies a linear convergence at rate
for an arbitrary small .
Similar to Corollary 1 one can increase the final solution precision of the FPF type IPG with logarithmically more iterations i.e. in a finite number of iterations one achieves . Therefore in contrast with the FP oracles one achieves an accuracy within the noise level that is the precision of an approximation-free IPG.
Using the PFP type oracles can also maintain the rate of linear convergence identical as for the exact IPG. For this the approximation errors suffice to follow a geometric decaying rate of .
The embedding condition (11) sufficient to guarantee our stability results is invariant to the precisions of the FP/PFP oracles and it is the same as for an exact IPG.
Iv-D Linear convergence of inexact IPG with -approximate projection for CS recovery
In this part we focus on the inexact algorithm (8) with a -approximate projection. As it turns out by the following theorem we require a stronger embedding condition to guarantee the CS stability compared to the previous algorithms.
Assume satisfy the main Lipschitz assumption and that
for and some constant . Set the step size . The sequence generated by Algorithm (8) obeys the following bound:
Similar conclusions follow as in Corollaries 1 and 2 on the linear convergence, logarithmic number of iterations vs. final level of accuracy (depending whether the gradient oracle is exact or FP/PFP) however with a stronger requirement than (11) on the embedding; increasing i.e. consequently , limits the recovery guarantee and slows down the convergence (compare in Theorems 1 and 2). Also approximations of this type result in amplifying distortions (i.e. constants ) due to the measurement noise and gradient errors. For example for an exact or a geometrically decaying PFP gradient updates and for a fixed chosen according to the Theorem 2 assumptions, Algorithm (8) achieves a full precision accuracy (similar to the exact IPG) in a finite number of iterations.
The assumptions of Theorem 2 impose a stringent requirement on the scaling of the approximation parameter i.e. which is not purely dependent on the model-restricted embedding condition but also on the spectral norm . In this sense since ignores the structure of the problem it might scale very differently than the corresponding embedding constants and . For instance a i.i.d. Gaussian matrix has w.h.p. (when ) whereas, e.g. for sparse signals, the embedding constants w.h.p. scale as . A similar gap exists for other low dimensional signal models and for other compressed sensing matrices e.g. random orthoprojectors. This indicates that the oracles may be sensitive to the CS sampling ratio i.e. for we may be limited to use very small approximations .
In the following we show by a deterministic example that this requirement is indeed tight. We also empirically observe in Section VII that such a limitation indeed holds in randomized settings (e.g. i.i.d. Gaussian ) and on average. Although it would be desirable to modify the IPG algorithm to avoid such restriction, as was done in  for specific structured sparse models, we note that this is the same term that appears due to ’noise folding’ when the signal model is not exact or when there is noise in the signal domain (see the discussion in Section IV-E). As such most practical CS systems will inevitably have to avoid regimes of extreme undersampling.
A converse example
Consider a noiseless CS recovery problem where , and the sampling matrix (i.e. here a row vector) is
for some parameter . Consider the following one-dimensional signal model along the first coordinate:
We have indeed . It is easy to verify that both of the embedding constants w.r.t. to are
Therefore one can tune to obtain arbitrary small ratios for .
Assume the true signal, the corresponding CS measurement, and the initialization point are
Consider an adversarial -approximate projection oracle which performs the following step for any given :
For simplicity we assume no errors on the gradient step. By setting , the corresponding inexact IPG updates as
and only along the first dimension (we note that due to the choice of oracle ). Therefore we have
which requires for convergence, and it diverges otherwise. As we can see for (i.e. where becomes extremely unstable w.r.t. sampling the first dimension) the range of admissible shrinks, regardless of the fact that the restricted embedding exhibits a perfect isometry; which is an ideal situation for solving a noiseless CS (i.e. in this case an exact IPG takes only one iteration to converge).
Iv-E When the projection is not onto the signal model
One can also make a distinction between the projection set and the signal model here denoted as (i.e. ) by modifying our earlier definitions (5) and (7) in the following ways: an approximate FP projection reads,
and a -approximate projection reads
With respect to such a distinction, Theorems 1 and 2 still hold (with the same embedding assumption/constants on the projection set ), conditioned that . Indeed this can be verified by following identical steps as in the proof of both theorems. This allows throughout the flexibility of considering an approximate projection onto a possibly larger set including the original signal model i.e. , which for instance finds application in fast tree-sparse signal or low-rank matrix CS recovery, see [30, 29]. Such an inclusion is also important to derive a uniform recovery result for all .
The case where can also be bounded in a similar fashion as in . We first consider a proximity point in i.e. and update the noise term to . We then use Theorems 1 and 2 to derive an error bound, here on . For this we assume the embedding condition uniformly holds over the projection set (which includes ). As a result we get a bound on the error which includes a bias term with respect to the distance of to . Note that since here also includes a signal (and not only measurement) noise term introduced by , the results are subjected to noise folding i.e. a noise amplification with a similar unfavourable scaling (when ) to our discussion in Remark 6 (for more details on CS noise folding see e.g. [49, 50]).
V-a Proof of Theorem 1
where the last inequality follows from the ULE property in Definition IV-B. Assuming , we have
Due to the update rule of Algorithm (6) and the inexact (fixed-precision) projection step, we have
The last inequality holds for any member of and thus here for . Therefore we can write
The last line replaces and uses the Cauchy-Schwartz inequality.
Similarly we use the LLE property in Definition IV-B to obtain an upper bound on :
where . Replacing this bound in (14) and cancelling from both sides of the inequality yields
We continue to lower bound the left-hand side of this inequality:
The first inequality uses the Cauchy-Schwartz’s and the second inequality follows from the ULE and LLE properties. Using this bound together with (15) we get
The last inequality assumes which holds since we previously assumed . As a result we deduce that
for and defined in Theorem 1. Applying this bound recursively (and setting ) completes the proof:
Note that for convergence we require and therefore, a lower bound on the step size which is .
V-B Proof of Corollary 2
V-C Proof of Theorem 2
where . For the last inequality we replace with two feasible points .
Note that so far we only assumed .
On the other hand by triangle inequality we have
The third inequality uses the ULE property and the last one holds since . Therefore, we get
Based on assumption of the theorem we can deduce
Applying this bound recursively (and setting ) completes the proof:
for defined in Theorem 2. The condition for convergence is which implies and a lower bound on the step size which is .
Vi Application in Data driven compressed sensing
Many CS reconstruction programs resort to signal models promoted by certain (semi) algebraic functions . For example we can have
where may be chosen as the or semi-norms or as the which promotes sparse, group-sparse or low-rank (for matrix spaces) solutions, respectively. One might also replace those penalties with their corresponding convex relaxations namely, the