AdaLISTA: Learned Solvers Adaptive to Varying Models
Abstract
Neural networks that are based on unfolding of an iterative solver, such as LISTA (learned iterative soft threshold algorithm), are widely used due to their accelerated performance. Nevertheless, as opposed to nonlearned solvers, these networks are trained on a certain dictionary, and therefore they are inapplicable for varying model scenarios. This work introduces an adaptive learned solver, termed AdaLISTA, which receives pairs of signals and their corresponding dictionaries as inputs, and learns a universal architecture to serve them all. We prove that this scheme is guaranteed to solve sparse coding in linear rate for varying models, including dictionary perturbations and permutations. We also provide an extensive numerical study demonstrating its practical adaptation capabilities. Finally, we deploy AdaLISTA to natural image inpainting, where the patchmasks vary spatially, thus requiring such an adaptation.
1 Introduction
Sparse coding is the task of representing a noisy signal as a combination of few base signals (called “atoms”), taken from a matrix – the “dictionary”. This is represented as the need to compute such that
(1) 
where the norm counts the nonzero elements, is the cardinality of the representation, and is often redundant (). Among the various approximation methods for handling this NPhard task, an appealing approach is a relaxation of the to an norm using Lasso or BasisPursuit Tibshirani (1996); Chen et al. (2001),
(2) 
An effective way to address this optimization problem uses an iterative algorithm such as ISTA (Iterative Soft Thresholding Algorithm) Daubechies et al. (2004), where the solution is obtained by iterations of the form
(3) 
where is the step size determined by the largest eigenvalue of the Gram matrix , and is the soft shrinkage function. FastISTA (FISTA) Beck and Teboulle (2009) is a speedup of the above iterative algorithm, which should remind the reader of the momentum method in optimization.
As a side note, we mention that ISTA has a much wider perspective when aiming to minimize a function of the form
(4) 
where and are convex functions, with possibly nonsmooth. The solution is given by the proximal gradient method Combettes and Wajs (2005); Beck (2017):
(5) 
The above fits various optimization problems such as a projected gradient descent over an indicator function , the matrix completion problem Mazumder et al. (2010), portfolio optimization Boyd and Vandenberghe (2004), nonnegative matrix factorization Sprechmann et al. (2015), and more.
Returning to the realm of sparse coding, the seminal work of LISTA (LearnedISTA) Gregor and LeCun (2010) has shown that by unfolding iterations of ISTA and freeing its parameters to be learned, one can achieve a substantial speedup over ISTA (and FISTA). Particularly, LISTA uses the following reparametrization:
(6) 
where and reparametrize the matrices and correspondingly. These two matrices and the scalar thresholding value are collectively referred to as – the parameters to be learned. The model, denoted as , is trained by minimizing the squared error between the predicted sparse representations at the th unfolding , and the optimal codes obtained by running ISTA itself,
(7) 
Once trained, LISTA requires only the test signals during inference, without their underlying dictionary. It has been shown in Gregor and LeCun (2010) that LISTA generalizes well for signals of the same distribution as in the train set, allowing a significant speedup versus its nonlearned counterparts. This may be explained by the fact that while nonlearned solvers do not make any assumption on the input signals, LISTA fits itself to the input distribution. More specifically, in sparse coding, the input signals are restricted to a union of lowdimensional Gaussians, as they are generated by a linear combination of few atoms. By focusing on such signals solely, this allows LISTA to achieve its acceleration. Note, however, that the original dictionary is hardcoded into the model weights via the ground truth solutions used during the supervised training. Given a new test sample that emerges from a slightly deviated (yet known) model/dictionary, LISTA will most likely deteriorate in performance, whereas ISTA and FISTA are expected to provide a robust and consistent result, as they are agnostic to the input signals and dictionary.
From a different point of view, a drawback of LISTA is its relevance to a single dictionary, requiring a separate and renewed training if the model evolves over time. Such is the case in video related applications as enhancement Protter and Elad (2008) or surveillance Zhao et al. (2011), where the dictionary should vary along time. Similarly, in some image restoration problems, the model encapsulated by the dictionary is often corrupted by an additional constant perturbation, e.g., the sensing matrix in compressive sensing Kulkarni et al. (2016), the blur kernel in nonblind image deblurring Tang et al. (2014), and a spatiallyvarying mask in image inpainting Mairal et al. (2007). In all these cases, deployment of the classic framework of LISTA necessitates a newly trained network for each new dictionary. An alternative to the above is incorporating LISTA as a fixed blackbox denoiser, and merging it within the plugandplay Venkatakrishnan et al. (2013) or RED Romano et al. (2017) schemes, significantly increasing the inference complexity.
Main Contributions:
Our aim in this work is to extend the applicability of LISTA to scenarios of model perturbations and varying signal distributions. More specifically,

We bridge the gap between the efficiency and the fast convergence rate of LISTA, and the high adaptivity and applicability of ISTA (and FISTA), by introducing “AdaLISTA” (AdaptiveLISTA). Our training is based on pairs of signals and their corresponding dictionaries, learning a generic architecture that wraps the dictionary by two auxiliary weight matrices. At inference, our model can accommodate the signal and its corresponding dictionary, allowing to handle a variety of model modifications without repetitive retraining.

We perform extensive numerical experiments, demonstrating the robustness of our model to three types of dictionary perturbations: permuted columns, additive Gaussian noise, and completely renewed random dictionaries. We demonstrate the ability of AdaLISTA to handle complex and varying signal models while still providing an impressive advantage over both learned and nonlearned solvers.

We prove that our modified scheme achieves a linear convergence rate under a constant dictionary. More importantly, we allow for noisy modifications and random permutations to the dictionary and prove that robustness remains, with an ability to reconstruct the ideal sparse representations with the same linear rate.

We demonstrate the use of our approach on natural image inpainting, which cannot be directly used with hardcoded models as LISTA. We show a clear advantage of AdaLISTA versus its nonlearned counterparts.
Adopting a wider perspective, our study contributes to the understanding of learned solvers and their ability to accelerate convergence. Common belief suggests that the signal model should be structured and fixed for successful learning of such solvers. Our work reveals, however, that effective learning can be achieved with a weaker constraint – having a fixed conditional distribution of the data given the model .
The LISTA concept of unfolding the iterations of a classical optimization scheme into an RNNlike neural network, and freeing its parameters to be learned over the training data, appears in many works. These include an unsupervised and online training procedure Sprechmann et al. (2015), a multilayer version Sulam et al. (2019), a gated mechanism compensating shrinkage artifacts Wu et al. (2020), as well as reducedparameter schemes Chen et al. (2018); Liu et al. (2019). This paradigm has been brought to various applications, such as compressed sensing, superresolution, communication, MRI reconstruction Zhang and Ghanem (2018); Metzler et al. (2017); Wang et al. (2015); Borgerding et al. (2017); Sun et al. (2016); Hershey et al. (2014), and more. A prominent line of work investigates the success of such learned solvers from a theoretical point of view Xin et al. (2016); Wang et al. (2016); Moreau and Bruna (2016); Giryes et al. (2018); Zarka et al. (2019). Most of these consider a fixed signal model, with the exception of “robustALISTA” Liu et al. (2019) that introduces an adaptive variation of LISTA. This scheme, however, is restricted to small model perturbations, and cannot address more complicated model variations. A more detailed discussion of the relevant literature in relation of our study appears in Section 4.
2 Proposed Method
Thus far, as depicted in Figure 1, one could either benefit from a high convergence rate using a learned solver as LISTA, while restricting the signals to a specific model, or employ a nonlearned and less effective solver as ISTA/FISTA that is capable of handling any pair of signal and its generative model. In this paper we introduce a novel architecture, termed “AdaptiveLISTA” (AdaLISTA), combining both benefits. Beyond enjoying the acceleration benefits of learned solvers, we incorporate the dictionary as part of the input at both training and inference time, allowing for adaptivity to different models. Figure 2 provides our suggested architecture, based on the following:
Definition 1 (AdaLISTA).
The AdaLISTA solver is defined
(8) 
The signal and the dictionary are the inputs, and the learned parameters are and .
The inference (for both ISTA and FISTA) and the training procedures of AdaLISTA are detailed in Algorithms 1 and 2 correspondingly. We consider a similar loss as in Equation 7, while also incorporating the concurrent dictionaries,
(9) 
This learning regime is supervised, requiring reference representations to be computed using ISTA. An unsupervised alternative can be envisioned, as in Sprechmann et al. (2015); Golts et al. (2018), where the loss is
In this paper we shall focus on the supervised mode of learning, leaving the unsupervised alternative to future work.
Several key questions arise on the applicability of the above learned solver: Does it work? and if so, is performance compromised by AdaLISTA, as opposed to training LISTA for each separate model? To what extent can it be used? Can it handle completely random models? Can theoretical guarantees be provided on its convergence rate, or adaptation capability? We aim to answer these questions, and we start with a theorem on the robustness of our scheme by proving linear rate convergence under varying model.
3 AdaLISTA: Theoretical Study
For the following study, we consider a reduced scheme of AdaLISTA with a single weight matrix, so as to avoid complication in theorem conditions. We emphasize, however, that the same claims can be derived for our original scheme.
Definition 2 (AdaLISTA – Single Weight Matrix).
AdaLISTA with a single weight matrix is defined by
(10) 
We start by recalling the definition of mutual coherence between two matrices:
Definition 3 (Mutual Coherence).
Given two matrices, and , if the diagonal elements of are equal to , then the mutual coherence is defined as
(11) 
where and are the th and th columns of and .
Our first goal is to prove that AdaLISTA is capable of solving the sparse coding problem in linear rate. We show that if all the signals emerge from the same dictionary , there exists a weight matrix and threshold values such that the recovery error decreases linearly over iterations. The following theorem indicates that if AdaLISTA’s training reaches its global minimum, the rate would be at least linear. In this part, we follow the steps in Zarka et al. (2019), which generalize the proof of ALISTA Liu et al. (2019) to noisy signals. The proof for Theorem 1 appears in Appendix A.
Theorem 1 (AdaLISTA Convergence Guarantee).
Consider a noisy input . If is sufficiently sparse,
(12) 
and the thresholds satisfy the condition
(13) 
with , , and , then the support in the th iteration of AdaLISTA (Definition 2) is included in the support of , and its values satisfy
(14) 
We proceed by claiming that AdaLISTA can be adaptive to model variations. In this setting, we argue that the signal can originate from different models, and nonetheless there exist global parameters such that AdaLISTA will converge in linear rate to the original representation. Our Theorem exposes the key idea that, as opposed to LISTA which corresponds to a single dictionary, AdaLISTA can be flexible to various models, while still providing good generalization. Appendix B contains the proof of the following Theorem.
Theorem 2 (AdaLISTA – The Applicable Dictionaries).
Consider a trained AdaLISTA network with a fixed , and noisy input . If the following conditions hold:
1. The diagonal elements of are close to : ;
2. The offdiagonals are bounded: ;
3. is sufficiently sparse: ; and
4. The thresholds satisfy
with , , and ,
then the support of the th iteration of AdaLISTA is included in the support of , and its values satisfy
(15) 
An interesting question arising is the following: Once AdaLISTA has been trained and the matrix is fixed, which dictionaries can be effectively served with the same parameters, without additional training? Theorem 2 reveals that as long as the effective matrix is sufficiently close to the identity matrix, linear convergence is guaranteed. This holds in particular for two interesting scenarios, proven in Appendices C and D:

Random permutations – If AdaLISTA converges for signals emerging from , it also converges for signals originating from any permutation of ’s atoms.

Noisy dictionaries – If AdaLISTA converges given a clean dictionary , satisfying , it also converges for noisy models , with some probability, depending on the distribution of .
To the best of our knowledge, Theorem 2 provides the first convergence guarantee in the presence of model variations, claiming that linear rate convergence is guaranteed, depending on the availability of small enough cardinality and low mutualcoherence . Note that the above claim, as in previous work Liu et al. (2019); Zarka et al. (2019), addresses the core capability of reaching linear convergence rate while disregarding both training and generalization errors.
4 Related Work
As already mentioned, the literature discussing LISTA and its successors, is abundant. In this section we aim to discuss relevant work to provide better context to our contribution.
The most relevant work to ours is “robustALISTA” Liu et al. (2019), introducing adaptivity to dictionary perturbations. Their work assumes that every signal comes from a different noisy model , where is an interference matrix. For each noisy dictionary this method computes an analytic matrix that minimizes the mutual coherence . Then, and are embedded in the architecture, and the training is performed over the step sizes and the thresholds only, leading to a considerable reduction in the number of trained parameters.
While RobustALISTA considers model perturbations only, we show empirically that our method can handle more complicated model deviations, as dictionary permutations and totally random models. Additionally, in terms of computational complexity, robustALISTA has a complicated calculation of the analytic matrices during inference time, a limitation that does not exist in our scheme. We refer the reader to Appendix F for a more detailed discussion on the difference between both methods.
As to the theoretical aspect of our study, Chen et al. (2018); Liu et al. (2019); Zarka et al. (2019); Wu et al. (2020) have recently shown that learned solvers can achieve linear convergence, under specific conditions on the sparsity level and mutual coherence. These results are the inspiration behind Theorem 1. This work, however, generalizes these guarantees to a varying model scenario, proving that the same weight matrix can serve different models while still reaching linear convergence.
5 Numerical Results
To demonstrate the effectiveness of our approach, we perform extensive numerical experiments, where our goal is twofold. First we examine AdaLISTA on a variety of synthetic data scenarios, including column permutations of the input dictionary, additive noisy versions of it, and completely random input dictionaries. Second, we perform a natural image inpainting experiment, showcasing our robustness to a realworld task
5.1 Synthetic Experiments
Experiment Setting.
We construct a dictionary with random entries drawn from a normal distribution, and normalize its columns to have a unit norm. Our signals are created as sparse combinations of atoms over this dictionary, . While the reported experiments in this section assume no additive noise, Appendix E presents a series of similar tests with varying levels of noise, showing the same qualitative results. The representation vectors are created by randomly choosing a support of cardinality with Gaussian coefficients, . Instead of using the true sparse representations as ground truth for training, we compute the Lasso solution with FISTA ( iterations, ), using the obtained signals and their corresponding dictionary . This is done in order to maintain a realworld setting, where one does not have access to the true sparse representations. We create in this manner examples for training, and for test. Our metric for comparison between different methods is the MSE (Mean Square Error) between the ground truth and the predicted sparse representations at unfoldings, . In all experiments, the AdaLISTA weight matrices are both initialized as the identity matrix. In the following set of experiments we gradually diverge from the initial model, given by the dictionary , by applying different degradation/modifications to it.
Random Permutations.
We start with a scenario in which the columns of the initial dictionary are permuted randomly to create a new dictionary . This transformation can occur in the nonconvex process of dictionary learning, in which different initializations might incur a different order of the resulting atoms. Although the signals’ subspace remains intact, learned solvers as LISTA where the dictionary is hardcoded during training, will most likely fail, as they cannot predict the updated support.
Here and below, we compare the results of four solvers: ISTA, FISTA, OracleLISTA and AdaLISTA, all versus the number of iterations/unfoldings, . For each training example in ISTA, FISTA and AdaLISTA, we create new instances of a permuted dictionary and its corresponding true representation, . We then apply FISTA for iterations and obtain the ground truth representations for the signal . Then ISTA and FISTA are applied for only iterations to solve for the pairs . Similarly, the supervised AdaLISTA is given the ground truth for training. In OracleLISTA we solve a simpler problem in which the dictionary is fixed () for all training examples . The results in Figure 3 clearly show that AdaLISTA is much more efficient compared to ISTA/FISTA, capable of mimicking the performance of the OracleLISTA, which considers a single constant .
Noisy Dictionaries.
In this experiment we aim to show that AdaLISTA can handle a more challenging case in which the dictionary varies by . Each training signal is created by drawing a different noisy instance of the dictionary and a sparse representation , and solving the FISTA to obtain . ISTA and FISTA receive the pairs , and AdaLISTA receives the triplet . By vanilla LISTA, we refer to a learned solver that obtains , and trains a network while disregarding the changing models. OracleLISTA, as before, handles a simpler case in which the dictionary is fixed, being , and all signals are created from it.
Figure 4 presents the performance of the different solvers with a decreasing SNR (Signal to Noise Ratio) of the dictionary
Random Dictionaries.
In this setting, we diverge even further from a fixed model, and examine the capability of our method to handle completely random input dictionaries. This time, for each training example we create a different Gaussian normalized dictionary , and a corresponding representation vector with an increasing cardinality: . The resulting signals, , and their corresponding dictionaries are fed to FISTA to obtain the ground truth sparse representations for training, . We compare the performance of ISTA, FISTA, AdaLISTA and OracleLISTA. Similarly to previous experiments, AdaLISTA is fed during training with the triplet . Vanilla LISTA cannot handle such variation in the input distribution, and thus it is omitted. For reference, we show the results of OracleLISTA in which all of the training signals are created from the same dictionary.
As can be seen in Figure 5, for a small cardinality of , OracleLISTA is able to drastically lower the reconstruction error as compared to ISTA and FISTA. This result, however, has already been demonstrated in Gregor and LeCun (2010). AdaLISTA which deals with a much more complex scenario, still provides a similar improvement over both ISTA and FISTA. As the cardinality increases to , the performance of both learned solvers deteriorates, and the improvement over their nonlearned counterparts diminishes.
The last experiment provides a valuable insight on the success of LISTAlike learned solvers. The common belief is that acceleration in convergence can be obtained when the signals are restricted to a union of lowdimensional subspaces, as opposed to the entire signal space. The above experiment suggests otherwise: Although the signals occupy the whole space, AdaLISTA still achieves improved convergence. This implies that the underlying structure should be only of the signal given its generative model , as opposed to the signal model, . In the above, even if the dictionaries are random, the signals must be sparse combinations of atoms. As this assumption of structure weakens with the increased cardinality, the resulting acceleration becomes less prominent. We believe that this conditional information is the key for improved convergence.
5.2 Natural Image Inpainting
In this section we apply our method to a natural image inpainting task. We assume the image is corrupted by a known mask with a ratio of missing pixels. Thus, the updated objective we wish to solve is
(16) 
where is a corrupt patch of the same size as the clean one, is a dictionary trained on clean image patches, and represents the mask, being an identity matrix with a percentage of diagonal elements equal to zero. Thus, the dictionary is constant, but each patch has a different (yet known) inpainting mask, and thus the effective dictionary changes for each signal.
Barbara  Boat  House  Lena  Peppers  C.man  Couple  Finger  Hill  Man  Montage  
ISTA  
FISTA  
AdaLFISTA  26.09  30.03  32.36  32.50  28.81  27.94  30.02  28.25  30.86  30.67  27.22 
Updated Model.
We slightly change the formulation of the model described in Section 2, and reverse the roles of the input and learned matrices. Specifically, the updated shrinkage step (Equation 3) for image inpainting is
(17) 
We consider the mask as part of the input, while the dictionary is learned with the following parameterization:
(18) 
where are the same size as the dictionary , and initialized by it.
Experiment Setting.
In order to collect natural image patches, we use the BSDS500 dataset Martin et al. (2001) and divide it to and training, validation and test images correspondingly. To train the dictionary , we extract patches at random locations from the train images, subtract their mean and divide by the average standard deviation. The dictionary of size is learned via scikitlearn’s function MiniBatchDictionaryLearning with . To train our network, we randomly pick a subset of training and validation patches. We train the network to perform an image inpainting task with ratio of . Instead of using AdaLISTA as before, we tweak the architecture described in Equation (18) to unfold the FISTA algorithm, termed AdaLFISTA, as described in algorithm 1. The input to our network is triplets of the corrupt train patches , their corresponding mask , and the solutions of the FISTA solver applied for iterations on the corrupt signals. The output is the reconstructed representations .
We evaluate the performance of our method on images from the popular Set11, corrupted with the same inpainting ratio of , and compare between ISTA, FISTA and AdaLFISTA for a fixed number of iterations/unfoldings. We extract all overlapping patches in each image, subtract the mean and divide by the standard deviation, apply each solver, unnormalize the patches and return their mean, and then place them in their correct position in the image and average over overlaps. The quality of the results is measured in PSNR between the clean images and the reconstruction of their corrupt version. The patchwise validation error versus the the number of unfoldings is given in Figure 7; numerical results are given in Table 1, and select qualitative results are shown in Figure 6 and more in Appendix G. There is a clear advantage to AdaLFISTA over the nonlearned ISTA and FISTA solvers. In this setting of missing pixels, a hardcoded solver with a fixed , such as LISTA, cannot deal with the changing mask of each patch.
6 Conclusions
We have introduced a new extension of LISTA, termed AdaLISTA, which receives both the signals and their dictionaries as input, and learns a universal architecture that can cope with the varying models. This modification produces great flexibility in working with changing dictionaries, leveling the playing field with nonlearned solvers such as ISTA and FISTA that are agnostic to the entire signal distribution, while enjoying the acceleration and convergence benefits of learned solvers. We have substantiated the validity of our method, both in a comprehensive theoretical study, and with extensive synthetic and realworld experiments. Future work includes further investigation of the discussed rationale, and an extension to additional applications.
Appendix A Proof of Theorem 1
Proof.
This proof follows the steps from Zarka et al. (2019), with slight modifications to fit our scheme. Following the notations in Theorem 1, denotes the true sparse representation of the signal , and . In addition, we define as the support of a vector.
Induction hypothesis:
For any iteration the following hold

The estimated support is contained in the true support,
(19) 
The recovery error is bounded by
(20)
Base case:
We start by showing that the induction hypothesis holds for . Since we get that the support is empty and the support hypothesis Equation 19 holds. As for the recovery error, we get that
(21) 
Therefore, to verify Equation 20 we need to show that
(22) 
Since , for any index we can write
(23) 
where denotes the th column in and denotes the th element in . Multiplying each side by we get
(24) 
Since by assumption , the left term becomes . In addition, since, by assumption, there are no more than nonzeros in and is bounded by , we get the following bound
(25) 
By taking a maximum over we obtain
(26) 
Since we have assumed that , and
(27) 
we get
(28) 
Finally, since , and , we get
(29) 
as in Equation 20, and therefore the recovery error hypothesis holds for the base case.
Inductive step:
Assuming the induction hypothesis holds for iteration , we show that it also holds for the next iteration . We define and denote by the subset of columns in .
We start by proving the support hypothesis (Equation 19). By Definition 2, the following holds for any index :
(30) 
Placing , we get
(31) 
Since , the following holds:
Therefore, Equation 31 becomes
(32) 
We aim to show that for any , , as the support hypothesis suggests. Since , we can bound the input argument of the soft threshold by
(33) 
Using the induction assumption on the support, , we can upper bound the first term in the righthandside,
(34) 
Using the induction assumption on the recovery error (Equation 20), we have . Therefore, we get
(35) 
However, by our assumptions,
(36) 
Therefore,
(37) 
and by placing we get
(38) 
Since is the input to the soft threshold operator , and it is no bigger than the threshold, we get that , and the support hypothesis holds.
We proceed by proving that the recovery error hypothesis also holds (Equation 20). We use the fact that for any scalar triplet, , the soft threshold satisfies
(39) 
Therefore, following Equation 32 we get
As before, since , we have
(40) 
Therefore, by using Equation 36 we get
(41) 
and by placing we obtain
(42) 
By taking a maximum over , we establish the recovery error hypothesis (Equation 20), concluding the proof. ∎
Appendix B Proof of Theorem 2
We define an effective matrix . In this part, we aim to prove that linear convergence is guaranteed for any dictionary , satisfying two conditions: (i) the diagonal elements of are close to , and (ii) the offdiagonal elements of are bounded.
Proof.
This proof is based on Appendix A, with the following two modifications: The mutual coherence is replaced with , and the diagonal element is not assumed to be equal to , but rather bounded from below by .
The base case of the induction (Equation 26) now becomes:
(43) 
Since we assume , and
(44) 
we get
(45) 
As , , therefore the induction hypothesis holds for the base case.
Moving to the inductive step, the proof of the support hypothesis remains almost the same, apart from replacing with . This is due to the fact that if , then , and therefore the diagonal elements multiply zero elements.
As to the recovery error hypothesis, we need to upper bound for . Since we need to modify Equation 31:
(46) 
Using Equation 39 we get that is upper bounded by
(47) 
which in turn is upper bounded by
(48) 
Placing results in
(49) 
Taking a maximum over establishes the recovery error assumption, proving the induction hypothesis. ∎
Appendix C Proof for Random Permutations
We show that if the weight matrix leads to linear convergence for signals generated by , then linear convergence is also guaranteed for signals originating from , where is a permutation matrix. The proof is straightforward, as the permutation matrix does not flip diagonal and offdiagonal elements in the effective matrix . Thus, the mutual coherence does not change and the conditions of Theorem 2 hold, establishing linear convergence.
Appendix D Proof for Noisy Dictionaries
We now consider signals from noisy models, , where , and the model deviations are of Gaussian distribution, . Given pairs of ), we show that AdaLISTA recovers the original representations , with respect to their model in linear rate.
Theorem 3 (AdaLISTA Convergence – Noisy Model).
Consider a noisy input , where , . If for some constants , is sufficiently sparse,
(50) 
and the thresholds satisfy
(51) 
with , , , , and , then, with probability of at least , the support of the th iteration of AdaLISTA is included in the support of and its values satisfy
(52) 
Proof.
The proof for this theorem consists of two stages. First, we study the effect of model perturbations on the effective matrix , deriving probabilistic bounds for the changes in the diagonal and offdiagonal elements. Then, we place these bounds in Theorem 2 to guarantee linear rate.
We start by bounding the changes in the effective matrix . These deviations modify the offdiagonal elements, which are no longer bounded by , and the diagonal elements that are not equal to anymore. Define as:
(53) 
This implies is equal to:
(54) 
Since and the elements in are independent, the expected value of is
(55) 
To bound the changes in we aim to use Cantelli’s inequality, but first, we need to find the variance of :
In what follows we calculate each term in the righthandside, starting with :
(56) 
Moving on to , we get