Dictionary Learning with BLOTLESS Update
Abstract
Algorithms for learning a dictionary to sparsely represent a given dataset typically alternate between sparse coding and dictionary update stages. Methods for dictionary update aim to minimise expansion error by updating dictionary vectors and expansion coefficients given patterns of nonzero coefficients obtained in the sparse coding stage. We propose a block total least squares (BLOTLESS) algorithm for dictionary update. BLOTLESS updates a block of dictionary elements and the corresponding sparse coefficients simultaneously. In the error free case, three necessary conditions for exact recovery are identified. Lower bounds on the number of training data are established so that the necessary conditions hold with high probability. Numerical simulations show that the bounds approximate well the number of training data needed for exact dictionary recovery. Numerical experiments further demonstrate several benefits of dictionary learning with BLOTLESS update compared with stateoftheart algorithms especially when the amount of training data is small.
I Introduction
Sparse signal representation has found a wide range of applications, including image denoising [19, 32], image inpainting[19], image deconvolution [11], image superresolution [48, 16], etc. The key idea behind the concept of sparse representation is that natural signals tend to have sparse representations under certain bases/dictionaries. Hence, finding a dictionary under which a given data set can be represented in a sparse manner, has become a very active area of research. Although numerous analytical dictionaries exist, including Fourier basis[15], discrete cosine transform (DCT) dictionaries, wavelets[14], curvelets [12], etc., the need to adapt to properties of specific data sets has long been driving research efforts towards the development of efficient algorithms for dictionary learning [35, 5]. More formally, dictionary learning is the problem of finding a dictionary of vectors in such that the training samples in can be written as , where the coefficient matrix is sparse. Of particular interest is overcomplete dictionary learning where the number of dictionary items is larger than the data dimension, i.e., , and the number of the training samples is typically much larger than the size of the dictionary, . Dictionary learning is a nonconvex bilinear inverse problem, very challenging to solve in general.
The bilinear dictionary learning problem is typically approached by alternating between two stages: sparse coding and dictionary update [35, 5, 20, 29, 18, 45]. In the sparse coding stage, the goal is to find sparse representations of training samples for a given dictionary . For that purpose, scores of algorithms have been developed. They can be divided into two main categories. The first category consists of greedy algorithms, including orthogonal matching pursuit (OMP) [46], regularized orthogonal matching pursuit (ROMP) [34], subspace pursuit (SP) [17], etc. In the second category, sparse coding is formulated as a convex optimization problem where norm is used to promote sparsity [13], and then optimization techniques, e.g. the fast iterative shrinkagethresholding algorithm (FISTA) [10], can be applied. Reviews of sparse recovery algorithms can be found in [47].
The goal of the dictionary update is to refine the dictionary so that the training samples have more accurate sparse representations given indices of nonzero coefficients obtained in the sparse coding stage. In the probabilistic framework, one may apply either maximum likelihood (ML) estimator [35] or maximum a posteriori (MAP) estimator [29], and then solve them by using gradient decent procedures. In the context of ML formulation [35], Engan et al. [20] proposed the method of optimal directions (MOD) where the sparse coefficients are fixed and the dictionary update problem is cast as a least squares problem which can be solved efficiently; modifications of MOD were subsequently proposed in [1, 41, 21].
Recently an alternative approach for dictionary update has become popular, where both the dictionary and the sparse coefficients are updated simultaneously with a given sparsity pattern. The representative algorithms include the famous KSVD algorithm [5, 42] and SimCO [18]. The crux of KSVD [5] algorithm is to update dictionary items and their corresponding sparse coefficients simultaneously, sequentially one by one. KSVD was subsequently extended to allow simultaneous update of multiple dictionary elements and corresponding coefficients [42]. SimCO [18], of which KSVD is a special case, goes further and updates the whole dictionary and sparse coefficients simultaneously. The main idea of SimCO is that given a sparsity pattern, the sparse coefficients can be viewed as a function of the dictionary. As a result, the dictionary update becomes a nonconvex optimisation problem with respect to the dictionary. The optimisation is then preformed using the gradient descent method combined with a heuristic subroutine designed to deal with singular points which can prevent from the convergence to the global minimum[18].
Due to the nonconvexity of dictionary learning problem, it is challenging to understand under which conditions exact dictionary recovery is possible and which method is optimal in achieving that. Following early efforts on theoretical analysis of exact dictionary recovery [4, 26, 36, 24, 39, 28, 40], more recently, Spielman et. al. [43] studied dictionary learning problem with complete dictionaries where the dictionary can be presented as a square matrix. By solving a certain sequence of linear programs, they showed that one can recover a complete dictionary from when is a sparse random matrix with nonzeros per column. In [2, 3, 8, 7], the authors propose algorithms which combine clustering, spectral initialization, and local refinement to recover overcomplete and incoherent dictionaries.
Again these algorithms succeed when has nonzeros per column. The work in [9] provides a polynomialtime algorithm that recovers a large class of overcomplete dictionaries when has nonzeros per column for any constant . However, the proposed algorithm runs in superpolynomial time when the sparsity level goes up to . Similarly, in [6] the authors proposed a superpolynomial time algorithm that guarantees recovery with close to nonzeros per column. Sun et al. [44, 27], on the other hand, proposed a polynomialtime algorithm that provably recovers complete dictionary when has nonzeros per column and the size of training samples is .
This paper addresses the dictionary update problem, where both the dictionary and the sparse coefficients are updated, for a given sparsity pattern. Whilst it is a subproblem of the overall dictionary learning, it is nevertheless an important step towards solving the overall problem, and its bilinear nature makes it nonconvex and hence very challenging to solve. Our main contributions are as follows.

For the errorfree case, when the sparsity pattern is known exactly, three necessary conditions for unique recovery are identified, expressed in terms of lower bounds on the number of training data. Numerical simulations show that the theoretical bounds well approximate the empirical number of training data needed for exact dictionary recovery. In particular, we show that the number of training samples needed is for complete dictionary update.

BLOTLESS is numerically demonstrated robust to errors in the assumed sparsity pattern. When embedded into the overall dictionary learning process, it leads to faster convergence rate and less training samples needed compared to stateoftheart algorithms including MOD, KSVD and SimCO.
Our work is inspired by a recent work [31] where bilinear inverse problems are formulated as linear inverse problems. The main difference is that our theoretical analysis and algorithm designs in Sections III and IV are specifically tailored to the generic dictionary update problem while the focus of [31] is selfcalibration which can be viewed as dictionary learning with only diagonal dictionaries. Parts of the results in this paper were presented in the conference paper [49]. In this journal paper, we refine the bounds in Section III and provide detailed proofs, add two total least squares algorithms in Section IV, and include more simulation results in Section V to support the design of the algorithm.
This paper is organized as follows. Section II briefly reviews dictionary learning and update methods. Section III discusses an ideal case where exact dictionary recovery is possible, for which a least squares method is developed and analysed. In Section IV, the general case of dictionary update is discussed, and the least squares method is extended to total least squares methods, leading to BLOTLESS. Results of extensive simulations are presented in Section V and conclusions are drawn in Section VI.
Ia Notation
In this paper, denotes the norm and stands for the Frobenius norm. For a positive integer , define . For a matrix , and denote the th row and the th column of respectively. Consider the sparse coefficient matrix . Let be the support set of , i.e., the index set that containing indices of all nonzero entries in . Let be the support set of the row vector . Then is the row vector obtained by keeping the nonzero entries of and removing all its zero entries. Symbols , , and denote the identity matrix, the vector of which all the entries are 1, and the vector with all zero entries, respectively. For a given set , denotes its complement in .
Ii Dictionary Learning: The Background
Dictionary learning is the process of finding a dictionary which sparsely represents given training samples. Let be the training sample matrix, where is the dimension of training sample vectors and is the number of training samples. The overall dictionary learning problem is often formulated as:
(1) 
where is the dictionary, is the sparse coefficient matrix, the pseudonorm gives the number of nonzero elements, also known as sparsity level, and is the upper bound of the sparsity level.
Dictionary learning algorithms typically iterate between two stages: sparse coding and dictionary update. The goal of sparsity coding is to find a sparse coefficient matrix for a given dictionary . One way to achieve this is to solve the problem
(2) 
In the dictionary update stage, the goal is to refine the dictionary with either fixed sparse coefficients or a fixed sparse pattern, i.e. fixed locations of nonzero coefficients. The famous MOD method [20] falls into the first category. With fixed sparse coefficients, dictionary update is simply a least squares problem
A more popular and advantageous approach is to simultaneously update the dictionary and nonzero sparse coefficients by fixing only the sparsity pattern. With this idea, dictionary update is then formulated as [5, 18, 42]
(3) 
where gives the vector formed by the entries of indexed by . However, problem (3) is bilinear, nonconvex, and challenging to solve.
Among many methods for solving (3), we here briefly review KSVD [5] and SimCO [18]. KSVD algorithm successively updates individual dictionary items and the corresponding sparse coefficients whilst keeping all other dictionary items and coefficients fixed:
(4) 
The optimal solution can be obtained by taking the largest left and right singular vectors of the matrix .
The idea of SimCO is to formulate the dictionary update problem in (3) as a nonconvex optimisation problem with respect to the overall dictionary, that is
(5) 
and then solve it using gradient descent of . This leads to an update of all dictionary items and sparse coefficients simultaneously. KSVD can be viewed as a special case of SimCO where the objective function reads
The focus of this paper is a novel solution to Problem (3).
Iii Exact Dictionary Recovery
This section focuses on an ideal case that the dictionary can be exactly recovered. We assume that the training samples in are generated from where is a tall or square matrix () and the sparsity pattern of is given. For compositional convenience, we focus on the case where is a square matrix, , as the same analysis is valid for a tall dictionary where .
With given sparsity pattern denoted by , the dictionary update problem can be formulated as a bilinear inverse problem in which the goal is to find and such that
(6) 
The constraint is nonconvex. Generally speaking, it is challenging to solve (6) and there are no guarantees for the global optimality of the solution.
Least Squares Solver
Suppose that the unknown dictionary matrix is invertible. The nonconvex problem in (6) can be translated into a convex problem by using a strategy similar to that explored in [31]. Define . Then . The goal is now to find and such that
(7) 
or equivalently,
(8) 
where the subscripts are used to indicate matrix dimensions. In this manner the original bilinear problem (6) is cast as an equivalent linear least squares problem.
However, the formulation in (8) admits trivial solution and . In fact, (8) admits at least linearly independent solutions.
Proposition 1.
There are at least linear independent solutions to the least squares problem in (8).
Proof.
This proposition is proved by construction. Let . Define matrix by keeping the th column of the matrix and setting all other columns to zero, that is, and for all . From the fact that , it is straightforward to verify that , , is a solution of (8).
The solutions , , are linearly independent. This can be easily verified by observing that the positions of nonzero elements in and , , are different. ∎
Necessary Conditions for Unique Recovery
We now consider the uniqueness of the solution in more detail and derive necessary conditions for unique recovery. Two ambiguities can be identified in the dictionary update problem in (8). The first is permutation ambiguity. Let and be the support sets (the index set containing indices corresponding to nonzero entries) of the th and th row of . If , then the tuple is a valid solution of (6), where denotes the permutation matrix generated by permuting the th and th row of the identity matrix. On the other hand, there is no permutation ambiguity if for all . In practice, the given sparsity pattern is typically diverse enough to avoid permutation ambiguity.
The second is the scaling ambiguity which cannot be avoided. Let be a diagonal matrix with nonzero diagonal elements. It is clear that the tuple is also a valid solution of (6). All solutions of the form form an equivalent class. The scaling ambiguity can be addressed by ntroducing additional constraints. One option used in [31] is that the sum of the elements in each row of is one, i.e., . With these constraints, one has
(9) 
Henceforth, we define unique recovery as unique up to the scaling ambiguity.
Definition 1 (Unique Recovery).
The dictionary update problem (6) is said to admit a unique solution if all solutions are of the form and for some diagonal matrix with nonzero diagonal elements.
In the following, we identified three necessary conditions for unique recovery.
Proposition 2.
Assume that is square and invertible. If the problem (6) has unique solution, then it holds that

.

For all , the support set of the th row of , denoted by , satisfies .

For all and all , such that and .
Proof.
Necessary condition 1 is proved by using the fact that the solution of (9) is unique only if the number of equations is larger or equal than the number of unknown variables. The number of unknown variables in (9) is while the number of equations in (9) is . Elementary calculations lead to the bound .
The proof of the other two necessary conditions is based on the fact that
where . To simplify the notations, we omit the subscript 0 from , and in the rest of this proof. Divide the sample matrix into two submatrices and . Then it holds that
( is in the null space of .) Hence is unique (up to a scaling factor) if and only if has dimension . In this case, is the null space of both and . It is concluded that is unique if and only if .
Necessary condition 2 follows directly from that .
To prove the last necessary condition, note first that the fact that implies that each column of participates in generating some columns of . That is, , participates in generating for some . Necessary condition 3 is therefore proved. Note that condition 3 is not sufficient. It does not prevent the following rank deficient case: there exist such that both and only participate in generating a single sample in for some . ∎
Discussions on the Number of Samples
We now study the number of samples needed to ensure that the necessary conditions for unique recovery, as specified in Proposition 2, hold with high probability. To that end we use the following probabilistic model: entries of are independently generated from the Gaussian distribution , and entries of are independently generated from the BernoulliGaussian distribution with , where BernoulliGaussian distribution is defined as follows.
Definition 2.
A random variable is BernoulliGaussian distributed with , if , where random variables and are independent, is Bernoulli distributed with parameter , and .
Remark 1.
The Gaussian distribution is not essential. It can be replaced by any continuous distribution.
Proposition 3 (Number of Samples).
Proof.
See Appendix A. ∎
Remark 2.
We have the following observations.

With fixed and , and scale linearly with while is proportional to .

With fixed and , , , and increase proportionally to .

With fixed and , when increases from 0 to 1, and increase, while first decreases and then increases. This matches the intuition that when is too small, we need more samples to have enough information to recover the dictionary. On the other hand, when is too large, more samples are needed to generate the orthogonal space of each . This is verified by simulations in Section V.
The bound provides a good estimate of the number of samples needed for unique recovery. By set theory, if event is a necessary condition for , then implies , or equivalently, and . In Proposition 3, the quantity is a lower bound for , where these necessary conditions hold. But unfortunately it is neither lower nor upper bound for , where the dictionary can be uniquely recovered. Nevertheless, our simulations show that is a good approximation to the number of samples needed to recover the dictionary uniquely with probability more than .
In an asymptotic regime, the bounds can be simplified.
Corollary 1 (Asymptotic Bounds).
This corollary follows from elementary calculations and the fact that when .
Iv Dictionary Update with Uncertainty
While Section III studies the ideal case, this section focuses on the general case using the insight from Section III. In practice, there may be noise in the training samples , and there may be errors in the assumed sparsity pattern. The exact equality in (6) may not hold any longer. Following the idea in Section III, total least squares methods are applied to handle the uncertainties. The techniques for nonovercomplete and overcomplete dictionaries are developed in Sections IVA and IVB respectively.
Iva Nonovercomplete Dictionary Update
In the case , let be the pseudoinverse of and assume that . Due to the uncertainty, Equation (9) becomes approximate, that is,
(10) 
Total least squares is a technique to solve a least squares problem in the form where errors in both observations and regression models are considered[33, 37]. It targets at minimising the total errors via
(11) 
The constraint set above is nonconvex and hence (11) is a nonconvex optimisation problem. Nevertheless, its global optimal solution can be obtained by using the singular value decomposition (SVD). Set . Observe that the constraint in (11) implies that . The optimal can be obtained from the smallest right singular vectors of the matrix , and the optimal is a best lowerrank approximation of the matrix .
The difficulty in applying total least squares directly is due to the additional constraint in (10). Below we present three possible solutions, where the last one IterTLS excels and is adopted.
Structured Total Least Squares (STLS)
Consider having uncertainties in both and the sparsity pattern. Based on (10), a straightforward total least squares formulation is
(12)  
To solve the above nonconvex optimisation problem, we follow the approach in [30]. It involves an iterative process where each iteration solves an approximated quadratic optimisation problem which admits a closedform optimal solution.
At each iteration, denote the initial estimate of by . Note that the constraint set in (12) can be written as where
We consider the first order Taylor approximation of at given , which reads
where , , and is the corresponding Jacobian matrix. With this approximation, the nonconvex optimisation problem in (12) becomes a quadratic optimisation problem with equality constraints
(13)  
This is a quadratic optimisation problem with linear equality constraints, and admits a closedform solution by a direct application of KKT conditions [23].
The STLS approach has two issues. The first issue is its very high computational cost. The quadratic optimisation problem (13) involves unknowns and equation constraints. Its closedform solution involves a matrix of size . We have obtained the closedform of the Jacobian matrix , implemented a conjugate gradient algorithm to use the structures in (13) for a speedup (details are omitted here). However, simulations in Section VB show that the computation speed is still too slow for practical problems. The second issue is the inferior performance compared to other TLS methods in Sections IVA2 and IVA3. This is because Taylor approximation of the constraint is used in STLS, while other TLS methods below incorporate the constraints directly without Taylor approximation.
Parallel Total Least Squares (ParTLS)
The key idea of ParTLS is to decouple the problem (10) into subproblems that can be solved in parallel:
It is straightforward to verify that this is equivalent to
(14) 
where is the projection matrix obtained by keeping the columns of the identity matrix indexed by and removing all other columns.
Subproblems (14) can be solved by directly applying the TLS formulation (11). Note that . The vector can be computed as a scaled version of the least right singular vector of the matrix . Then and can be obtained from .
ParTLS enjoys the following advantages. 1) Its global optimality is guaranteed for the ideal case of no data noise or sparsity pattern errors. It is straightforward to see that in the ideal case the ParTLS solutions satisfy (9). 2) It is computationally efficient. The subproblems (14) are of small size and can be solved in parallel. However, ParTLS also has its own issue — certain structures in the problem are not enforced. For different subproblem , the ‘denoised’ , denoted by can be different.
Iterative Total Least Squares (IterTLS)
IterTLS is an iterative algorithm such that in each iteration a total least squares problem is formulated based on the estimate from the previous iteration. It starts with an initial estimate obtained by solving the ideal case equation (9). In each iteration, let be an estimate of from either initialisation or the previous iteration. We formulate the following total least squares problem
(15) 
which has the identical form as (11). Note that the constraint in (10) is implicitly imposed as . The problem (15) can be optimally solved by using the SVD
The optimal solution is given by and . To prepare the next iteration, one obtains an updated estimate by applying a simple projection operator to : and . With this new estimate , one can proceed with the next iteration until convergence.
IvB Update Overcomplete Dictionary
The difficulty of overcomplete dictionary update comes from the fact that for an overcomplete dictionary , in general, where is the pseudoinverse of . Therefore, the above least squares or total least squares approaches cannot be directly applied.
To address this issue, a straightforward approach is to divide the whole dictionary into a set of subdictionaries each of which is either complete or undercomplete, and then update these subdictionaries onebyone whilst fixing all other subdictionaries and the corresponding coefficients. More explicitly, given estimated and , consider updating , the submatrix of consisting of columns indexed by , and , the submatrix of consisting of rows indexed by . Then, consider the residual matrix
(16) 
and apply the method in Section IVA3 to solve the problem . Then repeat this step for all subdictionaries. As the dictionary is updated block by block, we refer to our algorithm as BLOck Total LEast SquareS (BLOTLESS).
V Numerical Test
Parts of the numerical tests are based on synthetic data. The training samples are generated according to the probability model specified in Section III3. When the dictionary recovery is not exact, the performance criterion is the difference between the groundtruth dictionary and the estimated dictionary . In particular, the estimation error is defined as
(17) 
where is the th item in the estimated dictionary, is the th item in the groundtruth dictionary, and . The items in both dictionaries are normalised to have unit norm.
Numerical tests based on real data are presented in Section VD2 for image denoising. The performance metric is the Peak SignaltoNoiseRatio (PSNR) of the denoised images.
Va Simulations for Exact Dictionary Recovery
In this section we evaluate numerically bounds in Proposition 3. All the results presented here are based on 100 random and independent trials. For theoretical performance prediction, we compute , , and using . In the numerical simulations, we vary and find its critical value under which exact recovery happens with an empirical probability at most 99% and above which exact recovery happens with an empirical probability at least 99%.
We start with the relation between the number of training samples and the sparsity ratio for a given . In Figure 1, we fix , vary , and study the probability of exact recovery against the number of training samples. Results in Figure 1 show that the theoretical prediction is quite close to the critical obtained by simulations. One can also observe that the needed number of training samples first decreases and then increases as increases, which is also predicted by the theoretical bounds. A larger scale numerical test is done in Figure 2, where four subfigures correspond to four different values of . Once again, simulations demonstrate these bounds in Proposition 3 match the simulations very well.
Let us consider the required for exact recovery as a function of by fixing . Simulation results are depicted in Figure 3. When , behaves as . Otherwise, behaves as . This is consistent with Proposition 3.
Finally, we numerically study the accuracy of the asymptotic results in Corollary 1. We draw normalised number of training samples for exact recovery as a function of sparsity ratio for different values of , including . Simulation results in Figure 4 show a trend that is consistent with the asymptotic results in Corollary 1.
VB Total Least Squares Methods
The three total least squares methods introduced in Section IV are compared, henceforth denoted by BLOTLESSSTLS, BLOTLESSParTLS and BLOTLESSIterTLS respectively, by being embedded in the dictionary learning process. Random dictionaries are used as the initial starting point of dictionary learning. OMP is used for sparse coding.
STLS  621.6544  836.3027  1098.5955  1265.1895 
ParTLS  9.2544  18.1191  25.3678  31.0551 
IterTLS  6.3446  8.5157  10.9026  13.8056 
Fig. 5 compares the dictionary learning errors (17) for both complete and overcomplete dictionaries. BLOTLESSIterTLS converges the fastest and has the smallest error floor. Then a runtime comparison is given in Table I. BLOTLESSIterTLS clearly outperforms the other two methods. It is therefore used as the default dictionary update method in later simulations.
VC Robustness to Errors in Sparsity Pattern
Simulations are next designed to evaluate the robustness of different dictionary update algorithms to errors in sparsity pattern. Towards that end, a fraction of indices in the true support are randomly chosen to be replaced with the same number of randomly chosen indices not in the true support set. This erroneous sparsity pattern is then fed into different dictionary update algorithms. The numerical results in Fig. 6 demonstrate the robustness of BLOTLESS (with IterTLS).
VD Dictionary Learning with BLOTLESS Update
This subsection compares dictionary learning performance for different dictionary update methods. The sparse coding algorithm is OMP. IterTLS in Section IVA3 is used for BLOTLESS. Results for synthetic data are presented in Section VD1 while Section VD2 focuses on image denoising using real data.
Synthetic Data
Fig. 7 and 8 compare the performance of dictionary learning using different dictionary update algorithms. Fig. 7 focus on the noise free case where and Fig. 8 concerns with the noisy case where where is the additive Gaussian noise matrix with i.i.d. entries and the signaltonoise ratio (SNR) is set to 15dB. Both figures include the cases of complete and overcomplete dictionaries. The results presented in Fig. 7 and 8 clearly indicate that dictionary learning based on BLOTLESS converges much faster and needs at least less training samples than other benchmark dictionary update methods.
In BLOTLESS update, blocks of the dictionary (subdictionaries) are updated sequentially. Fig. 9 compares the performance with different block sizes. Note that when each block contains only one dictionary item, the dictionary update problem is the same as that in KSVD. Hence the performance of KSVD is added in Fig. 9. Simulations suggest that the larger the dictionary blocks are, the faster the convergence is and the better performance is. The performance of BLOTLESS with block size one is slightly better than that of KSVD. This is because the TLS step in IterTLS does not enforce the sparsity pattern and hence better accommodates errors in the estimated sparsity pattern.
Real Data
We use the Olivetti Research Laboratory (ORL) face database [38] for dictionary learning and then use the learned dictionary for image denoising.
For dictionary learning, according to the simulation results in Section VD1, samples of size patches from face images are enough for training a dictionary via BLOTLESS. The parameters used in dictionary learning are , , , and . After learning a dictionary, image denoising [19] is performed using test images from the same dataset. The denoising results are shown in Table II, where four test images are used. In all these four tests, the BLOTLESS method outperforms all other algorithms, which is consistent with these simulations in VD1.
Original Image 






Noisy Image 













MOD 













KSVD 













SimCO 













BLOTLESS 












Vi Conclusion
This paper proposed a BLOTLESS algorithm for dictionary update. It divides the dictionary into subdictionaries, each of which is nonovercomplete. Then BLOTLESS updates a subdictionary and the corresponding sparse coefficients using least sqaures or total least squares approaches. Necessary conditions for unique recovery are identified and they hold with high probability when the number of training samples is larger than the derived bounds in Proposition 3. Simulations show that these bounds match the simulations well, and that BLOTLESS outperforms other benchmark algorithms. One future direction is to find sufficient bounds for unique recovery and their comparisons to the necessary bounds.
a Proof of Proposition 3
The proof needs Hoeffding’s inequality[22] for Bernoulli random variables, stated below.
Lemma 1 (Hoeffding’s Inequality).
For many identical Bernoulli random variables . Each takes the value 1 with probability and 0 with probability , then the following Hoeffding’s inequality holds
(18) 
where is a constant number.
To derive , we consider the case that the necessary condition 1 in Proposition 2 fails. That is,
Based on Hoeffding’s inequality, the probability of this event is upper bounded by
If this probability is smaller than , it follows that
The left hand side of the above inequality is quadratic in , which after some elementary calculations leads to
The derivation of is similar. Consider the probability that the necessary condition 2 in Proposition 2 fails:
where the inequality in the third line follows from the union bound. After applying Hoeffding’s inequality and setting the upper bound less than we obtain
It follows that
To derive , we define the following event

: For given , such that and .
Then the probability that the necessary condition 3 fails is given by
where the inequality in the second line follows from the union bound. If we set this probability to be smaller than , we obtain
References
 (2001) Optimized signal expansions for sparse representation. IEEE Trans. Signal Process. 49 (5), pp. 1087–1096. Cited by: §I.
 (2014) Learning sparsely used overcomplete dictionaries. In Conf. Learn. Theory, pp. 123–137. Cited by: §I.
 (2013) Exact recovery of sparsely used overcomplete dictionaries. Stat 1050, pp. 8–39. Cited by: §I.
 (2006) On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them. Linear Algeb. and Its Appl. 416 (1), pp. 48–67. Cited by: §I.
 (2006) KSVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54 (11), pp. 4311. Cited by: §I, §I, §I, §II, §II.
 (2014) More algorithms for provable dictionary learning. arXiv preprint arXiv:1401.0579. Cited by: §I.
 (2015) Simple, efficient, and neural algorithms for sparse coding. Proc. Mach. Learn. Res.. Cited by: §I.
 (2014) New algorithms for learning incoherent and overcomplete dictionaries. In Conf. Learn. Theory, pp. 779–806. Cited by: §I.
 (2015) Dictionary learning and tensor decomposition via the sumofsquares method. In Proc. the 47th ann. ACM symp. Theory of comput., pp. 143–151. Cited by: §I.
 (2009) A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2 (1), pp. 183–202. Cited by: §I.
 (2005) Blind deconvolution of images using optimal sparse representations. IEEE Trans. Image Process. 14 (6), pp. 726–736. Cited by: §I.
 (2000) Curvelets: a surprisingly effective nonadaptive representation for objects with edges. Technical report Stanford Univ Ca Dept of Statistics. Cited by: §I.
 (2001) Atomic decomposition by basis pursuit. SIAM Rev. 43 (1), pp. 129–159. Cited by: §I.
 (1995) Discretetime wavelet extrema representation: design and consistent reconstruction. IEEE Trans. Signal Process. 43 (3), pp. 681–693. Cited by: §I.
 (2000) On discrete shorttime fourier analysis. IEEE Trans. Signal Process. 48 (9), pp. 2628–2640. Cited by: §I.
 (2017) Sparse representationbased multiple frame video superresolution. IEEE Trans. Image Process. 26 (2), pp. 765–781. Cited by: §I.
 (2009) Subspace pursuit for compressive sensing signal reconstruction. IEEE Trans. Inf. Theory 55 (5), pp. 2230–2249. Cited by: §I, §II.
 (2012) Simultaneous codeword optimization (SimCO) for dictionary update and learning. IEEE Trans. Signal Process. 60 (12), pp. 6340–6353. Cited by: §I, §I, §II, §II.
 (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15 (12), pp. 3736–3745. Cited by: §I, §VD2.
 (1999) Method of optimal directions for frame design. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Vol. 5, pp. 2443–2446. Cited by: §I, §I, §II.
 (2007) Family of iterative lsbased dictionary learning algorithms, ILSDLA, for sparse signal representation. Digital Signal Process. 17 (1), pp. 32–49. Cited by: §I.
 (1963) Probability inequalities for sums of bounded random variables. J. of the Amer. Statist. Assoc. 58 (301), pp. 13–30. Cited by: §A.
 (2013) Practical methods of optimization; 2nd ed.. Wiley, Hoboken, NJ. Cited by: §IVA1.
 (2014) On the local correctness of minimization for dictionary learning. In IEEE Int. Symp. Inf. Theory (ISIT), pp. 3180–3184. Cited by: §I.
 (2012) Blind calibration for compressed sensing by convex optimization. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 2713–2716. Cited by: 1st item.
 (2015) When can dictionary learning uniquely recover sparse data from subsamples?. IEEE Trans. Inf. Theory 61 (11), pp. 6290–6297. Cited by: §I.
 (2017) Complete dictionary recovery over the sphere ii: recovery by riemannian trustregion method. IEEE Trans. Inf. Theory 63 (2), pp. 885–914. Cited by: §I.
 (2014) On the identifiability of overcomplete dictionaries via the minimisation principle underlying KSVD. Appl. and Comput. Harmon. Anal. 37 (3), pp. 464–491. Cited by: §I.
 (2003) Dictionary learning algorithms for sparse representation. Neur. Comput. 15 (2), pp. 349–396. Cited by: §I, §I.
 (2003) Efficient implementation of a structured total least squares based speech compression method. Linear Algeb. and its Appl. 366, pp. 295–315. Cited by: §IVA1.
 (2018) Selfcalibration and bilinear inverse problems via linear least squares. SIAM J. Imag. Sci. 11 (1), pp. 252–292. Cited by: 1st item, §I, §III1, §III2.
 (2017) Weighted j