Dictionary Learning with BLOTLESS Update

Dictionary Learning with BLOTLESS Update


Algorithms for learning a dictionary to sparsely represent a given dataset typically alternate between sparse coding and dictionary update stages. Methods for dictionary update aim to minimise expansion error by updating dictionary vectors and expansion coefficients given patterns of non-zero coefficients obtained in the sparse coding stage. We propose a block total least squares (BLOTLESS) algorithm for dictionary update. BLOTLESS updates a block of dictionary elements and the corresponding sparse coefficients simultaneously. In the error free case, three necessary conditions for exact recovery are identified. Lower bounds on the number of training data are established so that the necessary conditions hold with high probability. Numerical simulations show that the bounds approximate well the number of training data needed for exact dictionary recovery. Numerical experiments further demonstrate several benefits of dictionary learning with BLOTLESS update compared with state-of-the-art algorithms especially when the amount of training data is small.

I Introduction

Sparse signal representation has found a wide range of applications, including image denoising [19, 32], image in-painting[19], image deconvolution [11], image super-resolution [48, 16], etc. The key idea behind the concept of sparse representation is that natural signals tend to have sparse representations under certain bases/dictionaries. Hence, finding a dictionary under which a given data set can be represented in a sparse manner, has become a very active area of research. Although numerous analytical dictionaries exist, including Fourier basis[15], discrete cosine transform (DCT) dictionaries, wavelets[14], curvelets [12], etc., the need to adapt to properties of specific data sets has long been driving research efforts towards the development of efficient algorithms for dictionary learning [35, 5]. More formally, dictionary learning is the problem of finding a dictionary of vectors in such that the training samples in can be written as , where the coefficient matrix is sparse. Of particular interest is overcomplete dictionary learning where the number of dictionary items is larger than the data dimension, i.e., , and the number of the training samples is typically much larger than the size of the dictionary, . Dictionary learning is a nonconvex bilinear inverse problem, very challenging to solve in general.

The bilinear dictionary learning problem is typically approached by alternating between two stages: sparse coding and dictionary update [35, 5, 20, 29, 18, 45]. In the sparse coding stage, the goal is to find sparse representations of training samples for a given dictionary . For that purpose, scores of algorithms have been developed. They can be divided into two main categories. The first category consists of greedy algorithms, including orthogonal matching pursuit (OMP) [46], regularized orthogonal matching pursuit (ROMP) [34], subspace pursuit (SP) [17], etc. In the second category, sparse coding is formulated as a convex optimization problem where -norm is used to promote sparsity [13], and then optimization techniques, e.g. the fast iterative shrinkage-thresholding algorithm (FISTA) [10], can be applied. Reviews of sparse recovery algorithms can be found in [47].

The goal of the dictionary update is to refine the dictionary so that the training samples have more accurate sparse representations given indices of non-zero coefficients obtained in the sparse coding stage. In the probabilistic framework, one may apply either maximum likelihood (ML) estimator [35] or maximum a posteriori (MAP) estimator [29], and then solve them by using gradient decent procedures. In the context of ML formulation [35], Engan et al.  [20] proposed the method of optimal directions (MOD) where the sparse coefficients are fixed and the dictionary update problem is cast as a least squares problem which can be solved efficiently; modifications of MOD were subsequently proposed in [1, 41, 21].

Recently an alternative approach for dictionary update has become popular, where both the dictionary and the sparse coefficients are updated simultaneously with a given sparsity pattern. The representative algorithms include the famous K-SVD algorithm [5, 42] and SimCO [18]. The crux of K-SVD [5] algorithm is to update dictionary items and their corresponding sparse coefficients simultaneously, sequentially one by one. K-SVD was subsequently extended to allow simultaneous update of multiple dictionary elements and corresponding coefficients [42]. SimCO [18], of which K-SVD is a special case, goes further and updates the whole dictionary and sparse coefficients simultaneously. The main idea of SimCO is that given a sparsity pattern, the sparse coefficients can be viewed as a function of the dictionary. As a result, the dictionary update becomes a nonconvex optimisation problem with respect to the dictionary. The optimisation is then preformed using the gradient descent method combined with a heuristic sub-routine designed to deal with singular points which can prevent from the convergence to the global minimum[18].

Due to the non-convexity of dictionary learning problem, it is challenging to understand under which conditions exact dictionary recovery is possible and which method is optimal in achieving that. Following early efforts on theoretical analysis of exact dictionary recovery [4, 26, 36, 24, 39, 28, 40], more recently, Spielman et. al. [43] studied dictionary learning problem with complete dictionaries where the dictionary can be presented as a square matrix. By solving a certain sequence of linear programs, they showed that one can recover a complete dictionary from when is a sparse random matrix with nonzeros per column. In [2, 3, 8, 7], the authors propose algorithms which combine clustering, spectral initialization, and local refinement to recover overcomplete and incoherent dictionaries.

Again these algorithms succeed when has nonzeros per column. The work in [9] provides a polynomial-time algorithm that recovers a large class of over-complete dictionaries when has nonzeros per column for any constant . However, the proposed algorithm runs in super-polynomial time when the sparsity level goes up to . Similarly, in [6] the authors proposed a super-polynomial time algorithm that guarantees recovery with close to nonzeros per column. Sun et al. [44, 27], on the other hand, proposed a polynomial-time algorithm that provably recovers complete dictionary when has nonzeros per column and the size of training samples is .

This paper addresses the dictionary update problem, where both the dictionary and the sparse coefficients are updated, for a given sparsity pattern. Whilst it is a sub-problem of the overall dictionary learning, it is nevertheless an important step towards solving the overall problem, and its bilinear nature makes it nonconvex and hence very challenging to solve. Our main contributions are as follows.

  • BLOTLESS simultaneously updates a block of dictionary items and the corresponding sparse coefficients. Inspired by ideas presented in [31, 25], in BLOTLESS the bilinear nonconvex block update problem is transformed into a linear least squares problem, which can be solved efficiently.

  • For the error-free case, when the sparsity pattern is known exactly, three necessary conditions for unique recovery are identified, expressed in terms of lower bounds on the number of training data. Numerical simulations show that the theoretical bounds well approximate the empirical number of training data needed for exact dictionary recovery. In particular, we show that the number of training samples needed is for complete dictionary update.

  • BLOTLESS is numerically demonstrated robust to errors in the assumed sparsity pattern. When embedded into the overall dictionary learning process, it leads to faster convergence rate and less training samples needed compared to state-of-the-art algorithms including MOD, K-SVD and SimCO.

Our work is inspired by a recent work [31] where bilinear inverse problems are formulated as linear inverse problems. The main difference is that our theoretical analysis and algorithm designs in Sections III and IV are specifically tailored to the generic dictionary update problem while the focus of [31] is self-calibration which can be viewed as dictionary learning with only diagonal dictionaries. Parts of the results in this paper were presented in the conference paper [49]. In this journal paper, we refine the bounds in Section III and provide detailed proofs, add two total least squares algorithms in Section IV, and include more simulation results in Section V to support the design of the algorithm.

This paper is organized as follows. Section II briefly reviews dictionary learning and update methods. Section III discusses an ideal case where exact dictionary recovery is possible, for which a least squares method is developed and analysed. In Section IV, the general case of dictionary update is discussed, and the least squares method is extended to total least squares methods, leading to BLOTLESS. Results of extensive simulations are presented in Section V and conclusions are drawn in Section VI.

I-a Notation

In this paper, denotes the norm and stands for the Frobenius norm. For a positive integer , define . For a matrix , and denote the -th row and the -th column of respectively. Consider the sparse coefficient matrix . Let be the support set of , i.e., the index set that containing indices of all nonzero entries in . Let be the support set of the row vector . Then is the row vector obtained by keeping the nonzero entries of and removing all its zero entries. Symbols , , and denote the identity matrix, the vector of which all the entries are 1, and the vector with all zero entries, respectively. For a given set , denotes its complement in .

Ii Dictionary Learning: The Background

Dictionary learning is the process of finding a dictionary which sparsely represents given training samples. Let be the training sample matrix, where is the dimension of training sample vectors and is the number of training samples. The overall dictionary learning problem is often formulated as:


where is the dictionary, is the sparse coefficient matrix, the pseudo-norm gives the number of non-zero elements, also known as sparsity level, and is the upper bound of the sparsity level.

Dictionary learning algorithms typically iterate between two stages: sparse coding and dictionary update. The goal of sparsity coding is to find a sparse coefficient matrix for a given dictionary . One way to achieve this is to solve the problem


using greedy algorithms, for example OMP [46] or SP [17].

In the dictionary update stage, the goal is to refine the dictionary with either fixed sparse coefficients or a fixed sparse pattern, i.e. fixed locations of non-zero coefficients. The famous MOD method [20] falls into the first category. With fixed sparse coefficients, dictionary update is simply a least squares problem

A more popular and advantageous approach is to simultaneously update the dictionary and nonzero sparse coefficients by fixing only the sparsity pattern. With this idea, dictionary update is then formulated as [5, 18, 42]


where gives the vector formed by the entries of indexed by . However, problem (3) is bilinear, nonconvex, and challenging to solve.

Among many methods for solving (3), we here briefly review K-SVD [5] and SimCO [18]. K-SVD algorithm successively updates individual dictionary items and the corresponding sparse coefficients whilst keeping all other dictionary items and coefficients fixed:


The optimal solution can be obtained by taking the largest left and right singular vectors of the matrix .

The idea of SimCO is to formulate the dictionary update problem in (3) as a nonconvex optimisation problem with respect to the overall dictionary, that is


and then solve it using gradient descent of . This leads to an update of all dictionary items and sparse coefficients simultaneously. K-SVD can be viewed as a special case of SimCO where the objective function reads

The focus of this paper is a novel solution to Problem (3).

Iii Exact Dictionary Recovery

This section focuses on an ideal case that the dictionary can be exactly recovered. We assume that the training samples in are generated from where is a tall or square matrix () and the sparsity pattern of is given. For compositional convenience, we focus on the case where is a square matrix, , as the same analysis is valid for a tall dictionary where .

With given sparsity pattern denoted by , the dictionary update problem can be formulated as a bilinear inverse problem in which the goal is to find and such that


The constraint is nonconvex. Generally speaking, it is challenging to solve (6) and there are no guarantees for the global optimality of the solution.

Least Squares Solver

Suppose that the unknown dictionary matrix is invertible. The nonconvex problem in (6) can be translated into a convex problem by using a strategy similar to that explored in [31]. Define . Then . The goal is now to find and such that


or equivalently,


where the subscripts are used to indicate matrix dimensions. In this manner the original bilinear problem (6) is cast as an equivalent linear least squares problem.

However, the formulation in (8) admits trivial solution and . In fact, (8) admits at least linearly independent solutions.

Proposition 1.

There are at least linear independent solutions to the least squares problem in (8).


This proposition is proved by construction. Let . Define matrix by keeping the -th column of the matrix and setting all other columns to zero, that is, and for all . From the fact that , it is straightforward to verify that , , is a solution of (8).

The solutions , , are linearly independent. This can be easily verified by observing that the positions of nonzero elements in and , , are different. ∎

Necessary Conditions for Unique Recovery

We now consider the uniqueness of the solution in more detail and derive necessary conditions for unique recovery. Two ambiguities can be identified in the dictionary update problem in (8). The first is permutation ambiguity. Let and be the support sets (the index set containing indices corresponding to nonzero entries) of the -th and -th row of . If , then the tuple is a valid solution of (6), where denotes the permutation matrix generated by permuting the -th and -th row of the identity matrix. On the other hand, there is no permutation ambiguity if for all . In practice, the given sparsity pattern is typically diverse enough to avoid permutation ambiguity.

The second is the scaling ambiguity which cannot be avoided. Let be a diagonal matrix with nonzero diagonal elements. It is clear that the tuple is also a valid solution of (6). All solutions of the form form an equivalent class. The scaling ambiguity can be addressed by ntroducing additional constraints. One option used in [31] is that the sum of the elements in each row of is one, i.e., . With these constraints, one has


Henceforth, we define unique recovery as unique up to the scaling ambiguity.

Definition 1 (Unique Recovery).

The dictionary update problem (6) is said to admit a unique solution if all solutions are of the form and for some diagonal matrix with nonzero diagonal elements.

In the following, we identified three necessary conditions for unique recovery.

Proposition 2.

Assume that is square and invertible. If the problem (6) has unique solution, then it holds that

  1. .

  2. For all , the support set of the -th row of , denoted by , satisfies .

  3. For all and all , such that and .


Necessary condition 1 is proved by using the fact that the solution of (9) is unique only if the number of equations is larger or equal than the number of unknown variables. The number of unknown variables in (9) is while the number of equations in (9) is . Elementary calculations lead to the bound .

The proof of the other two necessary conditions is based on the fact that

where . To simplify the notations, we omit the subscript 0 from , and in the rest of this proof. Divide the sample matrix into two sub-matrices and . Then it holds that

( is in the null space of .) Hence is unique (up to a scaling factor) if and only if has dimension . In this case, is the null space of both and . It is concluded that is unique if and only if .

Necessary condition 2 follows directly from that .

To prove the last necessary condition, note first that the fact that implies that each column of participates in generating some columns of . That is, , participates in generating for some . Necessary condition 3 is therefore proved. Note that condition 3 is not sufficient. It does not prevent the following rank deficient case: there exist such that both and only participate in generating a single sample in for some . ∎

Discussions on the Number of Samples

We now study the number of samples needed to ensure that the necessary conditions for unique recovery, as specified in Proposition 2, hold with high probability. To that end we use the following probabilistic model: entries of are independently generated from the Gaussian distribution , and entries of are independently generated from the Bernoulli-Gaussian distribution with , where Bernoulli-Gaussian distribution is defined as follows.

Definition 2.

A random variable is Bernoulli-Gaussian distributed with , if , where random variables and are independent, is Bernoulli distributed with parameter , and .

Remark 1.

The Gaussian distribution is not essential. It can be replaced by any continuous distribution.

Proposition 3 (Number of Samples).

Suppose that where and are generated according to the above probability model. Given a constant , the -th necessary condition in Proposition 2 holds with probability at least , if , where


Furthermore, it holds that . If , then all three necessary conditions in Proposition 2 hold.


See Appendix A. ∎

Remark 2.

We have the following observations.

  • With fixed and , and scale linearly with while is proportional to .

  • With fixed and , , , and increase proportionally to .

  • With fixed and , when increases from 0 to 1, and increase, while first decreases and then increases. This matches the intuition that when is too small, we need more samples to have enough information to recover the dictionary. On the other hand, when is too large, more samples are needed to generate the orthogonal space of each . This is verified by simulations in Section V.

The bound provides a good estimate of the number of samples needed for unique recovery. By set theory, if event is a necessary condition for , then implies , or equivalently, and . In Proposition 3, the quantity is a lower bound for , where these necessary conditions hold. But unfortunately it is neither lower nor upper bound for , where the dictionary can be uniquely recovered. Nevertheless, our simulations show that is a good approximation to the number of samples needed to recover the dictionary uniquely with probability more than .

In an asymptotic regime, the bounds can be simplified.

Corollary 1 (Asymptotic Bounds).

Consider the same settings as in Proposition 3. For a given , let with . If , then all three necessary conditions in Proposition 2 holds with a probability arbitrary close to 1.

This corollary follows from elementary calculations and the fact that when .

Iv Dictionary Update with Uncertainty

While Section III studies the ideal case, this section focuses on the general case using the insight from Section III. In practice, there may be noise in the training samples , and there may be errors in the assumed sparsity pattern. The exact equality in (6) may not hold any longer. Following the idea in Section III, total least squares methods are applied to handle the uncertainties. The techniques for non-overcomplete and overcomplete dictionaries are developed in Sections IV-A and IV-B respectively.

Iv-a Non-overcomplete Dictionary Update

In the case , let be the pseudo-inverse of and assume that . Due to the uncertainty, Equation (9) becomes approximate, that is,


Total least squares is a technique to solve a least squares problem in the form where errors in both observations and regression models are considered[33, 37]. It targets at minimising the total errors via


The constraint set above is nonconvex and hence (11) is a nonconvex optimisation problem. Nevertheless, its global optimal solution can be obtained by using the singular value decomposition (SVD). Set . Observe that the constraint in (11) implies that . The optimal can be obtained from the smallest right singular vectors of the matrix , and the optimal is a best lower-rank approximation of the matrix .

The difficulty in applying total least squares directly is due to the additional constraint in (10). Below we present three possible solutions, where the last one IterTLS excels and is adopted.

Structured Total Least Squares (STLS)

Consider having uncertainties in both and the sparsity pattern. Based on (10), a straightforward total least squares formulation is


To solve the above nonconvex optimisation problem, we follow the approach in [30]. It involves an iterative process where each iteration solves an approximated quadratic optimisation problem which admits a closed-form optimal solution.

At each iteration, denote the initial estimate of by . Note that the constraint set in (12) can be written as where

We consider the first order Taylor approximation of at given , which reads

where , , and is the corresponding Jacobian matrix. With this approximation, the nonconvex optimisation problem in (12) becomes a quadratic optimisation problem with equality constraints


This is a quadratic optimisation problem with linear equality constraints, and admits a closed-form solution by a direct application of KKT conditions [23].

The STLS approach has two issues. The first issue is its very high computational cost. The quadratic optimisation problem (13) involves unknowns and equation constraints. Its closed-form solution involves a matrix of size . We have obtained the closed-form of the Jacobian matrix , implemented a conjugate gradient algorithm to use the structures in (13) for a speed-up (details are omitted here). However, simulations in Section V-B show that the computation speed is still too slow for practical problems. The second issue is the inferior performance compared to other TLS methods in Sections IV-A2 and IV-A3. This is because Taylor approximation of the constraint is used in STLS, while other TLS methods below incorporate the constraints directly without Taylor approximation.

Parallel Total Least Squares (ParTLS)

The key idea of ParTLS is to decouple the problem (10) into sub-problems that can be solved in parallel:

It is straightforward to verify that this is equivalent to


where is the projection matrix obtained by keeping the columns of the identity matrix indexed by and removing all other columns.

Sub-problems (14) can be solved by directly applying the TLS formulation (11). Note that . The vector can be computed as a scaled version of the least right singular vector of the matrix . Then and can be obtained from .

ParTLS enjoys the following advantages. 1) Its global optimality is guaranteed for the ideal case of no data noise or sparsity pattern errors. It is straightforward to see that in the ideal case the ParTLS solutions satisfy (9). 2) It is computationally efficient. The sub-problems (14) are of small size and can be solved in parallel. However, ParTLS also has its own issue — certain structures in the problem are not enforced. For different sub-problem , the ‘denoised’ , denoted by can be different.

Iterative Total Least Squares (IterTLS)

IterTLS is an iterative algorithm such that in each iteration a total least squares problem is formulated based on the estimate from the previous iteration. It starts with an initial estimate obtained by solving the ideal case equation (9). In each iteration, let be an estimate of from either initialisation or the previous iteration. We formulate the following total least squares problem


which has the identical form as (11). Note that the constraint in (10) is implicitly imposed as . The problem (15) can be optimally solved by using the SVD

The optimal solution is given by and . To prepare the next iteration, one obtains an updated estimate by applying a simple projection operator to : and . With this new estimate , one can proceed with the next iteration until convergence.

Iv-B Update Overcomplete Dictionary

The difficulty of overcomplete dictionary update comes from the fact that for an overcomplete dictionary , in general, where is the pseudo-inverse of . Therefore, the above least squares or total least squares approaches cannot be directly applied.

To address this issue, a straightforward approach is to divide the whole dictionary into a set of sub-dictionaries each of which is either complete or undercomplete, and then update these sub-dictionaries one-by-one whilst fixing all other sub-dictionaries and the corresponding coefficients. More explicitly, given estimated and , consider updating , the submatrix of consisting of columns indexed by , and , the submatrix of consisting of rows indexed by . Then, consider the residual matrix


and apply the method in Section IV-A3 to solve the problem . Then repeat this step for all sub-dictionaries. As the dictionary is updated block by block, we refer to our algorithm as BLOck Total LEast SquareS (BLOTLESS).

V Numerical Test

Parts of the numerical tests are based on synthetic data. The training samples are generated according to the probability model specified in Section III-3. When the dictionary recovery is not exact, the performance criterion is the difference between the ground-truth dictionary and the estimated dictionary . In particular, the estimation error is defined as


where is the -th item in the estimated dictionary, is the -th item in the ground-truth dictionary, and . The items in both dictionaries are normalised to have unit -norm.

Numerical tests based on real data are presented in Section V-D2 for image denoising. The performance metric is the Peak Signal-to-Noise-Ratio (PSNR) of the denoised images.

V-a Simulations for Exact Dictionary Recovery

In this section we evaluate numerically bounds in Proposition 3. All the results presented here are based on 100 random and independent trials. For theoretical performance prediction, we compute , , and using . In the numerical simulations, we vary and find its critical value under which exact recovery happens with an empirical probability at most 99% and above which exact recovery happens with an empirical probability at least 99%.

(a) ,
(b) ,
(c) ,
(d) ,
Fig. 1: Probability of exact recovery against the number of training samples for .
Fig. 2: Normalised number of training samples for at least 99% probability of exact recovery versus .

We start with the relation between the number of training samples and the sparsity ratio for a given . In Figure 1, we fix , vary , and study the probability of exact recovery against the number of training samples. Results in Figure 1 show that the theoretical prediction is quite close to the critical obtained by simulations. One can also observe that the needed number of training samples first decreases and then increases as increases, which is also predicted by the theoretical bounds. A larger scale numerical test is done in Figure 2, where four sub-figures correspond to four different values of . Once again, simulations demonstrate these bounds in Proposition 3 match the simulations very well.

Fig. 3: Required for exact recovery versus with a given .

Let us consider the required for exact recovery as a function of by fixing . Simulation results are depicted in Figure 3. When , behaves as . Otherwise, behaves as . This is consistent with Proposition 3.

Fig. 4: Asymptotic bounds in Corollary 1 compared with empirical results for finite .

Finally, we numerically study the accuracy of the asymptotic results in Corollary 1. We draw normalised number of training samples for exact recovery as a function of sparsity ratio for different values of , including . Simulation results in Figure 4 show a trend that is consistent with the asymptotic results in Corollary 1.

V-B Total Least Squares Methods

The three total least squares methods introduced in Section IV are compared, henceforth denoted by BLOTLESS-STLS, BLOTLESS-ParTLS and BLOTLESS-IterTLS respectively, by being embedded in the dictionary learning process. Random dictionaries are used as the initial starting point of dictionary learning. OMP is used for sparse coding.

(a) , ,
(b) , , ,
Fig. 5: Performance comparison of different total least squares methods when used in the overall dictionary learning process. Results are averages of 50 independent trials.
STLS 621.6544 836.3027 1098.5955 1265.1895
ParTLS 9.2544 18.1191 25.3678 31.0551
IterTLS 6.3446 8.5157 10.9026 13.8056
TABLE I: Runtime (seconds per iteration of the dictionary learning process) comparison for different total least squares methods: , , and number of iterations . Results are averages of 50 independent trials using Matlab 2018b on a MacBook Pro with 16GB RAM and 2.3 GHz Intel Core i5 processor.

Fig. 5 compares the dictionary learning errors (17) for both complete and over-complete dictionaries. BLOTLESS-IterTLS converges the fastest and has the smallest error floor. Then a runtime comparison is given in Table I. BLOTLESS-IterTLS clearly outperforms the other two methods. It is therefore used as the default dictionary update method in later simulations.

V-C Robustness to Errors in Sparsity Pattern

Fig. 6: Performance comparison of different update algorithm when different sparse pattern mismatch is applied. , , . Results are averages of 100 independent trials.

Simulations are next designed to evaluate the robustness of different dictionary update algorithms to errors in sparsity pattern. Towards that end, a fraction of indices in the true support are randomly chosen to be replaced with the same number of randomly chosen indices not in the true support set. This erroneous sparsity pattern is then fed into different dictionary update algorithms. The numerical results in Fig. 6 demonstrate the robustness of BLOTLESS (with IterTLS).

V-D Dictionary Learning with BLOTLESS Update

This subsection compares dictionary learning performance for different dictionary update methods. The sparse coding algorithm is OMP. IterTLS in Section IV-A3 is used for BLOTLESS. Results for synthetic data are presented in Section V-D1 while Section V-D2 focuses on image denoising using real data.

Synthetic Data

(a) , , .
(b) , , , .
(c) , , .
(d) , , , .
Fig. 7: Comparison of dictionary update methods for the noise free case. Results are averages of 100 trials.
(a) , , .
(b) , , , .
(c) , , .
(d) , , , .
Fig. 8: Comparison of dictionary update methods for the noisy case: SNR is 15dB. Results are averages of 100 trials. denoes the number of iterations.

Fig. 7 and 8 compare the performance of dictionary learning using different dictionary update algorithms. Fig. 7 focus on the noise free case where and Fig. 8 concerns with the noisy case where where is the additive Gaussian noise matrix with i.i.d. entries and the signal-to-noise ratio (SNR) is set to 15dB. Both figures include the cases of complete and over-complete dictionaries. The results presented in Fig. 7 and 8 clearly indicate that dictionary learning based on BLOTLESS converges much faster and needs at least less training samples than other benchmark dictionary update methods.

Fig. 9: Dictionary learning via BLOTLESS with different block sizes: , , .

In BLOTLESS update, blocks of the dictionary (sub-dictionaries) are updated sequentially. Fig. 9 compares the performance with different block sizes. Note that when each block contains only one dictionary item, the dictionary update problem is the same as that in K-SVD. Hence the performance of K-SVD is added in Fig. 9. Simulations suggest that the larger the dictionary blocks are, the faster the convergence is and the better performance is. The performance of BLOTLESS with block size one is slightly better than that of K-SVD. This is because the TLS step in IterTLS does not enforce the sparsity pattern and hence better accommodates errors in the estimated sparsity pattern.

Real Data

We use the Olivetti Research Laboratory (ORL) face database [38] for dictionary learning and then use the learned dictionary for image denoising.

For dictionary learning, according to the simulation results in Section V-D1, samples of size patches from face images are enough for training a dictionary via BLOTLESS. The parameters used in dictionary learning are , , , and . After learning a dictionary, image denoising [19] is performed using test images from the same dataset. The denoising results are shown in Table II, where four test images are used. In all these four tests, the BLOTLESS method outperforms all other algorithms, which is consistent with these simulations in V-D1.

Original Image
Noisy Image
28.13 dB
22.11 dB
18.59 dB
28.13 dB
22.11 dB
18.59 dB
28.13 dB
22.11 dB
18.59 dB
28.13 dB
22.11 dB
18.59 dB
33.00 dB
29.24 dB
27.22 dB
32.68 dB
28.84 dB
26.82 dB
31.95 dB
27.43 dB
25.76 dB
32.13 dB
28.23 dB
26.26 dB
32.50 dB
28.72 dB
26.79 dB
32.03 dB
28.26 dB
26.35 dB
31.49 dB
27.43 dB
25.31 dB
31.58 dB
27.74 dB
25.81 dB
33.43 dB
29.78 dB
27.81 dB
33.58 dB
30.11 dB
28.04 dB
32.18 dB
28.51 dB
26.47 dB
32.65 dB
29.18 dB
27.27 dB
33.67 dB
29.90 dB
27.95 dB
33.91 dB
30.33 dB
28.20 dB
32.38 dB
28.67 dB
26.66 dB
32.88 dB
29.38 dB
27.45 dB
TABLE II: Denoising comparison using different learnt dictionaries, where the denoised PSNR (dB) are computed and shown in table.

Vi Conclusion

This paper proposed a BLOTLESS algorithm for dictionary update. It divides the dictionary into sub-dictionaries, each of which is non-overcomplete. Then BLOTLESS updates a sub-dictionary and the corresponding sparse coefficients using least sqaures or total least squares approaches. Necessary conditions for unique recovery are identified and they hold with high probability when the number of training samples is larger than the derived bounds in Proposition 3. Simulations show that these bounds match the simulations well, and that BLOTLESS outperforms other benchmark algorithms. One future direction is to find sufficient bounds for unique recovery and their comparisons to the necessary bounds.

-a Proof of Proposition 3

The proof needs Hoeffding’s inequality[22] for Bernoulli random variables, stated below.

Lemma 1 (Hoeffding’s Inequality).

For many identical Bernoulli random variables . Each takes the value 1 with probability and 0 with probability , then the following Hoeffding’s inequality holds


where is a constant number.

To derive , we consider the case that the necessary condition 1 in Proposition 2 fails. That is,

Based on Hoeffding’s inequality, the probability of this event is upper bounded by

If this probability is smaller than , it follows that

The left hand side of the above inequality is quadratic in , which after some elementary calculations leads to

The derivation of is similar. Consider the probability that the necessary condition 2 in Proposition 2 fails:

where the inequality in the third line follows from the union bound. After applying Hoeffding’s inequality and setting the upper bound less than we obtain

It follows that

To derive , we define the following event

  • : For given , such that and .

Then the probability that the necessary condition 3 fails is given by

where the inequality in the second line follows from the union bound. If we set this probability to be smaller than , we obtain


  1. S. O. Aase, J. H. Husoy, J. Skretting and K. Engan (2001) Optimized signal expansions for sparse representation. IEEE Trans. Signal Process. 49 (5), pp. 1087–1096. Cited by: §I.
  2. A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli and R. Tandon (2014) Learning sparsely used overcomplete dictionaries. In Conf. Learn. Theory, pp. 123–137. Cited by: §I.
  3. A. Agarwal, A. Anandkumar and P. Netrapalli (2013) Exact recovery of sparsely used overcomplete dictionaries. Stat 1050, pp. 8–39. Cited by: §I.
  4. M. Aharon, M. Elad and A. M. Bruckstein (2006) On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them. Linear Algeb. and Its Appl. 416 (1), pp. 48–67. Cited by: §I.
  5. M. Aharon, M. Elad and A. Bruckstein (2006) K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54 (11), pp. 4311. Cited by: §I, §I, §I, §II, §II.
  6. S. Arora, A. Bhaskara, R. Ge and T. Ma (2014) More algorithms for provable dictionary learning. arXiv preprint arXiv:1401.0579. Cited by: §I.
  7. S. Arora, R. Ge, T. Ma and A. Moitra (2015) Simple, efficient, and neural algorithms for sparse coding. Proc. Mach. Learn. Res.. Cited by: §I.
  8. S. Arora, R. Ge and A. Moitra (2014) New algorithms for learning incoherent and overcomplete dictionaries. In Conf. Learn. Theory, pp. 779–806. Cited by: §I.
  9. B. Barak, J. A. Kelner and D. Steurer (2015) Dictionary learning and tensor decomposition via the sum-of-squares method. In Proc. the 47th ann. ACM symp. Theory of comput., pp. 143–151. Cited by: §I.
  10. A. Beck and M. Teboulle (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2 (1), pp. 183–202. Cited by: §I.
  11. M. M. Bronstein, A. M. Bronstein, M. Zibulevsky and Y. Y. Zeevi (2005) Blind deconvolution of images using optimal sparse representations. IEEE Trans. Image Process. 14 (6), pp. 726–736. Cited by: §I.
  12. E. J. Candes and D. L. Donoho (2000) Curvelets: a surprisingly effective nonadaptive representation for objects with edges. Technical report Stanford Univ Ca Dept of Statistics. Cited by: §I.
  13. S. S. Chen, D. L. Donoho and M. A. Saunders (2001) Atomic decomposition by basis pursuit. SIAM Rev. 43 (1), pp. 129–159. Cited by: §I.
  14. Z. Cvetkovic and M. Vetterli (1995) Discrete-time wavelet extrema representation: design and consistent reconstruction. IEEE Trans. Signal Process. 43 (3), pp. 681–693. Cited by: §I.
  15. Z. Cvetkovic (2000) On discrete short-time fourier analysis. IEEE Trans. Signal Process. 48 (9), pp. 2628–2640. Cited by: §I.
  16. Q. Dai, S. Yoo, A. Kappeler and A. K. Katsaggelos (2017) Sparse representation-based multiple frame video super-resolution. IEEE Trans. Image Process. 26 (2), pp. 765–781. Cited by: §I.
  17. W. Dai and O. Milenkovic (2009) Subspace pursuit for compressive sensing signal reconstruction. IEEE Trans. Inf. Theory 55 (5), pp. 2230–2249. Cited by: §I, §II.
  18. W. Dai, T. Xu and W. Wang (2012) Simultaneous codeword optimization (SimCO) for dictionary update and learning. IEEE Trans. Signal Process. 60 (12), pp. 6340–6353. Cited by: §I, §I, §II, §II.
  19. M. Elad and M. Aharon (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15 (12), pp. 3736–3745. Cited by: §I, §V-D2.
  20. K. Engan, S. O. Aase and J. H. Husoy (1999) Method of optimal directions for frame design. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Vol. 5, pp. 2443–2446. Cited by: §I, §I, §II.
  21. K. Engan, K. Skretting and J. H. Husøy (2007) Family of iterative ls-based dictionary learning algorithms, ILS-DLA, for sparse signal representation. Digital Signal Process. 17 (1), pp. 32–49. Cited by: §I.
  22. N. I. Fisher and P. K. Sen (1963) Probability inequalities for sums of bounded random variables. J. of the Amer. Statist. Assoc. 58 (301), pp. 13–30. Cited by: §-A.
  23. R. Fletcher (2013) Practical methods of optimization; 2nd ed.. Wiley, Hoboken, NJ. Cited by: §IV-A1.
  24. Q. Geng and J. Wright (2014) On the local correctness of -minimization for dictionary learning. In IEEE Int. Symp. Inf. Theory (ISIT), pp. 3180–3184. Cited by: §I.
  25. R. Gribonval, G. Chardon and L. Daudet (2012) Blind calibration for compressed sensing by convex optimization. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pp. 2713–2716. Cited by: 1st item.
  26. C. J. Hillar and F. T. Sommer (2015) When can dictionary learning uniquely recover sparse data from subsamples?. IEEE Trans. Inf. Theory 61 (11), pp. 6290–6297. Cited by: §I.
  27. J. Sun, Q. Qu, and J. Wright (2017) Complete dictionary recovery over the sphere ii: recovery by riemannian trust-region method. IEEE Trans. Inf. Theory 63 (2), pp. 885–914. Cited by: §I.
  28. K. Schnass (2014) On the identifiability of overcomplete dictionaries via the minimisation principle underlying K-SVD. Appl. and Comput. Harmon. Anal. 37 (3), pp. 464–491. Cited by: §I.
  29. K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T. Lee and T. J. Sejnowski (2003) Dictionary learning algorithms for sparse representation. Neur. Comput. 15 (2), pp. 349–396. Cited by: §I, §I.
  30. P. Lemmerling, N. Mastronardi and S. Van Huffel (2003) Efficient implementation of a structured total least squares based speech compression method. Linear Algeb. and its Appl. 366, pp. 295–315. Cited by: §IV-A1.
  31. S. Ling and T. Strohmer (2018) Self-calibration and bilinear inverse problems via linear least squares. SIAM J. Imag. Sci. 11 (1), pp. 252–292. Cited by: 1st item, §I, §III-1, §III-2.
  32. L. Liu, L. Chen, C. P. Chen and Y. Y. Tang (2017) Weighted j