Convergence radius and sample complexity of ITKM algorithms for dictionary learning

Convergence radius and sample complexity of ITKM algorithms for dictionary learning

Karin Schnass Karin Schnass is with the Department of Mathematics, University of Innsbruck, Technikerstraße 13, 6020 Innsbruck, Austria, karin.schnass@uibk.ac.at.
Abstract

In this work we show that iterative thresholding and K means (ITKM) algorithms can recover a generating dictionary with K atoms from noisy sparse signals up to an error as long as the initialisation is within a convergence radius, that is up to a factor inversely proportional to the dynamic range of the signals, and the sample size is proportional to . The results are valid for arbitrary target errors if the sparsity level is of the order of the square root of the signal dimension and for target errors down to if scales as .

{keywords}

dictionary learning, sparse coding, sparse component analysis, sample complexity, convergence radius, alternating optimisation, thresholding, K-means

1 Introduction

The goal of dictionary learning is to find a dictionary that will sparsely represent a class of signals. That is given a set of training signals , which are stored as columns in a matrix , one wants to find a collection of normalised vectors , called atoms, which are stored as columns in the dictionary matrix , and coefficients , which are stored as columns in the coefficient matrix such that

(1)

Research into dictionary learning comes in two flavours corresponding to the two origins of the problem, the slightly older one in the independent component analysis (ICA) and blind source separation (BSS) community, where dictionary learning is also known as sparse component analysis, and the slightly younger one in the signal processing community, where it is also known as sparse coding. The main motivation for dictionary learning in the ICA/BSS community comes from the assumption that the signals of interest are generated as sparse mixtures - random sparse mixing coefficients - of several sources or independent components - the dictionary - which can be used to describe or explain a (natural) phenomenon, [15, 30, 27, 26]. For instance in the 1996 paper by Olshausen and Field, [15], which is widely regarded as the mother contribution to dictionary learning, the dictionary is learned on patches of natural images, and the resulting atoms bear a striking similarity to simple cell receptive fields in the visual cortex. A natural question in this context is, when the generating dictionary can be identified from , that is, the sources from the mixtures. Therefore the first theoretical insights into dictionary learning came from this community, [18]. Also the first dictionary recovery algorithms with global success guarantees, which are based on finding overlapping clusters in a graph derived from the signal correlation matrix , take the ICA/BSS point of view, [6, 2].
The main motivation for dictionary learning in the signal processing community is that sparse signals are immensely practical, as they can be easily stored, denoised, or reconstructed from incomplete information, [13, 33, 31]. Thus the interest is less in the dictionary itself but in the fact that it will provide sparse representations . Following the rule ’the sparser - the better’ the obvious next step is to look for the dictionary that provides the sparsest representations. So given a budget of atoms and non-zero coefficients per signal, one way to concretise the abstract formulation of the dictionary learning problem in (1) is to formulate it as optimisation problem, such as

(2)

where counts the nonzero elements of a vector or matrix and is defined as . While is for instance the starting point for the MOD or K-SVD algorithms, [14, 3], other definitions of optimally sparse lead to other optimisation problems and algorithms, [49, 37, 48, 32, 43, 38]. The main challenge of optimisation programmes for dictionary learning is finding the global optimum, which is hard because the constraint manifold is not convex and the objective function is invariant under sign changes and permutations of the dictionary atoms with corresponding sign changes and permutations of the coefficient rows. In other words for every local optimum there are equivalent local optima.
So while in the signal processing setting there is a priori no concept of a generating dictionary, it is often used as auxiliary assumption to get theoretical insights into the optimisation problem. Indeed without the assumption that the signals are sparse in some dictionary the optimisation formulation makes little or no sense. For instance if the signals are uniformly distributed on the sphere in , in asymptotics becomes a covering problem and the set of optima is invariant under orthonormal transforms.
Based on a generating model on the other hand it is possible to gain several theoretical insights. For instance, how many training signals are necessary such that the sparse representation properties of a dictionary on the training samples (e.g. the optimiser) will extrapolate to the whole class, [34, 47, 35, 20]. What are the properties of a generating dictionary and the maximum sparsity level of the coefficients and signal noise such that this dictionary is a local optimiser or near a local optimiser given enough training signals, [21, 17, 39, 40, 19].
An open problem for overcomplete dictionaries with some first results for bases, [44, 45], is whether there are any spurious optimisers which are not equivalent to the generating dictionary, or if any starting point of a descent algorithm will lead to a global optimum? A related question (in case there are spurious optima) is, if the generating dictionary is the global optimiser? If yes, it would justify using one of the graph clustering algorithms for recovering the optimum, [6, 2, 4, 7]. This is important since all dictionary learning algorithms with global success guarantees are computationally very costly, while optimisation approaches are locally very efficient and robust to noise. Knowledge of the convergence properties of a descent algorithm, such as convergence radius (basin of attraction), rate or limiting precision based on the number of training signals, therefore helps to decide when it should take over from a global algorithm for fast local refinement, [1].
In this paper we will investigate the convergence properties of two iterative thresholding and K-means algorithms. The first algorithm ITKsM, which uses signed signal means, originates from the response maximisation principle introduced in [40]. There it is shown that a generating -coherent dictionary constitutes a local maximum of the response principle as long as the sparsity level of the signals scales as . It further contains the first results showing that the maximiser remains close to the generator for sparsity levels up to . For a target recovery error the sample complexity is shown to scale as and the basin of attraction is conjectured to be of size .
Here we will not only improve on the conjecture by showing that in its online version the algorithm has a convergence radius of size but also show that for the algorithm rather than the principle the sample complexity reduces to (omitting factors). Again recovery to arbitrary precision holds for sparsity levels and stable recovery up to an error for sparsity levels . We also show that the computational complexity assuming an initialisation within the convergence radius scales as or omitting log factors .
Motivated by the desire to reduce the sample complexity for the case of exactly sparse, noiseless signals, we then introduce a second iterative thresholding and K-means algorithms ITKrM, which uses residual instead of signal means. It has roughly the same properties as ITKsM apart from the convergence radius which reduces to and the computational complexity, which scales as and thus can go up to for . However, if and the signals follow an exactly sparse, noiseless model, we can show that the sample complexity reduces to (omitting factors). Our results are in the same spirit as the results for the alternating minimisation algorithm in [1] but have the advantage that they are valid for more general coefficient distributions and a lower level of sparsity (S larger) resp. higher level of coherence, that the convergence radius is larger and that the algorithms exhibit a lower computational complexity. They are also close to some very recent results about several alternating minimisation algorithms, which are like the ITKMs based on thresholding, [5]. Compared to our results they are essentially the same in terms of convergence radius and sample complexity but are only valid for sparsity levels and up to a limiting precision (even in the exact sparse noiseless case). More interestingly [5] contains a strategy for finding initialisations within a radius to the generating dictionary, which is proven to succeed for sparsity levels . With slight modifications and using Tropp’s results on average isometry constants, [46], this initialisation strategy could probably be proven to work also for sparsity levels up to . However, its computational complexity seems to explode as grows.
The rest of the paper is organised as follows. After summarising notation and conventions in the following section, in Section 3 we re-introduce the ITKsM algorithm, discuss our sparse signal model and analyse the convergence properties of ITKsM. Based on the shortcomings of ITKsM we motivate the ITKrM algorithm in Section 4, and again analyse its convergence properties. In Section 5 we provide numerical simulations indicating that the convergence radius of both ITKM algorithms is generically much larger and that sometimes ITKrM even converges globally from random initialisations. Finally in Section 6 we compare our results to existing work and point out future directions of research.

2 Notations and Conventions

Before we join the melee, we collect some definitions and lose a few words on notations; usually subscripted letters will denote vectors with the exception of , where they are numbers, eg. vs. , however, it should always be clear from the context what we are dealing with.
For a matrix , we denote its (conjugate) transpose by and its Moore-Penrose pseudo inverse by . We denote its operator norm by and its Frobenius norm by , remember that we have .
We consider a dictionary a collection of unit norm vectors , . By abuse of notation we will also refer to the matrix collecting the atoms as its columns as the dictionary, i.e. . The maximal absolute inner product between two different atoms is called the coherence of a dictionary, .
By we denote the restriction of the dictionary to the atoms indexed by , i.e. , , and by the orthogonal projection onto the span of the atoms indexed by , i.e. . Note that in case the atoms indexed by are linearly independent we have . We also define the orthogonal projection onto the orthogonal complement of the span on , that is , where is the identity operator (matrix) in .
(Ab)using the language of compressed sensing we define as the smallest number such that all eigenvalues of are included in and the isometry constant of the dictionary as . When clear from the context we will usually omit the reference to the dictionary. For more details on isometry constants, see for instance [11].
To keep the sub(sub)scripts under control we denote the indicator function of a set by , that is is one if and zero else. The set of the first integers we abbreviate by .
We define the distance of a dictionary to a dictionary as

(3)

Note that this distance is not a metric, since it is not symmetric. For example if is the canonical basis and is defined by for , , and then we have while . A symmetric distance between two dictionaries could be defined as the maximal distance between two corresponding atoms, i.e.

(4)

where is the set of permutations of . Since locally the distances are equivalent we will state our results in terms of the easier to calculate asymmetric distance and assume that is already signed and rearranged in a way that .
We will make heavy use of the following decomposition of a dictionary into a given dictionary and a perturbation dictionary . If we set , where by definition . We can then find unit vectors with such that

(5)

The dictionary collects the perturbation vectors on its columns, that is and we define the diagonal matrices implicitly via

(6)

or in MATLAB notation with and analogue for . Based on this decomposition we further introduce the short hand and .
We consider a frame a collection of vectors for which there exist two positive constants such that for all we have

(7)

If can be chosen equal to , i.e. , the frame is called tight and if all elements of a tight frame have unit norm we have . The operator is called frame operator and by (7) its spectrum is bounded by . For more details on frames, see e.g. [12].
Finally we introduce the Landau symbols to characterise the growth of a function. We write

3 Dictionary Learning via ITKsM

Iterative thresholding and K signal means (ITKsM) for dictionary learning was introduced as algorithm to maximise the -response criterion

(8)

which for reduces to the K-means criterion, [40]. It belongs to the class of alternating optimisation algorithms for dictionary learning, which alternate between updating the sparse coefficients based on the current version of the dictionary and updating the dictionary based on the current version of the coefficients, [14, 3, 1]. As its name suggests, the update of the sparse coefficients is based on thresholding while the update of the dictionary is based on K signal means.

Algorithm 3.1 (ITKsM one iteration).

Given an input dictionary and training signals do:

  • For all find .

  • For all calculate

    (9)
  • Output .

The algorithm can be stopped after a fixed number of iterations or once a stopping criterion, such as improvement for some threshold , is reached. Its advantages over most other dictionary learning algorithms are threefold. First it has very low computational complexity. In each step the most costly operation is the calculation of the matrix vector products , that is the matrix product , of order . In comparison the globally successful graph clustering algorithms need to calculate the signal correlation matrix , cost .
Second due to its structure only one signal has to be processed at a time. Instead of calculating for all and calculating the sum, one simply calculates for the signal at hand, updates all atoms for which as and turns to the next signal. Once signals have been processed one does the normalisation step and outputs . Further in this online version only values corresponding to the input dictionary, the current version of the updated dictionary and the signal at hand, need to be stored rather than the signal matrix. Parallelisation can be achieved in a similar way. Again for comparison, the graph clustering algorithms, K-SVD, [3], and the alternating minimisation algorithm in [1] need to store the whole signal resp. residual matrix as well as the dictionary.
The third advantage is that with high probability the algorithm converges locally to a generating dictionary assuming that we have enough training signals and that these follow a sparse random model in . In order to prove the corresponding result we next introduce our sparse signal model.

3.1 Signal Model

We employ the same signal model, which has already been used for the analyses of the S-response and K-SVD principles, [39, 40]. Given a dictionary , we assume that the signals are generated as,

(10)

where is drawn from a sign and permutation invariant probability distribution on the unit sphere and is a centred random subgaussian vector with parameter , that is and for all vectors the marginals are subgaussian with parameter , meaning they satisfy for all . We recall that a probability measure on the unit sphere is sign and permutation invariant, if for all measurable sets , for all sign sequences and all permutations we have

(11)
(12)

We can get a simple example of such a measure by taking a positive, non increasing sequence , that is , choosing a sign sequence and a permutation uniformly at random and setting with . Conversely we can factorise any sign and permutation invariant measure into a random draw of signs and permutations and a measure on the space of non-increasing sequences.
By abuse of notation let now denote the mapping that assigns to each the non increasing rearrangement of the absolute values of its components, i.e. with for a permutation such that . Then the mapping together with the probability measure on induces a probability measure on via the preimage , that is for any measurable set .
Using this new measure we can rewrite our signal model as

(13)

where we define for a positive, non-increasing sequence distributed according to , a sign sequence and a permutation distributed uniformly at random and again a centred random subgaussian vector with parameter . Note that we have , with equality for instance in the case of Gaussian noise. To incorporate sparsity into our signal model we make the following definitions.

Definition 3.1.

A sign and permutation invariant coefficient distribution on the unit sphere is called -sparse with absolute gap and relative gap , if

(14)

or equivalently

(15)

The coefficient distribution is called strongly -sparse if .

For exactly sparse signals is simply the smallest non-zero coefficient and is the inverse dynamic range of the non-zero coefficients. We have the bounds and . Since equality holds for the ’flat’ distribution generated from for and zero else, we will usually think of being of the order and being of the order . We can also see that coefficient distributions can only be strongly -sparse as long as is smaller than , that is .
For the statement of our results we will use three other signal statistics,

(16)

The constants and will help characterise the expected size of . We have and

(17)

compare [40]. From the above inequality we can see that captures the expected signal to noise ratio, that is for large we have

(18)

Similarly the constant can be interpreted as the expected energy of the signal approximation using the largest generating coefficients and the generating dictionary, or in other words is a bound for the expected energy of the approximation error.
For noiseless signals generated from the flat distribution described above we have , and , so we will usually think of these constants having the orders , and .
From the discussion we see that, while being relatively simple, our signal model allows us to capture both approximation error and noise. Our results have quite straightforward extensions to more complicated (realistic) signal models, which for instance include outliers (normalised but not sign or permutation invariant coefficients) or a small portion of coefficients without gap. With somewhat more effort it is also possible to relax the assumption of sign and permutation invariance in our coefficient model, potentially at the cost of decreasing the admissible sparsity level, the convergence radius and the recovery accuracy and increasing the sample complexity. Indeed we will see that the main reason for assuming sign invariance is to ensure that when thresholding with the generating dictionary always succeeds in recovering the generating support with a large margin and therefore also succeeds with a perturbed dictionary. To a lesser degree, especially in the case of ITKrM, the sign invariance also supports the permutation invariance in ensuring a richness of signals such that the averaging procedures contract towards the generating atoms. In particular the permutation invariance prevents the situation that two atoms are always used together and could therefore be replaced by two of their linear combinations.
However, we will sacrifice generality for comprehensibility and therefore just give pointers in the respective proofs.

3.2 Convergence analysis of ITKsM

We first look at the more general case of noisy, non exactly S-sparse signals and specialise to noiseless, strongly S-sparse signals later.

Theorem 3.2.

Let be a unit norm frame with frame constants and coherence and assume that the training signals are generated according to the signal model in (13) with coefficients that are -sparse with absolute gap and relative gap .
Fix a target error , where

(19)

Given an input dictionary such that

(20)

then after iterations of ITKsM each on a fresh batch of training signals the output dictionary satisfies

(21)

except with probability

Before providing the proof let us discuss the result above. We first see that ITKsM will succeed if the input dictionary is within a radius to the generating dictionary . In case of exactly sparse signals this means that the convergence radius is up to a log factor inversely proportional to the dynamic range of the coefficients. This should not be come as a big surprise, considering that the average success of thresholding for sparse recovery with a ground truth dictionary depends on the dynamic range, [42]. It also means that in the best case the convergence radius is actually of size , since for the flat distribution .
Next note that in the theorem above we have restricted the target error to be larger than . However at the cost of unattractively large constants in the probability bound, we can actually reach any target error larger than .
To highlight the relation between the sparsity level and the minimally achievable error, we specialise the result to coefficients drawn from the flat distribution, meaning . We further assume white Gaussian noise with variance , corresponding to an expected signal to noise ratio of 1, and an incoherent dictionary with . If for some then the minimally achievable error can be as small as .
Last we want to get a feeling for the total number of training signals we need to have a good success probability. For exactly S-sparse signals with dynamic coefficient range 1 we have . Omitting loglog factors each iteration is therefore likely to be successful when using a batch of training signals, meaning that ITKsM is successful with high probability as soon as the total number of training signals used in the algorithms scales as . Note that in case of noise due to information theoretic arguments the factor seems unavoidable, [25].
To summarise the discussion we provide an O-notation version of the theorem, which is less plug and play but free of messy constants and as such better suited to convey the quality of the result. Compare also Subsection 3.1 for the O notation conventions.

Theorem - O (3.2).

Assume that in each iteration the number of training signals scales as . If then with high probability for any starting dictionary within distance to the generating dictionary after iterations of ITKsM, each on a fresh batch of training signals, the distance of the output dictionary to the generating dictionary will be smaller than

(22)
Proof.

The proof consists of two steps. First we show that with high probability one iteration of ITKsM reduces the error by at least a factor . Then we iteratively apply the results for one iteration.
Step 1: For the first step we use the following ideas, compare also [40]: For most sign sequences and therefore most signals

thresholding with a perturbation of the original dictionary will still recover the generating support , that is . Assuming that the generating support is recovered, for each the expected difference of the sum in (9) between using the original and the perturbation is small, that is smaller than , and due to concentration of measure also the difference on a finite number of samples will be small. Finally for each the sum in (9) will again concentrate around its expectation, a scaled version of the atom .
Formally we write,

(23)
(24)

Since , see the proof of Lemma B.5 in the appendix, using the triangle inequality and the bound we get,

(25)

Next note that for the draw of the event that for a given index the signal coefficient using thresholding with is different from the oracle signal is contained in the event that thresholding does not recover the entire generating support or that on the generating support the empirical sign pattern using is different from the generating pattern, for a ,

(26)

From [40], e.g. proof of Proposition 7, we know that the right hand side in (26) is in turn contained in the event , where

(27)
(28)
(29)

In particular if we choose , and we get that , which contains the event that thresholding using the generating dictionary fails, is independent of . To estimate the number of signals for which the thresholding summand is different from the oracle summand, it suffices to count how often or ,

(30)

Substituting these bounds into (25) we get,

(31)

If we want the error between and to be of the order , we need to ensure that the right hand side of (31) is less than .
From Lemma B.3 in the appendix we know that

(32)

Next Lemma B.4 tells us that

(33)

whenever

(34)

Finally by Lemma B.5 we have

(35)

whenever . Thus with high probability we have,

(36)

To be more precise if we choose a target error and set , , and , then except with probability

(37)

we have

(38)

By Lemma B.10 this further implies that

(39)

Note that in case of outliers we first have to split the sum in (9) into the outliers, whose number concentrates around the probability of being an outlier, and the inliers for which we can use the same procedure as above, see [19] for more details. Similarly the small portion of coefficients without (sufficiently) large gap can be included in the small number of signals for which thresholding fails.
Step 2: From Step 1 we know that in each iteration the error will either be decreased by at least a factor or if its already below will stay below . So after iterations each using a new batch of signals, , except with probability

(40)

Setting and taking into account that the failure probability of each iteration is bounded by leads to the final estimate.
One ∎

For most desired precisions Theorem 3.2, which is valid for a quite large hyper-cube of input dictionaries and a wide range of sparsity levels, will actually be sufficient. However, for completeness we specialise the theorem above to the case of strongly S-sparse, noiseless signals and show that in this case ITKsM can achieve arbitrarily small errors, provided enough samples.

Corollary 3.3.

Let be a unit norm frame with frame constants and coherence and assume that the training signals are generated according to the signal model in (13) with and coefficients that are strongly -sparse with relative gap . Fix a target error . If for the input dictionary we have

(41)

then after iterations of ITKsM, each on a fresh batch of training signals, the output dictionary satisfies

(42)

except with probability

The proof is analogue to the one of Theorem 3.2 and can be found in Appendix A.1.

Let us again discuss the result. The main difference to Theorem 3.2 is that the condition can only hold for much lower sparsity levels, that is and thus for incoherent dictionaries up to the square root of the ambient dimension . It is also no surprise that once the input dictionary is up to a log factor within this radius, ITKsM can achieve arbitrarily small errors. Indeed once thresholding is always guaranteed to recover the sparse support of a signal given the ground truth dictionary or a slight perturbation of it, [42].
To again turn the corollary into something less technical and more interesting we combine it with the corresponding theorem. If the coefficients are strongly -sparse the minimally achievable error using Theorem 3.2 will be smaller than the error we need for Corollary 3.3 to take over and so we get the following O notation result.

Corollary - O (3.3).

Assume that in each iteration the number of noiseless, exactly S-sparse training signals scales as . If then with high probability for any starting dictionary within distance