On the Identifiability of Overcomplete Dictionaries via the Minimisation Principle Underlying K-SVD

On the Identifiability of Overcomplete Dictionaries via the Minimisation Principle Underlying K-SVD

Karin Schnass Karin Schnass is with the Computer Vision Laboratory, University of Sassari, Porto Conte Ricerche, 07041 Alghero, Italy, email: kschnass@uniss.it
Abstract

This article gives theoretical insights into the performance of K-SVD, a dictionary learning algorithm that has gained significant popularity in practical applications. The particular question studied here is when a dictionary can be recovered as local minimum of the minimisation criterion underlying K-SVD from a set of training signals . A theoretical analysis of the problem leads to two types of identifiability results assuming the training signals are generated from a tight frame with coefficients drawn from a random symmetric distribution. First, asymptotic results showing, that in expectation the generating dictionary can be recovered exactly as a local minimum of the K-SVD criterion if the coefficient distribution exhibits sufficient decay. Second, based on the asymptotic results it is demonstrated that given a finite number of training samples , such that , except with probability there is a local minimum of the K-SVD criterion within distance to the generating dictionary.

{keywords}

dictionary learning, sparse coding, sparse component analysis, K-SVD, finite sample size, sampling complexity, dictionary identification, minimisation criterion, sparse representation

1 Introduction

As the universe expands so does the information we are collecting about and in it. New and diverse sources such as the internet, astronomic observations, medical diagnostics, etc., confront us with a flood of data in ever increasing dimensions and while we have a lot of technology at our disposal to acquire these data, we are already facing difficulties in storing and even more importantly interpreting them. Thus in the last decades high-dimensional data processing has become a very challenging and interdisciplinary field, requiring the collaboration of researchers capturing the data on one hand and researchers from computer science, information theory, electric engineering and applied mathematics, developing the tools to deal with the data on the other hand. One of the most promising approaches to dealing with high-dimensional data so far has proven to be through the concept of sparsity.
A signal is called sparse if it has a representation or good approximation in a dictionary, i.e. a representation system like an orthonormal basis or frame, [10], such that the number of dictionary elements, also called atoms, with non-zero coefficients is small compared to the dimension of the space. Modelling the signals as vectors and the dictionary accordingly as a matrix collecting normalised atom-vectors as its columns, i.e. , we have

for a set of size , i.e. , which is small compared to the ambient dimension, i.e. .
The above characterisation already shows why sparsity provides such an elegant way of dealing with high-dimensional data. No matter the size of the original signal, given the right dictionary, its size effectively reduces to a small number of non-zero coefficients. For instance the sparsity of natural images in wavelet bases is the fundamental principle underlying the compression standard JPEG 2000.
Classical sparsity research studies two types of problems. The first line of research investigates how to perform the dimensionality reduction algorithmically, i.e. how to find the sparse approximations of a signal given the sparsity inducing dictionary. By now there exists a substantial amount of theory including a vast choice of algorithms, e.g. [13, 9, 29, 6, 12], together with analysis about their worst case or average case performance, [38, 39, 35, 20]. The second line of research investigates how sparsity can be exploited for efficient data processing. So it has been shown that sparse signals are very robust to noise or corruption and can therefore easily be denoised, [15], or restored from incomplete information. This second effect is being exploited in the very active research field of compressed sensing, see [14, 8, 31].
However, while sparsity based methods have proven very efficient for high-dimensional data processing, they suffer from one common drawback. They all rely on the existence of a dictionary providing sparse representations for the data at hand.
The traditional approach to finding efficient dictionaries is through the careful analysis of the given data class, which for instance has led to the development of wavelets, [11], and curvelets, [7], for natural images. However when faced with a (possibly exotic) new signal class this analytic approach has the disadvantage of requiring too much time and effort. Therefore, more recently, researchers have started to investigate the possibilities of learning the appropriate dictionary directly from the new data class, i.e. given signals , stored as columns in a matrix find a decomposition

into a dictionary matrix with unit norm columns and a coefficient matrix with sparse columns. Looking at the matrix decomposition we can immediately see that, on top of being the key to sparse data processing schemes, dictionary learning is actually a powerful data-analysis tool. Indeed within the blind source separation community dictionary learning is known as sparse component analysis (the dictionary atoms are the sparse components) and this data-analysis point of view has been a parallel driving force for the development of dictionary learning.
So far the research focus in dictionary learning has been on algorithmic development rather than theoretic analysis. This means that by now there are several dictionary learning algorithms, which are efficient in practice and therefore popular in applications, see [16, 23, 3, 26, 42, 24, 36] or [32] for a more complete survey, but only comparatively little theory. Some theoretical insights come from the blind source separation community, [43, 18], and more recently from a set of generalisation bounds for learned dictionaries, [27, 40, 28, 19], which predict the quality of a learned dictionary for future data, but unfortunately do not directly imply uniqueness of the ’true’ dictionary nor guarantee recoverability by an efficient algorithm, However, especially to justify the use of dictionary learning as data analysis tool, we need theoretical identification results quantifying the conditions on the dictionary, the coefficient model generating the sparse signals and the number of training signals under which a scheme will be successful.
While it is true that for most schemes we do not yet understand their behaviour, there exists a handful of exceptions to this rule, [4, 21, 17, 22, 37]111For the sake of completeness we also mention (without discussion) some very recent results, developed while this work has been under review, [5, 2, 1] .. For these schemes there are known conditions under which a dictionary can be recovered from a given signal class, but unfortunately they all have certain drawbacks limiting their practical applicability. In [4] the authors themselves state that the algorithm is only of theoretical interest because of its computational complexity and also for the -minimisation principle, suggested in [43, 30] and studied in [21, 17, 22], finding a local minimum is computational sufficiently challenging to prohibit the learning of very high-dimensional dictionaries. Finally, the ER-SpUD algorithm, [37], has the disadvantage that it can only learn a basis, but not an overcomplete dictionary.
In this paper we will start bridging the gap between practically efficient and provably efficient dictionary learning schemes, by providing identification results for the minimisation principle underlying K-SVD (K-Singular Value Decomposition), one of the most widely applied dictionary algorithms.
K-SVD was introduced by Aharon, Elad and Bruckstein in [3] as a generalisation of the K-means clustering process. The starting point for the algorithm is the following minimisation criterion. Given some signals , , find

(1)

for and , where counts the number of non-zero entries of , and denotes the Frobenius norm. In other words we are looking for the dictionary that provides on average the best -term approximation to the signals in .
K-SVD aims to find the minimum of (1) by alternating two procedures, a) fixing the dictionary and finding a new close to optimal coefficient matrix column-wise, using a sparse approximation algorithm such as (Orthogonal) Matching Pursuit, [38], or Basis Pursuit, [9], and b) updating the dictionary atom-wise, choosing the updated atom to be the left singular vector to the maximal singular value of the matrix having as its columns the residuals of all signals to which the current atom contributes, i.e. . If in every step for every signal the best sparse approximation is found the K-SVD algorithm is guaranteed to find a local minimiser of (1). However because of the non-optimal sparse approximation procedure it can in general not be guaranteed to converge to a local minimiser of (1) unless and a greedy algorithm is used, see also the discussion in Section 5. We will not go further into algorithmic details, but refer the reader to the original paper [3] as well as [4]. Instead we concentrate on the theoretical aspects of the posed minimisation problem.
First it will be convenient to rewrite the objective function using the fact that for any signal the best -term approximation using is given by the largest projection onto a set of atoms , i.e.,

where denotes the Moore-Penrose pseudo inverse of . Abbreviating the projection onto the span of by , we can thus replace the minimisation problem in (1) with the following maximisation problem,

(2)

From the above formulation it is quite easy to see the motivation for the proposed learning criterion. Indeed assume that the training signals are all -sparse in an admissible dictionary , i.e. and , then clearly there is a global maximum222 is a global maximiser together with all dictionaries consisting of a permutation of the atoms in provided with a sign. For a more detailed discussion on the uniqueness of the maximiser/minimiser see eg. [21]. of (2) at , respectively a global minimum of (1) at , as long as . However in practice we will be facing something like,

(3)

where the coefficient vectors in are only approximately -sparse or rapidly decaying and the pure signals are corrupted with noise . In this case it is no longer trivial or obvious that is a local maximum of (2), but we can hope for a result of the following type.

Goal 1.1.

Assume that the signals are generated as in (3), with drawn from a distribution of approximately sparse or decaying vectors and random noise. As soon as the number of signals is large enough , with high probability there will be a local maximum of (2) within distance from .

The rest of this paper is organised as follows. After introducing some notation in Section 2, we first give conditions on the dictionary and the coefficients which allow for asymptotic identifiability by studying when is exactly at a local maximum in the limiting case, where we replace the sum in (2) with the expectation,

(4)

Thus in Section 3 we will prove identification results for (4) assuming first a simple (discrete, noise-free) signal model and then progressing to a noisy, continuous signal model. In Section 4 we will go from asymptotic results to results for finite sample sizes and prove versions of Theorem 1.1 that under the same assumptions as the asymptotic results quantify the sizes of the parameters in terms of the number of training signals and the size of in terms of the number of atoms . In the last section we will discuss the implications of our results for practical applications, compare them to existing identification results and point out some directions for future research.

2 Notations and Conventions

Before we jump into the fray, we collect some definitions and lose a few words on notations; usually subscripted letters will denote vectors with the exception of and where they are numbers, eg. vs. , however, it should always be clear from the context what we are dealing with.
For a matrix , we denote its (conjugate) transpose by and its Moore-Penrose pseudo inverse by . We denote its operator norm by and its Frobenius norm by , remember that we have .
We consider a dictionary a collection of unit norm vectors , . By abuse of notation we will also refer to the matrix collecting the atoms as its columns as the dictionary, i.e. . The maximal absolute inner product between two different atoms is called the coherence of a dictionary, .
By we denote the restriction of the dictionary to the atoms indexed by , i.e. , , and by the orthogonal projection onto the span of the atoms indexed by , i.e. . Note that in case the atoms indexed by are linearly independent we have .
(Ab)using the language of compressed sensing we denote the minimal eigenvalue of by and define the lower isometry constant of the dictionary as . If any set of atoms is linearly independent we have and in general we have the bound . When clear from the context we will usually omit the reference to the dictionary. For more details on isometry constants, see for instance [8].
For two dictionaries we define the distance between each other as the maximal distance between two corresponding atoms, i.e.

(5)

We consider a frame a collection of vectors for which there exist two positive constants such that for all we have

(6)

If can be chosen equal to , i.e. , the frame is called tight and if all elements of a tight frame have unit norm we have . The operator is called frame operator and by (6) its spectrum is bounded by . For more details on frames, see e.g. [10].
Finally we introduce the Landau symbols to characterise the growth of a function. We write if and if .

3 Asymptotic identification results

As mentioned in the introduction if the signals are all -sparse in a dictionary then clearly there is a global minimum of (1) or global maximum of (4) with parameter at . However what happens if we do not have perfect -sparsity? Let us start with a very simple negative example of a coefficient distribution for which the original generating dictionary is not at a local maximum for the case .

Example 3.1.

Let be an orthonormal basis and let the signals be generated as , where is a randomly 2-sparse, ’flat’ coefficients sequence, i.e. we pick an index set and two signs uniformly at random and set for and zero else. Then there is no local maximum of (4) with at . Indeed since the signals are all 2-sparse the maximal inner product with all atoms in is the same as the maximal inner product with only atoms. This degree of freedom we can use to construct an ascent direction. Choose . Using the identity we get,

which is larger than .

From the above example we see that in order to have a local maximum at the original dictionary we need a signal/coefficient model where the coefficients show some type of decay.

3.1 A simple model of decaying coefficients

To get started we consider a very simple coefficient model, constructed from a non-negative, non-increasing sequence with , which we permute uniformly at random and provide with random signs. To be precise for a permutation and a sign sequence , , we define the sequence component-wise as , and set where with probability .
The normalisation has the advantage that for dictionaries, which are an orthonormal basis, the resulting signals also have unit norm and for general dictionaries the signals have unit square norm in expectation, i.e. . This reflects the situation in practical applications, where we would normalise the signals in order to equally weight their importance.
Armed with this model we can now prove a first dictionary identification result for (4).

Theorem 3.1.

Let be a unit norm tight frame with frame constant and lower isometry constant . Let be a random permutation of a positive, nonincreasing sequence , where and , provided with random signs, i.e. with probability . Assume that the signals are generated as . If there exists such that for we have

(7)

then there is a local maximum of (4) at .
Moreover for we have as soon as

(8)

where and because of (7).

Proof.

The basic idea of the proof is that for the original dictionary the maximal response is always attained for the set and that for most signals, i.e. most sign sequences, also for a perturbed dictionary the maximal response is still at . Since the average loss of a perturbed dictionary over most sign sequences,

(9)

is larger than the possible gain on exceptional sign sequences we have a maximum at . More detailed sketches and a version of the proof for can be found in [34, 33].
Following the proof idea we first calculate the expectation using the original dictionary . Condition (7) quite obviously (and artlessly) guarantees that the maximum is always attained for the set , so setting we get from Lemma A.1 in the appendix,

(10)

To compute the expectation for a perturbation of the original dictionary we first note that we can parametrise all -perturbations of the original dictionary , i.e. , as

for some with and some with . For conciseness of the following presentation we define , and . Further we define and to get and . Note that some perturbations, e.g. small rotations, will be also unit norm tight frames but in general the perturbed dictionaries will not be tight.
As pointed out in the proof idea our strategy will be to show that for a fixed permutation with high probability (over ) the maximal projection is still onto the atoms indexed by .
For any index set of size we can bound the projection onto a perturbed dictionary as,

(11)

leading to

(12)

However (12) is a quite pessimistic estimate since for most , meaning for most , the expression will be much smaller than . Indeed we can estimate its typical size via the following convenient if not optimal concentration inequality for Rademacher series from [25], Chapter 4.

Corollary 3.2 (of Theorem 4.7 in [25]).

For a vector-valued Rademacher series , i.e. for independent Bernoulli variables with and , and we have,

(13)

Applied to this leads to the following estimate,

(14)

whenever - otherwise we trivially have . We now define the set ,

(15)

whose size we can estimate using (14) with and a union bound,

(16)

Note that whenever we have , since using the (reversed) triangular inequality we have

(17)

To finally calculate the expectation over for a perturbed dictionary we split it into a sum over the sign sequences contained in and its complement. We can estimate,

(18)
(19)

where we have used (3.1), reversing the roles of and and choosing , to go from (18) to (19). Using the expression for derived in Lemma A.1 in the appendix we get the following bound for the expectation of the maximal projection using a perturbed dictionary,

(20)

We are now ready to compare the above expression to the corresponding one for the original dictionary. Abbreviating and using the estimates for and from Lemma A.2 in the appendix, we get

(21)

with and and where we have used that (8) implies . Denote by the set for which is maximal. We can further estimate,

Thus to have a local maximum at we need to show that for small enough we have

or equivalently that

(22)

Applying Lemma A.3 we get that for the inequality above is satisfied if we have

(23)

Employing the bound this is further implied by

(24)

which is in turn implied by (8).
Finally all that remains to show is that for we have . Assume conversely that for we have meaning that for all of size . We can then find an index for which we have for some with and =1. For all of size containing we have and therefore . Choose to be any set of size containing . For all we have or , which means that either is in the span of and therefore or that is in the span of for all . However this would mean that has rank which is a contradiction to being a frame and we can conclude that for all of size containing . Iterating the argument we get that has to be in the span of which is a contradiction to and =1. ∎

Remark 3.2.

(a) To make the theorem more applicable it would be nice to have a concrete condition in terms of the coherence of the dictionary rather than the abstract condition in (7). Indeed it can be shown, see [34] Appendix C, that we can find a if we have and

(25)

In some cases we can also easily derive estimates for .
If is an orthonormal basis we have

(26)

and if we have