# Local Identification of Overcomplete Dictionaries

###### Abstract

This paper presents the first theoretical results showing that stable identification of overcomplete -coherent dictionaries is locally possible from training signals with sparsity levels up to the order and signal to noise ratios up to . In particular the dictionary is recoverable as the local maximum of a new maximisation criterion that generalises the K-means criterion. For this maximisation criterion results for asymptotic exact recovery for sparsity levels up to and stable recovery for sparsity levels up to as well as signal to noise ratios up to are provided. These asymptotic results translate to finite sample size recovery results with high probability as long as the sample size scales as , where the recovery precision can go down to the asymptotically achievable precision. Further, to actually find the local maxima of the new criterion, a very simple Iterative Thresholding and K (signed) Means algorithm (ITKM), which has complexity in each iteration, is presented and its local efficiency is demonstrated in several experiments.

Department of Mathematics

University of Innsbruck

Technikerstraße 19a

6020 Innsbruck, Austria

Editor: Shie Mannor

Keywords: dictionary learning, dictionary identification, sparse coding, sparse component analysis, vector quantisation, K-means, finite sample size, sample complexity, maximisation criterion, sparse representation

## 1 Introduction

Be it the 300 million photos uploaded to Facebook per day, the 800GB the large Hadron collider records per second or the 320.000GB per second it cannot record, it is clear that we have reached the age of big data. Indeed, in 2012, the amount of data existing worldwide is estimated to have reached 2.8 ZB = 2.800 billion GB and while 23 % of these data are expected to be useful if analysed, only 1% actually are. So how do we deal with this big data challenge? The key concept, that has driven data processing and data analysis in the past decade, is that even high-dimensional data has intrinsically low complexity, meaning that every data point can be represented as linear combination of a sparse (small) number of elements or atoms of an overcomplete dictionary , that is,

for a set of size , , which is small compared to
the ambient dimension, .

These sparse components do not only describe the data but the representations can also be used for a myriad of efficient sparsity based data processing schemes, ranging from denoising, (Donoho et al., 2006), to compressed sensing, (Donoho, 2006; Candès et al., 2006). Therefore a promising tool both for data analysis and data processing, that has emerged in the last years, is dictionary learning, also known as sparse coding or sparse component analysis. Dictionary learning addresses the fundamental question of how to automatically learn a dictionary, providing sparse representations for a given data class. That is, given signals , stored as columns in a matrix , find a decomposition

into a dictionary matrix with unit norm columns and a coefficient matrix with sparse columns.

Until recently the main research focus in dictionary learning has been on the development of algorithms.
Thus by now there is an ample choice of learning algorithms, that perform well in experiments and are popular in applications, (Field and Olshausen, 1996; Kreutz-Delgado and Rao, 2000; Kreutz-Delgado et al., 2003; Aharon et al., 2006; Yaghoobi et al., 2009; Mairal et al., 2010; Skretting and Engan, 2010; Rubinstein et al., 2010). However, slowly the interest is shifting and researchers are starting to investigate also the theoretical aspects of dictionary learning. Following the first theoretical insights, originating in the blind source separation community, (Zibulevsky and Pearlmutter, 2001; Georgiev et al., 2005), there is now a set of generalisation bounds predicting how well a learned dictionary can be expected to sparsely approximate future data, (Maurer and Pontil, 2010; Vainsencher et al., 2011; Mehta and Gray, 2012; Gribonval et al., 2013). These results give a theoretical foundation for dictionary learning as data processing tool, for example for compression, but unfortunately do not give guarantees that an efficient algorithm will find/recover a good dictionary provided that it exists. However, in order to justify the use of dictionary learning as data analysis tool, for instance in blind source separation, it is important to provide conditions under which an algorithm or scheme can identify the dictionary from a finite number of training signals, that is, the sources from the mixtures. Following the first dictionary identification results for the -minimisation principle, which was suggested in Zibulevsky and Pearlmutter (2001)/Plumbley (2007), by Gribonval and Schnass (2010); Geng et al. (2011); Jenatton et al. (2014) and for the ER-SPuD algorithm for learning a basis in (Spielman et al., 2012), 2013 has seen a number of interesting developments. First in Schnass (2014) it was shown that the K-SVD minimisation principle suggested in Aharon et al. (2006) can locally identify overcomplete tight dictionaries. Later in Arora et al. (2014); Agarwal et al. (2014b) algorithms with global identification guarantees for coherent dictionaries were presented. Finally in Agarwal et al. (2014a) it was shown that an alternating minimisation method is locally convergent to the correct generating dictionary. One aspect that all these results have in common is that the sparsity level of the training signals required for successful identification is of order or for incoherent dictionaries. Considering that on average sparse recovery in a given dictionary is successful for sparsity levels , (Tropp, 2008; Schnass and Vandergheynst, 2007), and that for dictionary learning we usually have a lot of training signals at our disposal, the same sparsity level should be sufficient for dictionary learning and indeed in this paper we provide the first indication that global dictionary identification could be possible for sparsity levels by proving that it is locally possible. Further we show that in experiments a very simple iterative algorithm, based on thresholding and K signed means, is locally successful.

The paper is organised as follows. After introducing all necessary notation in Section 2 we present a new optimisation criterion, motivated by the analysis of the K-SVD principle, (Schnass, 2014), in Section 3. In Section 4 we give asymptotic identification results both for exact and stable recovery, which in Section 5 are extended to results to finite sample sizes. Section 6 provides an algorithm for actually finding a local optimum and some experiments confirming the theoretical results. Finally in the last section we compare the results for the new criterion to existing identification results, discuss the implications of these local results for global dictionary identification algorithms and point out directions for future research.

## 2 Notations and Conventions

Before we jump into the fray, we collect some definitions and lose a few words on notations; usually subscripted letters will denote vectors with the exception of and where they are numbers, eg. vs. , however, it should always be clear from the context what we are dealing with.

We consider a dictionary a collection of unit norm vectors , . By abuse of notation we will also refer to the matrix collecting the atoms as its columns as the dictionary, that is .
The maximal absolute inner product between two different atoms is called the coherence of the dictionary, . By we denote the restriction of the dictionary to the atoms indexed by , that is , . We indicate the conjugate transpose of a matrix with a , for example would be the transpose of .

The set of all dictionaries of a given size () is denoted by .
For two dictionaries we define the distance between each other as the maximal distance between two corresponding atoms,

We consider a frame a collection of vectors for which there exist two positive frame constants such that for all we have

(1) |

From (1) it follows that , interpreted as matrix, has rank and that its non-zero singular values are in the interval . If can be chosen equal to , that is , the frame is called tight. If all frame elements have unit norm, we call a unit norm frame. For more details on frames, see e.g. Christensen (2003).

Finally we introduce the Landau symbols to characterise the growth of a function. We write

.

## 3 A Response Maximisation Criterion

One of the origins of dictionary learning can be found in the field of vector quantisation, where the aim is to find a codebook (dictionary) such that the codewords (atoms) closely represent the data, that is

Indeed the vector quantisation problem can be seen as an extreme case of dictionary learning, where we do not only want all our signals to be approximately -sparse but also want the single non-zero coefficient equal to one. On the other hand we allow the atoms (codewords) to have any length. The problem above is usually solved by a K-means algorithm, which alternatively separates the training data into K clusters, each assigned to one codeword, and then updates the codeword to be the mean of the associated train signals. For more detailed information about vector quantisation or the K-means algorithm see for instance Gersho and Gray (1992) or the introduction of Aharon et al. (2006). If we relax the condition that each coefficient has to be positive, but in turn ask for the atoms to have unit norm, we are already getting closer to the concept of -sparse dictionary learning,

This minimisation problem can be rewritten as

and is therefore equivalent to the maximisation problem

(2) |

A local maximum of (2) can be found with a signed K-means algorithm, which assigns each training signal to the atom of the current dictionary giving the largest response in absolute value and then updating the atom as normalised signed mean of the associated training signals, see Section 6 for more details. The question now is how do we go from these -sparse dictionary learning formulations to -sparse formulations with . The most common generalisation, which provides the starting point for the MOD and the K-SVD algorithm, is to give up all constraints on the coefficients except for -sparsity and to minimise,

(3) |

However, rewriting the problem we see that this formulation does not reduce to the same maximisation problem in case . Then the best one term approximation in the given dictionary is simply the largest projection onto one atom and we have

leading instead to the maximisation problem

A local maximum can now be found using the same partitioning strategy as before but updating the atoms as largest singular vector rather than signed mean of the associated training signals, requiring K SVDs as opposed to K means. While the minimisation problem (3) is definitely the most effective generalisation for dictionary learning when the goal is compression, it brings with it some complications when used as analysis tool. Indeed in Schnass (2014) it has been shown that for the K-SVD criterion (3) can only identify the underlying dictionary from sparse random mixtures to arbitrary precision (given enough training samples) if this dictionary is tight and it is conjectured that the same holds for . Roughly simplified the reason for this is that for random sparse signals and an -perturbation the average of the largest squared response behaves like

If is tight the term is constant over all dictionaries and therefore there is a local maximum at . From the above we also see that the average of the largest absolute response should behave like

meaning that we should have a maximum at also if it is non-tight. This suggests as alternative way to generalise the K-means optimisation principle for dictionary identification to simply maximise the absolute norm of the -largest responses,

(4) |

Other than for the K-SVD criterion it is not obvious that there should be a local optimum of (4) at even if all signals are perfectly -sparse in . Therefore it is quite intriguing that we will not only be able to prove local identifiability of any generating dictionary via (4) from randomly sparse signals, but that these identifiability properties are stable under coherence and noise. However, before we get to the main result in Theorem 5 on page 5, we first have to lay the foundation, by providing suitable random signal models and by studying the asymptotic identifiability properties of the new principle.

## 4 Asymptotic Results

We can get to the asymptotic version of the -response maximisation principle in (4) simply by replacing the sum over the training signals with the expectation, leading to

(5) |

Next we need a random sparse coefficient model to generate our signals . We make the following definition, see also Schnass (2014).

###### Definition 1

A probability distribution (measure) on the unit sphere is called symmetric if for all measurable sets , for all sign sequences and all permutations we have

Setting where is drawn from a symmetric probability measure on the unit sphere has the advantage that for dictionaries which are orthonormal bases
the resulting signals have unit norm and for general dictionaries the signals have unit square norm in expectation, that is . This reflects the situation in practical application, where we would normalise the signals in order to equally weight their importance.

One example of such a probability measure can be constructed from a non-negative, non-increasing sequence with , which we permute uniformly at random and provide with random signs. To be precise for a permutation and a sign sequence , , we define the sequence component-wise as
, and set
if there exist such that and otherwise.
While being very simple this measure exhibits all the necessary structure and indeed in our proofs we will reduce the general case of a symmetric measure to this simple case.

So far we have not incorporated any sparse structure in our coefficient distribution.
To motivate the sparsity requirements on our coefficients we will recycle the simple negative example of a sparse coefficient distribution for which the original generating dictionary is not at a local maximum of (5) with from Schnass (2014).

###### Example 1

Let be an orthonormal basis and let the signals be constructed as . If is randomly 2-sparse with ’flat’ coefficients, that is, drawn from the simple symmetric probability measure with base sequence , where , for , then is not a local maximum of (5) with .

Indeed, since the signals are all 2-sparse, the maximal inner product with all atoms in is the same as the maximal inner product with only atoms. This degree of freedom we can use to construct an ascent direction. Choose , then we have

From the above example we see that, in order to have a local maximum of (5) with at the original dictionary, we need our signals to be truly 1-sparse, that is, we need to have a decay between the first and the second largest coefficient. In the following sections we will study how large this decay should be to have a local maximum exactly at or near to the generating dictionary for more general dictionaries and sparsity levels.

### 4.1 Exact Recovery

To warm up we first provide an asymptotic exact dictionary identification result for (5) for incoherent dictionaries in the noiseless setting.

###### Theorem 2

Let be a unit norm frame with frame constants and coherence . Let the coefficients be drawn from a symmetric probability distribution on the unit sphere and assume that the signals are generated as . If there exists such that for the non-increasing rearrangement of the absolute values of the components of we have almost surely, that is

(6) |

then there is a local maximum of (5) at .

Moreover for we have as soon as

(7) |

where .

Proof idea We briefly sketch the main ideas of the proof, which are the same as for the corresponding theorem for the K-SVD principle in Schnass (2014). For self-containedness of the paper the full proof is included in Appendix A.1.

Assume that we have the case of a simple probability measure based on one sequence c, that is .
For any fixed permutation the condition in (6) ensures that for all sign sequences , and consequently all signals, the maximal S responses of the original dictionary are attained at and that there is a gap of size to the remaining responses.

For an -perturbation of the generating dictionary we have for some unit vectors with and . Now for most sign sequences the contribution of to the response will be smaller than so the maximal S responses will still be attained at . Comparing the loss of the perturbed dictionary over the typical sign sequences of all permutations, which scales as , to the maximal gain over the approximately atypical sign sequences shows that there is a maximum at the original dictionary. The general result follows from an integration argument.

As already mentioned, while for the K-SVD criterion (3) there is always an optimum at the generating dictionary if all training signals are -sparse, this it is not obvious for the response principle. Indeed, in the special case where all the training signals are exactly -sparse, almost surely, we get an additional condition to ensure asymptotic recoverability,

To get a better feeling for this constraint we bound the sum over the S largest reponses by S times the largest response, and arrive at the condition

(8) |

which is the classical condition under which simple thresholding will find the support of an exactly -sparse signal, compare for instance Schnass and Vandergheynst (2008).

### 4.2 Stability under Coherence and Noise

While giving a first insight into the identification properties of the response principle, Theorem 2 suffers from two main limitations.

First, the required condition on the coherence of the dictionary with respect to the decay of the coefficients, , is unfortunately quite strict. In the most favourable case of exactly sparse signals with equally sized coefficients, , we see from (8) that we can only identify dictionaries from very sparse signals, where . In case of very incoherent dictionaries with
this means that .
However, for most sign sequences we have

which indicates that a condition of the form may be strong enough to guarantee (approximate) recoverability of the dictionary. Assuming again the most favourable case of equally sized coefficients, we could therefore identify dictionaries from signals with sparsity levels of the order , which, in case of incoherent dictionaries, means of the order of the ambient dimension .

The second limitation of Theorem 2 is that, even if it allows for not exact S-sparseness of the signals, it does not take into account noise.
Our next goal is therefore to extend the exact identification result in Theorem 2 to a stable identification result for less sparse (larger S) and noisy signals. For this task we first need to amend our signal model to incorporate noise. We would like to consider unbounded white noise, but also keep the property that in expectation the signals have unit square norm.
Further for the next section, where we want to transform our asymptotic identification results to results for finite sample sizes, it will be convenient if our signals are bounded. These considerations lead to the following model:

(9) |

where is a centred random subgaussian vector with parameter . That is, the entries are independent and satisfy .

Employing this noisy signal model and formalising the ideas about the typical gap size between responses of the generating dictionary inside and outside the true support, leads to the following theorem.

###### Theorem 3

Let be a unit norm frame with frame constants and coherence . Let the coefficients be drawn from a symmetric probability distribution on the unit sphere . Further let be a centred random subgaussian noise-vector with parameter and assume that the signals are generated according to the noisy signal model in (9). If there exists such that for the non-increasing rearrangement of the absolute values of the components of we have almost surely and

(10) |

where and , then there is a local maximum of (5) at satisfying,

(11) |

Proof idea As outlined at the beginning of the section the main ingredient we have to add to the proof idea of Theorem 2 is a probabilistic argument to substitute the condition guaranteeing that the S largest responses of the generating dictionary are . Due to concentration of measure we get that for most sign sequences, and therefore most signals, the maximum is still attained at . Moreover the gap to the remaining responses is actually large enough to accommodate relatively high levels of noise and/or perturbations.

The detailed proof can be found in Appendix A.2.

Let us make some observations about the last result.

First, we want to point out that for sub-gaussian noise with parameter , the quantity in the statement above is well behaved. If for example the are iid Bernoulli-variables, that is , we have . In general we have the following estimate due for instance to Theorem 1 in Hsu et al. (2012). Since we have

setting , we get , which leads to

Also to illustrate the result we again specialise it to the most favourable case of exactly -sparse signals with balanced coefficients, that is . Assuming white Gaussian noise with variance we see that identification is possible even for expected signal to noise ratios of the order , that is

Similarly, by specialising Theorem 3 to the case of exactly -sparse and noiseless signals we get - to the best of our knowledge - the first result establishing that locally it is possible to stably identify dictionaries from signals with sparsity levels beyond the spark of the generating dictionary. Indeed, even if some of the -sparse signals could have representations in that require less than atoms, there will still be a local maximum of the asymptotic criterion close to the original dictionary as long as the smallest coefficient of each signal is of the order , which in the most favourable case means that we can have or . The quality of this result is on a par with the best results for finding sparse approximations in a given dictionary, which say that on average Basis Pursuit or thresholding can find the correct sparse support even for signals with sparsity levels of the order of the ambient dimension (Tropp, 2008; Schnass and Vandergheynst, 2007).

Next note that with the available tools it would be possible to consider also a signal model where a small fraction of the coefficients violates the decay condition and still have stability. However, we leave explorations in that direction to the interested reader and instead turn to the study of the criterion for a finite number of training samples.

## 5 Finite Sample Size Results

In this section we will transform the two asymptotic results from the last section into results for finite sample sizes, that is, we will study when is close to a local maximum of

(12) |

assuming that the are following either the noise-free or the noisy signal model. For convenience we will do the analysis for the normalised version (12) of the -response criterion (4).

###### Theorem 4

Let be a unit norm frame with frame constants and coherence . Let the coefficients be drawn from a symmetric probability distribution on the unit sphere and assume that the signals are generated as . If there exists such that for the non-increasing rearrangement of the absolute values of the components of we have almost surely and the target precision satisfies

where , then except with probability,

there is a local maximum of (12) respectively (4) at satisfying,

###### Theorem 5

Let be a unit norm frame with frame constants and coherence . Let the coefficients be drawn from a symmetric probability distribution on the unit sphere . Further let be i.i.d. centred random subgaussian noise-vectors with parameter and assume that the signals are generated according to the noisy signal model in (9). If there exists such that for the non-increasing rearrangement of the absolute values of the components of we have almost surely and if the target precision , the noiseparameter and the coherence satisfy

(13) | ||||

where and , then except with probability

where |

there is a local maximum of (12) respectively (4) at , satisfying

Proof idea The proofs, which can be found in Appendix A.3, are based on three ingredients, a Lipschitz property for the mapping for the respective signal model, the concentration of the sum around its expectation for a -net covering the space of all admissible dictionaries close to and a triangle inequality argument to show that the finite sample response differences are close to the expected response differences and therefore larger than 0 for all .

To see better how the sample complexity behaves, we simplify the two theorems to the special case of noiseless exactly -sparse signals with balanced coefficients for various orders of magnitude of .

If we have , Theorem 4 implies that in order to have a maximum within radius to the original dictionary with probability we need samples. Conversely given training signals we can expect the distance between generating dictionary and closest local maximum to be of the order .

If we assume a very incoherent dictionary where and thus let the sparsity level be of the order the sample complexity rises to . Taking into account that by (13) the target precision needs to be of order this means that we need at least training signals and once this initial level is reached, the error goes to zero at rate .

For an even lower sparsity level, , again assuming a very incoherent dictionary, the sample complexity for target precision implied now by Theorem 5 rises to . In this regime, however, we cannot reach arbitrarily small errors by choosing large enough but only approach the asymptotic precision
.

Following these promising theoretical results, in the next section we will finally see how theory translates into practice.

## 6 Experiments

After showing that the optimisation criterion in (4) is locally suitable for dictionary identification, in this section we present an iterative thresholding and K means type algorithm (ITKM) to actually find the local maxima of (4) and conduct some experiments to illustrate the theoretical results. We recall that given the input signals and a fixed sparsity parameter we want to solve,

Using LaGrange multipliers,

where , we arrive at the following update rule,

(14) |

where is a scaling parameter ensuring that .

In practice, when we do not have an oracle giving us the generating dictionary as initialisation, we also need to safeguard against bad initialisations resulting in a zero-update . For example we can choose
the zero-updated atom uniformly at random from the unit sphere or from the input signals.

Note that finding the sets corresponds to thresholding operations while updating according to (14) corresponds to K signed means. Altogether this means that each iteration of ITKM has computational complexity determined by the matrix multiplication , meaning . This is light in comparison to K-SVD, which even when using thresholding instead of OMP as sparse approximation procedure still requires the calculation of the maximal singular vector of K on average matrices. It is also more computationally efficient than the algorithm for local dictionary refinement, proposed in Arora et al. (2014), which is also based on averaging. Furthermore it is straightforward to derive online or parallelised versions of ITKM. In an online version for each newly arriving signal we calculate using thresholding and update for . After signals have been processed we renormalise the atoms to have unit norm and set . Similarly, to parallelise we divide the training samples into sets of size . Then on each node we learn a dictionary according to (14) with . We then calculate the sum of these dictionaries and renormalise the atoms in to have unit norm.

Armed with this very simple algorithm we will now conduct four experiments to illustrate our theoretical findings^{1}^{1}1A Matlab penknife (mini-toolbox) for playing around with ITKM and reproducing the experiments can be found at http://homepage.uibk.ac.at/~c7021041/ITKM.zip.

### 6.1 ITKM vs. K-SVD

In our first experiment we compare the local recovery error of ITKM and K-SVD for 3-dimensional bases with increasing condition numbers.

The bases are perturbations of the canonical basis with the vector . That is, , where and varying from 0 to 0.5 in steps of 0.1, which corresponds to condition numbers varying from 1 to 2.5. We generate approximately -sparse noiseless signals from the signal model described in Table 1 with , , and and run both ITKM and K-SVD with 1000 iterations, sparsity parameter and the true dictionary (basis) as initialisation. Figure 1(a) shows the recovery error between the original dictionary and the output of the respective algorithm averaged over 10 runs.

As predicted by the theoretical results on the corresponding underlying minimisation principles, the recovery errors of ITKM and K-SVD are roughly the same for , which is an orthogonal basis and therefore tight. However, while for ITKM the recovery error stays constantly low over all condition numbers, for K-SVD it increases with increasing condition number or non-tightness.

### 6.2 Recovery Error and Sample Size

The next experiment is designed to show how fast the maximiser near the original dictionary converges to with increasing sample size .

The generating dictionaries consist of the canonical basis in for and the first elements of the Hadamard basis and as such are not tight. For every set of parameters we generate noiseless signals with varying from to and run ITKM with 1000 iterations, sparsity parameter equal to the coefficient parameter and the true dictionary as initialisation. Figure 1(b) shows the recovery error between the original dictionary and the output of ITKM averaged over 10 runs.

As predicted by Theorem 4 the recovery error decays as . However, the separation of the curves for and almost exactly sparse signals by a factor around instead of , as suggested by the estimate , indicates that the cubic dependence of the sampling complexity on the number of atoms may be too pessimistic and could be lowered.

(a) | (b) |

### 6.3 Stability of Recovery Error under Coherence and Noise

With the last two experiments we illustrate the stability of the maximisation criterion under coherence and noise. As generating dictionaries we use again the canonical basis plus half Hadamard dictionaries described in the last experiment, which have coherence .To test the stability under coherence we use a large enough number of noiseless training signals , such that the distance between the local maximum of the criterion near the generating dictionary, that is the output of ITKM with oracle initialisation, and the generating dictionary is mainly determined by the ratio between the gap size and the coherence. For each set of parameters we create training signals with decreasing gap sizes by increasing from 0 to 0.1 in steps of 0.01 and run ITKM with oracle initialisation, parameter and 1000 iterations. Figure 2(a) shows the recovery error between the original dictionary and the output of ITKM again averaged over 10 trials.

Again the experiments reflect our theoretical results. For with or with the gap size is large enough that over the whole range of parameters the recovery error stays constantly low at the level defined by the number of samples. Note that this is quite good, since for we are already far beyond the gap size coherence ratios where the stable theoretical results hold. On the other hand for with or with early on the gap decreases enough to become the error determining factor and so we see an increase in recovery error as grows.

Conversely to test the stability under noise we use a large enough number of exactly sparse training signals, such that the recovery error will be mainly determined by the noise level. For each set of parameters we create training signals with Gaussian noise of variance (noise level) going from to in steps of and run ITKM with oracle initialisation, parameter and 1000 iterations. Figure 2(b) shows the recovery error between the original dictionary and the output of ITKM , this time averaged over 20 trials.

(a) | (b) |

The curves again correspond to the prediction of the theoretical results, that is the recovery error stays at roughly the same level defined by the number of samples until the noise becomes large enough and then increases. What is maybe interesting to observe in both experiments is the dithering effect for with , which is due to the special structure of the dictionary. Indeed using almost equally sized, almost exactly sparse coefficients, it is possible to build signals using only the canonical basis that have almost the same response in only half the Hadamard basis and the other way round. This indicates that slight perturbations of one with the other lead to even better responses and therefore a larger recovery error. After showing that the theoretical results translate into algorithmic practice, we finally turn to a discussion of our results in the context of existing work and point out directions of future research.

## 7 Discussion

We have introduced a new response maximisation principle for dictionary learning and shown that this is locally suitable to identify a generating -coherent dictionary from approximately -sparse training signals to arbitrary precision as long as the sparsity level is of the order . We have also presented - to the best of our knowledge - the first results showing that stable dictionary identification is locally possible not only for signal to noise ratios of the order but also for sparsity levels of the order .

The derived sample complexity (omitting log factors) of , for signals with sparsity levels is roughly the same as for the K-SVD criterion, Schnass (2014), or the -minimisation criterion, Jenatton et al. (2014), but somewhat large compared to recently developed dictionary algorithms that have a sample complexity of , Arora et al. (2014); Agarwal et al. (2014a), or , Agarwal et al. (2014b). However, as the sparsity approaches and goes beyond the derived sample complexity of compares quite favourably to the sample complexity of for a sparsity level as projected in Arora et al. (2014). Given that also our experimental results suggest that is quite pessimistic, one future direction of research aims to lower the sample complexity. In particular ongoing work suggests that for the ITKM algorithm a sample size of order is enough to guarantee local recovery with high probability.

Another strong point of the results is that the corresponding maximisation algorithm ITKM (Iterative Thresholding and signed Means) is locally successful, as demonstrated in several experiments, and computationally very efficient. The most complex step in each iteration is the matrix multiplication of order , which is even lighter than the iterative averaging algorithm described in Arora et al. (2014).

However, the serious drawback is that ITKM is only a local algorithm and that all our results are only local. Also while for the K-SVD criterion and the -minimisation criterion there is reason to believe that all local minima might be equivalent, the response maximisation principle has a lot of smaller local maxima, which is confirmed by preliminary experiments with random initialisations. There ITKM fails but with grace, that is, it outputs local maximisers that have not all, but only most atoms in common with what seems to be the global maximiser near the generating dictionary.
This behaviour is in strong contrast to the algorithms presented in Arora et al. (2014); Agarwal et al. (2014b), that have global success guarantees at a computational cost of the order , and leads to several very important research directions.

First we want to confirm that ITKM has a convergence radius of the order . This is suggested by the derived radius of the area on which the generating dictionary is the optimal maximiser as well as preliminary experiments. Alternatively, we could investigate how the results for the local iterative algorithms in Arora et al. (2014); Agarwal et al. (2014a) could be extended to larger sparsity levels and convergence radii using our techniques. The associated important question is how to extend the results for the algorithms presented in Arora et al. (2014); Agarwal et al. (2014b) to sparsity levels , if possible at lower cost than . Given the conjectured size of the convergence radius for ITKM it would even be sufficient for the output of the algorithm to arrive at a dictionary within distance to the generating dictionary, since the output could then be used as initialisation for ITKM.

A parallel approach for getting global identification results for sparsity levels , that we are currently pursuing, is to analyse a version of ITKM using residual instead of pure signal means, which in preliminary experiments exhibits global convergence properties.

The last research directions we want to point out are concerned with the realism of the signal model. The fact that for an input sparsity a gap of order between the and largest coefficient is sufficient can be interpreted as a relaxed dependence of the algorithm on the sparsity parameter, since a gap of order can occur quite frequently. To further decrease this sensitivity to the sparsity parameter in the criterion and the algorithm we would therefore like to extend our results to the case where we can only guarantee a gap of order between the largest and the largest coefficient for some . Last but not least we would like to exactly reflect the practical situation, where we would normalise our training signals to equally weight their importance and analyse the unit norm signal model where .

Acknowledgments

This work was supported by the Austrian Science Fund (FWF) under Grant no. J3335 and improved thanks to the reviewers’ comments and suggestions. Thanks also go to the Computer Vision Laboratory of the University of Sassari, Italy, which provided the surroundings, where almost all of the presented work was done, and to Michael Dymond, who looked over the final manuscript with a native speaker’s eye.

## A Proofs

### a.1 Proof of Theorem 2

We first reformulate and prove the theorem for the simple case of a symmetric coefficient distribution based on one sequence and then use an integration argument to extend it to the general case.

###### Proposition 6

Let be a unit norm frame with frame constants and coherence .
Let be a random permutation of a sequence , where and , provided with random signs, that is with probability . Assume that the signals are generated as .
If satisfies then there is a local maximum of (5) at .

Moreover for we have as soon as

(15) |

Proof We start by evaluating the objective function at the original dictionary .