Constrained Overcomplete Analysis Operator Learning for Cosparse Signal Modelling

# Constrained Overcomplete Analysis Operator Learning for Cosparse Signal Modelling

## Abstract

We consider the problem of learning a low-dimensional signal model from a collection of training samples. The mainstream approach would be to learn an overcomplete dictionary to provide good approximations of the training samples using sparse synthesis coefficients. This famous sparse model has a less well known counterpart, in analysis form, called the cosparse analysis model. In this new model, signals are characterised by their parsimony in a transformed domain using an overcomplete (linear) analysis operator. We propose to learn an analysis operator from a training corpus using a constrained optimisation framework based on L1 optimisation. The reason for introducing a constraint in the optimisation framework is to exclude trivial solutions. Although there is no final answer here for which constraint is the most relevant constraint, we investigate some conventional constraints in the model adaptation field and use the uniformly normalised tight frame (UNTF) for this purpose. We then derive a practical learning algorithm, based on projected subgradients and Douglas-Rachford splitting technique, and demonstrate its ability to robustly recover a ground truth analysis operator, when provided with a clean training set, of sufficient size. We also find an analysis operator for images, using some noisy cosparse signals, which is indeed a more realistic experiment. As the derived optimisation problem is not a convex program, we often find a local minimum using such variational methods. For two different settings, we provide preliminary theoretical support for the well-posedness of the learning problem, which can be practically used to test the local identifiability conditions of learnt operators.

## 1Introduction

natural signals can be modelled using low-dimensional structures. This means, although the investigated signals are distributed in a high dimensional space, they can be represented using just a few parameters in an appropriate model [1]. This is made possible by the fact that few parameters are actually involved in generating/describing such signals. Examples include polyphonic music, speech signals, cartoon images (where some lines/edges describe the images) and standard videos, where the scenes change slightly frame to frame. In modern signal processing, low-dimensional modelling is one of the most effective approaches to clarify the ambiguities in inverse problems, regularise the signals and overcome some conventional physical, temporal and computational barriers. This model has been heavily investigated in the last decade and many remarkable results achieved in almost all signal processing applications, see for example [2].

A natural and crucial question immediately arises. Given a class of signals that we are interested in, how can we model its low-dimensional structures? As an abstract answer, one could imagine that if is the signal class with low-dimensional structures, then there must be a map such that is sparse for all . Unfortunately, finding such out of all possible maps would be too complex, and one would have to restrict admissible class of maps. A simple option is to consider the class of linear maps. Hence, we ask: is there such that is sparse for ? This is the type of problem we focus on in this paper. Before getting to the nitty-gritty, we review relevant results to give some contexts to our approach.

### 1.1Sparse Signal Models

The most familiar type of low-dimensional structure is the sparse synthesis model [4]. In this setting, if a signal can approximately be generated by adding just a few elementary vectors , called atoms, from an overcomplete set, called a dictionary, as follows,

where is the dictionary (matrix) and is an index set with small cardinality, i.e. , it is called (approximately) sparse in . From the signal processing perspective, it is then important to identify the support . However, this is not an easy task [6]: when the signals are at most -sparse—this means —in model (Equation 1), they live in a union of subspaces (UoS) model [8] where each subspace has a dimension of at most , and the number of such subspaces is exponential. A popular and practical approach to find a sparse is the -minimisation [5] in which one looks for that minimises among the ones that satisfy . The attractiveness of the -objective comes from the fact that it promotes sparsity and is convex—good in computational aspects—at the same time.

An alternative form of -minimisation has been used in practice. In this form, one looks for that minimises among the signals that matches available (linear) information. Here, is some known linear operator. Note that a solution of this -minimisation necessarily leads to many zeros in and hence is orthogonal to many rows of (see Figure ?.a). This observation is not exactly compatible with the synthesis interpretation above, and the corresponding low-dimensional model is called the cosparse analysis model [11]. A signal in this model is called cosparse. is referred to as the analysis operator in this paper. Although this model is also an instance of the UoS model and relies on the vector form of parsimony, it has many structural differences with the synthesis sparsity model [12]. Very recently the signal recovery using analysis sparsity model has been investigated [13]. As can be deduced from the expression , the cosparse analysis model is intimately related to our work.

As can be seen from the synthesis model description, the dictionary plays an important role. When we deal with the natural signals, one can then naturally ask, how good we can choose such a dictionary? Generally, we can use some signal exemplars [15] for learning a dictionary or some domain knowledge for designing a suitable dictionary [17]. The readers can refer to [19] and [20] as some surveys on the dictionary selection approaches. It has been shown that the learned dictionaries usually out perform the canonical dictionaries, e.g. DCT and Wavelet, in practice. Online dictionary learning methods [21] have also been introduced to adapt the generative model to a large set of data or a time-varying system.

In the exemplar based dictionary learning methods, a set of training signals is given. Similar to sparse approximations, the norm is often preferred as the sparsity-promoting objective [16], and the dictionary learning with this objective can be formulated as follows:

where and is an admissible set. Here, is the dictionary we want to learn and is the coefficient matrix which is expected to be sparse column-wise. The reason for is to exclude trivial solutions, such as , with and . This type of problem is called the scale ambiguity of the signal modelling. Various constraints for the resolution of scale ambiguity have been proposed for dictionaries: fixed atom norms [23], fixed Frobenius norm [24] and convex balls made using these norms [25]. Note that the optimisation is no longer convex and can be difficult to solve. Different techniques have been proposed to find a sensible solution for (Equation 2), where they are often based on alternating optimisations of the objective based upon and . See [19] and [20] for a review on state of the art methods.

Designing (overcomplete) linear transforms to map some classes of signals to another space with a few significant elements, is not a new subject. It has actually been investigated for almost a decade and many harmonic/wavelet type transforms have been designed with some guarantees on fast decay of coefficients in the transform domain (See for example [27] and [28]). These transforms are designed such that the perfect reconstructions of the signals are possible, i.e. bijective mapping.

Relatively few research results have been reported in the exemplar based adaptation of analysis operators. This is an important part of data modelling, which has the potential to improve the performance of many current signal processing applications.

### 1.3Contributions

This paper investigates the problem of learning an analysis operator from a set of exemplars, for cosparse signal modelling. Our approach to analysis operator learning can be viewed as the counterpart to the optimisation in the analysis model setting. Here is the summary of our contributions in this work:

1. We propose to formulate analysis operator learning (AOL) as a constrained optimisation problem. Perhaps surprisingly, in contrast to its synthesis equivalent—the dictionary learning problem—the natural formulation of the analysis operator learning problem as an optimisation problem has trivial solutions even when scaling issues are dealt with. The constraints presented in this report exclude such trivial solutions but are also compatible with a number of conventional analysis operators, e.g. curvelets [27] and wave-atoms [28], giving further support for our proposal.

2. We provide a practical optimisation approach for AOL problem, based upon projected subgradient method. Clearly, an AOL principle which does not permit computationally feasible algorithm is of limited value. By implementing a practical algorithm, we also open the possibility to cope with large scale problems.

3. We give preliminary theoretical results toward a characterisation of the local optimality conditions in the proposed constrained optimisation problem. This helps us to better understand the nature of the admissible operator set. It also has potential use in the initialisation of our AOL problem.

Our approach is developed from the idea that was primarily suggested in [29] and [30].

### 1.4Related Work

Remarkably, even though the concept of cosparse analysis modelling has only been very recently pointed out as distinct from the more standard synthesis sparse model [11], it is already gaining momentum and a few other approaches have already been proposed to learn analysis operators. The most important challenge in formulation of the AOL as an optimisation problem is to avoid various trivial solutions. Ophir et al. used a random cycling approach to statistically avoiding the optimisation problem becoming trapped in a trivial solution [32]. Peyré et al. used a geometric constraint, using a linearisation approach, in the operator update step, to get a sensible local optimum for the problem [33]. In a recent approach, Elad et al. have introduced a K-SVD type approach to update each row of the operator at a time [34]. While these approaches have shown promising empirical results, a specificity of our approach is its explicit expression as an optimisation problem.1 We expect that this will open the door to a better understanding of the AOL problem, through mathematical characterisations of the optima of the considered cost function.

### 1.5Notation and Terminology

In this paper we generally use bold letters for vectors and bold capital letters for matrices. We have presented a list of the most frequent parameters in Table 1. We have also presented the corresponding parameters and their range-spaces, used in the related papers, in the same table. The notation , or simply subscript , has been used to specify the element locating in the row and column of the operand matrix. , and are respectively , and Frobenius norms. With an abuse of notations, we will use for the norm defined by the entrywise sum of absolute values of the operand matrix, which is different from the operator norm for matrices. The notation represents the canonical inner-product of two operands for, respectively, and Frobenius norms, in the vector-valued and matrix-valued vector spaces. denotes the trace of the operand matrix. The notation will be used to represent the orthogonal projection whose range is specified by the subscript, e.g. . Sub- and super- scripts in braces are used to indicate iteration number in the algorithm section.

The sparsity of in is noted here by , which is the cardinality of the support of , where is the sparse representation of . We can similarly define the cosparsity of [31], with respect to , and note it by , which is the cardinality of its co-support, i.e. .

Following the notation of [36], a sub-indexed vector or matrix will be determined using a subscript for the original parameter, e.g. . Here, has the dimension of , identical values as in its support and zeros elsewhere. We also use “bar”, e.g. , to denote a complement of the index set in this context.

### 1.6Organisation of the Paper

In the following section, we formulate the analysis operator learning problem as a constrained optimisation problem. We briefly review some of canonical constraints for the exemplar based learning frameworks and explain why they are not enough. we introduce a new constraint to make the problem ‘well-posed’ in ?. After the formulation of the AOL optimisation problem, we introduce a practical algorithm to find a sub-optimal solution for the problem in Section 3. Following the algorithm, in Section 4, we will present some simulation results for analysis operator learning with different settings. The simulations are essentially based on two scenarios, synthetic operators and operators for image patches. In the appendix, we will investigate the local optimality conditions of operators for the proposed learning framework.

## 2Analysis Operator Learning Formulation

The aim of analysis operator learning is to find an operator adapted to a set of observations of the signals , where is an element (column) of sample data and denotes the information available for the learning algorithm. For our purposes, we set , where denotes the (potential) noise. The data is assumed to be of full rank since otherwise, we can reduce the dimension of the problem with a suitable orthogonal transform and solve the operator learning in the new low dimension space.

A standard approach for these types of model adaptation problems, is to define a relevant optimisation problem such that its optimal solution promotes maximal sparsity of . The penalty can be used as the sparsity promoting cost function. As hinted in the introduction, we will use the convex penalty. The extension to an AOL is left for a future work.

### 2.1Constrained Analysis Operator Learning (CAOL)

Momentarily assume that we know . Unconstrained minimisation of , based on , has some trivial solutions. A solution for such a minimisation problem is ! A suggestion to exclude such trivial solutions is to restrict the solution set to an admissible set . Not assuming that is known any more, AOL can thus be formulated as,

where is the parameter corresponding to the noise. We call the AOL to be a noiseless operator learning when .

We prefer to use an alternative, regularised version of (Equation 3), using a Lagrangian multiplier , to simplify its optimisation. The reformulated AOL is the following problem,

If , problem (Equation 4) is similar to the noiseless case. In the following section, we explore some candidates for .

### 2.2Constraints for the Analysis Operators

A suitable admissible set should exclude undesired solutions of AOL while minimally affecting the rest of the search space. We name and as some undesired solutions of (Equation 3). These operators clearly do not reveal low-dimensional structures of the signals of interest. For simplicity, we again assume that is given, i.e .

Here, we intially propose some constraints for the problem (Equation 3) and explain why some of them can not individually exclude undesired solutions. A combined constraint , which is the Uniform Normalised Tight Frame (UNTF), will be introduced subsequently. The proposed constraint is smooth and differentiable.

#### Row norm constraints are insufficient

The first constraint is on the norms of rows of , i.e. for the row. We can find the solution of the optimisation by first finding that minimises . Then, the optimum is obtained by repeating , i.e., is the minimum. Of course, (see Figure ?.b), and the solution is not interesting enough.

#### Row norm + full rank constraints are insufficient

The set of full rank operators is not closed, and the operator belongs to its closure. As a result, the problem (Equation 3) with does not admit a minimiser but there is a sequence of operators, arbitrarily close to , that yield an objective function arbitrarily close to its infimum value.

#### Tight frame constraints are insufficient

In a complete setting , an orthonormality constraint can resolve the ill-conditioning of the problem. The rows of are geometrically as separated as possible. Letting , the orthonormality constraint is not further applicable, i.e. more rows than the dimension of space. In this case, the orthonormality constraint in the ambient space, and , where are respectively the and columns of , is possible. The admissible set of this constraint is the set of tight frames in , i.e. , where is the identity operator in . The admissible set is smooth and differentiable, which is a useful property for optimisation over this set. Such a constraint actually constructs a manifold in the space of , called the (orthogonal) Stiefel manifold [37].

Although the tight frame constraint may look appropriate to avoid “trivial” solutions to (Equation 3), preliminary empirical and theoretical investigations indicate that the analysis operator minimising (Equation 3), using this constraint, is always an orthonormal basis completed by zero columns (see Figure ?.d). This is possibly caused by the fact that the zero elements in such operators, are orthogonal to all finite signals, without giving any insight about their directions, and the operators can thus be truncated to produce complete operators. Therefore, this constraint does not bring anything new compared to the orthonormality constraint.

#### Proposed constraint: Uniform Normalised Tight Frame

This motivates us to apply an extra constraint. Here, we choose to combine the uniformly normalised rows and the tight frame constraints, yielding the UNTF constraint set. This constraint will be used in the rest of paper and the analysis of Appendix Section 6 is based on this admissible set.

The UNTF’s are actually in the intersection of two manifolds, uniform normalised (UN) frames manifold and tight frames (TF) manifold. There exist some results that guarantee the existence of such tight frames, i.e. non-emptiness of the intersection of two manifolds, for any arbitrary dimensions and [38]. This is indeed an important fact, as the optimisation with such a constraint is then always feasible.

This constraint can be written as follows,

and is the th row of . Since is not convex, (Equation 4) may have many separate local optima. Finding a global optima of (Equation 4) is not easy and we often find a local optima using variational methods. Using variational analysis techniques, we can also check if an operator is a local optimum. Details are presented in Appendix Section 6.

The exact recovery, using a variational analysis, in this setting, is essentially based on the following property,

This definition of local identifiability is a generalisation of the (global) identifiability as here we can not find the global solution. Such a property has been previously investigated in the context of dictionary learning in [36] and [39].

### 2.3Extension of UNTF Constraints

In some situations, we may not necessarily want to require that form a frame for the whole of . The inspiration for such a case comes from the finite difference operators which is closely tied to the popular TV-norm. When we model cartoon-like piece-wise constant images, an analysis operator need only to detect transitions in neighboring pixel values. Therefore, all the rows of can be orthogonal to the DC component.

We can facilitate learning similar rank-deficient operators by extending the UNTF constraints. Here, we know that , where is the known null-space. We can modify (Equation 5) to construct the following constraint,

where is the orthogonal projection operator onto the span of .

In the following, we introduce a practical algorithm to find a suboptimal solution of (Equation 4), constrained to or , using a projected subgradient based algorithm.

## 3Constrained Analysis Operator Learning Algorithm

Cosparse signals, by definition, have a large number of zero elements in the analysis space. In this framework, a convex formulation for recovering a signal from its (noisy) observation is to minimise the following convex program,

where is a fixed analysis operator. As (Equation 7) is a convex problem, many algorithms have been proposed to solve this program very efficiently (see [40],[41] and [42] as some examples of the latest methods). These methods can easily be extended to the matrix form to find given . Unfortunately, the AOL problem , due to the freedom to also change in this formulation, is very difficult.

This difficulty is similar to what we face in the dictionary learning for synthesis sparse approximation. In the dictionary learning problem, we have also a joint optimisation problem (Equation 2) that, in a simple setting, can be minimised in an alternating optimisation approach [43] as follows,

An alternating minimisation technique can also be used for the AOL problem to seek a local minimum. In this framework we first keep fixed and update the operator . We then update the cosparse approximations of the signals while keeping the operator fixed. Such an alternating update continues while the new pair is very similar to the previous or is repeated for a predetermined number of iterations. A pseudocode for such an alternating minimisation AOL is presented in Algorithm ?, Analysis Operator Learning Algorithm (AOLA). Each subproblem, i.e line ? or line ?, of AOLA can be solved separately based upon a single parameter. Although the problem in line ? is convex and it can thus be solved in a polynomial time, the problem in line ? is not a convex problem due to the UNTF constraint .

Note that when the cosparse matrix is given, i.e. in a noiseless scenario , the algorithm only has the operator update step, which we need to repeat until convergence. This step is also called noiseless AOL[29].

### 3.1Analysis Operator Update

We use a projected subgradient type algorithm to solve the operator update subproblem in line ? of the AOLA. The subgradient of the objective is , where is an extended set-valued sign function defined as follows,

In the projected subgradient methods, we have to choose a value in the set of subgradients. We randomly choose a value in , when the corresponding element is zero.

After the subgradient descent step, the modified analysis operator is no longer UNTF and needs to be projected onto the UNTF set. Unfortunately, to the authors’ knowledge, there is no analytical method to find the projection of a point onto this set. Many attempts have been done to find such an (approximate) projection, see for example [44] for an alternating projections and [45] for an ordinary differential equations based method. We rely on an alternating projections approach.

Projection of an operator with non-zero rows, onto the space of fixed row norm frames is easy and can be done by scaling each row to have norm . We use to denote this projection. If a row is zero, we set the row to a normalised random vector. This means that is not uniquely defined. This is due to the fact that the set of uniformly normalised vectors is not convex. The projection can be found by,

where is a random vector on the unit sphere.

Projection of a full rank matrix onto the tight frame manifold is possible by calculating a singular value decomposition of the linear operator [44]. Let be the given point and be a singular value decomposition of and be a diagonal matrix with identity on the main diagonal. The projection of can be found using,

Note that, although there is no guarantee to converge to a UNTF using this method, this technique practically works very well [44]. As the projected subgradient continuously changes the current point, which needs to be projected onto UNTF’s, we only use a single pair of projections at each iteration of the algorithm, to reduce the complexity of the algorithm, where the algorithm now asymptotically converges to a UNTF fixed point. To guarantee a uniform reduction of the objective , we can use a simple line search technique to adaptively reduce the stepsize. It indeed prevents any large update, which increases 2 . A pseudocode of this algorithm is presented in Algorithm ?, where is the maximum number of iterations.

For the constraint set mentioned in Section 2.3, we present a simple modification to the previous algorithm in order to constrain the operator to have a specific null-space. Here, we need to find a method to (approximately) project an operator onto . Following the alternating projection technique, which was used earlier to project onto , we only need to find the projection onto the set , as we already know how to project onto UN, i.e. . To project an onto the set of TF’s in , we need to compute the singular value decomposition of , projected into the orthogonal complement space of . The projection onto , , can be found using an arbitrary orthonormal basis for and simply subtracting the projection onto this basis, from the original . We can then rewrite the decomposition as . If the dimension of is , only singular values of can be non-zero. We find the constrained projection as follows,

where matrix is a diagonal matrix in , with identity values on the first diagonal positions, while the last elements are set to zero.

We therefore need to alternatingly use and to find a point in the intersection of these two constraint sets. In Algorithm ?, we only need to replace with to consider the new constraint .

### 3.2Cosparse Signal Update

When is known or given by line ? of AOLA, a convex program based on needs to be solved. Although this program is a matrix valued optimisation problem, it can easily be decoupled to some vector valued subproblems based upon the columns of . One approach is to individually solve each subproblem. Here we present an efficient approach to solve this program in the original matrix-valued form, despite the challenging nature of it, due to its large size.

The main difficulties are, a) is not differentiable and b) inside the penalty, does not allow us to use conventional methods for solving similar penalised problems. We here use the Douglas-Rachford Splitting (DRS) technique to efficiently solve the cosparse signal approximation of the AOLA. It is also called the alternating direction method of multipliers (ADMM) in this setting, see [46] for a brief overview. This technique has indeed been used for the Total Variation (TV) and analysis sparse approximations in [40][47]3. We here only present a simple version of the DRS technique, carefully tailored for this problem. The problem is a constrained convex program with two parameters and as follows,

The Augmented Lagrangian (AL) [49] method is applied to solve (Equation 8). In the Lagrangian multiplier method we use the dual parameter and add a penalty term, . In the AL, we also add an extra quadratic penalty related to the constraint and derive the new objective as follows,

where is a constant parameter. According to the duality property, the solution of coincides with the solution of (Equation 8). Using the DRS method, we iteratively optimise a convex/concave surrogate objective , where is the current estimation of . The fixed points of the iterative updates of are the same as , as the extra term vanishes in any fixed points. is convex with respect to and and concave with respect to . We can thus iteratively update each of the parameters, while keeping the rest fixed, see Algorithm ?. In this algorithm, , with an , is the entrywise soft-threshold operator defined by [50]. Note that the update formula for in Algorithm ?, line ?, involves a matrix inversion, which is computationally expensive. As is a tight frame here, the matrix inversion is significantly simplified using the fact that is identity. In this case, the operator is simply a scaling with .

We iterate Algorithm ? for a number of iterations or until the parameters cease to change significantly. Although the convergence of the iterative updates of this algorithm can separately be investigated, it can also be deduced using the fact that it is a particular case of DRS, which converges under mild conditions [46].

## 4Simulations

We present two categories of simulation results in this section. The first part is concerned with empirical demonstration of (almost) exact recovery of the reference analysis operators. An astute reader might ask: why should we care about exact recovery at all? Is it not enough to show that learned operators are good? This is, of course, true. However, when the signals are generated to be cosparse with respect to a reference operator or they are known to be cosparse with respect to a known analysis operator, the recovery of those operators is a good demonstration of the effectiveness of our method. Going further, one may imagine situations where the rows of the reference analysis operators explain certain properties/dynamics of the signals; for example, finite differences for piecewise constant images detect edges. Obviously, the recovery of reference operators in these contexts is significant.

In the second part, we are interested in demonstrating the effectiveness of learned operators in a practical task, namely, image denoising.

### 4.1Exact recovery of analysis operators

For the first experiment, the reference operators and the cosparse signal data set were generated as follows: A random operator was generated using i.i.d. zero mean, unit variance normal random variables4. The reference analysis operator is made by alternatingly projecting onto the sets of ’s and ’s. A set of training samples was generated, with different cosparsities, by randomly selecting a normal vector in the orthogonal complement space of a randomly selected rows of . Such a vector has (at least) zero components in , and it is, thus, cosparse.

To initialise the proposed algorithm, we used a linear model to generate the initial by combining the reference operator and a normalised random matrix , i.e. , and then alternatingly projecting onto and . It is clear that when is zero, we actually initialise with the reference model and when , the initial will be random.

First, we chose a set of size of such training corpus and used the noiseless formulation (Equation 3), where . The AOL algorithm in this setting is just the projected subgradient base algorithm, Algorithm ?, which was iterated times. To check the local optimality of the operator and the size of basin of attraction, we chose and . The average percentage of operator (rows) recovery, i.e. the maximum distance of any recovered row and the closest row of the reference operator, is not more than , for different cosparsity and 100 trials, are plotted in Figure ?. We practically observe that the operator is the local optimum even when the cosparsity of the signal is as low as 3. We also see that the average recovery reduces by starting from a point far from the the actual reference operator or a random operator, which is the case .

We now investigate the role of on the average operator recovery by some simulations. We kept the previous experiment settings and repeated the simulations for two new training sets, with populations of and , which are and times of the population in the previous experiments. We show the average operator recovery for in Figure ?. The simulation results show not only that can be locally identified with even less cosparse signals, smaller , but the basin of attraction is also extended and now the reference operator can be recovered by starting from a distant initial point, even using 2 cosparse signals.

In the next two experiments, we show that if the cosparsity is low, i.e. small , then the analysis operator cannot be “recovered”: it is not a solution of the proposed optimization problem (Equation 3). One way to demonstrate it would consist in solving the large scale convex optimisation program ( ?) expressed in Proposition ? of the Appendix Section 6. Alternatively, we can use Lemma ? (respectively Theorem ?) of the appendix, to possibly find a matrix , which violates the conditions. Here, we randomly generated 1000 in , and checked whether the inequality ( ?) (respectively ( ?)) was satisfied. We repeated this process for 10 different pseudo-randomly generated and plotted the percentages of satisfying the inequality in Figure ?, respectively in the left column (a) and the right column (b). We have also mentioned the training size, i.e. on each row of this figure. From left column, we can see that with 1-cosparse signals, there are operators which are not local minimum of the program (Equation 3) (). We can also see that the relaxation used to derive Theorem ?, makes the result of this Theorem very conservative, i.e. there are many cases which do not satisfy this theorem, but they still can be locally-identifiable.

For the last experiment, we chose the Shepp Logan Phantom, a well-known piecewise constant image in the Magnetic Resonance Imaging (MRI) community (see Figure ?). We used blocks of size 8 by 8, randomly selected from the image, such that, except one DC training sample, all the training images contain some edges. We learned a operator for the vectorised training image patches, by initialising the (noise-less) AOL algorithm with a pseudo-random UNTF operator. We used two different constraints and for the learning operator , i.e. (Equation 5) and (Equation 6), where the null space was selected to be constant images. We chose the second constraint to see if we would find a different operator by enforcing a similar null-space to the Finite Difference (FD) operator.

The learned operators are shown in Figure ?. We found operators which have many FD type elements. The operator which we learned using in (Equation 6) seems to have more FD type elements. We also compare the level of objective value for the learned operator, using in (Equation 5), and a reference operator. As a two dimensional FD operator is not a UNTF, we chose a two-times overcomplete Haar wavelet—see the right panel of Figure ?—which has some similarities with the FD, in the fine scale. The objective of (Equation 4) for the learned operator in different steps of training and the Haar based operator are shown in Figure ?. This shows that the learned operator finally outperforms the baseline operator in the objective value. Note that if we initialise the algorithm with the Haar based operator, the AOL algorithm does not provide a new operator. This empirically shows that the reference operator is a local optimum for the problem (Equation 4), where . This fact can be investigated using the analysis provided in Appendix Section 6, i.e. randomly generating admissible vectors and checking ( ?) or ( ?).

### 4.2Image denoising with learned analysis operators

For the next experiments we used a set of face images which are centred and cropped [51]. Such images can be modelled as approximately piecewise smooth signals. A pseudo-random admissible has been used as an initial analysis operator and a training set of size of image patches from 13 different faces, after normalised to have mean zero, have been used to learn the operators ().

We applied both of the noiseless AOL Algorithm and the noise-aware AOL Algorithm to demonstrate how much the cosparsities of the training samples increase using the noise aware formulation. (NL)AOL algorithm was iterated times and NAAOL algorithm iterated for iterations with , while the inner-loop, i.e. Algorithm ?, was iterated 100000 times.

A plot of the analysis coefficients for a selected along with its corresponding cosparsities, with three different ’s, are presented in Figure ?. The initial operator has been applied to in . Not surprisingly, the signal is not cosparse with this arbitrary operator (). In , the same plot is drawn using the learned operator with (NL)AOL. Although some coefficients are small, most are not zero, and . In the last plot, we have shown the analysis coefficients for using the learned operator with NAAOL. It is clear that the cosparsity has been increased significantly (from to ). We have further plotted the cosparsities of the first training samples ’s using the learned operator found by (NL)AOL and corresponding approximations ’s, which are found by NAAOL, in Figure ?. This figure also shows, the operator learning using the noise aware formulation (Equation 4), where is finite, results in much greater cosparsity. We show the learned operator found using NAAOL algorithm in Figure ?. This experiment suggests that a harmonic type operator should perform better than FD based operators for such images. This is an interesting observation which should be explored in the future in more detail.

The aim of next experiment is to compare denoising performance of an image using a learned operator and a FD operator [52] which is similar to the analysis operator in the TV denoising formulation and is shown in the right panel of Figure ?). We keep the previous experiment settings. The learned operator and the FD operator can now be used to denoise a corrupted version of another face from the database, using (Equation 7). The original face is shown in Figure ? (a) and the noisy version, with additive i.i.d Gaussian noise, is shown in (b), with PSNR = 26.8720 dB. Denoising was performed using two different regularisation settings: . The bottom two rows show the denoised images using the FD operator and the learned operator. We can visually conclude that the two operators successfully denoise the corrupted images with some slight differences. The results with the learned operators are smoother (this is mostly visible on a screen rather than a printed copy of the paper). The average PSNR’s of the denoised images over 100 trials are presented in Table 2. Although we get a marginally better average PSNR using the learned operator instead of FD when , we get a noticeably better average PSNR, i.e., 1 dB, when .

As the initial goal was to increase the cosparsities of signals, we have also shown the cosparsities of different patches of the selected face image (see Figure ?). The horizontal axis presents the index number of the patches. To compare the cosparsity using these operators, we have plotted their differences and its average in the bottom plot. Positive values here demonstrate the cases when the learned operator is a better operator than the FD operator. The average, which is indeed positive, is plotted as a horizontal line. As a result, the learned operator here performs much better (15% improvement) than the FD operator.

The synthesis framework has been used for different applications with some very promising results. In the last experiment, we have a comparative study between the two frameworks, for denoising face images. We chose the settings of the previous experiment and used the patch based learning. For the reference, we used a two times overcomplete DCT, which is a standard selection for the sparsity based denoising [53]. For the synthesis sparse representation, we used OMP method, and for the analysis sparse recovery, we used a convex formulation similar to the previous experiment. The optimum value of in the analysis was . The OMP was running to achieve an approximation error equal to , where is the standard deviation of the additive noise. This setting has been used in [53], for denoising with the learned synthesis dictionary using K-SVD method. We used another technique, which has also been used in the mentioned reference, to average out over the overlapping-blocks of images to reduce the blocking effect. The average PSNR’s of the denoised face images over 5 trials, for the DCT dictionary/operator are shown in the first row of Table 3.

This experiment demonstrates that the synthesis framework works better in the denoising of face images. We already have techniques to learn dictionaries/operators for each of these three cases. For the OMP and analysis based denoisings, we respectively chose K-SVD [54] and the proposed CAOL methods. The Lagrange multiplier for the analysis based denoising was , while we used the operator learned in the previous experiment. The average PSNR’s of the denoised face images over 5 trials are presented in the second row of Table 3.

By comparing the synthesis denoising techniques, with the overcomplete DCT and K-SVD dictionaries, we find a slight improvement (+0.2 dB) in the PSNR of the denoised images, using the learned dictionary. This improvement is more significant in the analysis denoising framework, as we achieved more than 2.5 dB improvement in the PSNR of the denoised image. In comparison between two frameworks, although the PSNR of the denoised image using the learned operator is significantly improved, it is marginally behind the synthesis competitor. However, we found these results very promising, since this is only the beginning of the field “denoising with the learned analysis operators”. More experiments are necessary to find a reason for such a behaviour, which we leave it for the future work.

## 5Summary and Conclusion

In this paper, we presented a new concept for learning linear analysis operators, called constrained analysis operator learning. The need for a constraint in the learning process comes from the fact that we have various trivial solutions, which we want to avoid by selecting an appropriate constraint. A suitable constraint for this problem was introduced after briefly explaining why some of the canonical constraints in model learning problems are not sufficient for this task. Although there is no claim that the introduced constraint is the most suitable selection, we practically observed very good results with the introduced algorithm.

In the simulation part, we showed some results to support our statements about the relevance of the constraint and the learning algorithm. We actually demonstrated that the learning framework can recover a synthetic analysis operator, when the signals are cosparse and enough training signals are given. By applying the algorithm on two different types of image classes, i.e. piecewise constant and approximately piecewise smooth images, we learned operators which have respectively similarities with the finite difference and harmonic type operators. This observation emphasises the fact that we benefit from selecting the most appropriate analysis operator for image processing tasks. The Matlab code of proposed simulations will be available, as a part of the dictionary learning toolbox SMALLbox [55], later.

The proposed constrained optimisation problem is a non-convex program, for which we can often find a local optimum using variational optimisation methods. In the appendix, we characterise the local optima of such programs. This can be useful to avoid initialising the algorithm with a local minima, and also for empirically checking when an operator can be a local minima. The latter emphasises the fact that we can recover such an operator, by starting the algorithm with a point in a close neighbourhood of the operator, which we do not know a priori.

## 6Theoretical Analysis of the CAOL

If we rewrite the optimisation program (Equation 4), using the UNTF admissible set, we find the following program,

The variational analysis [56] of the objective, near to a given pair , provides the following proposition.

Proposition ? is actually checking the local optimality of a pair, using the first-order approximation of the objective, which can be achieved using and . The constraint of ( ?) is the tangent space of the constraint of (Equation 9) at ,

which is a linear subspace of .

This proposition will be used in the next two subsections to investigate the conditions for the local optimality of a point in two different scenarios: (NA)AOL and NLAOL.

### 6.1Noise Aware Analysis Operator Learning

The tightest condition for the local optimality of pair , is the following necessary and sufficient condition.

According to the proposition ?, we can check the optimality of by checking that is the only solution of ( ?). The proof here is formed by showing that the objective of ( ?) increases if is replaced with , or equivalently, 5, when ( ?) is assured. From the definitions of in ( ?) and , we have,

where . The positivity of (Equation 10) is assured when

This completes the sufficiency of ( ?) for the optimality. If ( ?) is violated, such that is greater than . As the constraint , is a linear space, when and are admissible, and are also admissible. If we set and , it is easy to show that

which contradicts the optimality of , i.e. . This completes the necessary part of the lemma.

The proof is based on finding an upper bound for the right hand side of ( ?). Note that the maximum of may be achieved when is equal to on . Thus,

This means that condition implies the condition of Lemma ? and is thus a local minimum for (Equation 9).

### 6.2Noiseless Analysis Operator Learning

When the noise and model mismatch does not exist, we can solve (Equation 9) with . This case may be considered as an ideal operator learning form, as is here exactly cosparse. This formulation can be used as a benchmark for the operator recovery, very similar to the framework in [36] for dictionary recovery. In this setting we have , we can simplify the problem as follows,

As this formulation has a single parameter , the analysis is significantly easier. (Equation 11) has , i.e. , unknown parameters and constraints. For a full rank , the system of equations is underdetermined if . As we assume is rich enough and therefore full rank, minimising the objective is reasonable if .

We use formulation (Equation 11) to recover operator , while is cosparse with . The following proposition investigate the local optimality conditions of a point for (Equation 11), which is actually equivalent to Proposition ?, when . This proposition has a flavor similar to those obtained in the context of dictionary learning for the synthesis model [36], as well as in hybrid synthesis/analysis framework [57].

Note that ( ?) has a convex objective with a linear constraint. This matrix valued convex problem has a close relation to the analysis parsimony problems investigated in [12], where acts as the analysis operator of the columns of . We present some conditions for locally identifiability of a using variational analysis. A similar technique for the analysis parsimony problems have already been used in [11].

The kernel of the matrix has an important role in the identifiability of the analysis operator. Let be a singular value decomposition of . We can now partition into two parts, i.e. one part is multiplied to , which is the non-zero diagonal part of , and , a basis for the kernel of , and find and . Let two new parameters be defined as . We can now reformulate ( ?) based on the new parameters as follows,

where is the Hadamard product. The constraint of (Equation 12) is linear, which it means, we can represent it based on a linear operator and the vector version of , as . Such an operator is derived in Appendix Section 7. has a nontrivial kernel when . We can now characterise the kernel subspace of the constraint of (Equation 12), , as follows,

As ( ?) is a convex problem, is the only solution iff and . This is equivalent to the statement based on (Equation 12) as follows, is the only solution of ( ?) iff and . We can now find the necessary and sufficient condition for the subject as follow,

As , positivity of is equivalent to the positivity of . is a linear space and therefore if , too. This freedom in choosing the sign of , lets us have for any given , such that . Therefore, the necessary and sufficient condition for to be the only solution of ( ?) is ,

This theorem is a generalisation of lemma ?. The proof is based on, firstly, finding a lower-bound for the left-hand side of ( ?) and, secondly, generalising the lemma using any possible cosparsity pattern .

is never greater than , as,

This assures that . For any , it assures that if , is identifiable based on lemma ?. Now, if we consider all providing -cosparse training samples, with some possible cosparsity , the inequality ( ?) assures the identifiability of .

## 7Deriving a simple linear representation of the constraints

The optimisation problem (Equation 12) has three constraints, where the last two are clearly linear, and they can be represented as,

where . Let . The first constraint can now be reformulated as, . To derive this equation in a similar form to (Equation 13), we reformulate , then right-multiply with .

We now reformulate the first constraint as,

We can now generate using the weights in (Equation 13) and (Equation 14), corresponding to , and make a linear presentation as .

### Footnotes

1. Peyré and Fadili [33] use a similar type of expression to learn an analysis operator in the context of signal denoising, with exemplars of noisy and clean signals.
2. The stability of algorithm is guaranteed in a Lyapunov sense, i.e. the operator is enforced to remain bounded, when .
3. [47] derives the formulation by incorporating the Bregman distance. However, it has been shown that the new method, called Alternating Bregman Splitting method, is the same as DRS, applied to the dual problems [48].
4. is not necessarily a UNTF and needs to be projected onto the set of UNTF’s.
5. This condition for the (local) optimality is related to the positivity condition of directional derivative of for any admissible and in the tangent cone(s) of the constraint(s), see for example [56].

### References

1. R. Baraniuk, V. Cevher, and M. Wakin, “Low-dimensional models for dimensionality reduction and signal recovery: A geometric perspective,” Proceedings of the IEEE, vol. 98, no. 6, pp. 959–971, 2010.
2. J. Starck, F. Murtagh, and M. Fadili, Sparse Image and Signal Processing - Wavelets, Curvelets, Morphological Diversity.1em plus 0.5em minus 0.4emCambridge University Press, 2010.
3. M. Elad, Sparse and Redundant Representations - From Theory to Applications in Signal and Image Processing.1em plus 0.5em minus 0.4emSpringer, 2010.
4. S. Mallat and Z. Zhang, “Matching pursuits with time frequency dictionaries,” IEEE Trans. on Signal Processing, vol. 41, no. 12, pp. 3397–3415, 1993.
5. S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 33–61, 1998.
6. G. Davis, “Adaptive nonlinear approximations,” Ph.D. dissertation, New York University, 1994.
7. B. Natarajan, “Sparse approximate solutions to linear systems,” SIAM Journal of Comput, vol. 24, no. 2, pp. 227–234, 1995.
8. Y. Lu and M. Do, “A theory for sampling signals from a union of subspaces,” IEEE Transactions on Signal Processing, vol. 56, no. 6, pp. 2334 –2345, june 2008.
9. T. Blumensath and M. Davies, “Sampling theorems for signals from the union of finite-dimensional linear subspaces,” IEEE Transactions on Information Theory, vol. 55, no. 4, pp. 1872 –1882, april 2009.
10. Y. Eldar and M. Mishali, “Robust recovery of signals from a structured union of subspaces,” Information Theory, IEEE Transactions on, vol. 55, no. 11, pp. 5302 –5316, nov. 2009.
11. S. Nam, M. Davies, M. Elad, and R. Gribonval, “The cosparse analysis model and algorithms,” 2012, (In Press) Applied and Computational Harmonic Analysis.
12. M. Elad, P. Milanfar, and R. Rubinstein, “Analysis versus synthesis in signal priors,” Inverse Problems, vol. 23, no. 3, pp. 947–968, 2007.
13. E. Candès, Y. Eldar, D. Needell, and P. Randall, “Compressed sensing with coherent and redundant dictionaries,” ACHA, vol. 31, no. 1, pp. 59 – 73, 2011.
14. S. Vaiter, G. Peyré, C. Dossal, and J. Fadili, “Robust sparse analysis regularization,” submitted, arXiv:1109.6222v1.
15. B. Olshausen and D. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?” Vision Research, vol. 37, no. 23, pp. 3311–3325, 1997.
16. M. Lewicki and T. Sejnowski, “Learning overcomplete representations,” Neural Comp, vol. 12, no. 2, pp. 337–365, 2000.
17. =2plus 43minus 4 D. L. Donoho, “Sparse components of images and optimal atomic decompositions,” Constructive Approximation, vol. 17, pp. 353–382, 2001, 10.1007/s003650010032. [Online]. Available: http://dx.doi.org/10.1007/s003650010032 =0pt
18. M. Yaghoobi, L. Daudet, and M. Davies, “Parametric dictionary design for sparse coding,” IEEE Transaction on Signal Processing, vol. 57, no. 12, pp. 4800–4810, 2009.
19. R. Rubinstein, A. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1045 – 1057, June 2010.
20. I. Tošić and P. Frossard, “Dictionary learning,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 27–38, March 2011.
21. J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.
22. S. Lesage, R. Gribonval, F. Bimbot, and L. Benaroya, “Learning unions of orthonormal bases with thresholded singular value decomposition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, March 2005, pp. 293 – 296.
23. K. Engan, S. Aase, and J. Hakon-Husoy, “Method of optimal directions for frame design,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1999.
24. K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski, “Dictionary learning algorithms for sparse representation,” Neural Comp., vol. 15, pp. 349–396, 2003.
25. H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds.1em plus 0.5em minus 0.4emCambridge, MA: MIT Press, 2007, pp. 801–808.
26. M. Yaghoobi, T. Blumensath, and M. Davies, “Regularized dictionary learning for sparse approximation,” in EUSIPCO, 2008.
27. E. Candes and L. Demanet, “The curvelet representation of wave propagators is optimally sparse,” Communications on Pure and Applied Mathematics, vol. 58, no. 11, pp. 1472–1528, 2005.
28. =2plus 43minus 4 L. Demanet and L. Ying, “Wave atoms and sparsity of oscillatory patterns,” Applied and Computational Harmonic Analysis, vol. 23, no. 3, pp. 368 – 387, 2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1063520307000450 =0pt
29. M. Yaghoobi, S. Nam, R. Gribonval, and M. Davies, “Analysis operator learning for overcomplete cosparse representations,” in EUSIPCO, August 2011.
30. ——, “Noise aware analysis operator learning for approximately cosparse signals,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.
31. S. Nam, M. Davies, M. Elad, and R. Gribonval, “Cosparse analysis modeling- uniqueness and algorithms,” in IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.
32. B. Ophir, M. Elad, N. Bertin, and M. Plumbley, “Sequential minimal eigenvalues – an approach to analysis dictionary learning,” in Proceedings of EUSIPCO, September 2011.
33. G. Peyré and J. Fadili, “Learning analysis sparsity priors,” in Proc. of Sampta’11, 2011.
34. M. Elad and R. Rubinstein, “K-SVD dictionary-learning for analysis sparse models,” in Workshop on signal processing with adaptive sparse structured representations, SPARS11, 2011.
35. R. Rubinstein, T. Faktor, and M. Elad, “K-svd dictionary-learning for the analysis sparse model,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 5405–5409.
36. R. Gribonval and K. Schnass, “Dictionary identification - sparse matrix-factorisation via minimisation,” IEEE Trans. on Info. Theory, vol. 56, no. 7, pp. 3523–3539, July 2010.
37. P. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds.1em plus 0.5em minus 0.4emPrinceton University Press, 2008.
38. =2plus 43minus 4 J. J. Benedetto and M. Fickus, “Finite normalized tight frames,” Advances in Computational Mathematics, vol. 18, no. 2, pp. 357 –385, 2003. [Online]. Available: http://dx.doi.org/10.1023/A:1021323312367 =0pt
39. Q. Geng, H. Wang, and J. Wright, “On the local correctness of l1 minimization for dictionary learning,” CoRR, Tech. Rep. abs/1101.5672, 2011.
40. M. Afonso, J. Bioucas-Dias, and M. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Trans. on Image Processing, vol. 19, no. 9, pp. 2345–2356, 2010.
41. S. Becker, J. Bobin, and E. J. Candes, “Nesta: a fast and accurate first-order method for sparse recovery,” 2009, submitted for publication.
42. S. R. Becker, E. J. Candès, and M. C. Grant, “Templates for convex cone problems with applications to sparse signal recovery,” Mathematical Programming Computation, vol. 3, no. 3, pp. 165–218, 2011.
43. M. Yaghoobi, T. Blumensath, and M. Davies, “Dictionary learning for sparse approximations with the majorization method,” IEEE Trans. on Signal Processing, vol. 57, no. 6, pp. 2178–2191, 2009.
44. J. Tropp, I. Dhillon, R. Heath Jr., and T. Strohmer, “Designing structural tight frames via an alternating projection method,” IEEE Trans. on Information Theory, vol. 51, no. 1, pp. 188–209, 2005.
45. =2plus 43minus 4 B. G. Bodmann and P. G. Casazza, “The road to equal-norm parseval frames,” Journal of Functional Analysis, vol. 258, no. 2, pp. 397 – 420, 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0022123609003619 =0pt
46. J. Eckstein and D. Bertsekas, “On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators,” Mathematical Programming, vol. 55, no. 1, pp. 293 –318, 1992.
47. T. Goldstein, , and S. Osher, “The split Bregman method for l1-regularized problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 2, pp. 323 – 343, 2009.
48. S. Setzer, Split Bregman Algorithm, Douglas-Rachford Splitting and Frame Shrinkage, X.-C. Tai, K. Mœrken, M. Lysaker, and K.-A. Lie, Eds.1em plus 0.5em minus 0.4emSpringer Berlin / Heidelberg, 2009.
49. R. Rockafellar, “Augmented lagrange multiplier functions and duality in nonconvex programming,” SIAM J. Control, vol. 12, pp. 268 – 286, 1974.
50. D. Donoho and J. Johnstone, “Ideal spatial adaptation by wavelet shrinkage,” Biometrika, vol. 81, no. 3, pp. 425–455, 1994.
51. K. Lee, J. Ho, and D. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” PAMI, vol. 27, no. 5, pp. 684–698, May 2005.
52. L. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: Nonlinear Phenomena, vol. 60, no. 1-4, pp. 259–268, November 1992.
53. M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. on Image Processing, vol. 15, no. 12, pp. 3736–3745, December 2006.
54. M. Aharon, E. Elad, and A. Bruckstein, “K-SVD: an algorithm for designining of overcomplete dictionaries for sparse representation,” IEEE Trans. on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.
55. I. Damnjanovic, M. E. P. Davies, and M. D. Plumbley, “Smallbox - an evaluation framework for sparse representations and dictionary learning algorithms,” in LVA/ICA 10, 2010, p. 418–425.
56. R. Rockafellar and R.-B. Wets, Variational Analysis.1em plus 0.5em minus 0.4emGrundlehren der Math. Wissenschaften, Springer, Berlin, 1997, vol. 317.
57. F. Jaillet, R. Gribonval, M. Plumbley, and H. Zayyani, “An L1 criterion for dictionary learning by subspace identification,” in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010.