Trainlets: Dictionary Learning in High Dimensions
Sparse representations has shown to be a very powerful model for real world signals, and has enabled the development of applications with notable performance. Combined with the ability to learn a dictionary from signal examples, sparsity-inspired algorithms are often achieving state-of-the-art results in a wide variety of tasks. Yet, these methods have traditionally been restricted to small dimensions mainly due to the computational constraints that the dictionary learning problem entails. In the context of image processing, this implies handling small image patches.
In this work we show how to efficiently handle bigger dimensions and go beyond the small patches in sparsity-based signal and image processing methods. We build our approach based on a new cropped wavelet decomposition, which enables a multi-scale analysis with virtually no border effects. We then employ this as the base dictionary within a double sparsity model to enable the training of adaptive dictionaries. To cope with the increase of training data, while at the same time improving the training performance, we present an Online Sparse Dictionary Learning (OSDL) algorithm to train this model effectively, enabling it to handle millions of examples. This work shows that dictionary learning can be up-scaled to tackle a new level of signal dimensions, obtaining large adaptable atoms that we call trainlets.
Sparse representations over redundant dictionaries have shown to be a very powerful model for many real world signals, enabling the development of applications with notable performance in many signal and image processing tasks . The basic assumption of this model is that natural signals can be expressed as a sparse linear combination of atoms, chosen from a collection called a dictionary. Formally, for a signal , this can be described by , where is a redundant dictionary that contains the atoms as its columns, and is the representation vector.
Given the signal , finding its representation can be done in terms of the following sparse approximation problem:
where is a permitted deviation in the representation accuracy, and the expression is a count of the number of non-zeroes in the vector . The process of solving the above optimization problem is commonly referred to as sparse-coding. Solving this problem is in general NP-hard, but several greedy algorithms and other relaxations methods allow us to solve the problem exactly under certain conditions  and obtain useful approximate solutions in more general settings. These methods include MP , OMP , BP  and FOCUSS  among others.
A fundamental element in this problem is the choice of the dictionary . While some analytically-defined dictionaries (or transformations) such as the overcomplete Discrete Cosine Transform (ODCT) or Wavelet dictionaries were used originally, learning the dictionary from signal examples for a specific task has shown to perform significantly better . This adaptivity to the data allows sparsity-inspired algorithms to achieve state-of-the-art results in many tasks. The dictionary learning problem can be written as:
where is a matrix containing N signal examples, and are the corresponding sparse vectors, both ordered column wise. Several iterative methods have been proposed to handle this task [8, 9, 10]. Due to the computational complexity of this problem, all these methods have been restricted to relatively small signals. When dealing with high-dimensional data, the common approach is to partition the signal into small blocks, where the dictionary learning problem is more feasible.
In the context of image processing, small signals imply handling small image patches. Most state-of-the-art methods for image restoration exploit such a localized patch based approach [11, 12, 13]. In this setting, small overlapping patches ( - ) are extracted from the corrupted image and treated relatively independently according to some image model [14, 13], sparse representations being a popular choice [15, 16, 17, 18]. The full image estimation is then formed by merging together the small restored patches by overlapping and averaging.
Some works have attempted to handle larger two dimensional patches (i.e., greater than ) with some success. In , and later in , traditional K-SVD is applied in the Wavelet domain. These works implicitly manage larger patches while keeping the atom dimension small, noting that small patches of Wavelet coefficients translate to large regions in the image domain. In the context of Convolutional Networks, on the other hand, the work in  has reported encouraging state-of-art result on patches of size .
Though adaptable, explicit dictionaries are computationally expensive to apply. Some efforts have been done in designing fast dictionaries that can be both applied and learned efficiently. This requirement implies constraining the degrees of freedom of the explicit matrix in some way, i.e. imposing some structure on the dictionary. One such possibility is the search for adaptable separable dictionaries, as in , or the search of a dictionary which is an image in itself as in [23, 24], lowering the degrees of freedom and obtaining (close to) shift invariant atoms.
Another, more flexible alternative, has been the pursuit of sparse dictionaries [25, 26]. In these works the dictionary is composed of a multiplication of two matrices, one of which is sparse. The work in  takes this idea a step further, composing a dictionary from the multiplication of a sequence of sparse matrices. In the interesting work reported in  the dictionary is modeled as a collection of convolutions with sparse kernels, lowering the complexity of the problem and enabling the approximation of popular analytically-defined atoms. All of these works, however, have not addressed dictionary learning on real data of considerably higher dimensions or with a considerably large dataset.
A related but different model from the one posed in Equation (2) is the analysis model [29, 30]. In this framework, a dictionary is learned such that . A close variant is the Transform Learning model, where it is assumed that and , as presented in . This framework presents interesting advantages due to the very cheap sparse coding stage (a thresholding operation). An online transform learning approach was presented in , and a sparse transform model was presented in , enabling the training on bigger image patches. In our work, however, we constrain ourselves to the study of synthesis dictionary models.
We give careful attention to the model proposed in . In this work a double sparse model is proposed by combining a fixed separable dictionary with an adaptable sparse component. This lowers the degrees of freedom of the problem in Equation (2), and provides a feasible way of treating high dimensional signals. However, the work reported in  concentrated on 2D and 3D-DCT as a base-dictionary, thus restricting its applicability to relatively small patches.
In this work we expand on this model, showing how to efficiently handle bigger dimensions and go beyond the small patches in sparsity-based signal and image processing methods. This model provides the flexibility of incorporating multi-scale properties in the learned dictionary, a property we deem vital for representing larger signals. For this purpose, we propose to replace the fixed base-dictionary with a new multi-scale one. We build our approach on cropped wavelets, a multi-scale decomposition which overcomes the limitations of the traditional wavelet transform to efficiently represent small images (expressed often in the form of severe border effects).
Another aspect that has limited the training of large dictionaries has been the amount of data required and the corresponding amount of computations involved. As the signal size increases, a (significant) increase in the number of training examples is needed in order to effectively learn the inherent data structure. While traditional dictionary learning algorithms require many sweeps of the whole training corpus, this is no longer feasible in our context. Instead, we look to online learning methods, such as Stochastic Gradient Decent (SGD) . These methods have gained prominence in recent years with the advent of big data, and have been used in the context of traditional (unstructured) dictionary learning  and in training the special structure of the Image Signature Dictionary . We present an Online Sparse Dictionary Learning (OSDL) algorithm to effectively train the double-sparsity model. This approach allows us to handle very large training sets while using high dimensional signals, achieving faster convergence than the batch alternative and providing a better treatment of local minima, which are abundant in non-convex dictionary learning problems.
To summarize, this paper introduces a novel online dictionary learning algorithm, which builds a structured dictionary based on the double-sparsity format. The base-dictionary proposed is a fully-separable cropped Wavelets that has virtually no boundary effects. The overall dictionary learning algorithm can be trained on a corpus of millions of examples, and is capable of representing images of size and even more, while keeping the training, the memory, and the computational load reasonable and manageable. This high-dimensional dictionary learning framework, termed trainlets, shows that global dictionaries for entire images are feasible and trainable. We demonstrate the applicability of the proposed algorithm and its various ingredients in this paper, and we accompany this work with a freely available software package.
This paper is organized as follows. In section II we review sparse dictionary models. In section III we introduce the Cropped Wavelets and show their advantages over standard Wavelets. In section IV we present the Online Sparse Dictionary Learning algorithm, comparing it to the alternative method for training such a model, Sparse KSVD, and to the Online Dictionary Learning algorithm of , which trains an unconstrained (dense) dictionary. In section V we present results from several experiments and applications to image processing, demonstrating the benefits of our proposed method, and in section VI we conclude the paper.
Ii Sparse Dictionaries
Learning dictionaries for large signals requires adding some constraint to the dictionary, otherwise signal diversity and the number of training examples needed make the problem intractable. Often, these constraints are given in terms of a certain structure. One such approach is the double-sparsity model . In this model the dictionary is assumed to be a multiplication of a fixed operator (we will refer to it as the base dictionary) by a sparse adaptable matrix . Every atom in the effective dictionary is therefore a linear combination of few and arbitrary atoms from the base dictionary. Formally, this means that the training procedure requires solving the following problem:
Note that the number of columns in and might differ, allowing flexibility in the redundancy of the effective dictionary. The authors in  used an over-complete Discrete Cosine Transform (ODCT) as the base dictionary in their experiments. Using Wavelets was proposed but never implemented due both to implementation issues (the traditional Wavelet transform is not entirely separable) and to the significant border-effects Wavelets have in small-to-medium sized patches. We address both of these issues in the following section.
As for the training of such a model, the update of the dictionary is now constrained by the number of non-zeros in the columns of . In  a variant of the K-SVD algorithm (termed Sparse K-SVD) was proposed for updating the dictionary. As the work in , this is a batch method that updates every atom sequentially. In the context of the double-sparsity structure, this task is converted into a sparse-coding problem, and approximated by the greedy OMP algorithm.
In the recent inspiring work reported in  the authors extended the double-sparsity model to a scenario where the base dictionary itself is a multiplication of several sparse matrices, that are to be learned. While this structure allows for a clear decrease in the computational cost of applying the dictionary, its capacity to treat medium-size problems is not explored. The proposed algorithm involves a hierarchy of matrix factorizations with multiple parameters to be set, such as the number of levels and the sparsity of each level.
Iii A New Wavelets Dictionary
The double sparsity model relies on a base-dictionary which should be computationally efficient to apply. The ODCT dictionary has been used for this purpose in , but its applicability to larger signal sizes is weak. Indeed, as the patch size grows – getting closer to an image size – the more desirable a multi-scale analysis framework becomes111It is well known that when working with small patches in an image, a transform such as the 2D-DCT is highly effective. This is the reason for the success of DCT in JPEG. When the patch grows to become a small image, DCT is in fact highly ineffective as it insists of periodicity all over the support of the image. It is then Wavelets and its variants that emerge as an appealing alternative. Again, this explains the migration to Wavelets and frames when it comes to JPEG-2000 and global image restoration methods.. The separability of the base dictionary provides a further decrease in the computational complexity. Applying two (or more) 1D dictionaries on each dimension separately is typically much more efficient than an equivalent non-separable multi-dimensional dictionary. We will combine these two characteristics as guidelines in the design of the base dictionary for our model.
Iii-a Optimal Extensions and Cropped Wavelets
The two dimensional Wavelet transform has shown to be very effective in sparsifying natural (normal sized) images. When used to analyze small or medium sized images, not only is the number of possible decomposition scales limited, but more importantly the border effects become a serious limitation. Other works have pointed out the importance of the boundary conditions in the context of deconvolution [35, 36]. However, our approach is different from these, as we will focus on the basis elements rather than on the signal boundaries, and in the pursuit of the corresponding coefficients.
In order to build (bi-)orthogonal Wavelets over a finite (and small) interval, one usually assumes their periodic or symmetric extension onto an infinite axis. A third alternative, zero-padding, assumes the signal is zero outside of the interval. However, none of these alternatives provides an optimal approximation of the signal borders. In general, all these methods do not preserve their vanishing moments at the boundary of the interval, leading to additional non-zero coefficients corresponding to the basis functions that overlap with the boundaries . An alternative is to modify the Wavelet filters such that they preserve their vanishing moments at the borders of the interval, although constructing such Wavelets while preserving their orthogonality is complicated .
We begin our derivation by looking closely at the zero-padding case. Let be a finite signal. Consider , the zero-padded version of , where , ( is “big enough”). Considering the Wavelet analysis matrix of size , the Wavelet representation coefficients are obtained by applying the Discrete Wavelet Transform (DWT) to , which can be written as . Note that this is just a projection of the (zero-padded) signal onto the orthogonal Wavelet atoms.
As for the inverse transform, the padded signal is recovered by applying the inverse Wavelet transform or Wavelet synthesis operator (, assuming orthogonal Wavelets), of size to the coefficients . Immediately after, the padding is discarded (multiplying by ) to obtain the final signal in the original finite interval:
Zero-padding is not an option of preference because it introduces discontinuities in the function that result in large (and many) Wavelet coefficients, even if is smooth inside the finite interval. This phenomenon can be understood from the following perspective: we are seeking the representation vector that will satisfy the perfect reconstruction of ,
The matrix serves here as the effective dictionary that multiplies the representation in order to recover the signal. This relation is an under-determined linear system of equations with equations and unknowns, and thus it has infinitely many possible solutions.
In fact, zero padding chooses a very specific solution to the above system, namely, . This is nothing but the projection of the signal onto the adjoint of the above-mentioned dictionary, since . While this is indeed a feasible solution, such a solution is expected to have many non-zeros if the atoms are strongly correlated. This indeed occurs for the finite-support Wavelet atoms that intersect the borders, and which are cropped by .
To overcome this problem, we propose the following alternative optimization objective:
i.e., seeking the sparsest solution to this under-determined linear system. Note that in performing this pursuit, we are implicitly extending the signal to become , which is the smoothest possible with respect to the Wavelet atoms (i.e., it is sparse under the Wavelet transform). At the same time, we keep using the original Wavelet atoms with all their properties, including their vanishing moments. On the other hand, we pay the price of performing a pursuit instead of a simple back-projection. In particular, we use OMP to approximate the solution to this sparse coding problem. To conclude, our treatment of the boundary issue is obtained by applying the cropped Wavelets dictionary , and seeking the sparsest representation with respect to it, implicitly obtaining an extension of without boundary problems.
To illustrate our approach, in Fig. 1 we show the typical periodic, symmetric and zero-padding border extensions applied to a random smooth function, as well as the ones obtained by our method. As can be seen, this extension – which is nothing else than Wavelet atoms that fit in the borders in a natural way – guarantees not to create discontinuities which result in denser representations222A similar approach was presented in  in the context of compression. The authors proposed to optimally extend the borders of an irregular shape in the sense of minimal -norm of the representation coefficients under a DCT transform.. Note that we will not be interested in the actual extensions explicitly in our work.
To provide further evidence on the better treatment of the borders by the cropped Wavelets, we present the following experiment. We construct 1,000 random smooth functions of length 64 (3rd degree polynomials), and introduce a random step discontinuity at sample 32. These signals are then normalized to have unit -norm. We approximate these functions with only 5 Wavelet coefficients333The m-term approximation with Wavelets is performed with the traditional non-linear approximation scheme. In this framework, orthogonal Wavelets with periodic extensions perform better than symmetric extensions or zero-padding, which we therefore omit from the comparison. We used for this experiment Daubechies Wavelets with 13 taps. All random variables were chosen from Gaussian distributions., and measure the energy of the point-wise (per sample) error (in -sense) of the reconstruction. Fig.2 shows the mean distribution of these errors. As expected, the discontinuity at the center introduces a considerable error. However, the traditional (periodic) Wavelets also exhibit substantial errors at the borders. The proposed cropped Wavelets, on the other hand, manage to reduce these errors by avoiding the creation of extra discontinuities.
Practically speaking, the proposed cropped Wavelet dictionary can be constructed by taking a Wavelet synthesis matrix for signals of length and cropping it. Also, and because we will be making use of greedy pursuit methods, each atom is normalized to have unit norm. This way, the cropped Wavelets dictionary can be expressed as
where is a diagonal matrix of size with values such that each atom (column) in (of size ) has a unit norm444Because the atoms in are compactly supported, some of them may be identically zero in the central samples. These are discarded in the construction of .. The resulting transform is no longer orthogonal, but this – now redundant – Wavelet dictionary solves the borders issues of traditional Wavelets enabling for a lower approximation error.
Just as in the case of zero-padding, the redundancy obtained depends on the dimension of the signal, the number of decomposition scales and the length of the support of the Wavelet filters (refer to  for a thorough discussion). In practice, we set ; i.e, twice the closest higher power of 2 (which reduces to if is a power of two, yielding a redundancy of at most 2) guaranteeing a sufficient extension of the borders.
Iii-B A Separable 2-D Extension
The one-dimensional Wavelet transform is traditionally extended to treat two-dimensional signals by constructing two-dimensional atoms as the separable product of two one-dimensional ones, per scale . This yields three two-dimensional Wavelet functions at each scale , implying a decomposition which is only separable per scale. In practice, this means cascading this two-dimensional transform on the approximation band at every scale.
An alternative extension is a completely separable construction. Considering all the basis elements of the 1-D DWT (in all scales) arranged column-wise in the matrix , the 2-D separable transform can be represented as the Kronecker product . This way, all properties of the transform translate to each of the dimensions of the 2-dimensional signal on which is applied. Now, instead of cascading down a two-dimensional decomposition, the same 1-D Wavelet transform is applied first to all the columns of the image and then to all the rows of the result (or vice versa). In relatively small images, this alternative is simpler and faster to apply compared to the traditional cascade. This modification is not only applicable to the traditional Wavelet transform, but also to the cropped Wavelets dictionary introduced above. In this 2-D set-up, both vertical and horizontal borders are implicitly extended to provide a sparser Wavelet representation.
We present in Fig. 3 the 2-D atoms of the Wavelet (Haar) Transform for signals of size as an illustrative example. The atoms corresponding to the coarsest decomposition scale and the diagonal bands are the same in both separable and non-separable constructions. The difference appears in the vertical and horizontal bands (at the second scale and below). In the separable case we see elongated atoms, mixing a low scale in one direction with high scale in the other.
Iii-C Approximation of Real World Signals
While it is hard to rank the performance of separable versus non-separable analytical dictionaries or transforms in the general case, we have observed that the separable Wavelet transform provides sparser representations than the traditional 2-D decomposition on small-medium size images. To demonstrate this, we take 1,000 image patches of size from popular test images, and compare the m-term approximation achieved by the regular two-dimensional Wavelet transform, the completely separable Wavelet transform and our separable and cropped Wavelets. A small subset of these patches is presented on the left of Fig. 4. These large patches are in themselves small images, exhibiting the complex structures characteristic of real world images.
As we see from the results in Fig. 4 (right), the separability provides some advantage over regular Wavelets in representing the image patches. Furthermore, the proposed separable cropped Wavelets give an even better approximation of the data with fewer coefficients.
Before concluding this section, we make the following remark. It is well known that Wavelets (separable or not) are far from providing an optimal representation for general images [37, 40, 41]. Nonetheless, in this work these basis functions will be used only as the base dictionary, while our learned dictionary will consist of linear combinations thereof. It is up to the learning process to close the gap between the sub-optimal representation capability of the Wavelets, and the need for a better two dimensional representation that takes into account edge orientation, scale invariance, and more.
Iv Online Sparse Dictionary Learning
As seen previously, the de-facto method for training the doubly sparse model has been a batch-like process. When working with higher dimensional data, however, the required amount of training examples and the corresponding computational load increase. In this big-data (or medium-data) scenario, it is often unfeasible or undesired to perform several sweeps over the entire data set. In some cases, the dimensionality and the amount of data might restrict the learning process to only a couple of iterations. In this regime of work it may be impossible to even store all training samples in memory during the training process. In an extreme online learning set-up, each data sample is seen only once as new data flows in.
These reasons lead naturally to the formulation of an online training method for the double-sparsity model. In this section, we first introduce a dictionary learning method based on the Normalized Iterative Hard-Thresholding algorithm . We then use these ideas to propose an Online Sparse Dictionary Learning (OSDL) algorithm based on the popular Stochastic Gradient Descent technique, and show how it can be applied efficiently to our specific dictionary learning problem.
Iv-a NIHT-based Dictionary Learning
A popular practice in dictionary learning, which has been shown to be quite effective, is to employ a block coordinate minimization over this non-convex problem. This often reduces to alternating between a sparse coding stage, throughout which the dictionary is held constant, and a dictionary update stage in which the sparse coefficients (or their support) are kept fixed. We shall focus on the second stage, as the first remains unchanged, essentially applying sparse coding to a group of examples. Embarking from the objective as given in Equation (3), the problem to consider in the dictionary update stage is the following:
where is the base dictionary of size and is a matrix of size with non-zeros per column. Many dictionary learning methods undertake a sequential update of the atoms in the dictionary ([8, 10, 25]). Following this approach, we can consider minimization problems of the following form:
where is the error given by and denotes the -th row of . This problem produces the -th column in , and thus we sweep through to update all of .
The Normalized Iterative Hard-Thresholding (NIHT)  algorithm is a popular sparse coding method in the context of Compressed Sensing . This method can be understood as a projected gradient descent algorithm. We can propose a dictionary update based on the same concept. Note that we could rewrite the cost function in Equation (8) as , for an appropriate operator . Written in this way, we can perform the dictionary update in terms of the NIHT by iterating:
where is the adjoint of , is a Hard-Thresholding operator that keeps the largest non-zeros (in absolute value), and is an appropriate step-size. Note that this algorithm implies iterating over Equation (9) until convergence per atom in the dictionary update stage.
The choice of the step size is critical. Noting that , in  the authors propose to set this parameter per iteration as:
where denotes the support of . With this step size, the estimate is obtained by performing a gradient step and hard-thresholding as in Equation (10). Note that if the support of and are the same, setting as in Equation (10) is indeed optimal, as it is the minimizer of the quadratic cost w.r.t. . In this case, we simply set . If the support changes after applying , however, the step-size must be diminished until a condition is met, guaranteeing a decrease in the cost function555The step size is decreased by , where . We refer the reader to  and  for further details.. Following this procedure, the work reported in  shows that the algorithm in Equation (9) is guaranteed to converge to a local minimum of the problem in (8).
Consider now the algorithm given by iterating between 1) sparse coding of all examples in , and 2) atom-wise dictionary update with NIHT in Equation (8). An important question that arises is: will this simple algorithm converge? Let us assume that the pursuit succeeds, obtaining the sparsest solution for a given sparse dictionary , which can indeed be guaranteed under certain conditions. Moreover, pursuit methods like OMP, Basis Pursuit and FOCUSS perform very well in practice when (refer to  for a thorough review). For the cases where the theoretical guarantees are not met, we can adopt an external interference approach by comparing the best solution using the support obtained in the previous iteration to the one proposed by the new iteration of the algorithm, and choosing the best one. This small modification guarantees a decrease in the cost function at every sparse coding step. The atom-wise update of the dictionary is also guaranteed to converge to a local minimum for the above mentioned choice of step sizes. Performing a series of these alternating minimization steps ensures a monotonic reduction in the original cost function in Equation (2), which is also bounded from below, and thus convergence to a fixed point is guaranteed.
Iv-B From Batch to Online Learning
As noted in [23, 10], it is not compulsory to accumulate all the examples to perform an update in the gradient direction. Instead, we turn to a stochastic (projected) gradient descent approach. In this scheme, instead of computing the expected value of the gradient by the sample mean over all examples, we estimate this gradient over a single randomly chosen example . We then update the atoms of the dictionary based on this estimation using:
Since these updates might be computationally costly (and because we are only performing an alternating minimization over problem (3)), we might stop after a few iterations of applying Equation (11). We also restrict this update to those atoms that are used by the current example (since others have no contribution in the corresponding gradient). In addition, instead of employing the step size suggested by the NIHT algorithm, we employ the common approach of using decreasing step sizes throughout the iterations, which has been shown beneficial in stochastic optimization . To this end, and denoting by the step size resulting from the NIHT, we employ an effective learning rate of , with a manually set parameter . This modification does not compromise the guarantees of a decrease in the cost function (for the given random sample ), since this factor is always smaller than one. We outline the basic stages of this method in Algorithm 1.
An important question that now arises is whether shifting from a batch training approach to this online algorithm preserves the convergence guarantees described above. Though plenty is known in the field of stochastic approximations, most of the existing results address convergence guarantees for convex functions, and little is known in this area regarding projected gradient algorithms . For non-convex cases, convergence guarantees still demand the cost function to be differentiable with continuous derivatives . In our case, the pseudo-norm makes a proof of convergence challenging, since the problem becomes not only non-convex but also (highly) discontinuous.
That said, one could re-formulate the dictionary learning problem using a non-convex but continuous and differentiable penalty function666One of many such possibilities is , replacing ., moving from a constrained optimization problem to an unconstrained one. We conjecture that convergence to a fixed point of this problem can be reached under the mild conditions described in . Despite these theoretical benefits, we choose to maintain our initial formulation in terms of the measure for the sake of simplicity (note that we need no parameters other than the target sparsity). Practically, we saw in all our experiments that convergence is reached, providing numerical evidence for the behavior of our algorithm.
Iv-C OSDL In Practice
We now turn to describe a variant of the method described in Algorithm 1, and outline other implementation details. The atom-wise update of the dictionary, while providing a specific step-size, is computationally slower than a global update. In addition, guaranteeing a decreasing step in the cost function implies a line-search per atom that is costly. For this reason we propose to replace this stage by a global dictionary update of the form
where the thresholding operator now operates in each column of its argument. While we could maintain a NIHT approach in the choice of the step-size in this case as well, we choose to employ
Note that this is the square-root of the value in Equation (10) and it may appear as counter-intuitive. We shall present a numerical justification of this choice in the following section.
Secondly, instead of considering a single sample per iteration, a common practice in stochastic gradient descent algorithms is to consider mini-batches of examples arranged in the matrix . As explained in detail in , the computational cost of the OMP algorithm can be reduced by precomputing (and storing) the Gram matrix of the dictionary , given by . In a regular online learning scheme, this would be infeasible due to the need to recompute this matrix for each example. In our case, however, the matrix needs only to be updated once per mini-batch. Furthermore, only a few atoms get updated each time. We exploit this by updating only the respective rows and columns of the matrix . Moreover, this update can be done efficiently due to the sparsity of the dictionary .
Stochastic algorithms often introduce different strategies to regularize the learning process and try to avoid local minimum traps. In our case, we incorporate in our algorithm a momentum term controlled by a parameter . This term helps to attenuate oscillations and can speed up the convergence by incorporating information from the previous gradients. This algorithm, termed Online Sparse Dictionary Learning (OSDL) is depicted in Algorithm 2. In addition, many dictionary learning algorithms [8, 10] include the replacement of (almost) unused atoms and the pruning of similar atoms. We incorporate these strategies here as well, checking for such cases once every few iterations.
Iv-D Complexity Analysis
We now turn to address the computational cost of the proposed online learning scheme. As was thoroughly discussed in , the sparse dictionary enables an efficient sparse coding step. In particular, any multiplication by , or its transpose, has a complexity of , where is the number of atoms in (assume for simplicity square), is the atom sparsity and is the complexity of applying the base dictionary. For the separable case, this reduces to .
Using a sparse dictionary, the sparse coding stage with OMP (in its Cholesky implementation) is per example. Considering examples in a mini-batch, and assuming and , we obtain a complexity of .
Moving to the update stage in the OSDL algorithm777We analyze the complexity of just the OSDL for simplicity. The analysis of Algorithm 1 is similar, adding the complexity of the line search of the step sizes., calculating the gradient has a complexity of , and so does the calculation of the step size. Recall that is the set of atoms used by the current samples, and that ; i.e., the update is applied only on a subset of all the atoms. Updating the momentum variable grows as , and the hard thresholding operator is . In a pessimistic approach, assume .
Putting these elements together, the OSDL algorithm has a complexity of per mini-batch. The first term depends on the number of examples per mini-batch, and the second one depends only on the size of the dictionary. For high dimensions (large ), the first term is the leading one. Clearly, the number of non-zeros per atom determines the computational complexity of our algorithm. While in this study we do not address the optimal way of scaling , experiments shown hereafter suggest that its dependency with might in fact be less than linear. The sparse dictionary provides a computational advantage over the online learning methods using explicit dictionaries, such as , which have complexity of .
In this section we present a number of experiments to illustrate the behaviour of the method presented in the previous section. We start with a detailed experiment on learning an image-specific dictionary. We then move on to demonstrations on image denoising and image compression. Finally we tackle the training of universal dictionaries on millions of examples in high dimensions.
V-a Image-Specific Dictionary Learning
To test the behaviour of the proposed approach, we present the following experiment. We train an adaptive sparse dictionary in three setups of increasing dimension: with patches of size , and , all extracted from the popular image Lena, using a fixed number of non-zeros in the sparse coding stage (4, 10 and 20 non-zeros, respectively). We also repeat this experiment for different levels of sparsity of the dictionary . We employ the OSDL algorithm, as well as the method presented in Algorithm 1 (in its mini-batch version, for comparison). We also include the results by Sparse K-SVD, which is the classical (batch) method for the double sparsity model, and the popular Online Dictionary Learning (ODL) algorithm . Note that this last method is an online method that trains a dense (full) dictionary. Training is done on 200,000 examples, leaving 30,000 as a test set.
The sparse dictionaries use the cropped Wavelets as their operator , built using the Symlet Wavelet with 8-taps. The redundancy of this base dictionary is 1.75 (in 1-D), and the matrix is set to be square, resulting in a total redundancy of just over 3. For a fair comparison, we initialize the ODL method with the same cropped Wavelets dictionary. All methods use OMP in the sparse coding stage. Also, note that the ODL888We used the publicly available SPArse Modeling Software package, at http://spams-devel.gforge.inria.fr/. algorithm is implemented entirely in C, while in our case this is only true for the sparse coding, giving the ODL somewhat of an advantage in run-time.
The results are presented in Fig. 5, showing the representation error on the test set, where each marker corresponds to an epoch. The atom sparsity refers to the number of non-zeros per column of with respect to the signal dimension (i.e., in the case implies 7 non-zeros). Several conclusions can be drawn from these results. First, as expected, the online approaches provide a much faster convergence than the batch alternative. For the low dimensional case, there is little difference between Algorithm 1 and the OSDL, though this difference becomes more prominent as the dimension increases. In these cases, not only does Algorithm 1 converge slower but it also seems to be more prone to local minima.
As the number of non-zeros per atom grows, the representation power of our sparse dictionary increases. In particular, OSDL achieves the same performance as ODL for an atom sparsity of for a signal dimension of 144. Interestingly, OSDL and ODL achieve the same performance for decreasing number of non-zeros in as the dimension increases: for the case and for the . In this higher dimensional setting, not only does the sparse dictionary provide faster convergence but it also achieves a lower minimum. The lower degrees of freedom of the sparse dictionary prove beneficial in this context, where the amount of training data is limited and perhaps insufficient to train a full dictionary999Note that this limitation needed to be imposed for a comparison with Sparse K-SVD. Further along this section we will present a comparison without this limitation.. This example suggests that indeed could grow slower than linearly with the dimension .
Before moving on, we want to provide some empirical evidence to support the choice of the step size in the OSDL algorithm. In Fig. 6 we plot the atom-wise step sizes obtained by Algorithm 1, (i.e., the optimal values from the NIHT perspective), together with their mean value, as a function of the iterations for the case for illustration. In addition, we show the global step sizes of OSDL as in Equation (13). As can be seen, this choice provides a fair approximation to the mean of the individual step sizes. Clearly, the square of this value would be too conservative, yielding very small step sizes and providing substantially slower convergence.
V-B Image Restoration Demonstration
In the context of image restoration, most state-of-the-art algorithms take a patch-based approach. While the different algorithms differ in the models they enforce on the corrupted patches (or the prior they chose to consider, in the context a Bayesian formulation) the general scheme remains very much the same: overlapping patches are extracted from the degraded image, then restored more or less independently, before being merged back together by averaging. Though this provides an effective option, this locally-focused approach is far from being optimal. As noted in several recent works ([20, 49, 50]), not looking at the image as a whole causes inconsistencies between adjacent patches which often result in texture-like artifacts. A possible direction to seek for a more global outlook is, therefore, to allow for bigger patches.
We do not intended to provide a complete image restoration algorithm in this paper. Instead, we will show that benefit can indeed be found in using bigger patches in image restoration – given an algorithm which can cope with the dimension increase. We present an image denoising experiment of several popular images, for increasing patch sizes. In the context of sparse representations, an image restoration task can be formulated as a Maximum a Posteriori formulation . In the case of a sparse dictionary, this problem can be posed as:
where is the image estimate given the noisy observation , is an operator that extracts the patch from a given image and is the sparse representation of the patch. We can minimize this problem by taking a similar approach to that of the dictionary learning problem: use a block-coordinate descent by fixing the unknown image , and minimizing w.r.t the sparse vectors and the dictionary (by any dictionary learning algorithm). We then fix the sparse vectors and update the image . Note that even though this process should be iterated (as effectively shown in ) we stick to the first iteration of this process to make a fair comparison with the K-SVD based algorithms.
For this experiment, denoted as Experiment 4, we use both Sparse K-SVD and OSDL, for training the double sparsity model. Each method is run with the traditional ODCT and with the cropped Wavelets dictionary, presented in this paper. We include as a reference the results of the K-SVD denoising algorithm , which trains a regular (dense) dictionary with patches of size . The dictionary sparsity was set to be of the signal dimension. Regarding the size of the dictionary, the redundancy was determined by the redundancy of the cropped Wavelets (as explained in Section III-A), and setting the sparse matrix to be square. This selection of parameters is certainly not optimal. For example, we could have set the redundancy as an increasing function of the signal dimension. However, learning such increasingly redundant dictionaries is limited by the finite data of each image. Therefore, we use a square matrix for all patch sizes, leaving the study of other alternatives for future work. 10 iterations were used for the K-SVD methods and 5 iterations for the OSDL.
Fig. 7 presents the averaged results over the set of 10 publicly available images used by , where the noise standard deviation was set to . Note how the original algorithm presented in , Sparse K-SVD with the ODCT as the base dictionary, does not scale well with the increasing patch size. In fact, once the base dictionary is replaced by the cropped Wavelets dictionary, the same algorithm shows a jump in performance of nearly 0.4 dB. A similar effect is observed for the OSDL algorithm, where the cropped Wavelets dictionary performs the best.
Employing even greater patch sizes eventually results in decreasing denoising quality, even for the OSDL with Cropped Wavelets. Partially, this could be caused by a limitation of the sparse model in representing fine details as the dimension of the signal grows. Also, the amount of training data is limited by the size of the image, having approximately 250,000 examples to train on. Once the dimension of the patches increases, the amount of training data might become a limiting factor in the denoising performance.
As a final word about this experiment, we note that treating all patches the same way (with the same patch size) is clearly not optimal. A multi-size patch approach has already been suggested in , though in the context of the Non-Local Means algorithm. The OSDL algorithm may be the right tool to bring multi-size patch processing to sparse representation-based algorithms, and this remains a topic of future work.
V-C Adaptive Image Compression
Image compression is the task of reducing the amount of information needed to represent an image, such that it can be stored or transmitted efficiently. In a world where image resolution increases at a surprising rate, more efficient compression algorithms are always in demand. In this section, we do not attempt to provide a complete solution to this problem but rather show how our online sparse dictionaries approach could indeed aid a compression scheme.
Most (if not all) compression methods rely on sparsifying transforms. In particular, JPEG2000, one of the best performing and popular algorithms available, is based on the 2-D Wavelet transform. Dictionary learning has already been shown to be beneficial in this application. In , the authors trained several dictionaries for patches of size on pre-aligned face pictures. These offline trained dictionaries were later used to compress images of the same type, by sparse coding the respective patches of each picture. The results reported in  surpass those by JPEG2000, showing the great potential of similar schemes.
In the experiment we are presenting here (Experiment 5), we go beyond the locally based compression scheme and propose to perform naive compression by just keeping a certain number of coefficients through sparse coding, where each signal is the entire target image. To this end, we use the same data set as in  consisting of over 11,000 examples, and re-scaled them to a size of . We then train a sparse dictionary on these signals with OSDL, using the cropped Wavelets as the base dictionary for 15 iterations. For a fair comparison with other non-redundant dictionaries, in this case we chose the matrix such that the final dictionary is non-redundant (a rectangular tall matrix). A word of caution should be said regarding the relatively small training data set. Even though we are training just over 4000 atoms on only 11,000 samples, these atoms are only 250-sparse. This provides a great reduction to the degrees of freedom during training. A subset of the obtained atoms can be seen in Fig. 8a.
For completion, we include here the results obtained by the SeDiL algorithm  (with the code provided by the authors and with the suggested parameters), which trains a separable dictionary consisting of 2 small dictionaries of size . Note that this implies a final dictionary which has a redundancy of 4, though the degrees of freedom are of course limited due to the separability imposed.
The results of this naive compression scheme are shown in Fig. 8b for a testing set (not included in the training). As we see, the obtained dictionary performs substantially better than Wavelets – on the order of 8 dB at a given coefficient count. Partially, the performance of our method is aided by the cropped Wavelets, which in themselves perform better than the regular 2-D Wavelet transform. However, the adaptability of the matrix results in a much better compression-ratio. A substantial difference in performance is obtained after training with OSDL, even while the redundancy of the obtained dictionary is less (by about half) than the redundancy of its base-dictionary. The dictionary obtained by the SeDiL algorithm, on the other hand, has difficulties learning a completely separable dictionary for this dataset, in which the faces, despite being aligned, are difficult to approximate through separable atoms.
As one could observe from the obtained dictionary atoms by our method, some of them might resemble PCA-like basis elements. Therefore we include the results by compressing the testing images with a PCA transform, obtained from the same training set – essentially, performing a dimensionality reduction. As one can see, the PCA results are indeed better than Wavelets due to the regular structure of the aligned faces, but they are still relatively far from the results achieved by OSDL.
Lastly, we show that this naive compression scheme, based on the OSDL algorithm, does not rely on the regularity of the aligned faces in the previous database. To support this claim, we perform a similar experiment on images obtained for the “Cropped Labeled Faces in the Wild Database” . This database includes images of subjects found on the web, and its cropped version consists of images including only the face of the different subjects. These face images are in different positions, orientations, resolutions and illumination conditions. We trained a dictionary for this database, which consists of just over 13,000 examples, with the same parameter as in the previous case, and the compression is evaluated on a testing set not included in the training. An analogous training process was performed with SeDiL. As shown in Fig. 8c, the PCA results are now inferior, due to the lack of regularity of the images. The separable dictionary provided by SeDiL performs better in this dataset, whose examples consists of truncated faces rather than heads, and which can be better represented by separable atoms. Yet, its representation power is compromised by its complete separability when compared to OSDL, with a 1 dB gap between the two.
V-D Pursuing Universal Big Dictionaries
Dictionary learning has shown how to truly take advantage of sparse representations in specific domains, however dictionaries can also be trained for more general domains (i.e., natural images). For relatively small dimensions, several works have demonstrated that it is possible to train general dictionaries on patches extracted from non-specific natural images. Such general-purpose dictionaries have in turn been used in many applications in image restoration, outperforming analytically-defined transforms .
Using our algorithm we want to tackle the training of such universal dictionaries for image patches of size , i.e., of dimension 1024. To this end, in this experiment we train a sparse dictionary with a total redundancy of 6: the cropped Wavelets dictionary introduces a redundancy of around 3, and the matrix has a redundancy of 2. The atom sparsity was set to 250, and each example was coded with 60 non-zeros in the sparse coding stage. Training was done on 10 Million patches taken from natural images from the Berkeley Segmentation Dataset . We run the OSDL algorithm for two data sweeps. For comparison, we trained a full (unconstrained) dictionary with ODL with the same redundancy, on the same database and with the same parameters.
We evaluate the quality of such a trained dictionary in an M-Term approximation experiment on 600 patches (or little images). Comparison is done with regular and separable cropped Wavelets (the last one being the base-dictionary of the double sparsity model, and as such the starting point of the training). We also want to compare our results with the approximation achieved by more sophisticated multi-scale transforms, such as Contourlets. Contourlets are a better suited multi-scale analysis for two dimensions, providing an optimal approximation rate for piece-wise smooth functions with discontinuities along twice differentiable curves . This is a slightly redundant transform due to the Laplacian Pyramid used for the multi-scale decomposition (redundancy of 1.33). Note that, traditionally, hard-thresholding is used to obtain an M-term approximation, as implemented in the code made available by the authors. However, this is not optimal in the case of redundant dictionaries. We therefore construct an explicit Contourlet synthesis dictionary, and apply the same greedy pursuit we employ throughout the paper. Thus we fully leverage the approximation power of this transform, making the comparison fair.
Moreover, and to provide a complete picture of the different transforms, we include also the results obtained for a cropped version of Contourlets. Since Contourlets are not separable we use a 2-D extension of our cropping procedure detailed in Section III-A to construct a cropped Contourlets synthesis dictionary. The lack of separability makes this dictionary considerably less efficient computationally. As in cropped Wavelets, we naturally obtain an even more redundant dictionary (redundancy factor of 5.3)101010Another option to consider is to use undecimated multi-scale transforms. The Undecimated Wavelet Transform (UDWT)  and the Nonsubsampled Contourlet Transform (NSCT)  are shift-invariant versions of the Wavelet and Contourlet transforms, respectively, and are obtained by skipping the decimation step at each scale. This greater flexibility in representation, however, comes at the cost of a huge redundancy, which becomes a prohibiting factor in any pursuing scheme. A similar undecimated scheme could be proposed for the corresponding cropped transforms, however, but this is out of the scope of this work..
A subset of the obtained dictionary is shown in Fig. 9, where the atoms have been sorted according to their entropy. Very different types of atoms can be observed: from the piece-wise-constant-like atoms, to textures at different scales and edge-like atoms. It is interesting to see that Fourier type atoms, as well as Contourlet and Gabor-like atoms, naturally arise out of the training. In addition, such a dictionary obtains some flavor of shift invariance. As can be seen in Fig. 10, similar patterns may appear in different locations in different atoms. An analogous question could be posed regarding rotation invariance. Furthermore, we could consider enforcing these, or other, properties explicitly in the training. These, and many more questions, are the lines of on-going work.
The approximation results are shown in Fig. 11.a, where Contourlets can be seen to perform slightly better than Wavelets. The cropping of the atoms significantly enhances the results for both transforms, with a slight advantage for cropped Wavelets over cropped Contourlets. The Trainlets, obtained with OSDL, give the highest PSNR. Interestingly, the ODL algorithm by  performs slightly worse than the proposed OSDL, despite the vast database of examples. In addition, the learning (two epochs) with ODL took roughly 4.6 days, whereas the OSDL took approximately 2 days111111This experiment was run on a 64-bit operating system with an Intel Core i7 microprocessor, with 16 Gb of RAM, in Matlab.. As we see, the sparse structure of the dictionary is not only beneficial in cases with limited training data (as in Experiment 1), but also in this big data scenario. We conjecture that this is due to the better guiding of the training process, helping to avoid local minima which an uncontrained dictionary might be prone to.
As a last experiment, we want to show that our scheme can be employed to train an adaptive dictionary for even higher dimensional signals. In Experiment 8, we perform a similar training with OSDL on patches (or images) of size , using an atom sparsity of 600. The cropped Wavelets dictionary has a redundancy of 2.44, and we set to be square.
In order to have a fair comparison, and due to the extensive time involved in running ODL, we first ran ODL for 5 days, giving it sufficient time for convergence. During this time ODL accessed 3.8 million training examples. We then ran OSDL using the same examples121212The provided code for ODL is not particularly well suited for cluster-processing (needed for this experiment), and so the times involved in this case should not be taken as an accurate run-time comparison..
As shown in Fig. 11.b, the relative performance of the different methods is similar to the previous case. Trainlets again gives the best approximation performance, giving a glimpse into the potential gains achievable when training can be effectively done at larger signal scales. It is not possible to show here the complete trained dictionary, but we do include some selected atoms from it in Fig. 11.c. We obtain many different types of atoms: from the very local curvelets-like atoms, to more global Fourier atoms, and more.
Vi Summary and Future Work
This work shows that dictionary learning can be up-scaled to tackle a new level of signal dimensions. We propose a modification on the Wavelet transform by constructing two-dimensional separable cropped Wavelets, which allow a multi-scale decomposition of patches without significant border effects. We apply these Wavelets as a base-dictionary within the Double Sparsity model, allowing this approach to now handle larger and larger signals. In order to handle the vast data sets needed to train such a big model, we propose an Online Sparse Dictionary Learning algorithm, employing SGD ideas in the dictionary learning task. We show how, using these methods, dictionary learning is no longer limited to small signals, and can now be applied to obtained Trainlets, high dimensional trainable atoms.
While OMP proved sufficient for the experiments shown in this work, considering other sparse coding algorithms might be beneficial. In addition, the entire learning algorithm was developed using a strict pseudo-norm, and its relaxation to other convex norms opens new possibilities in terms of training methods. Another direction is to extend our model to allow for the adaptability of the separable base-dictionary itself, incorporating ideas of separable dictionary learning thus providing a completely adaptable structure. Understanding quantitatively how different parameters affect the learned dictionaries, such as redundancy and atom sparsity, will provide a better understanding of our model. These questions, among others, are part of ongoing work.
The authors would like to thank the anonymous reviewers who helped improve the quality of this manuscript, as well as the authors of  for generously providing their code and advice for comparison purposes.
-  M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer Publishing Company, Incorporated, 1st ed., 2010.
-  A. M. Bruckstein, D. L. Donoho, and M. Elad, “From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images,” SIAM Review., vol. 51, pp. 34–81, Feb. 2009.
-  S. Mallat and Z. Zhang, “Matching Pursuits With Time-Frequency Dictionaries,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3397–3415, 1993.
-  Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal Matching Pursuit: Recursive Function Approximat ion with Applications to Wavelet Decomposition,” Asilomar Conf. Signals, Syst. Comput. IEEE., pp. 40–44, 1993.
-  S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic Decomposition by Basis Pursuit,” SIAM Review, vol. 43, no. 1, pp. 129–159, 2001.
-  I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited data using FOCUSS: a re-weighted minimum norm algorithm,” IEEE Trans. Signal Process., vol. 45, pp. 600–616, Mar. 1997.
-  R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” IEEE Proceedings - Special Issue on Applications of Sparse Representation & Compressive Sensing, vol. 98, no. 6, pp. 1045–1057, 2010.
-  M. Aharon, M. Elad, and A. M. Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Trans. on Signal Process., vol. 54, no. 11, pp. 4311–4322, 2006.
-  K. Engan, S. O. Aase, and J. H. Husoy, “Method of Optimal Directions for Frame Design,” in IEEE Int. Conf. Acoust. Speech, Signal Process., pp. 2443–2446, 1999.
-  J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Learning for Matrix Factorization and Sparse Coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, 2010.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering.,” IEEE Trans. on Image Process., vol. 16, pp. 2080–2095, Jan. 2007.
-  J. Mairal, F. Bach, and G. Sapiro, “Non-local Sparse Models for Image Restoration,” IEEE International Conference on Computer Vision., vol. 2, pp. 2272–2279, 2009.
-  D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” 2011 International Conference on Computer Vision, ICCV., pp. 479–486, Nov. 2011.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising with block-matching and 3D filtering,” Proc. SPIE-IS&T Electron. Imaging, vol. 6064, pp. 1–12, 2006.
-  W. Dong, L. Zhang, G. Shi, and X. Wu, “Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization,” IEEE Trans. on Image Process., vol. 20, no. 7, pp. 1838–1857, 2011.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. on Image Process., vol. 19, no. 11, pp. 2861–2873, 2010.
-  M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries.,” IEEE Trans. Image Process., vol. 15, pp. 3736–3745, Dec. 2006.
-  Y. Romano, M. Protter, and M. Elad, “Single image interpolation via adaptive nonlocal sparsity-based modeling,” IEEE Trans. on Image Process., vol. 23, no. 7, pp. 3085–3098, 2014.
-  B. Ophir, M. Lustig, and M. Elad, “Multi-Scale Dictionary Learning Using Wavelets,” IEEE J. Sel. Top. Signal Process., vol. 5, pp. 1014–1024, Sept. 2011.
-  J. Sulam, B. Ophir, and M. Elad, “Image Denoising Through Multi-Scale Learnt Dictionaries,” in IEEE International Conference on Image Processing, pp. 808 – 812, 2014.
-  H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with BM3D?,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 2392–2399, 2012.
-  S. Hawe, M. Seibert, and M. Kleinsteuber, “Separable dictionary learning,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 438–445, 2013.
-  M. Aharon and M. Elad, “Sparse and Redundant Modeling of Image Content Using an Image-Signature-Dictionary,” SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 228–247, 2008.
-  L. Benoît, J. Mairal, F. Bach, and J. Ponce, “Sparse image representation with epitomes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011.
-  R. Rubinstein, M. Zibulevsky, and M. Elad, “Double Sparsity : Learning Sparse Dictionaries for Sparse Signal Approximation,” IEEE Trans. Signal Process., vol. 58, no. 3, pp. 1553–1564, 2010.
-  M. Yaghoobi and E. Davies, Mike, “Compressible dictionary learning for fast sparse approximations,” in IEEE/SP 15th Workshop on Statistical Signal Processing, pp. 662–665, Aug. 2009.
-  L. Le Magoarou and R. Gribonval, “Chasing butterflies: In search of efficient dictionaries,” in IEEE Int. Conf. Acoust. Speech, Signal Process, Apr. 2015.
-  O. Chabiron, F. Malgouyres, J. Tourneret, and N. Dobigeon, “Toward Fast Transform Learning,” International Journal of Computer Vision, pp. 1–28, 2015.
-  M. Elad, P. Milanfar, and R. Rubinstein, “Analysis versus synthesis in signal priors,” Inverse Problems, vol. 23, pp. 947–968, 2007.
-  R. Rubinstein and M. Elad, “Dictionary Learning for Analysis-Synthesis Thresholding,” IEEE Trans. on Signal Process., vol. 62, no. 22, pp. 5962–5972, 2014.
-  S. Ravishankar and Y. Bresler, “Learning Sparsifying Transforms,” IEEE Trans. Signal Process., vol. 61, no. 5, p. 61801, 2013.
-  S. Ravishankar, B. Wen, and Y. Bresler, “Online Sparsifying Transform Learningâ Part I: Algorithms,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 4, pp. 625–636, 2015.
-  S. Ravishankar and Y. Bresler, “Learning doubly sparse transforms for images,” IEEE Trans. Image Process., vol. 22, no. 12, pp. 4598–4612, 2013.
-  L. Bottou, “Online algorithms and stochastic approximations,” in Online Learning and Neural Networks, Cambridge University Press, 1998. revised, Oct 2012.
-  M. Almeida and M. Figueiredo, “Frame-based image deblurring with unknown boundary conditions using the alternating direction method of multipliers,” in IEEE International Conference on Image Processing (ICIP), pp. 582–585, Sept 2013.
-  S. Reeves, “Fast image restoration without boundary artifacts,” IEEE Trans. Image Process., vol. 14, pp. 1448–1453, Oct 2005.
-  S. Mallat, A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, 3rd ed., 2008.
-  A. Cohen, I. Daubechies, and P. Vial, “Wavelet bases on the interval and fast algorithms,” Journal of Applied and Computational Harmonic Analysis, vol. 1, no. 12, pp. 54–81, 1993.
-  Y. Zhao and D. Malah, “Improved segmentation and extrapolation for block-based shape-adaptive image coding,” in Proc. Vision Interface, pp. 388–394, 2000.
-  E. J. Candes and D. L. Donoho, “Curvelets, multiresolution representation, and scaling laws,” in Proc. SPIE, vol. 4119, pp. 1–12, 2000.
-  M. N. Do and M. Vetterli, “The contourlet transform: an efficient directional multiresolution image representation,” IEEE Trans. Image Process., vol. 14, no. 12, pp. 2091–2106, 2005.
-  T. Blumensath and M. E. Davies, “Normalized iterative hard thresholding: Guaranteed stability and performance,” IEEE Journal on Selected Topics in Signal Processing, vol. 4, no. 2, pp. 298–309, 2010.
-  T. Blumensath and M. E. Davies, “Iterative Thresholding for Sparse Approximations,” Journal of Fourier Analysis and Applications, vol. 14, pp. 629–654, Sept. 2008.
-  L. Bottou, “Stochastic Gradient Descent Tricks,” Neural Networks: Tricks of the Trade, vol. 1, no. 1, pp. 421–436, 2012.
-  L. Bottou and O. Bousquet, “The Tradeoffs of Large Scale Learning,” Artificial Intelligence, vol. 20, pp. 161–168, 2008.
-  L. Bottou, “Online learning and stochastic approximations,” On-line learning in neural networks, pp. 1–34, 1998.
-  R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit,” Technion - Computer Science Department - Technical Report., pp. 1–15, 2008.
-  J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Dictionary Learning for Sparse Coding,” in Int. Conference on Machine Learning, 2009.
-  J. Sulam and M. Elad, “Expected patch log likelihood with a sparse prior,” in Energy Minimization Methods in Computer Vision and Pattern Recognition, Lecture Notes in Computer Science, pp. 99–111, Springer International Publishing, 2015.
-  Y. Romano and M. Elad, “Boosting of Image Denoising Algorithms,” SIAM Journal on Imaging Sciences, vol. 8, no. 2, pp. 1187–1219, 2015.
-  M. Lebrun, A. Buades, and J. M. Morel, “Implementation of the ”Non-Local Bayes” (NL-Bayes) Image Denoising Algorithm,” Image Processing On Line, vol. 3, no. 3, pp. 1–42, 2013.
-  A. Levin, B. Nadler, F. Durand, and W. T. Freeman, “Patch Complexity, Finite Pixel Correlations and Optimal Denoising,” in European Conference on Computer Vision (ECCV), 2012.
-  O. Bryt and M. Elad, “Compression of facial images using the K-SVD algorithm,” J. Vis. Commun. Image Represent., vol. 19, pp. 270–282, May 2008.
-  C. Sanderson and B. C. Lovell, “Multi-region probabilistic histograms for robust and scalable identity inference,” Lecture Notes in Computer Science, pp. 199–208, 2009.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. 8th Int’l Conf. Computer Vision, vol. 2, pp. 416–423, July 2001.
-  R. Eslami and H. Radha, “Translation-invariant contourlet transform and its application to image denoising,” IEEE Trans. Image Process., vol. 15, no. 11, pp. 3362–3374, 2006.