Online Discriminative Dictionary Learning for Image Classification Based on BlockCoordinate Descent Method
Abstract
Previous researches have demonstrated that the framework of dictionary learning with sparse coding, in which signals are decomposed as linear combinations of a few atoms of a learned dictionary, is well adept to reconstruction issues. This framework has also been used for discrimination tasks such as image classification. To achieve better performances of classification, experts develop several methods to learn a discriminative dictionary in a supervised manner. However, another issue is that when the data become extremely large in scale, these methods will be no longer effective as they are all batchoriented approaches. For this reason, we propose a novel online algorithm for discriminative dictionary learning, dubbed ODDL in this paper. First, we introduce a linear classifier into the conventional dictionary learning formulation and derive a discriminative dictionary learning problem. Then, we exploit an online algorithm to solve the derived problem. Unlike the most existing approaches which update dictionary and classifier alternately via iteratively solving subproblems, our approach directly explores them jointly. Meanwhile, it can largely shorten the runtime for training and is also particularly suitable for largescale classification issues. To evaluate the performance of the proposed ODDL approach in image recognition, we conduct some experiments on three wellknown benchmarks, and the experimental results demonstrate ODDL is fairly promising for image classification tasks.
1 Introduction
Dictionary learning with sparse coding, which decompose signals as linear combinations of a few atoms from some basis or dictionary, have drawn extensive attentions in recent years. Researchers have demonstrated that this framework can achieve stateoftheart performances in image processing tasks such as image denoising [9], face recognition [22, 27], etc. Given a signal and a fixed dictionary which contains atoms, we say that admits a sparse representation over , if we can find one sparse coefficient which makes . As we know, predefined dictionaries, based on various types of wavelets [19], are not suitable for many vision applications such as appearancebased image classification, because the atoms of these dictionaries do not make use of the semantic prior of the given signals. However, the learned dictionaries can achieve more promising performances in various image processing tasks than that of the predefined ones [18, 26].
Several algorithms have been proposed for learning such dictionaries based on sparse representation recently. For example, KSVD algorithm [1] is one such algorithm which learns an overcomplete dictionary from the training data. It updates the atoms in the dictionary one at a time, by fixing all the other atoms unchanged and finding a new one with its corresponding coefficients which minimize the mean square error (MSE). Researchers have shown that this algorithm can achieve outstanding performances in image compression and denosing [5, 10]. However, KSVD algorithm merely focuses on the reconstructive power of learned dictionary, thus it is intrinsically adapted for (image) discrimination or classification tasks. To address this problem and to make use of dictionary learning powerfulness, several methods have been proposed recently. For example, semisupervised dictionaries [22] are learned via updating the KSVD dictionary based on results of a linear classifier iteratively. As well, by adding a linear classifier, another algorithm called discriminative KSVD [27] is developed for image classification. Moreover, to obtain the discriminative capability of the dictionary, a more sophisticated loss function called logistic loss function (softmax function for multiclass classification) is added to the classical dictionary formulation [16, 17].
In addition, most recent methods for dictionary learning are iterative batch algorithms, which assess all the training samples at each iteration to minimize the objective function under sparse constraints. Therefore, another problem we may encounter is that when the training set becomes very large, these methods are no longer efficient. To overcome this bottleneck, an online algorithm for dictionary learning which applies blockcoordinate descent method [15] has been proposed in the literature. However, this online dictionary learning method is still learning the reconstructive dictionary which can well represent the signals, but is not adapted for classification. Marial et al. attempt to address this issue by taskdriven dictionary learning [13] where supervised dictionaries are learned via a stochastic gradient descent algorithm.
To overcome the above two problems, i.e. lacking discriminative power in the reconstructive dictionary and the issue caused by largescale training set, we propose a novel online discriminative formulation for learning the discriminative dictionaries in a online manner. We name our approach ODDL in this paper. In our work, we first incorporate label information into the dictionary learning stage by adopting a linear classifier, and then formulate a supervised dictionary learning problem. To solve this problem, we propose a corresponding online algorithm, in which we apply the blockcoordinate descent method to train the dictionary and classifier simultaneously. Unlike most recent methods which update the dictionary and classifier alternately via iteratively exploring the solution of subproblems, it directly learns the dictionary and classifier jointly. Finally, we carry out some experiments on three wellknown benchmarks to demonstrate the effectiveness of our proposed method, and the experimental results show the proposed ODDL method is fairly competitive for classification tasks.
In summary, the main contributions of this paper include the following:

We propose a novel online algorithm with the numerical solution to learn a discriminative dictionary. It enables online framework and learning discriminative dictionary to merge into one framework. In other words, our proposed approach can efficiently and effectively derive the discriminative dictionary, meanwhile it overcomes large scale classification problem.

By analysis, we see our algorithm can update the classifier simultaneously with the update of the dictionary when a new training sample comes. By this way, computational cost can be significantly reduced.

As shown experimentally, our approach achieves encouraging performance compared with some other dictionary learning approaches.

Interestingly, we suggest a novel, efficient and effective dictionary construction scheme for face recognition. This scheme shows lights on face recognition experimentally.
The paper is organized as follows. Section 2 introduces the basic formulation of dictionary learning and sparse representation for classification. Then our proposed approach is presented in Section 3, followed by the experimental results demonstrated in Section 4. Finally, we conclude our paper in Section 5.
2 Related Work on Dictionary Learning Methods
Recent researches have demonstrated that natural signals such as images can admit sparse representations of some redundant basis^{1}^{1}1Here the term “basis” is loosely used, since the dictionary can be overcomplete and, even in the case of just complete, there is no guarantee of independence between the atoms. (also called dictionary). This phenomenon can explain the feasibility that image classification can be done by sparse representation with an overcomplete dictionary learned from the training images. In this section we briefly review three dictionary learning schemes which are closely relevant to our proposed method. Fig. 1 illustrates the flows of the three dictionary learning schemes with a classifier training process.
2.1 Reconstructive Dictionary Learning for Classification
In classical sparse coding problems, consider a signal and a dictionary . Under the assumption that a natural signal can be approximately represented by a linear combination of a few selected atoms from the dictionary, then can be represented by for some sparse coefficient vectors . To find the sparse representation of is equivalent to the following optimization problem:
(1) 
where is or . The pseudo norm sparse coding is an NPhard problem [2] and several greedy algorithms [20, 21] have been proposed to approximate the solution. The formulation of sparse coding is the wellknown Lasso [25] or Basic Pursuit [6] problem and can be effectively solved by algorithms such as LARS [8].
Eq. 1 is the classical reconstructive dictionary learning problem, in which overlapping patches instead of the whole images are sparsely decomposed as a result of the natural images are usually very large. For an image , suppose there are overlapping patches from image . Then the dictionary is learned via alternatively solving the following optimization over and :
(2) 
where is the coefficient matrix, is the patch of image written as a column vector, is the corresponding sparse code. Several algorithms have been proposed to solving this dictionary learning problem, such as [1] and [11].
Given sets of signals which belong to different classes. The training stage for classification based on sparse representations is composed of two independent parts: dictionary learning and classifier learning. First, a dictionary of classes is learned according to (2). Then, the classifier is trained via solving the following optimization problem:
(3) 
where is the label matrix of the training pathes, is the coefficient matrix computed on the learned dictionary , and is a loss function. However, this dictionary learning scheme has two main drawbacks, easily shown in Fig. 1 (a):
1. The dictionary training and classifier training are two independent stages. Thus, the learned dictionary cannot capture the most discriminative cues that are helpful for classification.
2. Practically, to improve the representative capacity of the dictionary, we often exploit largescale training samples to obtain a powerful dictionary in representation. But this action actually will fail to learn an effective dictionary, due to the largescale dataset problem.
2.2 Discriminative Dictionary Learning for Classification
Researchers have already made some efforts to overcome the first drawback mentioned in previous subsection that the learned dictionaries lack discrimination power for classification. In [16, 17], a discriminative term is introduced to combine the classifier learning process with dictionary learning, and the final objective function is:
(4) 
where is the classifier parameter, or is the label of patch , and is a logistic loss function, . In addition, in [22] and [27], a simpler term which is a linear classifier is considered for the discriminative power:
(5) 
where and are the classifier parameters, is the label vector of patch in which the element associated with the class label is 1 and the others are 0. denotes the Frobenius norm of a matrix , i.e. . Without generalization, the intercept can be omitted by normalize all the signals.
Dictionaries learned by these methods generally perform better in classification tasks than those learned in a reconstructive way. However, from Fig. 1 (b), we can see a fatal drawback of these methods is that, if a new and important training sample comes after the dictionary has been learned, we have to relearn the dictionary from scratch. From another point of view, discriminant dictionary learning methods suffer from largescale dataset problem.
2.3 Online Dictionary Learning for Classification
Largescale training set is a reasonable extension from human beings in learning from experiences. But the aforementioned two dictionary learning schemes fail to handle largescale dataset problem. For this reason, an online dictionary learning algorithm [15] turns up to an efficient dictionary learning paradigm for largescale training set. Inspired by [4], Mairal et al. use the expected objective function to replace the original empirical objective function, obtaining an novel dictionary learning problem:
(6) 
where denotes the sparse coefficients computed in the sparse coding stage. To solve the above problem, they propose an online algorithm which applies the blockgradient descent method for dictionary updating. However, one obvious drawback of this algorithm is that it also ignores the valuable label information which will enhance classification performance. Furthermore, from the flow of training process reflected in Fig. 1 (c), another critical defect can be easily seen that even though the dictionary can be efficiently learned in an online manner, the classifier must be relearned from scratch when a new training sample comes.
3 Online Discriminative Dictionary Learning
In the previous section, we review three dictionary learning schemes with their respective drawbacks. Now we derive our online discriminative dictionary learning (ODDL) to overcome the mentioned defects. The schematic flow chart is demonstrated in Fig. 2, from which we can see the obvious difference from the aforementioned three schemes.
3.1 Proposed Formulation
To overcome the issue lack of discriminative information for learned dictionary, we introduce an discriminative term to the original dictionary learning problem. In this paper, we consider the linear classifier for its simplicity. Adding the linear classifier, we obtain the following problem:
(7) 
where is the patch matrix, is the label matrix of the training patches, is the reconstructive error term, is the discriminative term, and controls the tradeoff between the reconstructive and discriminative terms.
Now we need to address another issue about the largescale dataset problem, as Bottou et al. [4] say, the minimization of the empirical cost is not the focus of researchers, but instead the minimization of the expected cost:
(8) 
where the expectation is taken with respect to the joint distribution of . In practice, to improve the representative power of learned dictionaries, a large amount of training data is always needed. For example, when applying dictionary for image processing tasks, the number of training patches can be up to several millions in a single image. In this case, we must exploit an efficient technique to solve this largescale dataset problem and online learning is such a technique.
3.2 Optimization
In this subsection, we briefly introduce an online discriminative dictionary learning algorithm to solve the proposed formulation (8) in the previous subsection. As same as most existing dictionary learning algorithms, there are still two stages in our proposed algorithm.
Sparse coding The sparse coding problem (1) with learned dictionary is an norm optimization problem, where is 0 or 1. Several algorithms have been proposed for solving this problem. In this paper, we choose the pseudo norm optimization problem as our sparse coding problem since in this formulation we can explicitly control the sparsity (nonzero elements) of the coefficients of the signals projected on the learned dictionary. This leads us to use the Orthogonal Matching Pursuit (OMP) algorithm [21], a greedy algorithm which selects atoms with highest correlation to current orthogonal projected residual sequentially.
Dictionary and classifier updating This stage is markedly different from that of other discriminative dictionary learning approaches. In our proposed ODDL, we use the blockcoordinate descent method for updating dictionary and classifier jointly, while the usual strategy in other algorithms consists of finding the approximate global solutions of dictionary and classifier via solving subproblem iteratively.
Rewrite Eq. 7 and we can derive a compact formulation as our objective function:
(9) 
Note that from a dictionary learning viewpoint, the “dictionary” , which represents the “signal” , is always assumed to be normalized columnwise in updating process, i.e. the Euclidian length of columns in the “dictionary” is 1. Moreover, the real dictionary we derive is also normalized, therefore, we can drop the regularization term in the objective function. Thus, we derive the final function:
(10) 
In our algorithm, there is an important assumption that the training set is composed of i.i.d. samples which admits a probability distribution . Using the same strategy in stochastic gradient descent, our algorithm draws one sample at each iteration, and computes the sparse code of on the previous dictionary , then updates dictionary and classifier parameter simultaneously via solving the following problem
(11) 
To address this problem, first we denote as and as . Then problem (11) can be rewritten as
(12) 
Using the blockcoordinate descent method, the th column of can be updated using
(13) 
Then parting and off we can update and by
(14) 
The details of derivation are showed in Appendix.
3.3 Algorithm
The approach we propose in this paper is a blockcoordinate descent algorithm, and the overall algorithm is summarized in Algorithm 1. In this algorithm, the i.i.d. samples are drawn from an unknown probability distribution sequentially. However, since the distribution is unknown, obtaining such i.i.d. samples may be very difficult. The common trick in online algorithms to obtain such i.i.d. samples is to cycle over a randomly permuted training set [3]. The convergence of the overall algorithm is proved empirically and theoretically [15]. We do not elaborate the proofs as the main contribution is not in the proof, and interested readers are encouraged to refer this paper [15], where the proofs have been already available.
Input: (random variables and a method to draw i.i.d samples of ), (regularization parameters), (sparsity factor), (number of iterations).
Output: Dictionary and classifier parameter .
Input: , , .
Output: and .
Initialization. The initialization of dictionary and classifier plays an important role in our proposed method. It may lead to poor performances if they are not well initialized. One can use patches randomly selected from the training data and zero matrix to initialize and respectively. In practice, our experiments show that using the classical reconstructive dictionary as our initial dictionary always lead to better performances than that of original patches from the training data. Using this initial dictionary , the classifier can be initialized via solving the optimization problem (5).
Minibatch strategy. The convergence speed of our algorithm can be improved with a minibatch strategy, which is widely used in stochastic gradient descent algorithms. The minibatch strategy draws more than one samples (denote the number of samples as ) at each iteration instead of a signal one. This is inspired by the fact the runtime for solving pseudo norm optimization problem (1) with dictionary can be greatly shorten using BatchOMP algorithm [24] with precomputation of matrix .
4 Experiment
In this section ^{2}^{2}2Our propose ODDL method is an online approach, therefore testing on a large scale database is a requisite to evaluate the performance. However, the largescale database evaluation is under way and we plan to report it along with one of our future work., we demonstrate the performance of our proposed ODDL method in two image classification tasks, handwritten digit recognition and face recognition. Before presenting the experiments, we first discuss the choices of three important parameters in our algorithm.
4.1 Choices of Parameters
Parameter . As introduced in the previous section, in our algorithm we choose the pseudo norm optimization problem as our sparse coding problem and use the Orthogonal Matching Pursuit (OMP) algorithm to find the approximative solutions. The sparsity prior controls the nonzero elements of the sparse coefficients in our algorithm. Our experiments have shown that handwritten digit images and face images can be represented well when are and respectively.
Parameter . is the parameter controlling the tradeoff between the reconstructive and discriminative power in our method. of large values will pay most attention to the reconstructive error, while small would enhance the discriminative power at the cost of losing the representation ability. Thus, the value of plays an important role for balancing representation and classification. In practice, the value has given good performances in our experiments.
Parameter . In our method, we cycle over a randomly permuted training set which is a common technique in online algorithms to obtain i.i.d. samples for experiments. We have observed that when is such a value that the whole training set is cycled one round the experimental results are always good.
4.2 Handwritten Digit Recognition
In this section we present experiments on the MNIST [14] and USPS [7] handwritten digit datasets. MNIST contains a total number of images of size , in which there are images for training and images for testing. USPS contains training images and testing images of size .
All the digit images are vectored and normalized to have zero mean and unit norm. Using these two datasets, we test four methods: our proposed ODDL method, ksvd method with a linear classifier, dubbed ksvdlinear, online reconstructive dictionary learning method with a linear classifier, dubbed onlinereclinear, and dksvd (referred to [27]) method. In ODDL and dksvd methods, we learn a signal dictionary with atoms, corresponding to roughly atoms each class, and a signal classifier. While for ksvdlinear and onlinereclinear methods, first independent dictionaries each with atoms are learned, one for each class. Then, we adopt the onevsall strategy [23] for learning classifiers. For class , the onevsall strategy uses all samples from class as the positive samples and samples from the other classes as the negative samples to train the classifier of class .
The average error rates of four testing methods on MNIST and USPS are shown in Table 1. From the results, we can see that learning dictionaries in a discriminative way lead to better performance than those learned in a reconstructive way when adapted to classification task. When compared with those methods which use more sophisticated classifier models such as linear and bilinear logistic loss functions, our proposed method does not perform better. We believe that one of the main reasons is due to the simplicity of our linear classifier model. Our proposed method provides a new strategy for online discriminative dictionary learning, and the great strength is that in our framework the dictionary and classifier can be updated jointly, markedly different from the strategy of dictionary and classifier training in most existing methods. Figure 3 shows dictionaries of the USPS dataset, which are learned via ksvdlinear and ODDL methods respectively.
Method  MNIST  USPS 

ODDL  3.58  5.35 
ksvdlinear  5.07  7.12 
onlinereclinear  5.32  7.35 
dksvd  4.58  6.53 
In addition, we also compare the runtime of our ODDL method and the ksvdlinear method for dictionary and classifier training. We take the total time for learning dictionaries and classifiers for all classes, then computed the average runtime via dividing it by the number of classes. The results are shown in Table 2. From Table 2, we can see our proposed ODDL can largely shorten the runtime for dictionary and classifier learning compared with the ksvdlinear method with the same dictionary size.
Method  MNIST  USPS 

ODDL  156  23 
ksvdlinear  583  62 
To study the role of the dictionary size in our method, we proceed another set of experiments. We learn dictionaries from the training set with different sizes in , and record the performances of these dictionaries on the testing set. The results are shown in Table 3. We observe that the dictionary size plays an important role in classification task. If is too small, information in learned dictionaries is not sufficient for discriminative. When is too big, learned dictionaries contain too much redundant information which may influence discrimination.
k  160  320  640  960  1280  2560 

MNIST  5.49  4.76  4.02  3.58  3.92  4.38 
USPS  7.63  6.43  5.78  5.35  5.69  6.24 
4.3 Extended YaleB Face Recognition
The Extended YaleB face dataset [12] consists of near frontal face images of individuals. These images are taken with different poses and under different illumination conditions. We randomly divide the dataset into two parts, and each part contains approximate samples. One is used for learning the dictionary and classifier, while the other is used as the testing set. Before presenting our experiments, we need some preprocessing steps. As known, the most important features in face recognition are eyebrows, eyes, nose, mouse, and chin. Using this information, we divide each face image into four nonoverlapping patches from top to bottom, and into three nonoverlapping patches from left to right. Figure 4 shows such patches. We can observe that each patch contains at least one feature. After doing this, for each person we have seven classes of patches. Then we vector all the patches and normalized them to have unit norm. In our experiments, seven dictionaries with atoms and seven classifiers are learned, corresponding to seven patch class.
For comparison, we also test our proposed method with ksvdlinear, onlinelinear, and dksvd methods. The results are demonstrated in Table 4. It is easy to see that discriminative dictionary performs better than reconstructive dictionaries. Figure 5 plots the dictionaries learned by our ODDL method for two individuals.
ODDL  ksvdlinear  onlinelinear  dksvd 
1.09  2.03  2.24  1.76 
As in the experiments with handwritten digit datasets, we also compared the average runtime of training stage of our proposed ODDL method and the ksvdlinear method. Table 5 shows the final results. For the ksvdlinear method, the dictionary size is for each patch class of each person. For our proposed method, we test the average runtime of training stage when the dictionary sizes are and respectively. As expected, learning dictionaries with smaller size can shorten the runtime.
ODDL (304)  ODDL (608)  ksvdlinear 
3  4  10 
5 Conclusion and Future Work
In this paper, we propose a novel framework for online discriminative dictionary learning (ODDL) for image classification task. By introducing a linear classifier into the conventional dictionary learning problem, the learned dictionary will capture the discriminative cues for classification along with representation powerfulness for reconstruction. We propose an online algorithm to solve this discriminative problem. Unlike other algorithms which find the dictionary and classifier alternately via solving the subproblems iteratively, our algorithm directly finds them jointly. The experimental results on MNIST and USPS handwritten digit datasets and the Extended yaleB face dataset demonstrate that our method is very competitive when applied to image classification task with largescale training set. More experiments need to be done to better demonstrate the performances of our proposed methods for image classification in the future.
Acknowledgements
This work is supported by by 973 Program (Project No.2010CB327905) and Natural Science Foundations of China (No.61071218).
Appendix A Appendix
To obtain (12), denote as the function to minimize in (11), then a bit of algebra gives
(15) 
where , and . Since the last term of the final formulation is irrespective of and , dropping it then we can obtain (12).
In order to obtain the update of , the th column of , a blockcoordinate descent method is used. Denote the objective function in (12) as , then using some algebraic transformations we obtain
(16) 
Now consider only the terms associated with , which we denote as
(17) 
Notice in above transformations we use an important information that the matrix is symmetric. Computing the derivative of with respect to we have
(18) 
Thus setting the above derivative to 0, can be updated
(19) 
References
 [1] M. Aharon, M. Elad, and A. Bruckstein. Ksvd: An algorithm for designing overcomplete dictionaries for sparse representation. Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], 54(11):4311–4322, 2006.
 [2] E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems, 1997.
 [3] L. Bottou. Online algorithms and stochastic approximations. In D. Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998.
 [4] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS, 2007.
 [5] O. Bryt and M. Elad. Compression of facial images using the ksvd algorithm. J. Vis. Comun. Image Represent., 19:270–282, May 2008.
 [6] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33–61, 1998.
 [7] L. Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems, pages 396–404. Morgan Kaufmann, 1990.
 [8] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Statist., 32(2):407–499, 2004. With discussion, and a rejoinder by the authors.
 [9] M. Elad and M. Aharon. image denoising via learned dictionaries and sparse representation. In CVPR, 2006.
 [10] M. Elad and M. Aharon. Image denoising via learned dictionaries and sparse representation. In In CVPR, pages 17–22, 2006.
 [11] K. Engan, S. O. Aase, and J. Hakon Husoy. Method of optimal directions for frame design. In ICASSP ’99: Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference, pages 2443–2446, Washington, DC, USA, 1999. IEEE Computer Society.
 [12] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):643–660, June 2001.
 [13] F. B. Julien Mairal and J. Ponce. Taskdriven dictionary learning. Technical report, 2010.
 [14] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [15] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010.
 [16] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. In CVPR, 2008.
 [17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In NIPS, pages 1033–1040, 2008.
 [18] J. Mairal, M. Elad, G. Sapiro, and S. Member. Sparse representation for color image restoration. In the IEEE Trans. on Image Processing, pages 53–69. ITIP, 2007.
 [19] S. Mallat. A Wavelet Tour of Signal Processing, 3rd ed., Third Edition: The Sparse Way. Academic Press, 3 edition, December 2008.
 [20] S. Mallat and Z. Zhang. Matching pursuit with timefrequency dictionaries. IEEE Transactions on Signal Processing, 41:3397–3415, 1993.
 [21] Y. C. Pati, R. Rezaiifar, Y. C. P. R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27 th Annual Asilomar Conference on Signals, Systems, and Computers, pages 40–44, 1993.
 [22] D.S. Pham and S. Venkatesh. Joint learning and dictionary construction for pattern recognition. In CVPR, 2008.
 [23] R. Rifkin and A. Klautau. In Defense of OneVsAll Classification. Journal of Machine Learning Research, 5:101–141, Jan. 2004.
 [24] Z. M. Rubinstein, R. and M. Elad. Efficient implementation of the ksvd algorithm using batch orthogonal matching pursuit. Technical report, April 2008.
 [25] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
 [26] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan. Sparse representation for computer vision and pattern recognition. 98(6):1031–1044, June 2010.
 [27] Q. Zhang and B. Li. Discriminative ksvd for dictionary learning in face recognition. In CVPR, pages 2691–2698, 2010.