Sparse Coding with Earth Mover’s Distance for MultiInstance Histogram Representation
Abstract
Sparse coding (Sc) has been studied very well as a powerful data representation method. It attempts to represent the feature vector of a data sample by reconstructing it as the sparse linear combination of some basic elements, and a norm distance function is usually used as the loss function for the reconstruction error. In this paper, we investigate using Sc as the representation method within multiinstance learning framework, where a sample is given as a bag of instances, and further represented as a histogram of the quantized instances. We argue that for the data type of histogram, using norm distance is not suitable, and propose to use the earth mover’s distance (EMD) instead of norm distance as a measure of the reconstruction error. By minimizing the EMD between the histogram of a sample and the its reconstruction from some basic histograms, a novel sparse coding method is developed, which is refereed as SCEMD. We evaluate its performances as a histogram representation method in tow multiinstance learning problems — abnormal image detection in wireless capsule endoscopy videos, and protein binding site retrieval. The encouraging results demonstrate the advantages of the new method over the traditional method using norm distance.
Keywords:
Multiinstance LearningHistogram RepresentationSparse CodingEarth Mover’s Distance∎
1 Introduction
Sparse Coding (SC) has been recently proposed and studied well as an effective data representation method in machine learning community Lee2007 (); Yang20091794 (); Mairal201019 (); Yang201545 (); Bai2015 (); Representing2015 (). Given a set of basic elements and a data sample, SC tries to represent the sample by reconstructing it as a linear combination of these basic elements. The linear combination coefficient vector could be used as the new representation of the sample. To this end, the basic elements and the coefficient vector (also called coding vector) are learned by minimizing the reconstruction error. At the same time, we also hope that the coding vector could be as sparse as possible. To measure the reconstruction error, a squared norm distance is usually applied to compare the original feature vector and its sparse linear combination as a loss function. At the same time, a norm regularization term is also imposed to the coding vector to seek its sparsity. The advantage of using the norm distance to as the loss function and using norm regularization for sparsity purpose lies on that it is easy to optimize and interpret. The featuresign search method had been proposed to solve the SC problem by Lee et al. in Lee2007 (). Some different SC versions had also be proposed since then, by adding different bias terms to the original SC loss function based on norm distance LapSC2010 (); GraphSC2011 (); Gao2013 ().
In the multiinstance learning framework, each sample is given as a bag of multiple instances, instead of one single instance in the traditional machine learning problem Zhou20091249 (); Zhou2005135 (); Zhou20071609 (); Du201551 (). For example, in image classification and retrieval problems, an image could be split into many small image patches, and each patch is an instance. In this case, we usually first learn a set of instance prototypes by clustering the instances of the training samples, and then represent a sample by quantizing its instances into the instance prototypes, and obtain a quantization histogram Tsai2009100 (); Lu2005956 (); Kotani2002II105 (). The normalized histogram is used as the feature vector of the sample for further classification or retrieval task. When we try to apply the SC to represent the histogram data samples under the multiinstance learning framework, directly using norm distance may not be suitable anymore. Other distance functions which is especially suitable for histogram data is desired. In fact, many distance functions have been studied for histogram data comparison, such as Kullback — Leibler divergence Seghouane200797 (); Rached2004917 (); Hershey2007IV317 (), distance Huong20091310 (); Emran200219 (); Ye2006393 (), Earth Mover’s Distance (EMD) Levina2001251 (); Ling2007840 (); Rubner200099 (), etc. Among these distance functions, the EMD metric has been known to quantify the errors in histogram comparison better than other distance metrics.
In this paper, we propose the first SC method with EMD metric for the representation of histogram data. Instead of using norm distance, we model the sparse coding problem by using the EMD to constructed the loss function. The newly proposed method, SCEMD is especially suitable for the representation of histogram data under the multiinstance learning framework.
This rest parts of this paper continue as follows: the formal objective function of SCEMD, the linear programmingbased optimization, and an iterative learning algorithm, are presented in section 2; experiments with three actual multiinstance learning tasks are presented in section 3; and conclusions are given in section 4.
2 Sparse Coding with Earth Mover’s Distance
In this section, we will introduce the novel sparse coding method using EMD as the distance metric, instead of the traditional squared norm distance, for the representation of histogram data.
2.1 Objective function
Assuming that we have a training set of data samples. We represent each data sample as a bag of multiple instances under the framework of multiinstance learning. To extract the feature vector from the th sample, we quantize the instances of the th sample into a set of instance prototypes, and use the quantization histogram as the feature vector. We assume that the number of instance prototypes is , thus the feature vector of the th sample is a normalized dimensional histogram, denoted as , where is the th bin of the histogram. Note that is normalized as , so that it is a distribution. The set of the histograms of all the training samples is denoted as , where is the histogram of the th sample.
Under the framework of sparse coding, we try to represent each histogram in as a sparse linear combination of a set of basic histograms. We denote the set of basic histograms as , where is the th basic histogram, and is the number of basic histograms. Similar to , is also normalized as . The basic histograms are further organized as a basic matrix . With the basic histograms, we try to reconstruct each as the weighted linear combination of these basic histograms, as
(1) 
where is the reconstruction coefficient vector, which is also called coding vector of . is the coefficient of the th basic histogram for the reconstruction of the histogram of the th sample. Similarly, the sparse coding vectors for all the samples in could also be organized as a coding matrix as , with the th column as the coding vector of the th sample. Given the histogram of th sample, the target of sparse coding problem is to learn a basic histogram matrix , and a sparse coding vector , so that the original and its reconstruction should be as close to each other as possible, and the reconstruction error can be minimized. At the same time, we also expect the coding vector to be as sparse as possible. To this end, we discuss the following two issues to build the objective function for the learning of and .
 Reconstruction Error

To measure the reconstruction error between and , traditional sparse coding methods have used the squared norm distance, as
(2) for the learning of and . The objective function is built by applying the squared norm distance to all samples in the training set. However, as we discussed in the introduction section, norm distance is unsuitable for the histogram data. In this work, we try to apply the EMD as a distance measure between and , which has been a popular metric for histogram data. To define the EMD between two histograms and , we treat each bin of as a supply, while each bin of as a demand. We also denote as the ground distance from the th supply to th demand. The EMD between and is defined as the minimum amount of work needed to fill all the demands with all the supplies,
(3) where variable denotes the amount transported from the th supply to the th demand for the th sample, and is the matrix of the transported amounts. The constrain prevents the negative transportation. The constrain means that the mess moved out from the th supply should not be larger than , while means that the mess moved into the th demand should not be larger than . The problem in (3) could be solved as a Linear Programming (LP) problem Liu2014397 (); BenTal2000411 (); Candes20054203 ().
 Sparsity Regularization

To encourage the sparsity of each coding vector , traditional sparse coding approaches have been imposing a norm based sparsity penalty to as
(4) Using the norm sparsity penalty, we can impose most of the the elements of to zeros, and only a few of them will be kept for the reconstruction of .
Direct optimization of this regularization term is difficulty, because it is nonconvex and nonsmooth. We follow the works of Fan et al. fan2014finding (); fan2010enhanced (); fan2011margin (); fan2015improved (); fan2014tightening () which are proposed to improve the lower bounds for Bayesian network structure learning, and propose to optimize the upper bound of the norm regularization. Fan et al. fan2014tightening () proposed a method to tighten the upper and lower bounds of the learning problem of the Bayesian network structure. In the work of Fan et al. fan2014tightening (), more informed variable groupings are used to create the pattern databases for the tightening of the lower bounds, and an anytime learning algorithm is used for the tightening of the upper bounds. Moreover, Fan et al. fan2015improved () proposed a new partition method to use the information extracted from the potential optimal parent sets to improve the lower bound for Bayesian network structure learning. Inspired by the works of Fan et al. fan2014tightening (); fan2015improved (), to solve the problem together with the LP problem of EMD, instead of minimizing the norm of the code vector directly, we introduce a slack vector the upper bound of its norm, and minimize its norm. We first introduce a nonnegative slack vector for each code vector as the upper bound of the absolute vector of the code vector, , where , and then minimize the norm of the slack vector to seek the sparsity of indirectly. Because is a nonnegative vector, its norm could be computed simply as the summation of its elements as . To seek the sparsity of , we have the following optimization problem,
(5) We also organize the upper bound vectors for the spare codes as a upper bound matrix , where the th column is the upper bound vector of the th sparse code vector. By using the slack vectors, we make the sparsity regularization term a smooth function, which could be integrated to the optimization problem of EMD naturally, and could solved as LP problem easily.
By applying both the EMD based reconstruction error term in (3) and the sparsity regularization term in (5) to each training sample in , and summing them up, we have the following objective function for the EMD based sparse coding problem,
(6)  
where is a tradeoff parameter, and the constrains and are introduced to the basic histograms to guarantee that the learned basic histograms are normalized distributions. Please note that the itself is also obtained by solving a minimizing problem with regarding to . We substitute (3) and (5) to (6), so that the optimization problem in (6) is extended into the parameterenlarged optimization with additional parameters of transported amount matrices as,
(7)  
This problem is a parameterenlarged LP problem.
2.2 Optimization
Directly optimizing the object of (7) is difficult and timeconsuming. Similar to the original normbased sparse coding method, we adopt an alternate optimization strategy for the learning of and in an iterative algorithm. In each iteration, one of and will be optimized while the other one is fixed, and then their roles will be switched. The iteration will be repeated until a maximum iteration number is reached.
2.2.1 Optimizing while fixing
By fixing , we could optimize the coding vectors in together with the other additional variables. Similar to the traditional sparse coding methods, we update each sparse coding vector individually. When the coding vector of the th sample and its slack vector are being optimized, the other ones with their corresponding additional variables ( and ) are fixed. Thus, the optimization problem in (7) will be turned to
(8)  
which could be solved as a LP problem. The LP problem is solve by using a active set algorithm. Please notice that LP solves a problem for a given vector of unknown variables. Here we substitute the vector of variables in (the original variables of the EMD problem) with a longer vector, which contains the entries in , , and . We did not reformulate the EMD objective, but we changed its constraints to take care of the additional variables related to the the problem solved. In this way, the new LP problem is different from that of the original EMD, and the result contains entries of , and .
2.2.2 Optimizing while fixing
By fixing and , the optimization problem in (7) can be turned to
(9)  
which could also be solved as a LP problem using active set algorithm.
An important limitation of both the optimization problems in (8) and (9) is the large number of additional variables for the LP problem. For each sample , a transported amount matrix is solved in both (8) and (9), thus there are totally transportation amount variables in the LP problem for the training samples. When the dimension of the histogram , or the training sample is large, there would be a large number of variables, which could cause serious computation problem. To overcome this shortage, we reduce the number of variables in by allowing moving the earth from the th supply only to its nearest demands instead of all the demands. The nearest demands of th supply is found by using the ground distances. In this way, we reduce the transported mass variables for each supply of each sample from to . Usually , thus the total transported amount is reduced significantly from to .
2.3 Algorithm
We summarize the iterative basic histograms and coding vectors learning algorithm in Algorithm 1. In each iteration, the sparse coding vector for each sample is first learned sequentially, and the basic histograms are then updated based on the learned sparse coding vectors. The iterations will be repeated times. When a novel sample comes with its histogram in the test procedure, we simply solve (8) to obtain its sparse coding vector.
2.4 Relation to nonnegative matrix factorization with earth mover’s distance
Sandler and Lindenbaum sandler2011nonnegative () proposed nonnegative matrix factorization with earth mover’s distance (NMFEMD), which factorize a nonnegative matrix by minimizing the EMD between the original data matrix and the product of two factorization matrices. Our work SCEMD has close relation to it. We discuss the relations of the two methods as following:

Both SCEMD and NMFEMD use earth mover’s distance to measure the reconstruction error of the data, which is a suitable distance measure for multiinstance quantization histogram.

NMFEMD dose impose the sparsity of the factorization matrices, while SCEMD impose the reconstruction coefficients to be sparse. The sparsity of the reconstruction coefficients is measure by a norm. Thus the objective function of SCEMD is different compared to NMFEMD, because the objective of NMFEMD is only a EMD term, while the object of SCEMD is composed of a EMD term and a term. The optimization of SCEMD is more difficult than NMFEMD due to this additional term.

NMFEMD imposes the reconstruction matrices to be nonnegative, while SCEMD doesn’t have such constraints. However, these constraints do not change the optimization of the objective. The NMFEMD is optimized as a linear constrained LP problem. Adding the nonnegative constraint only adds some more linear constraints to the problem. But SCEMD adds a norm regularization term to the objective, and changes the optimization problem.
3 Experiments
In this section, we evaluated the proposed method on three multiinstance learning problems, where each feature vector is a histogram for each sample.
3.1 Experiment I: Abnormal Image Detection in Wireless Capsule Endoscopy Videos
Wireless Capsule Endoscopy (WCE) has been used to detect the mucosal abnormalities in the gastrointestinal tract, including blood, ulcer, polyp, etc Iddan2000417 (); Ell2002685 (); Mylonaki20031122 (); Hwang2011320 (). However, usually only a few frames from a large number of WCE videos contain abnormalities, thus a medical clinician spends long time to find the abnormal frames from a WCE video. In this situation, it is very necessary to develop a system to automatically discriminate abnormal frames from the normal ones. In this experiment, we evaluated the proposed method as image representation method for the task of abnormal image detection in WCE videos.
3.1.1 Dataset and Setup
We constructed the data set for the experiment by collecting 170 images of WCE videos belonging to three abnormal classes and one normal class. The data set contains 50 normal images, 40 polyp images, 40 ulcer images, and 40 blood images. Given an image of WCE video, the task of abnormal image detection is to classify it to one of the four classes. To this end, each image was split into many small patches, and each patch was treated as an instance, thus the image was represented as a bag of instances under the framework of multiinstance learning. We extract color and texture features from each patch and concatenate them as visual features of each instance. Then the instances were quantized into a pool of instance prototypes and the quantization histogram was normalized and used as the feature vector of the image. The histograms were further represented using the proposed SCEMD algorithm as the sparse coding vectors, and the coding vectors were used to train a Support Vector Machine (SVM) Cherkassky2004113 (); Schuldt200432 (); Keerthi2001637 (); Tsang2005 () to classify the images into one of the four image types.
To conduct the experiment, we employed the 10fold crossvalidation protocol Gandek19981171 (); Craven1978377 (); Bagby199423 (). The entire data set was split into 10 nonoverlapping folds randomly. In each fold, there were 5 normal images, 4 polyp images, 4 ulcer images, and 4 blood images respectively. Each fold was used as the test set in turn, and the remaining 9 folds are combined and used as the training set. After the images in the training set were represented as histograms under the multiinstance learning framework, we performed the SCEMD algorithm to them and obtain the basic histograms and the sparse coding vectors for the training images. Then we train a SVM classifier from these sparse coding vectors for each class. To handle the multiclass problem, we used the oneagainstall protocol to train classifiers Kumar201114238 (); Liu2005849 (); Polat20091587 (). A SVM classifier was trained for each class, using the images of this class as positive samples, while all other images as negative samples. Based on the basic histograms learned from training histograms, we represented the test images and obtain the sparse coding vectors, and finally input them into the learned SVM classifiers to have the final classification results. Please note that the parameters were turned using only the training set while excluding the test set.
The classification results are measured by the recallprecision curve Goadrich2006231 (); Gordon1989145 (); Huang20051665 (), Receiver Operating Characteristic (ROC) curve Cook2007928 (); DeLong1988837 (); Hanley198229 () and the Area Under the ROC Curve (AUC) value Pencina2008157 (); DeLong1988837 (); Hanley198229 () for each class.
3.1.2 Results
In the experiments, we compared our SCEMD algorithm as a data representation method against the traditional sparse coding method using the norm distance (denoted as SC norm), and also against the original histogram as representation (denoted as Histogram). The recallprecision curves for four different classes are given in Fig. 1 (a)  (d). In these figures, it is clearly shown that with the proposed SCEMD, the classification performances for all four classes are improved significantly, even more so for the last three classes. The performance improvement is particularly dramatic for the Polyp and Normal classes. SC norm could improve the original histogram features somehow, however, due to the reason that it employs the norm distance as loss function, which is not suitable for the histogram data, the improvement is limited. In particular, Fig. 1 (a) shows that an increase in classification performance is obtained by SC norm against both original histogram and SCEMD. The results validate the importance of performing sparse coding with appropriate loss function to the histogram data.
The ROC curves of different classes are shown in Fig. 2. Moreover, the AUC values are given in Fig. 3. As shown in Fig. 2 and Fig. 3, our SCEMD algorithm clearly outperforms the original histogram feature and SC normbased method in all four classes again. The advantage is particularly significant on the more challenging normal class. This result highlights the importance of using the EMD measure for histograms rather than norm distance.
3.2 Experiment II: Protein binding site retrieval
Searching geometrically similar protein binding sites is significantly important to understand the functions of protein and also to drug discovery Bind2012 (); Bradford20051487 (); Neuvirth2004181 (); Laurie20051908 (). Pang et al. Bind2012 () presented the protein binding sites as a histogram using the multiinstance learning framework for the protein binding site retrieval problem. In this experiment, we evaluated the proposed algorithm for the representation of histograms of protein binding sites.
3.2.1 Dataset and Setup
In this experiment, we used a protein binding site data set reported by Pang et al. Bind2012 (). In this nonredundant data set, there are totally 2,819 protein binding sites, belonging to 501 different classes. The number of sites in each class varies from 2 to 58. To conduct the 4fold crossvalidation, we had selected 2,226 binding sites randomly to construct our data set. The selected data set contained sites of 249 classes, and the number of sites for each class was from 4 to 58, so that we could guarantee that when the 4fold crossvalidation was performed, in each fold there were at least one site from every class. The numbers of sites for all the selected classes are shown in Fig. 4.
Given a query binding site and a protein binding site database, the protein binding site retrieval problem is to rank the database sites according to their similarity to the query in a descending order, so that the database sites belonging to the same class as the query can be ranked at the top positions of the returned list. To this end, we first represented each binding site as a bag of feature points selected from the binding site surface, and for each point the geometric features were extracted Bind2012 (). In the multiinstance learning framework, a binding site was refereed as a bag, and each feature point was refereed as a instance. Then all the feature points were quantized into a set of prototype points, and a histogram was generated as the baglevel feature of the binding site Bind2012 (). Using the proposed SCEMD algorithm, the histograms are represented as sparse codes for the final ranking. The ranking performances were evaluated by the recallprecision and the ROC curves. AUC values of the ROC curve were also reported as a single performance measure of the ranking results.
3.2.2 Results
The recallprecision and ROC curves of different histogram representation methods are given in Fig. 5. From the results in Fig. 5, we can find that the SCEMD method performs the best in terms of both recallprecision and ROC curves. It proves that the EMD based method could discover the best distance measure for histogram comparison and coding. The AUC values of the ROC curves are also given in Fig. 6. These protein binding site retrieval system with SCEMD representation method achieves an AUC value of 0.9466, compared to an AUC value of 0.9282 using SC norm and 0.9114 using the original histogram.
3.3 Experiment III: Object Recognition
In this section, we compared the proposed method to some other feature extraction and classification methods on an publicly accessed image database.
3.3.1 Dataset and Protocol
In this experiment, we used the COREL2000 image database which is popular in the computer vision community for the problem of object recognition Tang2007384 (); chen2004image (). In this data set, there were 2,000 images of 20 objects. For each object, the number of images was 100. The target of image recognition was to assign a given image to a class of object correctly.
To conduct the experiment, we also used the 10fold crossvalidation. Each images was regarded as a sample, and we extracted the Regions Of Interest (ROI) as the instances Christopoulos2000247 (); Heikkil2009425 (). We extracted multiple ROIs for each image, thus each image was represented as a bag of multiple instances. Moreover, the instances of the training images were clustered to generate a set of instance prototypes using a clustering algorithm Frey2007972 (); Xu2005645 (); Xie1991841 (); Jain1999316 (), and then the instances of each image were quantized into it to present each image as a histogram. The histograms were firstly normalized and then the proposed SCEMD algorithm was applied to represent the histograms to sparse codes. The histogram of the training samples were used to learn the basic histograms and their sparse codes first, and then a SVM classifier was learned in the sparse code space for each class. To classify a test sample, we also represented its histogram to a sparse code vector using the basic histograms learned from the training set, and then classified the sparse code vector using the SVM classifier learned from the training set. To evaluate the classification performances of the proposed algorithm, we used the classification accuracy as the performance measure Lim2000203 (); Baldi2000412 (); Foody2002185 (), which is computed as,
(10) 
3.3.2 Results
In this experiment, we compared the proposed histogram representation algorithm against several visual feature extraction methods, including the Learning LocalityConstrained Collaborative Representation (LCCR) method proposed by Peng et al. Peng2013Learning (), SC, Histogram of Oriented Gradients (HOG) Suard2006206 (); Zhu20061491 (); Dalal2005886 (), and ScaleInvariant Feature Transform (SIFT) Li2007332 (); Shen20091714 (); Cheung20092012 (). The experiment results are given in Fig. 7 (a). As we can see from this figure, the proposed method outperforms all the other methods significantly besides LCCR. SIFT and HOG both represent the images as histograms, however, they ignore the structure of the data set by representing each images individually, thus the performances are inferior to others. SC explorers the training set by learning a set of basic histograms to represent all the image histograms, and it achieves some minor improvements. But it uses the norm distance to compare the histograms, which is not suitable. It is very interesting to notice that LCCR, which improves the robustness and discrimination of data representation by introducing the local consistency, has archived similar performances to SCEMD and outperformed other methods too. Although it also uses the norm as loss function which is unsuitable for histograms, it considers the local consistency of the data samples, and seek the smoothness of the code instead of the sparsity. These are the main reasons for the good performances of LCCR. It also encourages us to develop novel methods to combine EMD and localityconstrained collaborative representation, which may even improve the performance more significantly. Moreover, we also compare our method to two popular classification methods, including Sparse Representationbased Classification (SRC) Wright2009210 (), Nearest Neighbor classification (NN) Zhang20062126 (); Weinberger2009207 (); Denoeux1995804 (). The results is given in Fig. 7 (b). As we can see from the figure, the proposed algorithm based on EMD outperforms both the two classification using norm distance as distance metric for histograms. This is another evidence that EMD is essential for histogram data analysis and classification.
4 Conclusion and Future Works
A new type of sparse coding method, sparse coding with EMD metric, is proposed in this paper for the representation of histogram data. The objective function is composed of an EMD term between the original histogram and the reconstruction result from a pool of basic histograms, and a term for the regularization of the coding vector. The optimization problem is solved as a LP problem in an iterative algorithm. Algorithms based on the proposed SCEMD outperformed previous norm based sparse coding algorithms in three challenging multiinstance learning tasks.
Acknowledgements
The work was support partly by the foundation for innovative research groups of the national natural science foundation of China (Grant No. 61521003), partly by science and technique foundation of HeNan province, China (Grant No. 152102210087), partly by foundation of educational committee of HeNan province, China(Grant No. 14A520040).
References
 (1) Bagby, R., Parker, J., Taylor, G.: The twentyitem toronto alexithymia scale  i: Item selection and crossvalidation of the factor structure. Journal of Psychosomatic Research 38(1), 23–32 (1994)
 (2) Bai, X., Yan, C., Ren, P., Bai, L., Zhou, J.: Discriminative sparse neighbor coding. Multimedia Tools and Applications p. 25 (2015). DOI 10.1007/s1104201529514
 (3) Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., Nielsen, H.: Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16(5), 412–424 (2000)
 (4) BenTal, A., Nemirovski, A.: Robust solutions of linear programming problems contaminated with uncertain data. Mathematical Programming, Series B 88(3), 411–424 (2000)
 (5) Bradford, J., Westhead, D.: Improved prediction of proteinprotein binding sites using a support vector machines approach. Bioinformatics 21(8), 1487–1494 (2005)
 (6) Candes, E., Tao, T.: Decoding by linear programming. IEEE Transactions on Information Theory 51(12), 4203–4215 (2005)
 (7) Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions. The Journal of Machine Learning Research 5, 913–939 (2004)
 (8) Cherkassky, V., Ma, Y.: Practical selection of svm parameters and noise estimation for svm regression. Neural Networks 17(1), 113–126 (2004)
 (9) Cheung, W., Hamarneh, G.: nsift: ndimensional scale invariant feature transform. IEEE Transactions on Image Processing 18(9), 2012–2021 (2009)
 (10) Christopoulos, C., Askelöf, J., Larsson, M.: Efficient methods for encoding regions of interest in the upcoming jpeg2000 still image coding standard. IEEE Signal Processing Letters 7(9), 247–249 (2000)
 (11) Cook, N.: Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 115(7), 928–935 (2007)
 (12) Craven, P., Wahba, G.: Smoothing noisy data with spline functions  estimating the correct degree of smoothing by the method of generalized crossvalidation. Numerische Mathematik 31(4), 377–403 (1978)
 (13) Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings  2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. I, pp. 886–893 (2005)
 (14) DeLong, E., DeLong, D., ClarkePearson, D.: Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3), 837–845 (1988)
 (15) Denoeux, T.: nearest neighbor classification rule based on dempstershafer theory. IEEE Transactions on Systems, Man and Cybernetics 25(5), 804–813 (1995)
 (16) Du, X., Wang, J.Y.: Support image set machine: Jointly learning representation and classifier for image set classification. KnowledgeBased Systems 78(1), 51–58 (2015)
 (17) Ell, C., Remke, S., May, A., Helou, L., Henrich, R., Mayer, G.: The first prospective controlled trial comparing wireless capsule endoscopy with push enteroscopy in chronic gastrointestinal bleeding. Endoscopy 34(9), 685–689 (2002)
 (18) Emran, S., Ye, N.: Robustness of chisquare and canberra distance metrics for computer intrusion detection. Quality and Reliability Engineering International 18(1), 19–28 (2002)
 (19) Fan, X., Malone, B., Yuan, C.: Finding optimal bayesian network structures with constraints learned from data. In: Proceedings of the 30th Annual Conference on Uncertainty in Artificial Intelligence (UAI14), pp. 200–209 (2014)
 (20) Fan, X., Tang, K.: Enhanced maximum auc linear classifier. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, vol. 4, pp. 1540–1544. IEEE (2010)
 (21) Fan, X., Tang, K., Weise, T.: Marginbased oversampling method for learning from imbalanced datasets. In: Advances in Knowledge Discovery and Data Mining, pp. 309–320. Springer (2011)
 (22) Fan, X., Yuan, C.: An improved lower bound for bayesian network structure learning. In: TwentyNinth AAAI Conference on Artificial Intelligence, pp. 2439 – 2445 (2015)
 (23) Fan, X., Yuan, C., Malone, B.: Tightening bounds for bayesian network structure learning. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 2439 – 2445 (2014)
 (24) Foody, G.: Status of land cover classification accuracy assessment. Remote Sensing of Environment 80(1), 185–201 (2002)
 (25) Frey, B., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
 (26) Gandek, B., Ware, J., Aaronson, N., Apolone, G., Bjorner, J., Brazier, J., Bullinger, M., Kaasa, S., Leplege, A., Prieto, L., Sullivan, M.: Crossvalidation of item selection and scoring for the sf12 health survey in nine countries: Results from the iqola project. Journal of Clinical Epidemiology 51(11), 1171–1178 (1998)
 (27) Gao, S., Tsang, I.H., Chia, L.T.: Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1), 92–104 (2013)
 (28) Gao, S., Tsang, I.H., Chia, L.T., Zhao, P.: Local features are not lonely  laplacian sparse coding for image classification. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3555–3561 (2010)
 (29) Goadrich, M., Oliphant, L., Shavlik, J.: Gleaner: Creating ensembles of firstorder clauses to improve recallprecision curves. Machine Learning 64(13), 231–261 (2006)
 (30) Gordon, M., Kochen, M.: Recallprecision tradeoff. a derivation. Journal of the American Society for Information Science 40(3), 145–151 (1989)
 (31) Hanley, J., McNeil, B.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982)
 (32) Heikkilä, M., PietikÃ¤inen, M., Schmid, C.: Description of interest regions with local binary patterns. Pattern Recognition 42(3), 425–436 (2009)
 (33) Hershey, J., Olsen, P.: Approximating the kullback leibler divergence between gaussian mixture models. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing  Proceedings, vol. 4, pp. IV317–IV320 (2007)
 (34) Huang, Y., Powers, R., Montelione, G.: Protein nmr recall, precision, and fmeasure scores (rpf scores): Structure quality assessment measures based on information retrieval statistics. Journal of the American Chemical Society 127(6), 1665–1674 (2005)
 (35) Huong, V., Park, D.C., Woo, D.M., Lee, Y.: Centroid neural network with chi square distance measure for texture classification. In: Proceedings of the International Joint Conference on Neural Networks, pp. 1310–1315 (2009)
 (36) Hwang, S.: Bagofvisualwords approach to abnormal image detection in wireless capsule endoscopy videos. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6939 LNCS(PART 2), 320–327 (2011)
 (37) Iddan, G., Meron, G., Glukhovsky, A., Swain, P.: Wireless capsule endoscopy. Nature 405(6785), 417–418 (2000)
 (38) Jain, A., Murty, M., Flynn, P.: Data clustering: A review. ACM Computing Surveys 31(3), 316–323 (1999)
 (39) Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to platt’s smo algorithm for svm classifier design. Neural Computation 13(3), 637–649 (2001)
 (40) Kotani, K., Qiu, C., Ohmi, T.: Face recognition using vector quantization histogram method. In: IEEE International Conference on Image Processing, vol. 2, pp. II/105–II/108 (2002)
 (41) Kumar, M., Gopal, M.: Reduced oneagainstall method for multiclass svm classification. Expert Systems with Applications 38(11), 14,238–14,248 (2011)
 (42) Laurie, A., Jackson, R.: Qsitefinder: An energybased method for the prediction of proteinligand binding sites. Bioinformatics 21(9), 1908–1916 (2005)
 (43) Lee, H., Battle, A., Raina, R., Ng, A.: Efficient sparse coding algorithms. In: Advances in Neural Information Processing Systems, pp. 801–808 (2007)
 (44) Levina, E., Bickel, P.: The earth mover’s distance is the mallows distance: Some insights from statistics. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2, pp. 251–256 (2001)
 (45) Li, L., Guo, B., Shao, K.: Geometrically robust image watermarking using scaleinvariant feature transform and zernike moments. Chinese Optics Letters 5(6), 332–335 (2007)
 (46) Lim, T.S., Loh, W.Y., Shih, Y.S.: Comparison of prediction accuracy, complexity, and training time of thirtythree old and new classification algorithms. Machine Learning 40(3), 203–228 (2000)
 (47) Ling, H., Okada, K.: An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(5), 840–853 (2007)
 (48) Liu, Y., Ding, M.F.: A ladder method for linear semiinfinite programming. Journal of Industrial and Management Optimization 10(2), 397–412 (2014)
 (49) Liu, Y., Zheng, Y.: Oneagainstall multiclass svm classification using reliability measures. In: Proceedings of the International Joint Conference on Neural Networks, vol. 2, pp. 849–854 (2005)
 (50) Lu, Z.M., Burkhardt, H.: Colour image retrieval based on dctdomain vector quantisation index histograms. Electronics Letters 41(17), 956–957 (2005)
 (51) Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research 11, 19–60 (2010)
 (52) Mylonaki, M., FritscherRavens, A., Swain, P.: Wireless capsule endoscopy: A comparison with push enteroscopy in patients with gastroscopy and colonoscopy negative gastrointestinal bleeding. Gut 52(8), 1122–1126 (2003)
 (53) Neuvirth, H., Raz, R., Schreiber, G.: Promate: A structure based prediction program to identify the location of proteinprotein binding sites. Journal of Molecular Biology 338(1), 181–199 (2004)
 (54) Pang, B., Zhao, N., Korkin, D., Shyu, C.R.: Fast protein binding site comparisons using visual words representation. Bioinformatics 28(10), 1345–1352 (2012)
 (55) Pencina, M., D’Agostino Sr., R., D’Agostino Jr., R., Vasan, R.: Evaluating the added predictive ability of a new marker: From area under the roc curve to reclassification and beyond. Statistics in Medicine 27(2), 157–172 (2008)
 (56) Peng, X., Zhang, L., Yi, Z., Tan, K.K.: Learning localityconstrained collaborative representation for robust face recognition. Pattern Recognition 47(9), 2794–2806 (2014)
 (57) Polat, K., Gunes, S.: A novel hybrid intelligent method based on c4.5 decision tree classifier and oneagainstall approach for multiclass classification problems. Expert Systems with Applications 36(2 PART 1), 1587–1592 (2009)
 (58) Rached, Z., Alajaji, F., Campbell, L.: The kullbackleibler divergence rate between markov sources. IEEE Transactions on Information Theory 50(5), 917–921 (2004)
 (59) Rubner, Y., Tomasi, C., Guibas, L.: Earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000)
 (60) Sandler, R., Lindenbaum, M.: Nonnegative matrix factorization with earth mover’s distance metric for image analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33(8), 1590–1602 (2011)
 (61) Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proceedings  International Conference on Pattern Recognition, vol. 3, pp. 32–36 (2004)
 (62) Seghouane, A.K., Amari, S.I.: The aic criterion and symmetrizing the kullbackleibler divergence. IEEE Transactions on Neural Networks 18(1), 97–106 (2007)
 (63) Shen, Y., Guturu, P., Damarla, T., Buckles, B., Namuduri, K.: Video stabilization using principal component analysis and scale invariant feature transform in particle filter framework. IEEE Transactions on Consumer Electronics 55(3), 1714–1721 (2009)
 (64) Suard, F., Rakotomamonjy, A., Bensrhair, A., Broggi, A.: Pedestrian detection using infrared images and histograms of oriented gradients. In: IEEE Intelligent Vehicles Symposium, Proceedings, pp. 206–212 (2006)
 (65) Tang, J., Lewis, P.: A study of quality issues for image autoannotation with the corel dataset. IEEE Transactions on Circuits and Systems for Video Technology 17(3), 384–389 (2007)
 (66) Tsai, P.: Histogrambased reversible data hiding for vector quantisationcompressed images. IET Image Processing 3(2), 100–114 (2009)
 (67) Tsang, I., Kwok, J., Cheung, P.M.: Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research 6 (2005)
 (68) Wang, J., Zhou, Y., Yin, M., Chen, S., Edwards, B.: Representing data by sparse combination of contextual data points for classification. In: X. Hu, Y. Xia, Y. Zhang, D. Zhao (eds.) Advances in Neural Networks ?ISNN 2015, Lecture Notes in Computer Science, vol. 9377, pp. 373–381 (2015)
 (69) Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, 207–244 (2009)
 (70) Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2), 210–227 (2009)
 (71) Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(8), 841–847 (1991)
 (72) Xu, R., Wunsch II, D.: Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)
 (73) Yang, J., Ding, Z., Guo, F., Wang, H., Hughes, N.: A novel multivariate performance optimization method based on sparse coding and hyperpredictor learning. Neural Networks 71, 45–54 (2015)
 (74) Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009, pp. 1794–1801 (2009)
 (75) Ye, N., Parmar, D., Borror, C.: A hybrid spc method with the chisquare distance monitoring procedure for largescale, complex process data. Quality and Reliability Engineering International 22(4), 393–402 (2006)
 (76) Zhang, H., Berg, A., Maire, M., Malik, J.: Svmknn: Discriminative nearest neighbor classification for visual category recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2126–2136 (2006)
 (77) Zheng, M., Bu, J., Chen, C., Wang, C., Zhang, L., Qiu, G., Cai, D.: Graph regularized sparse coding for image representation. IEEE Transactions on Image Processing 20(5), 1327–1336 (2011)
 (78) Zhou, Z.H., Jiang, K., Li, M.: Multiinstance learning based web mining. Applied Intelligence 22(2), 135–147 (2005)
 (79) Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multiinstance learning by treating instances as noni.i.d. samples. In: Proceedings of the 26th International Conference On Machine Learning, ICML 2009, pp. 1249–1256 (2009)
 (80) Zhou, Z.H., Zhang, M.L.: Multiinstance multilabel learning with application to scene classification. In: Advances in Neural Information Processing Systems, pp. 1609–1616 (2007)
 (81) Zhu, Q., Avidan, S., Yeh, M.C., Cheng, K.T.: Fast human detection using a cascade of histograms of oriented gradients. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1491–1498 (2006)