A bagtoclass divergence approach to multipleinstance learning Kajsa Møllersen, Jon Yngve Hardeberg, Fred Godtliebsen
Abstract
In multiinstance (MI) learning, each object (bag) consists of multiple feature vectors (instances), and is most commonly regarded as a set of points in a multidimensional space. A different viewpoint is that the instances are realisations of random vectors with corresponding probability distribution, and that a bag is the distribution, not the realisations. In MI classification, each bag in the training set has a class label, but the instances are unlabelled. By introducing the probability distribution space to baglevel classification problems, dissimilarities between probability distributions (divergences) can be applied. The bagtobag KullbackLeibler information is asymptotically the best classifier, but the typical sparseness of MI training sets is an obstacle. We introduce bagtoclass divergence to MI learning, emphasising the hierarchical nature of the random vectors that makes bags from the same class different. We propose two properties for bagtoclass divergences, and an additional property for sparse training sets.
keywords:
Multiinstance, Multipleinstance, Objecttoclass, Divergence, Dissimilarity, Classification, Image analysis, Multipleinstance learning, Bagtoclass, KullbackLeibler1 Introduction
1.1 Multiinstance learning
In supervised learning, the training data consists of objects, , with corresponding class labels, ; . An object is typically a vector of feature values, , named instance. In multiinstance (MI) learning, each object consists of several instances. The set , where the elements are vectors of length , is referred to as bag. The number of instances, , varies from bag to bag, whereas the vector length is constant. In supervised MI learning, the training data consists of sets and their corresponding class labels, .
Figure 1 shows an image (bag), , of breast tissue Gelasca2008Evaluation (), divided into segments with corresponding feature vectors (instances) Kandemir2014Empowering ().
The images in the data set have class labels, the individual segments do not. This is a key characteristic of MI learning: the instances are not labelled. MI learning includes instance classification Doran2016MultipleInstance (), clustering Zhang2009Multiinstance (), regression Zhang2009Multiinstance (), and multilabel learning Zhou2012Multiinstance (), but this article will focus on bag classification.
The term MI learning was introduced in an application of molecules (bags) with different shapes (instances), and their ability to bind to other molecules Dietterich1997Solving (). A molecule binds if at least one of its shapes can bind. In MI terminology, the classes, , in binary classification are referred to as positive, , and negative, . The assumption that a positive bag contains at least one positive instance, and a negative bag contains only negative instances is referred to as the standard MI assumption.
Many new applications violate the standard MI assumption, such as image classification Xu2016Multipleinstance () and text categorisation Qiao2017Diversified (). Consequently, successful algorithms meet more general assumptions, see e.g. the hierarchy of Weidmann et al. Weidmann2003Twolevel () or Foulds and Frank’s taxonomy Foulds2010Review (). For a more recent review on MI classification algorithms, see e.g. Cheplygina2015Multiple (). Carbonneau et al. Carbonneau2018Multiple () discussed sample independence and data sparsity, which we address in Section 2 and Section 3. Amores Amores2013Multiple () presented the three paradigms of instance space (IS), embedded space (ES), and bag space (BS). IS methods aggregate the outcome of singleinstance classifiers applied to the instances of a bag, whereas ES methods map the instances to a vector, and then use a singleinstance classifier. In the BS paradigm, the instances are transformed to a nonvectorial space, e.g. graphs Zhou2009Multiinstance (); Lee2012Bridging (), where the classification is performed, avoiding the detour via singleinstance classifiers. The nonvectorial space of probability functions has not yet been introduced to the BS paradigm, despite its analytical benefits.
1.2 The nonvectorial space of probability functions
From the probabilistic viewpoint Maron1998Framework (), an instance is a realisation of a random vector, , with probability distribution and sample space . The posterior probability, , is an effective classifier if the standard MI assumption holds, since it is known beforehand to be
where is the positive instance space, and the positive and negative instance spaces are disjoint.
Bayes’ rule, , can be used when the posterior probability is unknown. An assumption used to estimate the probability distribution of instance given the class, , is that instances from bags of the same class are independent and identically distributed (i.i.d.) random samples Maron1998Framework (). This is a poor description for MI learning, and therefore approaches such as estimating the expectation by the mean Xu2004Logistic (), or estimation of class distribution parameters Tax2011Bag (), are not valid. As an illustrative example, let the instances be the colour of image segments from the class sea. Image depicts a clear blue sea, whereas image depicts a deep green sea, and instance distributions are clearly dependent not only on class, but also on bag. The random vectors in are i.i.d., but have a different distribution than those in , a view that was missed in Zhou2009Multiinstance (). An important distinction between uncertain objects, whose distribution depends solely on the class label Jiang2013Clustering (); Kriegel2005Densitybased (), and MI learning is that the instances of two bags from the same class are not from the same distribution.
The dependency nature for MI learning can be described as a hierarchical distribution (Eq. 1), where a bag, , is defined as the probability distribution of its instances, , and the bag space, , is a set of distributions. This approach was introduced for learnability theory under the standard MI assumption for instance classification Doran2016MultipleInstance (). We expand the use of the hierarchical model and introduce the nonvectorial space of probability functions as an extension within the BS paradigm for bag classification through dissimilarity measures between distributions.
1.3 Dissimilarities in MI learning
Dissimilarities in MI learning have been categorised as instancetoinstance or bagtobag Amores2013Multiple (); Cheplygina2016DissimilarityBased (). Bagtoclass dissimilarity has not been studied within the MI framework, but was used under the assumption of i.i.d. samples given class for image classification Boiman2008In (), where also the sparseness of training sets was addressed: if the instances are aggregated on class level, a denser representation is achieved.
Many MI algorithms use dissimilarities, e.g. graph distances Lee2012Bridging (), Hausdorff metrics Scott2005Generalized (), functions of the Euclidean distance Cheplygina2015Multiple (); RuizMunoz2016Enhancing (), and distribution parameter based distances Cheplygina2015Multiple (). The performances of dissimilarities on specific data sets have been investigated Cheplygina2015Multiple (); Tax2011Bag (); Cheplygina2016DissimilarityBased (); RuizMunoz2016Enhancing (); Sorensen2010DissimilarityBased (), but more analytical comparisons are missing. Amores Amores2013Multiple () implicitly assumed metricity for dissimilarity functions in the BS paradigm Scholkopf2000Kernel (), but there is nothing inherent to MI learning that imposes these restrictions. The nonmetric KullbackLeibler (KL) information Kullback1951Information (), applied in Boiman2008In (), is an example of a divergence: a dissimilarity measure between two probability distributions.
Divergences have not been used in MI learning, due to the lack of a probability function space defined for the BS paradigm, despite the benefit of analysis independent of specific data sets Gibbs2002Choosing (). The divergences Ali1966General (); Csiszar1967Informationtype () have desirable properties for dissimilarity measures, including minimum value for equal distributions, but there is no complete categorisation of divergences. The KL information is a nonsymmetric divergence, often used in both statistics and computer science, and is defined as follows for two probability density functions (pdfs) and :
An example of a symmetric divergence is the Bhattacharyya (BH) distance, defined as
and can be a better choice if the absolute difference, and not the ratio, differentiates the two pdfs. The appropriate divergence for a specific task can be chosen based on identified properties, e.g. for clustering Mollersen2016DataIndependent (), or a new dissimilarity function can be proposed Mollersen2015Divergencebased (). This article aims to identify properties for bag classification, and we make the following contributions:

Presenting the hierarchical model for general, nonstandard MI assumptions (Section 2).

Introduction of bagtoclass dissimilarity measure (Section 3).

Identification of two properties for bagtoclass divergence (Section 4.1).

A new bagtoclass dissimilarity measure for sparse training data (Section 4.2).
In Section 5, the KL information and the new dissimilarity measure is applied to data sets and the results are reported. Bags defined in the probability distribution space, in combination with bagtoclass divergence, constitutes a new framework for MI learning, which is compared to other frameworks in Section 6.
2 Hierarchical distributions
A bag is the probability distribution from which the instances are sampled. The generative model of instances from a positive or negative bag follows a hierarchical distribution
(1) 
respectively. The common view in MI learning is that a bag consists of positive and negative instances, which corresponds to a bag being a mixture of a positive and a negative distribution.
Consider tumour images labelled or , with instances extracted from segments. Let and denote the pdfs of positive and negative segments, respectively, of image . The pdf of bag is a mixture distribution
where , where if instance is positive. The probability of positive segments, , depends on the image’s class label, and hence is sampled from or . The characteristics of positive and negative segments vary from image to image. Hence, and are realisations of random variables, with corresponding probability distributions and . The generative model of instances from a positive (negative) bag is
(2) 
The corresponding sampling procedure from positive (negative) bag, , is
Step 1: Draw from , from , and from . These three parameters define the bag.
Step 2: For , draw from , draw from if , and from otherwise.
By imposing restrictions, assumptions can be accurately described, e.g. the standard MI assumption: at least one positive instance in a positive bag: ; no positive instances in a negative bag: ; the positive and negative instance spaces are disjoint.
Eq. 2 is the generative model of MI problems, assuming that the instances have unknown class labels and that the distributions are parametric. The parameters , and are i.i.d. samples from their respective distributions, but are not observed and are hard to estimate, due to the very nature of MI learning: The instances are not labelled. Instead, can be estimated from the observed instances, and divergence can serve as classifier.
3 Bagtoclass dissimilarity
The training set in MI learning is the instances, since the bag distributions are unknown. Under the assumption that the instances from each bag are i.i.d. samples, the KL information has a special role in model selection, both from the frequentist and the Bayesian perspective. Let be the sample distribution (unlabelled bag), and let and be two models (labelled bags). Then the expectation over of the log ratio of the two models, , is equal to . In other words, the log ratio test reveals the model closest to the sampling distribution in terms of KL information Eguchi2006Interpreting (). From the Bayesian viewpoint, the Akaike Information Criterion (AIC) reveals the model closest to the data in terms of KL information, and is asymptotically equivalent to Bayes factor under certain assumptions Kass1995Bayes ().
If the bag sampling is sparse, the dissimilarity between and the labelled bags becomes somewhat arbitrary w.r.t. the true label of . The risk is high for ratiobased divergences such as the KL information, since for . The bagtobag KL information is asymptotically the best choice of divergence function, but this is not the case for sparse training sets. Bagtoclass dissimilarity makes up for some of the sparseness by aggregation of instances. Consider an image segment of colour deep green, which appears in sea images, but not in sky images, and a segment of colour white, which appears in both classes (waves and clouds). If the combination deep green and white does not appear in the training set, then a bagtobag KL information will result in infinite dissimilarity for all bags, regardless of class, but the bagtoclass KL information will be finite for the sea class.
Let be the probability distribution of a random vector from the bags of class . Let and be the divergences between the unlabelled bag and each of the classes. Choice of divergence is not obvious, since is different from both and , but can be done by identification of properties.
4 Properties for baglevel classification
4.1 Properties for bagtoclass divergences
We here propose two properties for bagtoclass divergences regarding infinite bagtoclass ratio and zero instance probability. Let , and . Denote the divergence between an unlabelled bag and the reference distribution, , by .
As a motivating example, consider the following: A positive bag, , is a continuous uniform distribution , sampled according to . A negative bag, , is sampled according to , and let so that there is an overlap between the two classes. For both positive and negative bags, we have that for a subspace of and for a different subspace of , merely reflecting that the variability in instances within a class is larger than within a bag, as illustrated in Fig. 2.
If is a sample from the negative class, and for some subspace of it can easily be classified. From the above analysis, large bagtoclass ratio should be reflected in large divergence, whereas large classtobag ratio should not.
Property 1: For the subspace of where the bagtoclass ratio is larger than some , the contribution to the total divergence, , approaches the maximum contribution as . For the subspace of where the classtobag ratio is larger than , the contribution to the total divergence, , does not approach the maximum contribution as :
Property 1 can not be fulfilled by a symmetric divergence.
As a second motivating example, consider the same positive class as before, and the two alternative negative classes defined by;
For bag classification, the question becomes: from which class is a specific bag sampled? It is equally probable that a bag comes from each of the two negative classes, since and only differ where , and we argue that should be equal to .
Property 2: For the subspace of where is smaller than some , the contribution to the total divergence, , approaches zero as :
KL information is the only divergence that fulfils these two properties among the nonsymmetric divergences listed in Taneja2006Generalized (). As there is no complete list of divergences, so it is possible that other divergences fulfil these properties that the authors are not aware of.
4.2 A classconditional dissimilarity for MI classification
In the sea and sky images example, consider an unlabelled image with a pink segment, e.g. a boat. If pink is absent in the training set, then the bagtoclass KL information will be infinite for both classes. We therefore propose the following property:
Property 3: For the subspace of where both class probabilities are smaller than some , the contribution to the total divergence, , approaches zero as :
We present a classconditional dissimilarity that accounts for this:
which also fulfils Properties 1 and 2. differs from the earlier proposed weighted KL information Sahu2003Fast (), whose weight is a constant function of .
5 Simulated and real data
5.1 Simulated data
The following study exemplifies the difference between BH distance ratio, , KL information ratio, , and as classifiers for sparse training data. The number of instances from each bag is , the number of bags in the training set is varied from to from each class, and the number of bags in the test set is . Each bag and its instances are sampled as described in Eq. 2. For simplicity, we use Gaussian distributions in one dimension for Sim 1Sim 4:
Sim 1: :
No positive instances in negative bags.
Sim 2: :
Positive instances in negative bags.
Sim 3: :
Positive and negative instances have the same expectation of the mean, but unequal variance.
Sim 4: :
Positive instances are sampled from two distributions with unequal mean expectation.
We add Sim 5 and Sim 6 for the discussion on instance labels in Section 6, as follows: Sim 5 is an uncertain object classification, where the positive bags are lognormal densities with and , and negative bags are Gaussian mixtures densities with , , , and . These two densities are nearly identical, see (McLachlan2000Finite, , p. 15). In Sim 6, the parameters of Sim 5 are i.i.d. observations from Gaussian distributions, each with for the Gaussian mixture, and for the lognormal distribution.
All distributions are estimated by kernel density estimation (KDE) with Epanechnikov kernel, and bandwidth varying with the number of observations. Figure 3 shows the estimated class densities and two estimated bag densities for Sim 2 with negative bags in the training set.
The integrals are estimated by importance sampling. The minimum dissimilarity bagtobag classifier are also implemented, based on KL information and BH distance.
The area under the receiver operating characteristic (ROC) curve (AUC) serves as performance measure. Table 1 shows the mean AUCs for repetitions.
Bags  : 5  : 10  : 25  
Sim:  :  
1  61  69  85  62  72  89  61  73  92  
1  5  63  75  86  64  82  94  68  84  97 
10  69  86  87  73  91  95  75  91  98  
1  57  61  75  59  61  78  58  55  75  
2  5  59  67  79  60  68  84  62  63  85 
10  64  77  80  66  78  86  68  72  86  
1  51  55  71  52  58  73  50  57  74  
3  5  53  61  76  53  66  81  52  65  83 
10  58  73  78  58  76  84  57  76  87  
1  55  61  70  56  62  73  56  58  69  
4  5  56  63  75  57  64  81  59  59  80 
10  60  74  77  62  76  85  63  69  84  
1  64  61  62  67  63  66  64  62  67  
5  5  73  69  63  74  70  67  75  71  72 
10  74  70  62  75  73  69  76  74  72  
1  68  68  67  66  68  68  68  71  68  
6  5  65  64  67  68  68  69  70  71  74 
10  66  64  66  70  69  72  72  73  74 
5.2 Real data
Breast tissue images (see Fig. 1) with corresponding feature vectors (www.miproblems.org) Kandemir2014Empowering (), are used as example. Following the procedure in Kandemir2014Empowering (), the principal components are used for dimension reduction, and fold crossvalidation is used in the classification. For density estimation, Gaussian mixture models (GMMs) are fitted to the first principal component, using an EMalgorithm, with number of components chosen by minimum AIC. In addition, KDE as in Section 5.1, and KDE with Gaussian kernel and optimal bandwidth is used.
KDE (Epan.)  KDE (Gauss.)  GMMs  

90  92  94  
82  92  96 
5.3 Results
The general trend in Table 1 is that gives higher AUC than , which in turn gives higher AUC than , in line with the divergences’ properties for sparse training sets. The same trend can be seen with a Gaussian kernel and optimal bandwidth (numbers not reported). The gap between and narrows with larger training sets. In other words, the benefit of increases with sparsity. This can be explained by the risk of , as seen in Figure 3(a).
Increasing also narrows the gap between and , and eventually (at approximately ), outperforms (numbers not reported). Sim 1 and Sim 3 are less affected because the ratio is already .
The minimum bagtobag classifier gives a single sensitivityspecificity outcome, and the KL information outperforms the BH distance. Compared to the ROC curve, as illustrated in Fig. 4, the minimum bagtobag KL information classifier exceeds the bagtoclass dissimilarities only for very large training sets, typically for 500 or more, then at the expense of extensive computation time.
Sim 5 is an example in which the absolute difference, and not the ratio, differentiates the two classes, and has the superior performance. When the extra hierarchy level is added in Sim 6, the performances returned to normal.
The real data study shows that the simple divergencebased approach can outperform more sophisticated algorithms. is more sensitive than to choice of density estimation method. performs better than with GMM, and both exceed the AUC of of the original algorithm.
6 Discussion
6.1 Comparison to other frameworks
The bagtoclass divergence approach fits into the branch of collective assumptions of the Foulds and Frank taxonomy Foulds2010Review (), where all instances contributes to the bag label. Another branch includes instancelevel and baglevel distances, which we have expanded by a bagtoclass category. Amores Amores2013Multiple () defines a bag as a set of points in a dimensional space, and the probability distribution viewpoint offers an alternative definition for bags. The bagtoclass divergence approach fits into the BS paradigm, which is the least explored of the three.
Carbonneau et al. Carbonneau2018Multiple () assume underlying instance labels. In Sim 1  Sim 4, the instance labels are inaccessible through observations without previous knowledge about the distributions. In Sim 6, the instance label approach is not useful, due to the similarity between the two distributions:
(3) 
where and are the lognormal and the Gaussian mixture, respectively. Eq. 2 is just a special case of Eq. 3, where is the random vector . Without knowledge about the distributions, discriminating between training sets following the generative model of Eq. 2 and Eq. 3 is only possible for a limited number of problems. Even the uncertain objects of Sim 5 is difficult to discriminate from MI objects based solely on the observations in the training set.
The assumption of i.i.d. samples from the same bag facilitates distribution estimation, but this is not inherent to the probability distribution viewpoint. Prior knowledge about the distribution shape can facilitate choice of divergence, as seen for the uncertain objects of Sim 5.
6.2 Conclusions
Although the bagtobag KL information has the minimum misclassification rate, the typical sparseness of MI training sets is an obstacle, which is partly solved by bagtoclass dissimilarities. The proposed classconditional KL information accounts for additional sparsity, but the diversity of data types, assumptions, problem characteristics, sampling sparsity, etc. is far too large for any one approach to be sufficient.
The bagtoclass divergence approach addresses three main challenges MI learning. (1) Aggregation of instances according to bag label and the additional classconditioning provides a solution for the data sparsity problem. (2) The bagtobag approach suffers from extensive computation time, solved by the bagtoclass approach. (3) Viewing bags as probability distributions give access to analytical tools from statistics and probability theory, and comparisons of methods can be done on a dataindependent level through identification of properties. The properties presented here are not an extensive list, and any extra knowledge should be taken into account whenever available.
The introduction of divergences as an alternative class of dissimilarity functions; and the bagtoclass dissimilarity as an alternative to the bagtobag dissimilarity, has added additional tools to the MI toolbox.
Acknowledgements
No specific funding was received.
Footnotes
 journal: Pattern Recognition
References
 E. Drelie Gelasca, J. Byun, B. Obara, B. S. Manjunath, Evaluation and benchmark for biological image segmentation, in: IEEE International Conference on Image Processing, 2008. doi:10.1109/ICIP.2008.4712130.

G. Doran, S. Ray,
Multipleinstance learning from
distributions, Journal of Machine Learning Research 17 (128) (2016) 1–50.
URL http://jmlr.org/papers/v17/15171.html  M.L. Zhang, Z.H. Zhou, Multiinstance clustering with applications to multiinstance prediction, Applied Intelligence 31 (1) (2009) 47–68. doi:10.1007/s104890070111x.
 Z.H. Zhou, M.L. Zhang, S.J. Huang, Y.F. Li, Multiinstance multilabel learning, Artificial Intelligence 176 (1) (2012) 2291–2320. doi:10.1016/j.artint.2011.10.002.
 T. G. Dietterich, R. H. Lathrop, T. LozanoPérez, Solving the multiple instance problem with axisparallel rectangles, Artificial Intelligence 89 (1) (1997) 31–71. doi:10.1016/s00043702(96)000343.
 Y.Y. Xu, Multipleinstance learning based decision neural networks for image retrieval and classification, Neurocomputing 171 (2016) 826–836. doi:10.1016/j.neucom.2015.07.024.
 M. Qiao, L. Liu, J. Yu, C. Xu, D. Tao, Diversified dictionaries for multiinstance learning, Pattern Recognition 64 (2017) 407–416. doi:10.1016/j.patcog.2016.08.026.
 N. Weidmann, E. Frank, B. Pfahringer, A twolevel learning method for generalized multiinstance problems, in: N. Lavrač, D. Gamberger, H. Blockeel, L. Todorovski (Eds.), Machine Learning: ECML 2003, Springer Berlin Heidelberg, 2003, pp. 468–479. doi:10.1007/9783540398578_42.
 J. Foulds, E. Frank, A review of multiinstance learning assumptions, The Knowledge Engineering Review 25 (1) (2010) 1–25. doi:10.1017/s026988890999035x.
 V. Cheplygina, D. M. J. Tax, M. Loog, Multiple instance learning with bag dissimilarities, Pattern Recognition 48 (1) (2015) 264–275. doi:10.1016/j.patcog.2014.07.022.
 M. A. Carbonneau, V. Cheplygina, E. Granger, G. Gagnon, Multiple instance learning: A survey of problem characteristics and applications, Pattern Recognition 77 (2018) 329–353. doi:10.1016/j.patcog.2017.10.009.
 J. Amores, Multiple instance classification: Review, taxonomy and comparative study, Artif. Intell. 201 (2013) 81–105. doi:10.1016/j.artint.2013.06.003.
 Z. H. Zhou, Y. Y. Sun, Y. F. Li, Multiinstance learning by treating instances as noni.i.d. samples, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, ACM, New York, NY, USA, 2009, pp. 1249–1256. doi:10.1145/1553374.1553534.
 W.J. Lee, V. Cheplygina, D. M. J. Tax, M. Loog, R. P. W. Duin, Bridging structure and feature representations in graph matching, Int. J. Patt. Recogn. Artif. Intell. 26 (05) (2012) 1260005+. doi:10.1142/s0218001412600051.
 O. Maron, T. LozanoPÂ´erez, A framework for multipleinstance learning, in: Advances in Neural Information Processing Systems 10, 1998, pp. 570–576.
 X. Xu, E. Frank, Logistic regression and boosting for labeled bags of instances, in: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, Springer Berlin Heidelberg, 2004, pp. 272–281. doi:10.1007/9783540247753_35.
 D. M. J. Tax, M. Loog, R. W. P. Duin, V. Cheplygina, W.J. Lee, Bag dissimilarities for multiple instance learning, in: M. Pelillo, E. R. Hancock (Eds.), SimilarityBased Pattern Recognition, Springer Berlin Heidelberg, 2011, pp. 222–234. doi:10.1007/9783642244711_16.
 B. Jiang, J. Pei, Y. Tao, X. Lin, Clustering uncertain data based on probability distribution similarity, Knowledge and Data Engineering, IEEE Transactions on 25 (4) (2013) 751–763. doi:10.1109/tkde.2011.221.
 H. P. Kriegel, M. Pfeifle, Densitybased clustering of uncertain data, in: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, ACM, New York, NY, USA, 2005, pp. 672–677. doi:10.1145/1081870.1081955.
 V. Cheplygina, D. M. J. Tax, M. Loog, Dissimilaritybased ensembles for multiple instance learning, Neural Networks and Learning Systems, IEEE Transactions on 27 (6) (2016) 1379–1391. doi:10.1109/TNNLS.2015.2424254.
 O. Boiman, E. Shechtman, M. Irani, In defense of nearestneighbor based image classification, in: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8. doi:10.1109/cvpr.2008.4587598.
 S. Scott, J. Zhang, J. Brown, On generalized multipleinstance learning, Int. J. Comp. Intel. Appl. 05 (01) (2005) 21–35. doi:10.1142/s1469026805001453.
 J. F. RuizMuñoz, G. CastellanosDominguez, M. OrozcoAlzate, Enhancing the dissimilaritybased classification of birdsong recordings, Ecological informatics 33 (2016) 75–84. doi:10.1016/j.ecoinf.2016.04.001.
 L. Sørensen, M. Loog, D. M. J. Tax, W.J. Lee, M. de Bruijne, R. P. W. Duin, Dissimilaritybased multiple instance learning, in: E. R. Hancock, R. C. Wilson, T. Windeatt, I. Ulusoy, F. Escolano (Eds.), Structural, Syntactic, and Statistical Pattern Recognition, Springer Berlin Heidelberg, 2010, pp. 129–138. doi:10.1007/9783642149801_12.

B. Schölkopf, The
kernel trick for distances, in: Proceedings of the 13th International
Conference on Neural Information Processing Systems, NIPS’00, MIT Press,
Cambridge, MA, USA, 2000, pp. 283–289.
URL http://portal.acm.org/citation.cfm?id=3008793  S. Kullback, R. A. Leibler, On information and sufficiency, The Annals of Mathematical Statistics 22 (1) (1951) 79–86. doi:10.1214/aoms/1177729694.
 A. L. Gibbs, F. E. Su, On choosing and bounding probability metrics, International Statistical Review 70 (3) (2002) 419–435. doi:10.1111/j.17515823.2002.tb00178.x.
 K. Møllersen, S. S. Dhar, F. Godtliebsen, On dataindependent properties for densitybased dissimilarity measures in hybrid clustering, Applied Mathematics 07 (15) (2016) 1674–1706. doi:10.4236/am.2016.715143.
 K. Møllersen, J. Y. Hardeberg, F. Godtliebsen, Divergencebased colour features for melanoma detection, in: Colour and Visual Computing Symposium (CVCS), 2015, IEEE, 2015, pp. 1–6. doi:10.1109/CVCS.2015.7274885.

S. M. Ali, S. D. Silvey, A general
class of coefficients of divergence of one distribution from another,
Journal of the Royal Statistical Society. Series B (Methodological) 28 (1)
(1966) 131–142.
URL http://www.jstor.org/stable/2984279  I. Csiszár, Informationtype measures of difference of probability distributions and indirect observations, Studia Scientiarum Mathematicarum Hungarica 2 (1967) 299–318.
 L. M. Bregman, The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, USSR Computational Mathematics and Mathematical Physics 7 (3) (1967) 200–217. doi:10.1016/00415553(67)900407.
 S. Eguchi, J. Copas, Interpreting KullbackLeibler divergence with the NeymanPearson lemma, J. Multivar. Anal. 97 (9) (2006) 2034–2040. doi:10.1016/j.jmva.2006.03.007.
 R. E. Kass, A. E. Raftery, Bayes factors, Journal of the American Statistical Association 90 (430) (1995) 773–795. doi:10.2307/2291091.
 I. J. Taneja, P. Kumar, Generalized nonsymmetric divergence measures and inequaities, Journal of Interdisciplinary Mathematics 9 (3) (2006) 581–599. doi:10.1080/09720502.2006.10700466.
 S. K. Sahu, R. C. H. Cheng, A fast distancebased approach for determining the number of components in mixtures, Can J Statistics 31 (1) (2003) 3–22. doi:10.2307/3315900.
 M. Kandemir, C. Zhang, F. A. Hamprecht, Empowering multiple instance histopathology cancer diagnosis by cell graphs, in: P. Golland, N. Hata, C. Barillot, J. Hornegger, R. Howe (Eds.), Medical Image Computing and ComputerAssisted Intervention â MICCAI 2014, Springer International Publishing, 2014, pp. 228–235. doi:10.1007/9783319104706_29.
 G. McLachlan, D. Peel, Finite Mixture Models, Wiley Series in Probability and Statistics, John Wiley & Sons, Inc., 2000. doi:10.1002/0471721182.