Diversity in Machine Learning
Abstract
Machine learning methods have achieved good performance and been widely applied in various realworld applications. It can learn the model adaptively and be better fit for special requirements of different tasks. Many factors can affect the performance of the machine learning process, among which diversity of the machine learning is an important one. Generally, a good machine learning system is composed of plentiful training data, a good model training process, and an accurate inference. The diversity could help each procedure to guarantee a total good machine learning: diversity of the training data ensures the data contain enough discriminative information, diversity of the learned model (diversity in parameters of each model or diversity in models) makes each parameter/model capture unique or complement information and the diversity in inference can provide multiple choices each of which corresponds to a plausible result. However, there is no systematical analysis of the diversification in machine learning system. In this paper, we systematically summarize the methods to make data diversification, model diversification, and inference diversification in machine learning process, respectively. In addition, the typical applications where the diversity technology improved the machine learning performances have been surveyed, including the remote sensing imaging tasks, machine translation, camera relocalization, image segmentation, object detection, topic modeling, and others. Finally, we discuss some challenges of diversity technology in machine learning and point out some directions in future work. Our analysis provides a deeper understanding of the diversity technology in machine learning tasks, and hence can help design and learn more effective models for specific tasks.
Keywords:
Diversity Training Data Model Learning Inference Machine Learning Active Learning Bayesian Method Posterior Regularization∎
1 Introduction
Traditionally, machine learning methods can learn parameters automatically with training samples and thus it can adapt to special requirements of various applications. Actually, it has achieved great success in tackling many realworld artificial intelligence and data mining problems (Bishop, 2006), such as optical character recognition, face detection, autonomous car driving, data mining of biological data, and web search/information retrieval. A success machine learning system often includes plentiful training data which can provide enough information to train the model, a good model learning process which can better model the data, and an accurate inference to discriminate different objects. Many factors can affect the performance of the machine learning process among which the diversity in machine learning plays an important role.
Recently, diversity is frequently occurred in many fields, such as biological system, culture, products and so on. A diversified system usually contains more information and can better fit for various environments. Particularly, the diversity property can be also applied in machine learning system. Here, we define the diversity property in machine learning as that different data contain the complement and useful information or features, and each factor in the same layer can capture unique information from the data. The diversity property tries to minimize the redundancy between different training data as well as the redundancy between the information captured by different factors. Therefore, diversity in machine learning can improve the performance of the model and plays an important role in machine learning process. We summarize the diversification of machine learning into three categories: the diversity in training data (data diversification), the diversity of the model/models (model diversification) and the diversity of the inference (inference diversification).
Data diversification can provide samples with enough information to train the machine learning model. The diversity in training data could maximize the information contained in the data. Many prior works have imposed the diversity in the construction of each training batch for the machine learning process to train the model more effectively (Zhang et al, 2017). In addition, diversity in active learning can also make the labelled training data contain the most information (You and Tao, 2014; Shi and Shen, 2016) and thus the learned model can achieve good performance with limited training samples.
Model diversification comes from the diversity in human visual system. Vinje and Gallant (2000); Olshausen and Field (1996); Zylberberg et al (2011) have shown that the human visual system represents decorrelation and sparseness, namely diversity. This makes different neurons in the human learning respond to different stimuli and generates little redundancy in the learning process which ensures the high effectiveness of the human learning. However, usual machine learning methods perform the redundancy in the learned model where different factors model the similar features (Xie et al, 2015b). Therefore, model diversification could significantly improve the performance of machine learning systems. The model diversification includes the diversity in parameters of each model (Dmodel) and the diversity among parameters of multiple base models (Dmodels). The former one tries to encourage different parameters in each model to be diversified and each parameter can model unique information (Zhong et al, 2017; Xie et al, 2015a). As a result, the performance of each model can be significantly improved (Gong et al, 2018c). In contrast, diversity among multiple base models, which is also called ensemble diversity, tries to repulse different base models and encourages each base model provide complement information (Viola and Jones, 2004; Lee et al, 2016; Zhu and Xing, 2009).
Inference diversification can provide choices/representations with more complement information. Generally, one could obtain multiple choices from the inference. However, the obtained choices from usual machine learning systems presents similarity between each other where the next choice will onepixel shifted versions of others (Park and Ramanan, 2011). While through diversification, the machine learning model could provide multiple choices which contain the most complement information (Batra et al, 2012; Yadollahpour et al, 2013; Chen et al, 2013; Felzenszwalb et al, 2010).
This work systematically covers the literature on diversitypromoting methods over data diversification, model diversification, and inference diversification in machine learning tasks. In particular, three main questions in the analysis of diversity technology in machine learning have arisen.

How to diversify the training data, the learned model/models, and the inference in machine learning system, respectively? Why do these methods work on the diversification of the machine learning system?

Is there any difference between the diversification of the model and models? Furthermore, is there any similarity between the diversity in training data, the learned model/models, and the inference?

Which realworld applications have the diversity been applied in prior works? How do the diversification methods work on these applications?
All of the three problems are important, none of them has been thoroughly answered. Diversity in machine learning can balance the training data, encourage the learned parameters to be diversified, and diversify the choices from the inference. Through enforcing diversity in the machine learning system, the machine learning model can present better performance. Following the framework, the three questions above have been answered by both theoretical analysis and realworld applications.
The remainder of this paper is organized as follows. Section 2 discusses the general forms of supervised learning and active learning in machine learning model. Section 3 summarizes the most common used diversity measurements in prior works. Section 4 outlines some of the prior works on diversification in training data. Section 5 reviews the strategies for model diversification, including the Dmodel, and the Dmodels. The prior works for inference diversification are summarized in Section 6. Section 7 introduces some applications of the diversitypromoting methods in prior works, and finally we do some discussions, conclude the paper and point out some future directions.
2 General Machine Learning Models
Traditionally, machine learning consists of supervised learning, active learning, unsupervised learning, and reinforcement learning. For reinforcement learning, training data is given only as the feedback to the program’s actions in a dynamic environment, and it does not require accurate input/output pairs and suboptimal actions need not be explicitly correct. However, the diversity technologies mainly work on the model itself to improve the model’s performance. Therefore, this work will ignore the reinforcement learning and mainly discusses the machine learning model as Fig. 1 shows. In addition, although some works have been tried in diversification of data preprocessing of unsupervised learning, most of the prior works focus on the diversification of supervised learning and active learning in machine learning models. Therefore, this work will mainly summarize these kinds of supervised and active learning models as Fig. 1 shows.
2.1 Supervised Learning
We consider the task of general supervised machine learning models, which are commonly used in realword machine learning tasks. Fig. 1 shows the flowchart of general machine learning methods in this work. It can be noted from Fig. 1 that the supervised machine learning model consists of data preprocessing, training (modeling), inference.
Let denote the set of training samples and is the corresponding label of , where ( is the set of class labels and is the number of the classes). Traditionally, the machine learning task can be formulated as the following optimization problem:
(1)  
where represents the loss function and is the constraint of the parameters of the model. The Lagrange multiplier of the optimization can be reformulated as follows.
(2) 
where is a positive value. Therefore, the machine learning problem can be seen as the minimization of .
Figs. 2 and 3 show the flowchart of special forms of supervised learning models. Among them, Fig. 2 shows the flowchart of supervised machine learning with single model. In the datapreprocessing stage, the more diversification and balance each training batch is, the more effectiveness the training process is. In addition, it should be noted that the factors in the same layer of the model can be diversified to improve the representational ability of the model (which is called Dmodel in this paper). Moreover, when we obtain multiple choices from the model, the obtained choices are desired to provide more complement information. Therefore, some works focus on the diversification of multiple choices (which we call inference diversification). Fig. 3 shows the flowchart of supervised machine learning with multiple parallel base models. We can find that a best strategy to diversify the training set for different base models can improve the performance of the whole ensemble. Furthermore, we can diversify these base models directly to enforce each base model to provide more complement information for further analysis.
2.2 Active Learning
Since labelling is always cost and time consuming, it usually cannot provide enough labelled samples for training. Therefore, active learning which can reduce the label cost and keep the training set in a moderate size plays an important role in machine learning model. It can make use of the informative samples and obtain the higher performance with less labelled training samples.
Through active learning, we can choose the most informative samples as labelled samples to train the model. This paper will take the Convex Transductive Experimental Design (CTED) as the representative of the active learning methods (Yu et al, 2006, 2008).
Denote as the candidate unlabelled samples for active learning. Then, the active learning problem can be formulated as the following optimization problem (Yu et al, 2008):
(3)  
where is the th entry of , and is a positive tradeoff parameter. As is shown, CTED utilizes a data reconstruction framework to select informative samples for labelling. The matrix contains reconstruction coefficients and is the sample selection vector. The norm makes the learned to be sparse. Then, the obtained is used to select samples for labelling and finally the training set is constructed with the selected samples and the original training samples. However, the selected samples from CTED usually make similarity from each other, which leads to the redundancy of the training samples. Therefore, diversity property is also required in active learning process.
To be concluded, diversification can be used in supervised learning and active learning to improve the model’s performance. According to the models in 2.1 and 2.2, the diversification technology in machine learning model has been divided into three parts: data diversification (Section 4), model diversification (Section 5), and inference diversification (Section 6). Since the diversification in training batch (Fig. 2) and the diversification in active learning mainly consider the diversification in training data, we summarize the prior works in these diversification as data diversification in section 4. In addition, the diversification of the model in Fig. 2 and models in Fig. 3 mainly focus on the diversification in the training model or between different base models directly, and thus we summarize these works in section 5. Finally, the inference diversification in Fig. 2 will be summarized in section 6. In the following section, we’ll first introduce the measurements which can calculate the similarity and promote diversity in machine learning models.
3 Diversity Measurements
Even though the physical meaning of the training data, the parameter factors in training model, and the choices in inference are different, the mathematical forms of these factors are similar since these factors can be represented as vectors. Denote , where represents the training data or factors in the machine learning model. Generally, geometric properties between these vectors can be used to measure the similarity between the vectors. Since vectors can be seen as the points in the space, then the data distribution of data points can be used to measure the similarity. In addition, the obtained results from machine learning model are usually probability distributions, then some Bayesian methods can be used to calculate the similarity with the probability distributions. Finally, considering the diversity in group is another way to measure the similarity. In the following, we summarize the diversity measurements which are usually used to encourage diversification over different factors in prior works from the four aspects respectively.
3.1 Geometric Properties
Generally, when two vectors are orthogonal or the distance between two vectors is far away enough, two vectors are thought to be uncorrelated and dissimilar. Therefore, geometric properties between different vectors, such as the distance and the angular can be used to measure the similarity between different factors. In the following, we’ll introduce the angularbased methods, such as cosine similarity and inner product, distancebased methods, such as the Euler distance and the heat kernel, eigenvaluebased methods, such as uncorrelation and evenness and submodular spectral diversity, respectively.
3.1.1 Cosine Similarity Measurement
Cosine similarity. The most common used measurement to calculate the correlation between different vectors is cosine similarity. The cosine similarity between different factors and can be calculated as
(4) 
And the diversitypromoting prior of generalized cosine similarity measurement can be written as
(5) 
It should be noted that when is set to 1, the diversitypromoting prior over different vectors by cosine similarity can be formulated as
(6) 
where is a positive value. When cosine similarity measurement is used as diversitypromoting prior, is enforced to be 0. Then, and tend to be orthogonal and different factors are encouraged to be uncorrelated and diversified. However, there exist some defects in the former measurement where the measurement is variant to orientation.
Angular. Some works use the variance and mean value of the angular between different factors to formulate the diversity of the model to overcome the problem occurred in cosine similarity. The angular between different factors can be formulated as
(7) 
The diversity function can be defined as
(8) 
where
In other words, presents the mean of the angular between different factors and presents the variance of the angular. Then, the diversity promoting prior by the angular of cosine similarity measurement can be formulated as
(9) 
The represents the angular between different factors. The prior in Eq. 9 encourages the angular between different factors to be and thus these factors are enforced to be diversified under the diversification prior. Moreover, the measurement is invariant to scale, translation, rotation, and orientation.
3.1.2 Inner Product Measurement
Different vectors present more diversity when they tend to be more orthogonal. The inner product can measure the orthogonality between different vectors and therefore it can be applied in machine learning models for more diversity. Guo et al (2017) uses the special form of the inner product measurement, which is called exclusivity. The exclusivity between two vectors and is defined as
(10) 
where denotes the Hadamard product, and denotes the norm. Therefore, the diversitypromoting prior can be written as
(11) 
Due to the nonconvexity and discontinuity of norm, the relaxed exclusivity is calculated as
(12) 
where denotes the norm. Then, the diversitypromoting prior based on relaxed exclusivity can be calculated as
(13) 
Li et al (2016); Liu et al (2016) use the trace to form the inner product measurement. The diversitypromoting prior by (Li et al, 2016; Liu et al, 2016) can be formulated as
(14) 
where represents the trace of the matrix.
The inner product measurement takes advantage of the characteristics among the vectors and tries to encourage different factors to be orthogonal to enforce the learned factors to be diversified. It should be noted that the measurement can be seen as a special form of cosine similarity measurement. It is easy to implement but is variant to scale and orientation.
3.1.3 Euler Distance Measurement
In general, the larger of the Euler distance two vectors have, the more difference the vectors are. Therefore, Euler distance can be used to measure the difference between the vectors. We can diversify different vectors through enlarging the Euler distances between these vectors. Then, the diversitypromoting prior by Euler distance can be formulated as
(15) 
Even though the Euler distance use the distance of different factors to measure the similarity between these factors, the measurement is variant to scale, and thus this may decrease the effectiveness of the diversity measurement.
3.1.4 Heat Kernel Measurement
Another commonly used method to measure the correlation between different parameters is heat kernel. The correlation between different factors can be calculated as
(16) 
where is a positive value. We can find that when and are dissimilar, tends to zero. The term can measure the correlation between different factors. Then, The diversitypromoting prior by heat kernel can be formulated as
(17) 
Heat kernel takes advantage of the distance between factors to encourage the diversity of the model. In addition, the measurement makes the penalization of diversification variant with the Gaussian function, and the degree of the variance is affected by the factor .
Instead of the distancebased and angularbased measurements, the eigenvalues of the kernel matrix can also be used to encourage different factors to be orthogonal and diversified. Recall that, for an orthogonal matrix, all the eigenvalues of the kernel matrix are equal to 1. Here, we denote as the kernel matrix of . Therefore, when we constrain the eigenvalues to 1, the obtained vectors tend to be orthogonal. Two ways can encourage the eigenvalues to be 1, including the submodular spectral diversity measurement and the uncorrelation and evenness measurement. In the following, the two measurements will be introduced in detail.
3.1.5 Submodular Spectral Diversity (SSD) Measurement
The submodular spectral diversity (SSD) measurement uses the square distance to encourage the eigenvalues to be 1 directly. Define as the eigenvalues of the kernel matrix. Then, the diversitypromoting prior by SSD can be formulated as
(18) 
where is also a positive value. This regularizes the variance of the eigenvalues of the matrix. Since all the eigenvalues are enforced to be 1, the obtained factors are more orthogonal and thus the model would present more diversity.
3.1.6 Uncorrelation and Evenness (UE) Measurement
Another diversity measurement based on kernel matrix is uncorrelation and evenness (Xie et al, 2017c). This measurement encourages the learned factors to be uncorrelated and to play equally important roles in modeling data. Formally, this amounts to encouraging the kernel matrix of the vectors to have more uniform eigenvalues.
The basic idea is to normalize the eigenvalues into a probability simplex and encourage the discrete distribution parameterized by the normalized eigenvalues to have small KullbackLeibler (KL) divergence with the uniform distribution (Xie et al, 2017c). Then, the diversitypromoting prior by uniform eigenvalues is formulated as
(19) 
subject to ( is positive definite matrix) and , where is the kernel matrix.
3.2 Data Distributions
Different from the aforementioned measurements which use the geometric properties to encourage diversity between pairwise vectors, the data distribution method, which prefers to a diverse set of vectors, can also be used to enforce the diversity between different factors. In the following, we’ll introduce the determinantal point process (DPP) which is usually used in prior works to diversify machine learning models.
3.2.1 Determinantal Point Process (DPP) Measurement
A DPP is a distribution over subsets of a fixed ground set, which prefers a diverse set of factors other than a redundant one (Kulesza and Taskar, 2012). Let denote a continuous space and the factors . Then, denote a positive semidefinite kernel function on ,
(20)  
where denotes the kernel matrix and the pairwise is the pairwise correlation between and . denotes the determinant of matrix. is an identity matrix. Since the space is constant, is a constant value. Therefore, the corresponding diversity prior of transition parameter matrix modeled by DPP can be formulated as
(21) 
In general, the kernel can be divided into the correlation and the prior part. Therefore, the kernel can be reformulated as
(22) 
where is the prior for the parameter and denotes the correlation of these factors. These kernels would always induce repulsion between different factors and thus a diverse set of factors tends to have higher probability. Generally, the vectors are supposed to be uniformly distributed variables. Therefore, the prior is a constant value, and then, the kernel
(23) 
Some works have shown that the DPP prior is usually not arbitrarily strong for some special case when applied into machine learning models (Lavancier et al, 2015). To encourage the DPP prior strong enough for all the training data, the DPP prior is augmented by an additional positive parameter . Therefore, the DPP prior can be reformulated as
(24) 
When we set the cosine similarity as the correlation kernel , from geometric interpretation, the DPP prior can be seen as the volume of the parallelepiped spanned by the columns of (Kulesza and Taskar, 2012). Therefore, diverse sets are more probable because their feature vector are more orthogonal, and hence span larger volumes.
3.3 Probability Distribution
Bayesian methods, such as divergence and cross entropy, can measure the similarity between different distributions. They can be also used to enforce diversity between different factors and prior works have shown the effectiveness of these methods. In addition, some works combine the Bayesian methods with statistics to encourage different factors to be diversified, such as negative correlation learning (Liu and Yao, 1997; Alhamdoosh and Wang, 2014). In the following, these Bayesian methods have been introduced in detail.
3.3.1 Divergence Measurement
Each factor can be processed as a probability distribution. Since divergence method can measure the dissimilarity between different distributions, it can also be used to measure the diversity between different factors. The divergence between factors and can be calculated as
(25) 
subject to . The divergence can measure the dissimilarity between the learned factors, such that the diversitypromoting prior by divergence can be formulated as
(26) 
The measurement takes advantage of the characteristics of the divergence to measure the dissimilarity between different distributions. It should be noted that the norm of the learned factors need to be constrained to 1. This limits the field to use the divergence measurement.
3.3.2 Negative Correlation Learning (NCL) Measurement
Negative correlation learning (NCL) has been proposed to reduce the covariance among different models while the variance and bias terms are not increased (Rosen, 1996). The measurement is usually used for diversifying multiple models. Denote as the inference results from the th model. represents the parameters in th model. Rosen (1996) uses the penalty to decorrelate the current learning model with all previously learned models
(27) 
Define where . Then, the penalty term can also be defined to reduce the correlation mutually among all the learned models by using the actual distribution obtained from each model instead of the target function y (Liu and Yao, 1997; Alhamdoosh and Wang, 2014).
(28) 
Then, the diversitypromoting prior by NCL can be written as
(29) 
This measurement uses the covariance of the inference results obtained from the multiple models to reduce the correlation mutually among the learned models. Therefore, the learned multiple models can be diversified.
3.3.3 Cross Entropy Measurement
Cross entropy is another measurement which can be used in Dmodels. As former subsection shows, depicts the inference results from the th model. Therefore, the crossentropy between different models can be calculated as
(30) 
The diversitypromoting regularization can be formulated as
(31) 
Then, the diversitypromoting prior by cross entropy can also be written as
(32) 
We all know that the larger the cross entropy is, the more difference the distributions are. Therefore, under cross entropy measurement, different models can be diversified and provide more complement information.
Both the cross entropy measurement and the negative correlation learning measurement encourage diversity of the multiple models by repulsing the obtained inference results from each other. Especially, the cross entropy measurement uses the cross entropy between pairwise distributions to encourage two distributions to be dissimilar and then different base models could provide more complement information.
3.4 Groupwise Correlation
3.4.1 Measurement
It is well known that the norm leads to the groupwise sparse representation of . can be used to measure the correlation between different parameter factors and diversify the learned factors to improve the representational ability of the model. Then, the prior can be calculated as
(33) 
where means the th entry of . The internal norm encourages different factors to be sparse, while the external norm is used to control the complexity of entire model.
3.5 Analysis
These diversity measurements can calculate the similarity between different vectors and thus encourage the diversity of the machine learning model. However, there exists the difference between these measurements. The details of these diversity measurements can be seen in Table 1. It can be noted from the table that all these methods take advantage of the pairwise correlation except which uses the groupwise correlation between different factors. Moreover, the determinantal point process, submodular spectral diversity, uncorrelation and evenness, and negative correlation learning can also take advantage of correlation among three or more factors.
Another property of these diversity measurement is scale invariant. Scale invariant can make the diversity of the model be invariant w.r.t. the norm of these factors. The cosine similarity measurement calculates the diversity via the angular between different vectors. As a special case for DPP, the cosine similarity can be used as the correlation term in DPP and thus the DPP measurement is scale invariant. For divergence measurement, since the factors are constrained with , the measurement is scale invariant. In addition, crossentropy and negative correlation learning take advantage of the distribution of the model, and the distribution obtained from the model is invariant with the scale of these factors.
Measurements  Pairwise Correlation  Multiple Correlation  Groupwise Correlation  Scale Invariant 

Cosine Similarity  
Determinantal Point Process  
Submodular Spectral Diversity  
Euler Distance  
Heat Kernel  
Divergence  
Uncorrelation and Evenness  
Inner Product  
CrossEntropy  
Negative correlation learning 
These measurements can encourage diversity within different vectors. Generally, the machine learning models can be looked as the set of latent parameter factors, which can be represented as the vectors. These factors can be learned and used to represent the objects. In the following, we’ll mainly summarize the use of different diversity measurements in machine learning process to improve the model’s performance.
4 Data Diversification
4.1 Diversification in Data PreProcessing
Machine learning model is usually trained with minibatches to accurately estimate the training model. Most of the former works generate the minibatch randomly. However, due to the imbalance of the training samples, redundancy may occur in the generated minibatches which shows negative effects on the machine learning process. Different from classical stochastic gradient descent (SGD) which relies on uniformly sampling data points to form a minibatch, Zhang et al (2017) proposes a nonuniform sampling scheme based on the DPP measurement.
As Section 3.2.1 shows, DPPs provide a probability measure over every configuration of subsets on data points. Through a similarity matrix over the data and a determinant operator, DPP assigns higher probabilities to those subsets with dissimilar items. Therefore, it can give low probabilities to minibatches which contain redundant data, and higher probabilities to minibatches with more diverse data. This simultaneously balances the data and leads to stochastic gradients with lower variance.
Through the DPP measurement, each minibatch contain more diverse and balance training samples, which can train the model more effectively and thus the learned model can exact more discriminative features from the objects.
4.2 Diversification in Active Learning
As section 2.2 shows, active learning can obtain good performance with less labelled training samples. However, some selected samples with CTED are similar to each other. The highly similar samples make the redundancy of the training samples, and thus decreases the training efficiency, which requires more training samples for comparable performance.
To select more informative and complement samples with active learning method, some prior works introduce diversity in the selected samples obtained from CTED. Shi and Shen (2016) enhances CTED with a diversity regularizer
(34)  
where the similarity matrix is introduced to model the pairwise similarities among all the samples, such that larger value of means higher similarity between the th sample and the th one. Shi and Shen (2016) uses the cosine similarity measurement to formulate the diversity term. Similarly, You and Tao (2014) denotes the diversity term in active learning with the angular of the cosine similarity to obtain a diverse set of training samples (see section 3.1.1 for details).
Through adding diversity regularization over the selected samples by active learning, more informative samples can be chosen for training. Therefore, the machine learning process can obtain comparable or better performance with limited training samples than that with more training samples.
5 Model Diversification
In addition to the diversification of the training samples by active learning to improve the performance with less training samples, we can also diversify the model to improve the representational ability of the model directly. As introduction shows, the machine learning methods aim to learn parameters by the machine itself with the training samples. However, due to the limited and imbalanced training samples, the highly similar learned parameters lead to the redundancy of the learned model and decrease the model’s representational ability.
In order to solve the problem, one of the methods is to diversify the learned parameters and improve the representational ability of the model (Dmodel). Therefore, each parameter factor can model unique information and the whole factors model a larger proportional of information. Another method is to obtain diversified multiple models (Dmodels). Traditionally, if we train multiple models separately, the obtained representations from different models would be similar and this leads to redundancy between different representations. Through regularizing the multiple base models with diversification prior, different models would be enforced to repulse from each other and each base model can provide more complement information.
5.1 DModel
The first method tries to diversify the parameters in the model to directly improve the representational ability of the model. Traditionally, Bayesian method and posterior regularization method can be used to impose the diversity property into the model. Different diversitypromoting priors have been proposed in prior works to measure the diversity between the learned parameter factors according to special requirements of different tasks. This subsection will mainly introduce the methods to enforce the diversity of the model and summarize these methods occurred in prior works.
5.1.1 Bayesian Method
Traditionally, diversitypromoting priors can be used to measure the diversification of the model. The parameters of the model can be calculated by Bayesian method as
(35) 
where represents the factors in machine learning model, is the likelihood of the training set on the constructed model and stands for the prior knowledge of the learned model. For the machine learning task at hand, denotes the diversitypromoting prior. Then, the machine learning task can be written as
(36) 
The loglikelihood of the optimization can be formulated as
(37) 
5.1.2 Posterior Regularization Method
Generally, the regularization method can add side information into parameter estimation and thus it can encourage the learned factors to possess specific property. We can also use the posterior regularization to enforce the learned model to be diversified. The diversity regularized optimization problem can be formulated as
(39) 
where stands for the diversity function which measures the diversification between different learned factors. represents the optimization term of the model which can be seen in subsection 2.1. means the tradeoff between the optimization and the diversification term.
From Eqs. 38 and 39, we can find that the posterior regularization has the similar form as the Bayesian method. In general, the optimization (38) can be transformed into the form (39). Therefore, in the following, we will summarize the diversitypromoting methods mainly in the posterior regularization form.
Measurements  Papers 

Cosine Similarity  Gong et al (2018c); Zhong et al (2017); Xiong et al (2015); Xie et al (2015b); Zhu et al (2015); Li et al (2016); Xie et al (2015c, 2017a, 2016); Xiong et al (2014); Zhao et al (2016); Rao et al (2015) 
Determinantal Point Process  Kwok and Adams (2012); Xie et al (2017b); Qiao et al (2015, 2017); Cohen (2015); Kulesza and Taskar (2012); Bardenet and AUEB (2015); Shotton et al (2013a); Gu and Han (2013); Gillenwater et al (2014); Kang (2013); Affandi et al (2012); Kulesza and Taskar (2010); Mariet and Sra (2015); Wachinger and Golland (2015); Zhang and Ou (2016) 
Submodular Spectral Diversity  Das et al (2012) 
Inner Product  Li et al (2016); Liu et al (2016) 
Euler Distance  Cai et al (2011); Zhang and Huan (2012); Graff and Ellen (2016) 
Heat Kernel  Sun et al (2017); Peng et al (2017); Belkin and Niyogi (2002) 
Divergence  Cai et al (2011) 
Uncorrelation and Evenness  Xie et al (2017c) 
Jiang et al (2014); Hu et al (2015); Wang et al (2015); Lang et al (2012); Zhai et al (2014); Sun et al (2016); Peng et al (2016); Parizi et al (2014); Li et al (2015) 
5.1.3 Diversity Regularization
Distancebased measurements. The simplest way to formulate the diversity between different factors is Euler distance. As subsection 3.1.3 introduces, increasing the distances between different factors can decrease the similarity between these factors. Cai et al (2011); Zhang and Huan (2012); Graff and Ellen (2016) have applied the Euler distance as the measurements to encourage the latent factors in machine learning to be diversified. The diversity term can be formulated as
(40) 
where is the number of the factors which we intend to diversify in the machine learning model. Another commonly used distancebased method to encourage diversity in machine learning is heat kernel (Sun et al, 2017; Peng et al, 2017; Belkin and Niyogi, 2002). The diversity term by heat kernel (subsection 3.1.4) can be formulated as
(41) 
where is a positive value. We can find that heat kernel has the form of Gaussian function and the diversity penalization is affected by the distance. All the former distancebased methods encourage the diversity of the model by enforcing the factors away from each other and thus these factors would show more difference. However, it should be noted that the Euler distance measurement can be significantly affected by scaling.
Angularbased measurements. To make the diversity measurement be invariant to scale, some works take advantage of the angular to encourage the diversity of the model. Among these works, the cosine similarity measurement is the most common used (Gong et al, 2018c; Zhong et al, 2017). As subsection 3.1.1 shows, the cosine similarity can measure the similarity between different vectors. In machine learning tasks, It can be used to measure the redundancy between different latent factors (Gong et al, 2018c; Zhong et al, 2017; Xiong et al, 2015; Xie et al, 2015b; Zhu et al, 2015; Li et al, 2016). The aim of cosine similarity prior is to encourage different latent factors to be uncorrelated, such that each factor can model unique features from the samples. The diversity term can be formulated as
(42) 
However, the former diversity term is variant to orientation. To overcome this problem, many works use the angular of cosine similarity to measure the diversity between different factors. Since the angular between different factors is invariant to translation, rotation, orientation and scale, Xie et al (2015a, b, c) develops the angularbased diversifying method for Restricted Boltzmann Machine. The details for the method can be seen in subsection 3.1.1. When we impose the diversifying prior in traditional machine learning method, the diversity term can be formulated as
(43) 
where and are defined as subsection 3.1.1. As a special form, some works also use the inner product to measure the correlation between different factors (Li et al, 2016; Liu et al, 2016). Then, the diversity term can be formulated as (see subsection 3.1.2 for details)
(44) 
where denotes the trace of a matrix. It should be noted that this term is variant to scale and orientation but it is easy to implement.
Eigenvaluebased measurements. Denote as the kernel matrix of the latent factors. Many prior works introduce diversity in the machine learning process based on the kernel matrix. The first method is submodular spectral diversity (see subsection 3.1.5 for details), which is based on the eigenvalues of the kernel matrix. Das et al (2012) introduces the submodular spectral diversity in the process of feature selection, which aims to select a diverse set of features. Feature selection is a key component in many machine learning settings. The process involves choosing a small subset of features in order to build a model to approximate the target concept well. The diversity term can be formulated as
(45) 
In addition, Xie et al (2017c) also develops a uncorrelation and evenness measurement based on the kernel matrix (see subsection 3.1.6 for details). The basic idea is that we normalize the eigenvalues into a probability simplex and encourage the discrete distribution parameterized by the normalized eigenvalues to have small KullbackLeibler (KL) divergence with the uniform distribution. The diversitypromoting uniform eigenvalue regularizer (UER) is formulated as
(46) 
where is the dimension of each factor.
Divergence measurement. Bayesian method can also be applied in the model diversification. Traditionally, divergence can be used to measure the dissimilarity between different distributions. Some works (Cai et al, 2011) uses the divergence to formulate the similarity between different factors. The diversity term in machine learning can be formulated as
(47) 
As subsection 3.3.1 shows, the norm of the learned factors need to satisfy which limits the application of the diversity measurement.
measurement. is another popular diversity measurements(Jiang et al, 2014; Hu et al, 2015; Wang et al, 2015). It can also be used for model diversification. can obtain a groupwise sparse representation of latent factors . The diversity term based on can be formulated as
(48) 
where is the dimension of each factor . The internal norm encourages different factors to be sparse, while the external norm is used to control the complexity of entire model.
DPP measurement. The former diversity measurements mainly focus on the pairwise diversification. Different from these former measurements, the DPP measurement takes the multiple correlation between different latent factors into consideration. As subsection 3.2.1 shows, it can encourage the learned factors to repulse from each other. Therefore, the DPPbased diversifying prior can obtain machine learning models with a diverse set of learned factors other than a redundant one. Some works have shown that the DPP prior is usually not arbitrarily strong for some special case when applied into machine learning models (Lavancier et al, 2015). To encourage the DPP prior strong enough for all the training data, the DPP prior is augmented by an additional positive parameter . Therefore, the DPP prior can be reformulated as
(49) 
The learned factors are usually normalized, and thus the optimization for machine learning can be written as
(50) 
where represents the diversity term for machine learning. It should be noted that different kernels can be selected according to the special requirements of different machine learning tasks (Affandi et al, 2014).
In conclusion, there have been numerous approaches to diversify the learned factors in machine learning model. A summary of the most frequently encountered diversity methods is shown in Table 2. Although most papers use slightly different specifications for their diversification of the learned model, the fundamental representation of diversification is usually similar. It should be also noted that the thing in common among studied diversity methods is that the diversity enforced in a pairwise form between members strikes a good balance between complexity and effectiveness (Guo et al, 2017). In addition, different applications should choose the proper diversity measurements according to the specific requirements.
5.2 DModels
The former subsection introduces the way to diversify the parameters in single model and improve the representational ability of the model directly. Much efforts have been done to obtain the highest probability (MAP) configuration in machine learning models. However, even when the training samples are sufficient, the maximum a (MAP) solution could be suboptimal. In many situations, one could benefit from additional representations with multiple models. However, traditional way to train multiple models may provide representations that tend to be similar while the representations obtained from different models are desired to provide complement information. Recently, many diversifying methods have been proposed to overcome this problem. Through diversification, each base model can provide more complement information, such that more discriminative representation can be obtained with these multiple diversified representations.
Denote and as the parameters and the inference from the th model. Then, the optimization of the machine learning to obtain multiple models can be written as
(51) 
where represents the optimization term of the th model and denotes the training samples of the th model. Traditionally, the training samples are randomly divided into multiple subsets and each subset trains a corresponding model. However, selecting subsets randomly may lead to the redundancy between different representations. Therefore, the first way to obtain multiple diversified models is to diversify these training samples over different base models, which we call samplebased methods.
Another way to encourage the diversification between different models is to measure the similarity between different base models with a special similarity measurement and encourage different base models to be diversified in the training process. The optimization of these methods can be written as
(52) 
where measures the diversification between different base models.
Finally, some other methods try to obtain large amounts of models and select the top as the final ensemble. In the following, we’ll summarize different methods for diversifying multiple models from the three aspects in detail.
Methods  Measurements  Papers 

Optimizationbased  Divergence  Kuncheva et al (2003); Zhu and Xing (2009) 
Renyientropy  Xing and Wang (2017)  
Cross Entropy  Gong et al (2018b); Lee et al (2015)  
Cosine Similarity  Yu et al (2011)  
Exclusivity  Guo et al (2017)  
Wang et al (2015)  
NCL  Rosen (1996); Liu and Yao (1997); Alhamdoosh and Wang (2014)  
Others  Yin et al (2014b); Kuncheva et al (2003); Ho (1998); Tang et al (2006); Yu et al (2011); Giacinto and Roli (2001); Dietterich (2000)  
Samplebased    GuzmanRivera et al (2014b); Viola and Jones (2004); Zhang and Zhou (2013); CarreiraPerpinÃ¡n and Raziperchikolaei (2016); Lee et al (2016); Shotton et al (2013a); Gu and Han (2013) 
Rankingbased    Ahmed et al (2015) 
5.2.1 OptimizationBased Method
Optimizationbased methods are one of the most common used methods to diversify multiple models. These methods try to obtain multiple diversified models by optimizing a given objective function, which includes a diversity measurement. The optimization for these methods can be written as Eq. 52. It can be noted that the main problem of these methods is to define diversity measurements which can calculate the difference between different models.
Many prior works (Tang et al, 2006; Yin et al, 2014b; Yu et al, 2011) have summarized some pairwise diversity measurements, such as Qstatistics measure (Kuncheva et al, 2003), correlation coefficient measure (Kuncheva et al, 2003), disagreement measure (Ho, 1998; Yin et al, 2014b), doublefault measure (Giacinto and Roli, 2001; Yin et al, 2014b), statistic measure (Dietterich, 2000), KohaviWolpert variance (Tang et al, 2006), interrater agreement (Tang et al, 2006), the generalized diversity (Tang et al, 2006) and the measure of ”Difficult” (Tang et al, 2006). Recently, some more measurements have also been developed, including not only the pairwise diversity measurement (Zhu and Xing, 2009; Kuncheva et al, 2003; Yu et al, 2011) but also measurements which calculate the multiple correlation and others (Wang et al, 2015; Lee et al, 2012; Alhamdoosh and Wang, 2014; Liu and Yao, 1997). This subsection will summarize these methods systematically.
Bayesianbased measurements. Similar to Dmodel, Bayesian methods can also be applied in Dmodels. Among these Bayesian methods, divergence is a popular one (subsection 3.3.1). The way to formulate the diversitypromoting term by divergence is to calculate the divergence between different parameters of the model, respectively (Zhu and Xing, 2009; Kuncheva et al, 2003). The diversitypromoting term by divergence can be formulated as
(53) 
where means the th entry in . In addition to the divergence measurements, Renyientropy which measures the kernelized distances between the images of samples and the center of ensemble in the highdimensional feature space can also be used (Xing and Wang, 2017). The diversitypromoting term based on Renyientropy can be formulated as
(54) 
where is a positive value and represents the Gaussian kernel function, which can be calculated as
(55) 
where denotes the dimension of . Another measurement which bases on Bayesian method is cross entropy measurement(Gong et al, 2018b; Lee et al, 2015). Based on subsection 3.3.3, the diversitypromoting term can be formulated as
(56) 
where is the inference of the th model and is the probability of the sample belonging to th class. Moreover, Lee et al (2012) proposes a hierarchical pair competitionbased parallel genetic algorithm (HFCPGA) to increase the diversity among the component neural networks. Then, the diversity term by HFCPGA can be formulated as
(57) 
Another method, namely negative correlation learning (NCL) (Rosen, 1996; Liu and Yao, 1997; Alhamdoosh and Wang, 2014), tries to reduce the covariance among all the models while the variance and bias terms are not increased. The NCL trains base models simultaneously in a cooperative manner that decorrelates individual errors. The penalty term can be designed in different ways depending on whether the models are trained sequentially or parallelly. Rosen (1996) defines the penalty by decorrelating the current learning model with all previously learned models
(58) 
Define where . Then the penalty term can also be defined to reduce the correlation mutually among all the models by using the actual inference instead of the target function y (Liu and Yao, 1997; Alhamdoosh and Wang, 2014).
(59) 
Yin et al (2014a) also combines the NCL with sparsity. The sparsity is purely pursued by the norm regularization without considering the complementary characteristics of the available base models. These Bayesian methods either take advantages of the probability distribution obtained from each base model or transform parameters in each base model as probability distribution form to measure the diversity between different models.
Cosine Similarity measurement. Different from the Bayesian methods which promote diversity from distribution view, Yu et al (2011) introduces the cosine similarity measurements to calculate the difference between different models from geometric view. As Section 3.1.1 shows, the diversitypromoting term can be written as
(60) 
In addition, as a special form of angularbased measurement, a special form of inner product measurement, termed as exclusivity, has been proposed by Guo et al (2017) to obtain diversified models (see section 3.1.2 for details). It can jointly suppress the training error of ensemble and enhance the diversity between bases. The diversitypromoting term by exclusivity can be written as
(61) 
These measurements try to encourage the pairwise models to be uncorrelated such that each base model can provide more complement information.
measurement. Just as the former subsection, norm can also be used as the diversification of multiple models(Wang et al, 2015). the diversitypromoting regularization by can be formulated as
(62) 
The measurement uses the groupwise correlation between different base models and favors selecting diverse models residing in more groups.
Some other diversity measurements have been proposed for deep ensemble. Zhou et al (2002) reveals that it may be better to ensemble many instead of all of the neural networks at hand. The paper develops an approach named GASEN to obtain different weights of each neural network. Then based the obtained weights, the deep ensemble can be formulated. Moreover, KeshavarzHedayati and Dimopoulos (2017) also encourages the diversity of the deep ensemble by defining a pairwise similarity between different terms.
These optimizationbased methods utilize the correlation between different models and try to repulse these models from one another. The aim is to enforce these representations which are obtained from different models to be diversified and thus each base model can provide more complement information.
5.2.2 SampleBased Method
In addition to obtain multiple diversified models from optimization view, we can also diversify the models from sample view. In general, we randomly divide the training set into multiple subsets where each base model corresponds to a specific subset as the training samples. However, this may cause the redundancy between the obtained features from different models. To overcome this problem and present more complement information from different models, Lee et al (2016) develops a novel method by dividing the training samples into multiple subsets. In (Lee et al, 2016), each sample is assigned into the specified subset where the corresponding learned model has the lowest predict error. Therefore, each base model focus on modeling different features. Moreover, clustering is another popular method to divide the training samples for different models (Shotton et al, 2013a). Although diversifying the obtained multiple subsets can make the multiple models provide more complement information, the less of training samples by dividing the whole training set would show negative effects over the performance.
To overcome this problem, another way to enforce different models to be diversified is to define each sample with different weights (GuzmanRivera et al, 2014b). By training different models with different weights of samples, each base model can focus on complement information from the samples. The detailed steps in (GuzmanRivera et al, 2014b) are as follows: first, define the weights over each training sample randomly, and train the model with the given weights; second, revise the weights over each training sample based on the final loss from the obtained model, and train the second model with the updated weights; finally, train models with the aforementioned strategies.
The former methods take advantage of the labelled training samples to enforce the diversity of multiple models. There exists another method, namely UDEED (Zhang and Zhou, 2013), which can use the unlabelled samples to provide diversity of the model. Unlike existing semisupervised ensemble methods where errorprone pseudolabels are estimated for unlabelled data to enlarge the labelled data to improve accuracy. UDEED works by maximizing accuracies of base models on labelled data while maximizing diversity among them on unlabelled data.
Moreover, CarreiraPerpinÃ¡n and Raziperchikolaei (2016) combines the different initializations, different training sets and different feature subsets to encourage the diversity of the multiple models.
The methods in this subsection process on the training sets to diversify different models. By training different models with different training samples or samples with different weights, these models would provide different information and thus the whole models could provide a larger proportional of information.
5.2.3 RankingBased Method
Another kind of methods to promote diversity in the obtained multiple models is rankingbased methods. All the models is first ranked according to some criterion, and then the top are selected to form the final ensemble. Here, Ahmed et al (2015) focuses on pruning techniques based on forward/backward selection, since they allow a direct comparison with the simple estimation of accuracy from different models.
Cluster can be also used as rankingbased method to enforce diversity of the multiple models (Gu and Han, 2013). In (Gu and Han, 2013), each model is first clustered based on the similarity of their predictions, and then each cluster is then pruned to remove redundant models, and finally the remaining models in each cluster are finally combined as the base models.
In addition to the former mentioned methods, Viola and Jones (2004) provides multiple diversified models by selecting different sets of multiple features. Through multiscale or other tricks, each sample will provide large amounts of features, and then choose top multiple features from the all the features as the base features (see Viola and Jones (2004) for details). Then, each base feature from the samples is used to train a specific model, and the final inference can be obtained through the combination of these models.
In summary, this paper summarizes the diversification of inference from three aspects: optimizationbased methods, samplebased methods, and rankingbased methods. The details of the most frequently encountered diversity methods is shown in Table 3. Optimizationbased methods encourage the multiple models to be diversified by imposing diversity regularization between different base models while optimizing these models. In contrary, samplebased methods obtain diversified models by training different models with specific training sets. Most of methods to diversify multiple models mainly focus on the two aspects. While the rankingbased methods obtain the multiple diversified models by choosing the top models.
6 Inference Diversification
The former section summarizes the methods to diversify different parameters in the model or models. However, the Dmodel focuses on the diversification of parameters in the model and improves the representational ability of the model itself. While Dmodels tries to obtain multiple diversified models, which aims to diversify the parameters between different models. In addition, many methods focus on obtaining multiple diversified choices directly, which we call inference diversification. This part will summarize these methods for inference diversification under graph model.
We consider a set of discrete random variables , each taking value in a finite label set . Let (, ) be a graph defined over these variable. The set denotes a Cartesian product of sets of labels corresponding to the subset of variables. Let , () be functions defining the energe at each node and edge for the labelling of variables in scope. The goal of MAP inference is to find the labelling of the variables that minimizes this realvalued energy function:
(63) 
Traditional methods to obtain multiple results try to solve the following optimization:
(64)  
However, the obtained secondbest choice will typically be onepixel shifted versions of the best (Park and Ramanan, 2011). In other words, the next best choices will almost certainly be located on the upper slope of the peak corresponding with the most confident detection, while other peaks may be ignored entirely. To overcome this problem, many methods, such as diversified multiple choice learning (DMCL), submodular, MModes, MNMS, have been introduced. These methods try to diverse the obtained choices (do not overlap under a userdefined criteria) while obtaining high score on the optimization term.
Measurements  Papers 

DiversityPromoting Multiple Choice Learning (DMCL)  GuzmanRivera et al (2012); Batra et al (2012); Kirillov et al (2015); Yadollahpour et al (2013); GuzmanRivera et al (2014a); Gimpel et al (2013) 
Submodular for Diversification  Prasad et al (2014a); Kirillov et al (2016); Nemhauser et al (1978) 
Mmodes  Chen et al (2013) 
MNMS  Blaschko (2011); Felzenszwalb et al (2010); Stephens et al (2013) 
DPP  Azadi et al (2017) 
6.1 DiversityPromoting Multiple Choice Learning (DMCL)
The DMCL tries to find a diverse set of highly probable solutions under a discrete probabilistic model. Given a dissimilarity function measuring similarity between pairwise choices, our formulation involves maximizing a linear combination of the probability and dissimilarity to previous choices. Even if the MAP solution alone is of poor quality, a diverse set of highly probable hypotheses might still enable accurate predictions. The goal of DMCL is to produce a diverse set of lowenergy solutions.
The first method is to approach the problem with a greedy algorithm, where the next choice is defined as the lowest energy state with at least some minimum dissimilarity from the previously chosen choices. To do so, we assume access to a dissimilarity function . In order to find the diverse, low energy, labellings , the method proceeds by solving a sequence of problems of the form (Batra et al, 2012; Yadollahpour et al, 2013; GuzmanRivera et al, 2014a; Gimpel et al, 2013)
(65) 
for , where determines a tradeoff between diversity and energy, is the MAPsolution and the function defines the diversity of two labels. In other words, takes a large value if and are diverse, and a small value otherwise. For special case, the MBest MAP is obtained when is a 01 dissimilarity (i.e. ).
Contrary to the former form, the second method formulate the best diverse problem in form of a single energy minimization problem (Kirillov et al, 2015). Instead of the greedy sequential procedure in (65), this method suggests to infer all labellings jointly, by minimizing
(66) 
where defines the total diversity of any labellings. To achieve this, let us first create copies of the initial model. Three specific different diversity measures are introduced. The splitdiversity measure is written as the sum of pairwise diversities, i.e. those penalizing pairs of labellings (Kirillov et al, 2015)
(67) 
The nodediversity measure is defined as (Kirillov et al, 2015)
(68) 
Finally, the special case of the splitdiversity and nodediversity measures is the nodesplitdiversity measure (Kirillov et al, 2015)
(69) 
The DMCL methods try to find multiple choices with a dissimilarity function. This can obtain choices with more difference and show more diversity. However, the obtained choices may not be optima and there exist other choices which could better represent the objects than the obtained ones.
6.2 Submodular for Diversification
The problem of searching for a diverse but highquality subset of items in a ground set of items has been studied in information retrieval, web search, sensor placement, document summarization, viral marketing and robotics. In many of these works, an effective, theoreticallygrounded and practical tool for measuring the diversity of a set are submodular set functions. Submodularity is a property that comes from marginal gains. A set function is submodular when its marginal gains are decreasing: for all and . In addition, if is monotone, i.e. whenever , then a simple greedy algorithm that iteratively picks the element with the largest marginal gain to add to the current set , achieves the best possible approximation bound of (Nemhauser et al, 1978). This result has had significant practical impact. Unfortunately, if the number of items is exponentially large, then even a single linear scan for greedy augmentation is simply infeasible. The diversity is measured by a monotone, nondecreasing and normalized submodular function .
Denote as the set of choices. The diversification is measured by a monotone, nondecreasing and normalized submodular function . Then, we aim to find a maximizing configurations for the combined score (Prasad et al, 2014a)
(70) 
The optimization can be solved by the greedy algorithm that starts out with , and iteratively adds the best term (Prasad et al, 2014a):
(71)  
where . The selected choice is within a factor of of the optimal solution :
(72) 
The submodular takes advantage of the maximization of marginal gains to find multiple choices which can provide the maximum of complement information.
6.3 MNms
Another way to obtain multiple diversified choices is nonmaximum suppression (MNMS). the MNMS is typically defined in an algorithmic way: starting from the MAP prediction one goes through all labellings according to an increasing order of the energy. A labelling becomes part of the predicted set if and only if it is more than away from the ones chosen before, where is the threshold defined by user to judge whether two labellings are similar. The MNMS guarantee the choices to be apart from each other. The MNMS is typically implemented by greedy algorithm (Felzenszwalb et al, 2010; Barinova et al, 2012; Desai et al, 2011).
A simple greedy algorithm for instantiating multiple choices are used: Search over the exponentially large space of choices for the maximally scoring choice, instantiate it, remove all choices which overlap, and repeat. The process is repeated until the score for the nextbest choice is below a threshold or M choices have been instantiated. However, the traditional implementation of such an algorithm would take exponential time.
The MNMS method tries to find Mbest choices by throwing away the similar choices from candidate set. To be concluded, the DMCL, submodular, and MNMS have the similar idea. All of them tries to find the Mbest choices under a dissimilarity function or the ones which can provide the most complement information.
6.4 Mmodes
Even though the former three methods guarantee the obtained multiple choices to be apart from each other, the choices are typically not local extrema of the probability distribution. To guarantee both the local extrema and the diversification of the obtained multiple choices, the problem can be transformed to Mmodes. Mmodes have multiple possible applications, because they are intrinsically diverse. Different from Mbest predictions where the main shortcoming is its lack of diversity because the choice space of the model is typically very large and fine grained.
For a nonnegative integer , define the neighborhood of a labelling to be as the set of labellings whose distances from is no more than , where measures the distance between two labellings, which we can choose Hamming distance. We call a labelling a local maximum of energy function , iff , .
Given , the set of modes is denoted by , formally, (Chen et al, 2013)
(73) 
As increases from zero to infinity, the neighborhood of monotonically grows and the set of modes monotonically decreases. Therefore, the can form a nested sequence, (Chen et al, 2013)
(74) 
Thus, the Mmodes can be defined as computing the labellings with minimal energies in . Then the problem has been transformed to Mmodes: Compute the labellings with minimal energies in .
Chen et al (2013) validates that a labelling is a mode if and only if it behaves like a âlocal modeâ everywhere, and thus a new chain has been constructed and Mmodes problem is reduced into the best problem of the new chain.
Furthermore, it also validates the onetoone cost preserving correspondence between consistent configurations and the set of modes . Therefore, the problem of computing the best modes are transferred to the problem of computing the best configurations in the new chain.
Different from the former three methods, Mmodes can obtain M choices which are extrema of the optimization and this can provide M choices which contains the most complement information.
6.5 Dpp
Different from the former methods which obtain multiple diversified choices through repulsing different choices from each other, Azadi et al (2017) proposes a differentiable DPP layer to predict a set of diverse and informative proposals with enriched representations.
We all know that the key for object detection is to find multiple proposals for further detection. From Section 3.2.1, it can be noted that the DPP is a distribution over subsets of a fixed ground set, which prefers a diverse set of factors. Therefore, Azadi et al (2017) develops a DPP loss to find a subset of diverse bounding boxes using the outputs of the other two loss functions (namely, the probability of each proposal to belong to each object category as well as the location information of the proposals) and will reinforce them in finding more accurate object instances in the end. The DPP loss is employed to maximize the likelihood of an accurate selection given the pool of overlapping background and nonbackground boxes over multiple categories (Azadi et al, 2017).
6.6 Analysis
Even though all the methods in former subsections can be used for inference diversity, there exists some difference between these methods. These methods in prior works are summarized in Table 4. It can be noted from the former subsections that the DMCL is the easiest to implement and common seen in prior works. One only needs to calculate the MAP choice and obtain other choices by constraining the optimization with a dissimilarity function. In contrary, the MNMS neglects the choices which are in the neighbors of the former choice and obtain other choices from the remainder. The DMCL and MNMS obtain choices by solving the optimization with userdefined similarity while the submodular method tries to obtain the choices which can provide the maximal marginal and complement information. The former three methods may provide choices which is not local optima while the local optimal choices contain more information than others. Therefore, different from the former three methods, the Mmodes tries to obtain multiple diversified choices which are also local optima. All these methods before can be used in traditional machine learning methods. The DPP method for inference diversification which is first proposed by Azadi et al (2017) is mainly applied in deep learning models. The method constructs a DPP layer as well as a DPP loss for the implementation in deep learning models. From the introduction of data diversification, Dmodel, Dmodels, and inference diversification, one could choose the proper method for diversification of machine learning in various computer vision tasks. In the following, we’ll introduce some applications of diversity technology in machine learning model.
7 Applications
Diversity technology in machine learning can significantly improve the representational ability of the model in many computer vision tasks, including remote sensing imaging tasks (Gong et al, 2018c; Zhong et al, 2017; Gong et al, 2018b, a), machine translation (Gimpel et al, 2013; Li and Jurafsky, 2016), camera relocalization (GuzmanRivera et al, 2014b; Shotton et al, 2013b), natural image segmentation (Yadollahpour et al, 2013; Batra et al, 2012; GuzmanRivera et al, 2014a), object detection (Blaschko, 2011; Felzenszwalb et al, 2010), topic modeling (Cohen, 2015), and so on. The diversity priors, which decrease the redundancy in the learned model or diversify the obtained multiple choices, can provide more informative features and show powerful ability in realworld application, especially for the computer vision tasks with limited training samples and complex structures in the training samples. In the following, the applications of diversity technology in machine learning will be introduced in detail.
7.1 Remote Sensing Imaging Tasks
Remote sensing images, including hyperspectral images, multispectral images, and so on, have played a more and more important role in the past two decades (BioucasDias et al, 2013). However, there are typical difficulties in remote sensing imaging tasks. First, limited number of training samples in remote sensing imaging tasks usually make it difficult to represent the images. Since labelling is usually timeconsuming and cost. We usually cannot provide enough training samples to train the model. Then, remote sensing images usually have large intraclass variance and low interclass variance, which make it difficult to extract discriminative features from the images. Finally, deep models with large amounts of parameters are usually used to model the remote sensing images. These models usually have better performance than other shallow models while the limited training samples usually lead the learned model to be suboptimal.
To overcome these problems, some works have applied the diversitypromoting prior to diversify the model (Gong et al, 2018c; Zhong et al, 2017). In (Gong et al, 2018c), the independence prior, which is based on cosine similarity in the former section, is imposed on a special deep structural metric learning method for remote sensing scene classification. Zhong et al (2017) imposes the independence prior on DBN for hyperspectral image classification. From (Xiong et al, 2014), we can find that the diversity promoting prior is effective for traditional RBM model. DBN, which is the stack of RBMs, can also be diversified for better representation.
Then, some other works focus on the diversification of multiple models for remote sensing images (Gong et al, 2018b, a). Gong et al (2018b) has applied cross entropy measurement to diversify the obtained multiple models and then the obtained multiple models could provide more complement information. Different from (Gong et al, 2018b), Gong et al (2018a) divides the training samples into several subsets for different models, separately. Then, each model focuses on the representation of different classes and the whole representation of these models could be improved.
7.2 Machine Translation
Machine translation (MT) is a subfield of computational linguistics that investigates the use of software to translate text or speech from one language to another. Recently, machine translation systems have been developed and widely used in realworld application. Commercial machine translation services, such as Google translator, Microsoft translator, and Baidu translator, have made great success. From the perspective of user interaction, the ideal machine translator is an agent that reads documents in one language and produces accurate, high quality translations in another. This interaction ideal has been implicit in machine translation (MT) research since the fieldâs inception. Unfortunately, when a real, imperfect MT system makes an error, the user is left trying to guess what the original sentence means. Therefore, to overcome this problem, providing the Mbest translations instead of a single best one is necessary (Venugopal et al, 2008).
However, in MT, for example, many translations on Mbest lists are extremely similar, often differing only by a single punctuation mark or minor morphological variation. We argue that the implicit goal behind these technologies is to better explore the output space by introducing diversity into the surrogate set.
Some prior works have introduced diversity into the obtained multiple choices and obtained better performance (Li and Jurafsky, 2016). (Gimpel et al, 2013) develops the method to diversify multiple choices which is introduced in subsection 6.1. The anthors define a novel dissimilarity function which is defined on different translations to increase the diversity between the obtained translations.
7.3 Camera Relocalization
Camera relocalization is to estimate the pose of a camera relative to a known 3D scene from a single RGBD frame (Shotton et al, 2013b). It can be formulated as the inversion of the generative rendering procedure, which is to find the camera pose corresponding to a rendering of the 3D scene model that is most similar to the observed input. Since the problem is a nonconvex optimization problem which has many local optima, one of the methods to solve the problem is to find a set of M predictors which generate M camera pose hypotheses and then infers the best pose from the multiple pose hypotheses. Similar to traditional Mbest problems, the obtained M predictors is usually similar.
To overcome this problem and obtain hypotheses that are different from each other, GuzmanRivera et al (2014b) tries to learn âmarginally relevantâ predictors, which make complementary predictions, and compare their performance when used with different selection procedures. In (GuzmanRivera et al, 2014b), greedy algorithm is used to obtain multiple diversified models. Different weights are defined on each training samples, and the weights is updated with the training loss from the former learned model. Finally, multiple diversified models can be obtained for camera relocalization.
7.4 Image Segmentation
In computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels). The goal of segmentation is to simplify and change the representation of an image into something that is more meaningful and easier to analyze. More precisely, image segmentation is the process of assigning a label to each pixel in an image such that pixels with the same label share certain characteristics. Since a semantic segmentation algorithm deals with tremendous amount of uncertainty from inter and intra object occlusion and varying appearance, lighting and pose, obtaining multiple best choices from all possible segmentations tends to be one of the possible way to solve the problem. Therefore, the image segmentation problem can be transformed into the Mbest problem. However, as traditional problem in Mbest problem, the obtained multiple choices are usually similar and the information provided to the user tends to be redundant.
The way to solve this problem is to introduce diversity into the training process to encourage the multiple choices to be diverse. Many works (Yadollahpour et al, 2013; Batra et al, 2012; GuzmanRivera et al, 2014a, 2012; Prasad et al, 2014a; Ramakrishna and Batra, 2012; Sun and Batra, 2015; Lee et al, 2016; Prasad et al, 2014b) have introduced diversity into the image segmentation tasks via different ways. Batra et al (2012) first introduces the DMCL in subsection 6.1 for image segmentation. Yadollahpour et al (2013); GuzmanRivera et al (2012) combines the DMCL with reranking which provide a way to obtain multiple diversified choices and select the proper one from the multiple choices. Prasad et al (2014a, b) uses the submodular to measure the diversification between multiple choices. Sun and Batra (2015) combines the NMS (see details in subsection 6.3) and the sliding window to obtain multiple choices. The former works mainly focus on the obtain of diversified multiple choices while Lee et al (2016) tries to obtain multiple models. The method proposed by Lee et al (2016) is to divide the training samples into several subsets where each base model is trained with a specific one. Through allocating each training sample to the model with lowest predict error, each model tends to model different classes from others.
7.5 Object Detection
Object detection is computer vision tasks which deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Similar to image segmentation tasks, great uncertainty is contained in the object detection algorithms. Therefore, obtaining multiple diversified choices is also an important way to solve the problem. Some prior works (Blaschko, 2011; Felzenszwalb et al, 2010) have made some tries to obtain multiple diversified choices. Blaschko (2011) demonstrates that the energies resulting from MNMS lead to the maximization of submodular function, and then through branchandbound strategy, all the image can be explored and diversified multiple detections can be obtained.
7.6 Topic Modeling
In machine learning and natural language processing, a topic model is a statistical model for discovering the abstract âtopicsâ that occur in a collection of documents. Probabilistic topic models such as Latent Dirichlet Allocation (LDA) can provide a useful and elegant tool for discovering hidden structure within large data sets of discrete data, such as corpuses of text. However, LDA implicitly discovers topics along only a single dimension. Recent research on multidimensional topic modeling aims to devise techniques that can discover multiple groups of topics, where each group models some different dimension or aspect of the data.
Cohen (2015) presents a new multidimensional topic model that uses a determinantal point process prior (see details in subsection 6.1) to encourage different groups of topics to model different dimensions of the data. Determinantal point processes are probabilistic models of repulsive phenomena which originated in statistical physics but have recently seen interest from the machine learning community.
8 Discussions
This article surveyed the available work on diversity technology in general supervised machine learning model, by systematically categorizing the diversity in training samples, Dmodel, Dmodels, and inference diversity in the model. We first summarize the main results and identify the challenges encountered throughout the article.
Machine learning methods have shown powerful ability for realworld applications. Due to the limited number and imbalance of training samples in usual tasks, the diversity technology can be used. We want to emphasize that the diversity technology is not decisive. The technology aims to improve the training process of machine learning methods.
Advice for implementation. We expect this article is useful to researchers who want to improve the representational ability of machine learning models for computer vision tasks. For a given computer vision task, the proper machine learning model should be chosen first. Then, we advise to consider adding diversitypromoting priors to improve the performance of the model and further what type of diversity measurement is desired. When one desires obtain multiple models or multiple choices, then one can consider diversifying multiple models or the obtained multiple choices and section 5.2 and 6 would be relevant and helpful. We advise the reader to first consider whether the multiple models or multiple choices can be helpful for the performance.
9 Conclusions
The training of machine learning models requires large amounts of labelled samples. However, the limited training samples constrain the performance of machine learning models. Therefore, effective diversity technology, which can encourage the model to be diversified and improve the representational ability of the model, is expected to be an active area of research in machine learning tasks. This paper summarizes the diversity technology for machine learning in previous work. We introduce diversity technology in data preprocessing, model training, inference, respectively. Other researchers can judge whether diversity technology is needed and choose the proper diversity method for the special requirements according to the introductions in former sections.
Acknowledgements.
This work was supported in part by the Natural Science Foundation of China under Grant 61671456 and 61271439, in part by the Foundation for the Author of National Excellent Doctoral Dissertation of China (FANEDD) under Grant 201243, and in part by the Program for New Century Excellent Talents in University under Grant NECT130164.References
 Affandi et al (2012) Affandi RH, Kulesza A, Fox EB (2012) Markov determinantal point processes. arXiv preprint arXiv:12104850
 Affandi et al (2014) Affandi RH, Fox EB, Adams RP, Taskar B (2014) Learning the parameters of determinantal point process kernels. In: International Conference on Machine Learning, pp 1224–1232
 Ahmed et al (2015) Ahmed MA, Didaci L, Fumera G, Roli F (2015) An empirical investigation on the use of diversity for creation of classifier ensembles. In: International Workshop on Multiple Classifier Systems, pp 206–219
 Alhamdoosh and Wang (2014) Alhamdoosh M, Wang D (2014) Fast decorrelated neural network ensembles with random weights. Information Sciences 264:104–117
 Azadi et al (2017) Azadi S, Feng J, Darrell T (2017) Learning detection with diverse proposals. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7369–7377
 Bardenet and AUEB (2015) Bardenet R, AUEB MTR (2015) Inference for determinantal point processes without spectral knowledge. In: Advances in Neural Information Processing Systems, pp 3393–3401
 Barinova et al (2012) Barinova O, Lempitsky V, Kholi P (2012) On detection of multiple object instances using hough transforms. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(9):1773–1784
 Batra et al (2012) Batra D, Yadollahpour P, GuzmanRivera A, Shakhnarovich G (2012) Diverse mbest solutions in markov random fields. In: European Conference on Computer Vision, pp 1–16
 Belkin and Niyogi (2002) Belkin M, Niyogi P (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems, pp 585–591
 BioucasDias et al (2013) BioucasDias JM, Plaza A, CampsValls G, Scheunders P, Nasrabadi N, Chanussot J (2013) Hyperspectral remote sensing data analysis and future challenges. IEEE Geoscience and Remote Sensing Magazine 1(2):6–36
 Bishop (2006) Bishop CM (2006) Pattern Recognition and Machine Learning. Springer
 Blaschko (2011) Blaschko M (2011) Branch and bound strategies for nonmaximal suppression in object detection. In: Energy Minimization Methods in Computer Vision and Pattern Recognition, pp 385–398
 Cai et al (2011) Cai D, He X, Han J, Huang TS (2011) Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(8):1548–1560
 CarreiraPerpinÃ¡n and Raziperchikolaei (2016) CarreiraPerpinÃ¡n MA, Raziperchikolaei R (2016) An ensemble diversity approach to supervised binary hashing. In: Advances in Neural Information Processing Systems, pp 757–765
 Chen et al (2013) Chen C, Kolmogorov V, Zhu Y, Metaxas D, Lampert C (2013) Computing the m most probable modes of a graphical model. Artificial Intelligence and Statistics pp 161–169
 Cohen (2015) Cohen J (2015) Multidimensional topic modeling with determinantal point processes. Independent Work Report Fall
 Das et al (2012) Das A, Dasgupta A, Kumar R (2012) Selecting diverse features via spectral regularization. In: Advances in Neural Information Processing Systems, pp 1583–1591
 Desai et al (2011) Desai C, Ramanan D, Fowlkes CC (2011) Discriminative models for multiclass object layout. International journal of computer vision 95(1):1–12
 Dietterich (2000) Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2):139–157
 Felzenszwalb et al (2010) Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained partbased models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9):1627–1645
 Giacinto and Roli (2001) Giacinto G, Roli F (2001) Design of effective neural network ensembles for image classification purposes. Image and Vision Computing 19(910):699–707
 Gillenwater et al (2014) Gillenwater JA, Kulesza A, Fox E, Taskar B (2014) Expectationmaximization for learning determinantal point processes. In: Advances in Neural Information Processing Systems, pp 3149–3157
 Gimpel et al (2013) Gimpel K, Batra D, Dyer C, Shakhnarovich G (2013) A systematic exploration of diversity in machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Naatural Language Processing, pp 1100–1111
 Gong et al (2018a) Gong Z, Zhong P, Shan J, Hu W (2018a) A diversified deep ensemble for hyperspectral image classification. In: WHISPERS
 Gong et al (2018b) Gong ZQ, Zhong P, Shan JX, Hu WD (2018b) Diversifying deep multiple choices for remote sensing scene classification. IGARSS
 Gong et al (2018c) Gong ZQ, Zhong P, Yu Y, Hu WD (2018c) Diversitypromoting deep structural metric learning for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 56(1):371–390
 Graff and Ellen (2016) Graff CA, Ellen J (2016) Correlating filter diversity with convolutional neural network accuracy. In: IEEE International Conference on Machine Learning and Applications (ICMLA), pp 75–80
 Gu and Han (2013) Gu Q, Han J (2013) Clustered support vector machines. In: AISTATS
 Guo et al (2017) Guo X, Wang X, Ling H (2017) Exclusivity regularized machine: a new ensemble svm classifier. In: IJCAI, pp 1739–1745
 GuzmanRivera et al (2012) GuzmanRivera A, Batra D, Kohli P (2012) Multiple choice learning: Learning to produce multiple structured outputs. In: Advances in Neural Information Processing Systems, pp 1799–1807
 GuzmanRivera et al (2014a) GuzmanRivera A, Kohli P, Batra D, Rutenbar R (2014a) Efficiently enforcing diversity in multioutput structured prediction. Artificial Intelligence and Statistics pp 284–292
 GuzmanRivera et al (2014b) GuzmanRivera A, Kohli P, Glocker B, Shotton J, Sharp T, Fitzgibbon A, Izadi S (2014b) Multioutput learning for camera relocalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1114–1121
 Ho (1998) Ho TK (1998) The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8):832–844
 Hu et al (2015) Hu W, Li W, Zhang X, Maybank S (2015) Single and multiple object tracking using a multifeature joint sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(4):816–833
 Jiang et al (2014) Jiang L, Meng D, Yu SI, Lan Z, Shan S, Hauptmann A (2014) Selfpaced learning with diversity. In: Advances in Neural Information Processing, pp 2078–2086
 Kang (2013) Kang B (2013) Fast determinantal point process sampling with application to clustering. In: Advances in Neural Information Processing Systems, pp 2319–2327
 KeshavarzHedayati and Dimopoulos (2017) KeshavarzHedayati B, Dimopoulos NJ (2017) Sensitivity and similarity regularization in dynamic selection of ensembles of neural networks. In: IJCNN, pp 3953–3958
 Kirillov et al (2015) Kirillov A, Savchynskyy B, Schlesinger D, Vetrov D, Rother C (2015) Inferring mbest diverse labelings in a single one. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1814–1822
 Kirillov et al (2016) Kirillov A, Shekhovtsov A, Rother C, Savchynskyy B (2016) Joint mbestdiverse labelings as a parametric submodular minimization. In: Advances in Neural Information Processing Systems, pp 334–342
 Kulesza and Taskar (2010) Kulesza A, Taskar B (2010) Structured determinantal point processes. In: Advances in Neural Information Processing Systems, pp 1171–1179
 Kulesza and Taskar (2012) Kulesza A, Taskar B (2012) Determinantal point processes for machine learning. Foundations and Trends in Machine Learning 5:123–286
 Kuncheva et al (2003) Kuncheva LI, Whitaker CJ, Shipp C, Duin R (2003) Limits on majority vote accuracy in classifier fusion. Pattern Analysis and Applications 6(1):22–31
 Kwok and Adams (2012) Kwok JT, Adams RP (2012) Priors for diversity in generative latent variable models. In: Advances in Neural Information Processing Systems, pp 2996–3004
 Lang et al (2012) Lang C, Liu G, Yu J, Yan S (2012) Saliency detection by multitask sparsity pursuit. IEEE Transactions on Image Processing 21(3):1327–1338
 Lavancier et al (2015) Lavancier F, Moller J, Rubak E (2015) Determinantal point process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77(4):853–877
 Lee et al (2012) Lee H, Kim E, Pedrycz W (2012) A new selective neural network ensemble with negative correlation. Applied intelligence 37(4):488–498
 Lee et al (2015) Lee S, Purushwalkam S, Cogswell M, Crandall D, Batra D (2015) Why m heads are better than one: Training a diverse ensemble of deep networks. ArXiv preprint arXiv: 151106314
 Lee et al (2016) Lee S, Prakash SPS, Cogswell M, Ranjan V, Crandall D, Batra D (2016) Stochastic multiple choice learning for training diverse deep ensembles. In: Advances in Neural Information Processing Systems, pp 2119–2127
 Li and Jurafsky (2016) Li J, Jurafsky D (2016) Mutual information and diverse decoding improve neural machine translation. In: arXiv preprint arXiv: 1601.00372
 Li et al (2016) Li T, Dou Y, Liu X (2016) Joint diversity regularization and graph regularization for multiple kernel kmeans clustering via latent variables. Neurocomputing 218:154–163
 Li et al (2015) Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(10):2085–2098
 Liu et al (2016) Liu X, Dou Y, Yin J, Wang L, Zhu E (2016) Multiple kernel kmeans clustering with matrixinduced regularization. In: AAAI, pp 1888–1894
 Liu and Yao (1997) Liu Y, Yao X (1997) Negatively correlated neural networks can produce best ensembles. Australian journal of intelligent information processing systems 4(3):176–185
 Mariet and Sra (2015) Mariet Z, Sra S (2015) Fixedpoint algorithms for learning determinantal point processes. In: International Conference on Machine Learning, pp 2389–2397
 Nemhauser et al (1978) Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of approximations for maximizing submodular set functions. Mathematical Programming 14(1):265–294
 Olshausen and Field (1996) Olshausen BA, Field DJ (1996) Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607–607
 Parizi et al (2014) Parizi SN, Vedaldi A, Zisserman A, Felzenszwalb P (2014) Automatic discovery and optimization of parts for image classification. arXiv preprint arXiv: 14126598
 Park and Ramanan (2011) Park D, Ramanan D (2011) Nbest maximal decoders for part models. In: IEEE International Conference on Computer Vision, pp 2627–2634
 Peng et al (2017) Peng H, Li B, Ling H, Hu W, Xiong W, Maybank SJ (2017) Salient object detection via structured matrix decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4):818–832
 Peng et al (2016) Peng Y, Zhai X, Zhao Y, Huang X (2016) Semisupervised crossmedia feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology 26(3):583–596
 Prasad et al (2014a) Prasad A, Jegelka S, Batra D (2014a) Submodular maximization and diversity in structured output spaces. In: Advances in Neural Information Processing Systems, pp 1–6
 Prasad et al (2014b) Prasad A, Jegelka S, Batra D (2014b) Submodular meets structured: Finding diverse subsets in exponentiallylarge structured item sets. In: Advances in Neural Information Processing Systems, pp 2645–2653
 Qiao et al (2015) Qiao M, Bian W, DaXu RY, Tao D (2015) Diversified hidden markov models for sequential labeling. IEEE Transactions on Knowledge and Data Engineering 27(11):2947–2960
 Qiao et al (2017) Qiao M, Liu L, Yu J, Xu C, Tao D (2017) Diversified dictionaries for multiinstance learning. Pattern Recognition 64:407–416
 Ramakrishna and Batra (2012) Ramakrishna V, Batra D (2012) Modemarginals: Expressing uncertainty via mbest solutions. In: NIPS Workshop on Perturbations, Optimization, and Statistics, pp 1–5
 Rao et al (2015) Rao V, Jain P, Jawahar CV (2015) Diverse yet efficient retrieval using hash functions. arXiv preprint arXiv: 150906553
 Rosen (1996) Rosen BE (1996) Ensemble learning using decorrelated neural networks. Connection science 8(3):373–384
 Shi and Shen (2016) Shi L, Shen YD (2016) Diversifying convex transductive experimental design for active learning. In: IJCAI, pp 1997–2003
 Shotton et al (2013a) Shotton J, Glocker B, Zach C, Izadi S, Criminisi A, Fitzgibbon A (2013a) Scene coordinate regression forests for camera relocalization in rgbd images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2930–2937
 Shotton et al (2013b) Shotton J, Glocker B, Zach C, Izadi S, Criminisi A, Fitzgibbon A (2013b) Scene coordinate regression forests for camera relocalization in rgbd images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2930–2937
 Stephens et al (2013) Stephens GJ, Mora T, Tkacik G, Bialek W (2013) Statistical thermodynamics of natural images. Physical Review Letters 110(1):018,701–018,701
 Sun and Batra (2015) Sun Q, Batra D (2015) Submodboxes: Nearoptimal search for a set of diverse object proposals. In: Advances in Neural Information Processing Systems, pp 1378–1386
 Sun et al (2016) Sun X, He Z, Zhang X, Zou W, Baciu G (2016) Saliency detection via diversityinduced multiview matrix decomposition. In: Asian Conference on Computer Vision, pp 137–151
 Sun et al (2017) Sun X, He Z, Xu C, Zhang X, Zou W, Baciu G (2017) Diversity induced matrix decomposition model for salient object detection. Pattern Recognition 66:253–267
 Tang et al (2006) Tang EK, Suganthan PN, Yao X (2006) An analysis of diversity measures. Machine Learning 65(1):247–271
 Venugopal et al (2008) Venugopal A, Zollmann A, Smith NA, Vogel S (2008) Wider pipelines: Nbest alignments and parses in mt training. In: Proceedings of AMTA, pp 192–201
 Vinje and Gallant (2000) Vinje WE, Gallant JL (2000) Sparse coding and decorrelation in primary visual cortex during natural vision. Science 287(5456):1273–1276
 Viola and Jones (2004) Viola P, Jones MJ (2004) Robust realtime face detection. International Journal of Computer Vision 57(2):137–154
 Wachinger and Golland (2015) Wachinger C, Golland P (2015) Diverse landmark sampling from determinantal point processes for scalable manifold learning. arXiv preprint arXiv:150303506
 Wang et al (2015) Wang S, Peng J, Liu W (2015) An regularization framework for diverse learning tasks. Signal Processing 109:206–211
 Xie et al (2015a) Xie P, Deng Y, Xing E (2015a) Diversifying restricted boltzmann machine for document modeling. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1315–1324
 Xie et al (2015b) Xie P, Deng Y, Xing E (2015b) Latent variable modeling with diversityinducing mutual angular regularization. arXiv preprint arXiv: 151207336
 Xie et al (2015c) Xie P, Deng Y, Xing E (2015c) On the generalization error bounds of neural networks under diversityinducing mutual angular regularization. arXiv preprint arXiv: 151107110
 Xie et al (2016) Xie P, Zhu J, Xing E (2016) Diversitypromoting bayesian learning of latent variable models. In: International Conference on Machine Learning, pp 59–68
 Xie et al (2017a) Xie P, Deng Y, Zhou Y, Kumar A, Yu Y, Zou J, Xing EP (2017a) Learning latent space models with angular constraints. In: International Conference on Machine Learning, pp 3799–3810
 Xie et al (2017b) Xie P, Salakhutdinov R, Mou L, Xing EP (2017b) Deep determinantal point process for largescale multilabel classification. In: IEEE International Conference on Computer Vision, pp 473–482
 Xie et al (2017c) Xie P, Singh A, Xing EP (2017c) Uncorrelation and evenness: a new diversitypromoting regularizer. In: International Conference on Machine Learning, pp 3811–3820
 Xing and Wang (2017) Xing HJ, Wang XZ (2017) Selective ensemble of svdds with renyi entropy based diversity measure. Pattern Recognition 61:185–196
 Xiong et al (2014) Xiong H, Szedmak S, RodriguezSanchez A, Piater J (2014) Towards sparsity and selectivity: Bayesian learning of restricted boltzmann machine for early visual features. In: International Conference on Artificial Neural Networks, pp 419–426
 Xiong et al (2015) Xiong H, RodriguezSanchez AJ, Szedmak S, Piater J (2015) Diversity priors for learning early visual features. In: Frontiers in Computational Neuroscience, pp 1–9
 Yadollahpour et al (2013) Yadollahpour P, Batra D, Shakhnarovich G (2013) Discriminative reranking of diverse segmentations. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1923–1930
 Yin et al (2014a) Yin XC, Huang K, Yang C, Hao HW (2014a) Convex ensemble learning with sparsity and diversity. Information Fusion 20:49–59
 Yin et al (2014b) Yin XC, Yang C, Hao HW (2014b) Learning to diversify via weighted kernels for classifier ensemble. arXiv preprint arXiv:14061167
 You and Tao (2014) You X, Tao RWD (2014) Diverse expected gradient active learning for relative attributes. IEEE Transactions on Image Processing 23(7):3203–3217
 Yu et al (2006) Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: Proceedings of the 23rd international conference on Machine learning, pp 1081–1088
 Yu et al (2008) Yu K, Zhu S, Xu W, Gong Y (2008) Nongreedy active learning for text categorization using convex transductive experimental design. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 635–642
 Yu et al (2011) Yu Y, Li YF, Zhou ZH (2011) Diversity regularized machine. In: Proceedings of International Joint Conference on Artificial Intelligence, pp 1603–1608
 Zhai et al (2014) Zhai X, Peng Y, Xiao J (2014) Learning crossmedia joint representation with sparse and semisupervised regularization. IEEE Transaction on Circuits and Systems for Video Technology 24(6):965–978
 Zhang et al (2017) Zhang C, Kjellstrom H, Mandt S (2017) Determinantal point processes for minibatch diversification. In: 33rd Conference on Uncertainty in Artificial Intelligence
 Zhang and Huan (2012) Zhang J, Huan J (2012) Inductive multitask learning with multiple view data. In: proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 543–551
 Zhang and Ou (2016) Zhang MJ, Ou Z (2016) Blockwise map inference for determinantal point processes with application to changepoint detection. In: Statistical Signal Processing Workshop (SSP), pp 1–5
 Zhang and Zhou (2013) Zhang ML, Zhou ZH (2013) Exploiting unlabeled data to enhance ensemble diversity. Data Mining and Knowledge Discovery 26(1):98–129
 Zhao et al (2016) Zhao Y, Dou Y, Liu X, Li T (2016) Elm based multiple kernel kmeans with diversityinduced regularization. In: International Joint Conference on Neural Networks (IJANN), pp 2699–2705
 Zhong et al (2017) Zhong P, Gong ZQ, Li ST, SchÃ¶nlieb CB (2017) Learning to diversify deep belief networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 66(6):3516–3530
 Zhou et al (2002) Zhou ZH, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artificial intelligence 137(1):239–263
 Zhu and Xing (2009) Zhu J, Xing EP (2009) Maximum entropy discrimination markov networks. Journal of Machine Learning Research 10:2531–2569
 Zhu et al (2015) Zhu Y, Lan Y, Guo J, Cheng X (2015) Structural learning of diverse ranking. arXiv preprint arXiv: 150404596
 Zylberberg et al (2011) Zylberberg J, Murphy JT, DeWeese MR (2011) A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of v1 simple cell receptive fields. PLoS computational biology 7(10):e1002,250–e1002,250