Diversity in Machine Learning

Diversity in Machine Learning

Zhiqiang Gong Zhiqiang Gong Ping Zhong Weidong Hu College of Electrical Science, National University of Defense Techonology, Changsha, Hunan, China 410073
44email: gongzhiqiang13@nudt.edu.cnPing Zhong 66email: zhongping@nudt.edu.cnWeidong Hu 88email: wdhuatr@icloud.com
   Ping Zhong Zhiqiang Gong Ping Zhong Weidong Hu College of Electrical Science, National University of Defense Techonology, Changsha, Hunan, China 410073
44email: gongzhiqiang13@nudt.edu.cnPing Zhong 66email: zhongping@nudt.edu.cnWeidong Hu 88email: wdhuatr@icloud.com
   Weidong Hu Zhiqiang Gong Ping Zhong Weidong Hu College of Electrical Science, National University of Defense Techonology, Changsha, Hunan, China 410073
44email: gongzhiqiang13@nudt.edu.cnPing Zhong 66email: zhongping@nudt.edu.cnWeidong Hu 88email: wdhuatr@icloud.com
Received: date / Accepted: date
Abstract

Machine learning methods have achieved good performance and been widely applied in various real-world applications. It can learn the model adaptively and be better fit for special requirements of different tasks. Many factors can affect the performance of the machine learning process, among which diversity of the machine learning is an important one. Generally, a good machine learning system is composed of plentiful training data, a good model training process, and an accurate inference. The diversity could help each procedure to guarantee a total good machine learning: diversity of the training data ensures the data contain enough discriminative information, diversity of the learned model (diversity in parameters of each model or diversity in models) makes each parameter/model capture unique or complement information and the diversity in inference can provide multiple choices each of which corresponds to a plausible result. However, there is no systematical analysis of the diversification in machine learning system. In this paper, we systematically summarize the methods to make data diversification, model diversification, and inference diversification in machine learning process, respectively. In addition, the typical applications where the diversity technology improved the machine learning performances have been surveyed, including the remote sensing imaging tasks, machine translation, camera relocalization, image segmentation, object detection, topic modeling, and others. Finally, we discuss some challenges of diversity technology in machine learning and point out some directions in future work. Our analysis provides a deeper understanding of the diversity technology in machine learning tasks, and hence can help design and learn more effective models for specific tasks.

Keywords:
Diversity Training Data Model Learning Inference Machine Learning Active Learning Bayesian Method Posterior Regularization
journal: Mach. Learn.

1 Introduction

Traditionally, machine learning methods can learn parameters automatically with training samples and thus it can adapt to special requirements of various applications. Actually, it has achieved great success in tackling many real-world artificial intelligence and data mining problems (Bishop, 2006), such as optical character recognition, face detection, autonomous car driving, data mining of biological data, and web search/information retrieval. A success machine learning system often includes plentiful training data which can provide enough information to train the model, a good model learning process which can better model the data, and an accurate inference to discriminate different objects. Many factors can affect the performance of the machine learning process among which the diversity in machine learning plays an important role.

Recently, diversity is frequently occurred in many fields, such as biological system, culture, products and so on. A diversified system usually contains more information and can better fit for various environments. Particularly, the diversity property can be also applied in machine learning system. Here, we define the diversity property in machine learning as that different data contain the complement and useful information or features, and each factor in the same layer can capture unique information from the data. The diversity property tries to minimize the redundancy between different training data as well as the redundancy between the information captured by different factors. Therefore, diversity in machine learning can improve the performance of the model and plays an important role in machine learning process. We summarize the diversification of machine learning into three categories: the diversity in training data (data diversification), the diversity of the model/models (model diversification) and the diversity of the inference (inference diversification).

Data diversification can provide samples with enough information to train the machine learning model. The diversity in training data could maximize the information contained in the data. Many prior works have imposed the diversity in the construction of each training batch for the machine learning process to train the model more effectively (Zhang et al, 2017). In addition, diversity in active learning can also make the labelled training data contain the most information (You and Tao, 2014; Shi and Shen, 2016) and thus the learned model can achieve good performance with limited training samples.

Model diversification comes from the diversity in human visual system. Vinje and Gallant (2000); Olshausen and Field (1996); Zylberberg et al (2011) have shown that the human visual system represents decorrelation and sparseness, namely diversity. This makes different neurons in the human learning respond to different stimuli and generates little redundancy in the learning process which ensures the high effectiveness of the human learning. However, usual machine learning methods perform the redundancy in the learned model where different factors model the similar features (Xie et al, 2015b). Therefore, model diversification could significantly improve the performance of machine learning systems. The model diversification includes the diversity in parameters of each model (D-model) and the diversity among parameters of multiple base models (D-models). The former one tries to encourage different parameters in each model to be diversified and each parameter can model unique information (Zhong et al, 2017; Xie et al, 2015a). As a result, the performance of each model can be significantly improved (Gong et al, 2018c). In contrast, diversity among multiple base models, which is also called ensemble diversity, tries to repulse different base models and encourages each base model provide complement information (Viola and Jones, 2004; Lee et al, 2016; Zhu and Xing, 2009).

Inference diversification can provide choices/representations with more complement information. Generally, one could obtain multiple choices from the inference. However, the obtained choices from usual machine learning systems presents similarity between each other where the next choice will one-pixel shifted versions of others (Park and Ramanan, 2011). While through diversification, the machine learning model could provide multiple choices which contain the most complement information (Batra et al, 2012; Yadollahpour et al, 2013; Chen et al, 2013; Felzenszwalb et al, 2010).

This work systematically covers the literature on diversity-promoting methods over data diversification, model diversification, and inference diversification in machine learning tasks. In particular, three main questions in the analysis of diversity technology in machine learning have arisen.

  • How to diversify the training data, the learned model/models, and the inference in machine learning system, respectively? Why do these methods work on the diversification of the machine learning system?

  • Is there any difference between the diversification of the model and models? Furthermore, is there any similarity between the diversity in training data, the learned model/models, and the inference?

  • Which real-world applications have the diversity been applied in prior works? How do the diversification methods work on these applications?

All of the three problems are important, none of them has been thoroughly answered. Diversity in machine learning can balance the training data, encourage the learned parameters to be diversified, and diversify the choices from the inference. Through enforcing diversity in the machine learning system, the machine learning model can present better performance. Following the framework, the three questions above have been answered by both theoretical analysis and real-world applications.

The remainder of this paper is organized as follows. Section 2 discusses the general forms of supervised learning and active learning in machine learning model. Section 3 summarizes the most common used diversity measurements in prior works. Section 4 outlines some of the prior works on diversification in training data. Section 5 reviews the strategies for model diversification, including the D-model, and the D-models. The prior works for inference diversification are summarized in Section 6. Section 7 introduces some applications of the diversity-promoting methods in prior works, and finally we do some discussions, conclude the paper and point out some future directions.

Figure 1: Flowchart for training process of general machine learning (including the active learning, supervised learning and unsupervised learning). We can find that when the training data is labelled, the training process is supervised. In contrast, the training process is unsupervised. In addition, when labelled and unlabelled data are all used for training, then the training process is semi-supervised.

2 General Machine Learning Models

Traditionally, machine learning consists of supervised learning, active learning, unsupervised learning, and reinforcement learning. For reinforcement learning, training data is given only as the feedback to the program’s actions in a dynamic environment, and it does not require accurate input/output pairs and sub-optimal actions need not be explicitly correct. However, the diversity technologies mainly work on the model itself to improve the model’s performance. Therefore, this work will ignore the reinforcement learning and mainly discusses the machine learning model as Fig. 1 shows. In addition, although some works have been tried in diversification of data pre-processing of unsupervised learning, most of the prior works focus on the diversification of supervised learning and active learning in machine learning models. Therefore, this work will mainly summarize these kinds of supervised and active learning models as Fig. 1 shows.

2.1 Supervised Learning

We consider the task of general supervised machine learning models, which are commonly used in real-word machine learning tasks. Fig. 1 shows the flowchart of general machine learning methods in this work. It can be noted from Fig. 1 that the supervised machine learning model consists of data pre-processing, training (modeling), inference.

Let denote the set of training samples and is the corresponding label of , where ( is the set of class labels and is the number of the classes). Traditionally, the machine learning task can be formulated as the following optimization problem:

(1)

where represents the loss function and is the constraint of the parameters of the model. The Lagrange multiplier of the optimization can be reformulated as follows.

(2)

where is a positive value. Therefore, the machine learning problem can be seen as the minimization of .

Figs. 2 and 3 show the flowchart of special forms of supervised learning models. Among them, Fig. 2 shows the flowchart of supervised machine learning with single model. In the data-preprocessing stage, the more diversification and balance each training batch is, the more effectiveness the training process is. In addition, it should be noted that the factors in the same layer of the model can be diversified to improve the representational ability of the model (which is called D-model in this paper). Moreover, when we obtain multiple choices from the model, the obtained choices are desired to provide more complement information. Therefore, some works focus on the diversification of multiple choices (which we call inference diversification). Fig. 3 shows the flowchart of supervised machine learning with multiple parallel base models. We can find that a best strategy to diversify the training set for different base models can improve the performance of the whole ensemble. Furthermore, we can diversify these base models directly to enforce each base model to provide more complement information for further analysis.

Figure 2: Flowchart of supervised machine learning with single model. In the data-preprocessing stage, the more diversification and balance each training batch is, the more effectiveness the training process is. In addition, it should be noted that the factors in the same layer of the model can be diversified to improve the representational ability of the model (which is called D-model in this paper). Moreover, when we obtain multiple choices from the model, the obtained choices are desired to provide more complement information. Therefore, some works focus on the diversification of multiple choices (which we call inference diversification).
Figure 3: Flowchart of supervised machine learning with multiple parallel models. We can find that a best strategy to diversify the training set for different models can improve the performance of multiple models. Furthermore, we can diversify different models directly to enforce different model to provide more complement information for further analysis.

2.2 Active Learning

Since labelling is always cost and time consuming, it usually cannot provide enough labelled samples for training. Therefore, active learning which can reduce the label cost and keep the training set in a moderate size plays an important role in machine learning model. It can make use of the informative samples and obtain the higher performance with less labelled training samples.

Through active learning, we can choose the most informative samples as labelled samples to train the model. This paper will take the Convex Transductive Experimental Design (CTED) as the representative of the active learning methods (Yu et al, 2006, 2008).

Denote as the candidate unlabelled samples for active learning. Then, the active learning problem can be formulated as the following optimization problem (Yu et al, 2008):

(3)

where is the -th entry of , and is a positive tradeoff parameter. As is shown, CTED utilizes a data reconstruction framework to select informative samples for labelling. The matrix contains reconstruction coefficients and is the sample selection vector. The -norm makes the learned to be sparse. Then, the obtained is used to select samples for labelling and finally the training set is constructed with the selected samples and the original training samples. However, the selected samples from CTED usually make similarity from each other, which leads to the redundancy of the training samples. Therefore, diversity property is also required in active learning process.

To be concluded, diversification can be used in supervised learning and active learning to improve the model’s performance. According to the models in 2.1 and 2.2, the diversification technology in machine learning model has been divided into three parts: data diversification (Section 4), model diversification (Section 5), and inference diversification (Section 6). Since the diversification in training batch (Fig. 2) and the diversification in active learning mainly consider the diversification in training data, we summarize the prior works in these diversification as data diversification in section 4. In addition, the diversification of the model in Fig. 2 and models in Fig. 3 mainly focus on the diversification in the training model or between different base models directly, and thus we summarize these works in section 5. Finally, the inference diversification in Fig. 2 will be summarized in section 6. In the following section, we’ll first introduce the measurements which can calculate the similarity and promote diversity in machine learning models.

3 Diversity Measurements

Even though the physical meaning of the training data, the parameter factors in training model, and the choices in inference are different, the mathematical forms of these factors are similar since these factors can be represented as vectors. Denote , where represents the training data or factors in the machine learning model. Generally, geometric properties between these vectors can be used to measure the similarity between the vectors. Since vectors can be seen as the points in the space, then the data distribution of data points can be used to measure the similarity. In addition, the obtained results from machine learning model are usually probability distributions, then some Bayesian methods can be used to calculate the similarity with the probability distributions. Finally, considering the diversity in group is another way to measure the similarity. In the following, we summarize the diversity measurements which are usually used to encourage diversification over different factors in prior works from the four aspects respectively.

3.1 Geometric Properties

Generally, when two vectors are orthogonal or the distance between two vectors is far away enough, two vectors are thought to be uncorrelated and dissimilar. Therefore, geometric properties between different vectors, such as the distance and the angular can be used to measure the similarity between different factors. In the following, we’ll introduce the angular-based methods, such as cosine similarity and inner product, distance-based methods, such as the Euler distance and the heat kernel, eigenvalue-based methods, such as uncorrelation and evenness and submodular spectral diversity, respectively.

3.1.1 Cosine Similarity Measurement

Cosine similarity. The most common used measurement to calculate the correlation between different vectors is cosine similarity. The cosine similarity between different factors and can be calculated as

(4)

And the diversity-promoting prior of generalized cosine similarity measurement can be written as

(5)

It should be noted that when is set to 1, the diversity-promoting prior over different vectors by cosine similarity can be formulated as

(6)

where is a positive value. When cosine similarity measurement is used as diversity-promoting prior, is enforced to be 0. Then, and tend to be orthogonal and different factors are encouraged to be uncorrelated and diversified. However, there exist some defects in the former measurement where the measurement is variant to orientation.

Angular. Some works use the variance and mean value of the angular between different factors to formulate the diversity of the model to overcome the problem occurred in cosine similarity. The angular between different factors can be formulated as

(7)

The diversity function can be defined as

(8)

where

In other words, presents the mean of the angular between different factors and presents the variance of the angular. Then, the diversity promoting prior by the angular of cosine similarity measurement can be formulated as

(9)

The represents the angular between different factors. The prior in Eq. 9 encourages the angular between different factors to be and thus these factors are enforced to be diversified under the diversification prior. Moreover, the measurement is invariant to scale, translation, rotation, and orientation.

3.1.2 Inner Product Measurement

Different vectors present more diversity when they tend to be more orthogonal. The inner product can measure the orthogonality between different vectors and therefore it can be applied in machine learning models for more diversity. Guo et al (2017) uses the special form of the inner product measurement, which is called exclusivity. The exclusivity between two vectors and is defined as

(10)

where denotes the Hadamard product, and denotes the norm. Therefore, the diversity-promoting prior can be written as

(11)

Due to the non-convexity and discontinuity of norm, the relaxed exclusivity is calculated as

(12)

where denotes the norm. Then, the diversity-promoting prior based on relaxed exclusivity can be calculated as

(13)

Li et al (2016); Liu et al (2016) use the trace to form the inner product measurement. The diversity-promoting prior by (Li et al, 2016; Liu et al, 2016) can be formulated as

(14)

where represents the trace of the matrix.

The inner product measurement takes advantage of the characteristics among the vectors and tries to encourage different factors to be orthogonal to enforce the learned factors to be diversified. It should be noted that the measurement can be seen as a special form of cosine similarity measurement. It is easy to implement but is variant to scale and orientation.

3.1.3 Euler Distance Measurement

In general, the larger of the Euler distance two vectors have, the more difference the vectors are. Therefore, Euler distance can be used to measure the difference between the vectors. We can diversify different vectors through enlarging the Euler distances between these vectors. Then, the diversity-promoting prior by Euler distance can be formulated as

(15)

Even though the Euler distance use the distance of different factors to measure the similarity between these factors, the measurement is variant to scale, and thus this may decrease the effectiveness of the diversity measurement.

3.1.4 Heat Kernel Measurement

Another commonly used method to measure the correlation between different parameters is heat kernel. The correlation between different factors can be calculated as

(16)

where is a positive value. We can find that when and are dissimilar, tends to zero. The term can measure the correlation between different factors. Then, The diversity-promoting prior by heat kernel can be formulated as

(17)

Heat kernel takes advantage of the distance between factors to encourage the diversity of the model. In addition, the measurement makes the penalization of diversification variant with the Gaussian function, and the degree of the variance is affected by the factor .

Instead of the distance-based and angular-based measurements, the eigenvalues of the kernel matrix can also be used to encourage different factors to be orthogonal and diversified. Recall that, for an orthogonal matrix, all the eigenvalues of the kernel matrix are equal to 1. Here, we denote as the kernel matrix of . Therefore, when we constrain the eigenvalues to 1, the obtained vectors tend to be orthogonal. Two ways can encourage the eigenvalues to be 1, including the submodular spectral diversity measurement and the uncorrelation and evenness measurement. In the following, the two measurements will be introduced in detail.

3.1.5 Submodular Spectral Diversity (SSD) Measurement

The submodular spectral diversity (SSD) measurement uses the square distance to encourage the eigenvalues to be 1 directly. Define as the eigenvalues of the kernel matrix. Then, the diversity-promoting prior by SSD can be formulated as

(18)

where is also a positive value. This regularizes the variance of the eigenvalues of the matrix. Since all the eigenvalues are enforced to be 1, the obtained factors are more orthogonal and thus the model would present more diversity.

3.1.6 Uncorrelation and Evenness (UE) Measurement

Another diversity measurement based on kernel matrix is uncorrelation and evenness (Xie et al, 2017c). This measurement encourages the learned factors to be uncorrelated and to play equally important roles in modeling data. Formally, this amounts to encouraging the kernel matrix of the vectors to have more uniform eigenvalues.

The basic idea is to normalize the eigenvalues into a probability simplex and encourage the discrete distribution parameterized by the normalized eigenvalues to have small Kullback-Leibler (KL) divergence with the uniform distribution (Xie et al, 2017c). Then, the diversity-promoting prior by uniform eigenvalues is formulated as

(19)

subject to ( is positive definite matrix) and , where is the kernel matrix.

3.2 Data Distributions

Different from the aforementioned measurements which use the geometric properties to encourage diversity between pairwise vectors, the data distribution method, which prefers to a diverse set of vectors, can also be used to enforce the diversity between different factors. In the following, we’ll introduce the determinantal point process (DPP) which is usually used in prior works to diversify machine learning models.

3.2.1 Determinantal Point Process (DPP) Measurement

A DPP is a distribution over subsets of a fixed ground set, which prefers a diverse set of factors other than a redundant one (Kulesza and Taskar, 2012). Let denote a continuous space and the factors . Then, denote a positive semi-definite kernel function on ,

(20)

where denotes the kernel matrix and the pairwise is the pairwise correlation between and . denotes the determinant of matrix. is an identity matrix. Since the space is constant, is a constant value. Therefore, the corresponding diversity prior of transition parameter matrix modeled by DPP can be formulated as

(21)

In general, the kernel can be divided into the correlation and the prior part. Therefore, the kernel can be reformulated as

(22)

where is the prior for the parameter and denotes the correlation of these factors. These kernels would always induce repulsion between different factors and thus a diverse set of factors tends to have higher probability. Generally, the vectors are supposed to be uniformly distributed variables. Therefore, the prior is a constant value, and then, the kernel

(23)

Some works have shown that the DPP prior is usually not arbitrarily strong for some special case when applied into machine learning models (Lavancier et al, 2015). To encourage the DPP prior strong enough for all the training data, the DPP prior is augmented by an additional positive parameter . Therefore, the DPP prior can be reformulated as

(24)

When we set the cosine similarity as the correlation kernel , from geometric interpretation, the DPP prior can be seen as the volume of the parallelepiped spanned by the columns of (Kulesza and Taskar, 2012). Therefore, diverse sets are more probable because their feature vector are more orthogonal, and hence span larger volumes.

3.3 Probability Distribution

Bayesian methods, such as divergence and cross entropy, can measure the similarity between different distributions. They can be also used to enforce diversity between different factors and prior works have shown the effectiveness of these methods. In addition, some works combine the Bayesian methods with statistics to encourage different factors to be diversified, such as negative correlation learning (Liu and Yao, 1997; Alhamdoosh and Wang, 2014). In the following, these Bayesian methods have been introduced in detail.

3.3.1 Divergence Measurement

Each factor can be processed as a probability distribution. Since divergence method can measure the dissimilarity between different distributions, it can also be used to measure the diversity between different factors. The divergence between factors and can be calculated as

(25)

subject to . The divergence can measure the dissimilarity between the learned factors, such that the diversity-promoting prior by divergence can be formulated as

(26)

The measurement takes advantage of the characteristics of the divergence to measure the dissimilarity between different distributions. It should be noted that the norm of the learned factors need to be constrained to 1. This limits the field to use the divergence measurement.

3.3.2 Negative Correlation Learning (NCL) Measurement

Negative correlation learning (NCL) has been proposed to reduce the covariance among different models while the variance and bias terms are not increased (Rosen, 1996). The measurement is usually used for diversifying multiple models. Denote as the inference results from the th model. represents the parameters in th model. Rosen (1996) uses the penalty to decorrelate the current learning model with all previously learned models

(27)

Define where . Then, the penalty term can also be defined to reduce the correlation mutually among all the learned models by using the actual distribution obtained from each model instead of the target function y (Liu and Yao, 1997; Alhamdoosh and Wang, 2014).

(28)

Then, the diversity-promoting prior by NCL can be written as

(29)

This measurement uses the covariance of the inference results obtained from the multiple models to reduce the correlation mutually among the learned models. Therefore, the learned multiple models can be diversified.

3.3.3 Cross Entropy Measurement

Cross entropy is another measurement which can be used in D-models. As former subsection shows, depicts the inference results from the th model. Therefore, the cross-entropy between different models can be calculated as

(30)

The diversity-promoting regularization can be formulated as

(31)

Then, the diversity-promoting prior by cross entropy can also be written as

(32)

We all know that the larger the cross entropy is, the more difference the distributions are. Therefore, under cross entropy measurement, different models can be diversified and provide more complement information.

Both the cross entropy measurement and the negative correlation learning measurement encourage diversity of the multiple models by repulsing the obtained inference results from each other. Especially, the cross entropy measurement uses the cross entropy between pairwise distributions to encourage two distributions to be dissimilar and then different base models could provide more complement information.

3.4 Group-wise Correlation

3.4.1 Measurement

It is well known that the -norm leads to the group-wise sparse representation of . can be used to measure the correlation between different parameter factors and diversify the learned factors to improve the representational ability of the model. Then, the prior can be calculated as

(33)

where means the th entry of . The internal norm encourages different factors to be sparse, while the external norm is used to control the complexity of entire model.

3.5 Analysis

These diversity measurements can calculate the similarity between different vectors and thus encourage the diversity of the machine learning model. However, there exists the difference between these measurements. The details of these diversity measurements can be seen in Table 1. It can be noted from the table that all these methods take advantage of the pairwise correlation except which uses the group-wise correlation between different factors. Moreover, the determinantal point process, submodular spectral diversity, uncorrelation and evenness, and negative correlation learning can also take advantage of correlation among three or more factors.

Another property of these diversity measurement is scale invariant. Scale invariant can make the diversity of the model be invariant w.r.t. the norm of these factors. The cosine similarity measurement calculates the diversity via the angular between different vectors. As a special case for DPP, the cosine similarity can be used as the correlation term in DPP and thus the DPP measurement is scale invariant. For divergence measurement, since the factors are constrained with , the measurement is scale invariant. In addition, cross-entropy and negative correlation learning take advantage of the distribution of the model, and the distribution obtained from the model is invariant with the scale of these factors.

Measurements Pairwise Correlation Multiple Correlation Group-wise Correlation Scale   Invariant
Cosine Similarity
Determinantal Point Process
Submodular Spectral Diversity
Euler Distance
Heat Kernel
Divergence
Uncorrelation and Evenness
Inner Product
Cross-Entropy
Negative correlation learning
Table 1: Comparisons of Different Measurements. represents that the measurement possess the property while means the measurement does not possess the property.

These measurements can encourage diversity within different vectors. Generally, the machine learning models can be looked as the set of latent parameter factors, which can be represented as the vectors. These factors can be learned and used to represent the objects. In the following, we’ll mainly summarize the use of different diversity measurements in machine learning process to improve the model’s performance.

4 Data Diversification

4.1 Diversification in Data Pre-Processing

Machine learning model is usually trained with mini-batches to accurately estimate the training model. Most of the former works generate the mini-batch randomly. However, due to the imbalance of the training samples, redundancy may occur in the generated mini-batches which shows negative effects on the machine learning process. Different from classical stochastic gradient descent (SGD) which relies on uniformly sampling data points to form a mini-batch, Zhang et al (2017) proposes a non-uniform sampling scheme based on the DPP measurement.

As Section 3.2.1 shows, DPPs provide a probability measure over every configuration of subsets on data points. Through a similarity matrix over the data and a determinant operator, DPP assigns higher probabilities to those subsets with dissimilar items. Therefore, it can give low probabilities to mini-batches which contain redundant data, and higher probabilities to mini-batches with more diverse data. This simultaneously balances the data and leads to stochastic gradients with lower variance.

Through the DPP measurement, each mini-batch contain more diverse and balance training samples, which can train the model more effectively and thus the learned model can exact more discriminative features from the objects.

4.2 Diversification in Active Learning

As section 2.2 shows, active learning can obtain good performance with less labelled training samples. However, some selected samples with CTED are similar to each other. The highly similar samples make the redundancy of the training samples, and thus decreases the training efficiency, which requires more training samples for comparable performance.

To select more informative and complement samples with active learning method, some prior works introduce diversity in the selected samples obtained from CTED. Shi and Shen (2016) enhances CTED with a diversity regularizer

(34)

where the similarity matrix is introduced to model the pairwise similarities among all the samples, such that larger value of means higher similarity between the th sample and the th one. Shi and Shen (2016) uses the cosine similarity measurement to formulate the diversity term. Similarly, You and Tao (2014) denotes the diversity term in active learning with the angular of the cosine similarity to obtain a diverse set of training samples (see section 3.1.1 for details).

Through adding diversity regularization over the selected samples by active learning, more informative samples can be chosen for training. Therefore, the machine learning process can obtain comparable or better performance with limited training samples than that with more training samples.

5 Model Diversification

In addition to the diversification of the training samples by active learning to improve the performance with less training samples, we can also diversify the model to improve the representational ability of the model directly. As introduction shows, the machine learning methods aim to learn parameters by the machine itself with the training samples. However, due to the limited and imbalanced training samples, the highly similar learned parameters lead to the redundancy of the learned model and decrease the model’s representational ability.

In order to solve the problem, one of the methods is to diversify the learned parameters and improve the representational ability of the model (D-model). Therefore, each parameter factor can model unique information and the whole factors model a larger proportional of information. Another method is to obtain diversified multiple models (D-models). Traditionally, if we train multiple models separately, the obtained representations from different models would be similar and this leads to redundancy between different representations. Through regularizing the multiple base models with diversification prior, different models would be enforced to repulse from each other and each base model can provide more complement information.

5.1 D-Model

The first method tries to diversify the parameters in the model to directly improve the representational ability of the model. Traditionally, Bayesian method and posterior regularization method can be used to impose the diversity property into the model. Different diversity-promoting priors have been proposed in prior works to measure the diversity between the learned parameter factors according to special requirements of different tasks. This subsection will mainly introduce the methods to enforce the diversity of the model and summarize these methods occurred in prior works.

5.1.1 Bayesian Method

Traditionally, diversity-promoting priors can be used to measure the diversification of the model. The parameters of the model can be calculated by Bayesian method as

(35)

where represents the factors in machine learning model, is the likelihood of the training set on the constructed model and stands for the prior knowledge of the learned model. For the machine learning task at hand, denotes the diversity-promoting prior. Then, the machine learning task can be written as

(36)

The log-likelihood of the optimization can be formulated as

(37)

Then, Eq. 37 can be written as the following optimization

(38)

where represents the optimization objective of the model, which can be formulated as as subsection 2.1 shows. the diversity-promoting prior aims to encourage the learned factors to be diversified.

5.1.2 Posterior Regularization Method

Generally, the regularization method can add side information into parameter estimation and thus it can encourage the learned factors to possess specific property. We can also use the posterior regularization to enforce the learned model to be diversified. The diversity regularized optimization problem can be formulated as

(39)

where stands for the diversity function which measures the diversification between different learned factors. represents the optimization term of the model which can be seen in subsection 2.1. means the tradeoff between the optimization and the diversification term.

From Eqs. 38 and 39, we can find that the posterior regularization has the similar form as the Bayesian method. In general, the optimization (38) can be transformed into the form (39). Therefore, in the following, we will summarize the diversity-promoting methods mainly in the posterior regularization form.

Measurements Papers
Cosine Similarity Gong et al (2018c); Zhong et al (2017); Xiong et al (2015); Xie et al (2015b); Zhu et al (2015); Li et al (2016); Xie et al (2015c, 2017a, 2016); Xiong et al (2014); Zhao et al (2016); Rao et al (2015)
Determinantal Point Process Kwok and Adams (2012); Xie et al (2017b); Qiao et al (2015, 2017); Cohen (2015); Kulesza and Taskar (2012); Bardenet and AUEB (2015); Shotton et al (2013a); Gu and Han (2013); Gillenwater et al (2014); Kang (2013); Affandi et al (2012); Kulesza and Taskar (2010); Mariet and Sra (2015); Wachinger and Golland (2015); Zhang and Ou (2016)
Submodular Spectral Diversity Das et al (2012)
Inner Product Li et al (2016); Liu et al (2016)
Euler Distance Cai et al (2011); Zhang and Huan (2012); Graff and Ellen (2016)
Heat Kernel Sun et al (2017); Peng et al (2017); Belkin and Niyogi (2002)
Divergence Cai et al (2011)
Uncorrelation and Evenness Xie et al (2017c)
Jiang et al (2014); Hu et al (2015); Wang et al (2015); Lang et al (2012); Zhai et al (2014); Sun et al (2016); Peng et al (2016); Parizi et al (2014); Li et al (2015)
Table 2: Overview of most frequently used diversification method in D-model and the papers in which example measurements can be found.

5.1.3 Diversity Regularization

Distance-based measurements. The simplest way to formulate the diversity between different factors is Euler distance. As subsection 3.1.3 introduces, increasing the distances between different factors can decrease the similarity between these factors. Cai et al (2011); Zhang and Huan (2012); Graff and Ellen (2016) have applied the Euler distance as the measurements to encourage the latent factors in machine learning to be diversified. The diversity term can be formulated as

(40)

where is the number of the factors which we intend to diversify in the machine learning model. Another commonly used distance-based method to encourage diversity in machine learning is heat kernel (Sun et al, 2017; Peng et al, 2017; Belkin and Niyogi, 2002). The diversity term by heat kernel (subsection 3.1.4) can be formulated as

(41)

where is a positive value. We can find that heat kernel has the form of Gaussian function and the diversity penalization is affected by the distance. All the former distance-based methods encourage the diversity of the model by enforcing the factors away from each other and thus these factors would show more difference. However, it should be noted that the Euler distance measurement can be significantly affected by scaling.

Angular-based measurements. To make the diversity measurement be invariant to scale, some works take advantage of the angular to encourage the diversity of the model. Among these works, the cosine similarity measurement is the most common used (Gong et al, 2018c; Zhong et al, 2017). As subsection 3.1.1 shows, the cosine similarity can measure the similarity between different vectors. In machine learning tasks, It can be used to measure the redundancy between different latent factors (Gong et al, 2018c; Zhong et al, 2017; Xiong et al, 2015; Xie et al, 2015b; Zhu et al, 2015; Li et al, 2016). The aim of cosine similarity prior is to encourage different latent factors to be uncorrelated, such that each factor can model unique features from the samples. The diversity term can be formulated as

(42)

However, the former diversity term is variant to orientation. To overcome this problem, many works use the angular of cosine similarity to measure the diversity between different factors. Since the angular between different factors is invariant to translation, rotation, orientation and scale, Xie et al (2015a, b, c) develops the angular-based diversifying method for Restricted Boltzmann Machine. The details for the method can be seen in subsection 3.1.1. When we impose the diversifying prior in traditional machine learning method, the diversity term can be formulated as

(43)

where and are defined as subsection 3.1.1. As a special form, some works also use the inner product to measure the correlation between different factors (Li et al, 2016; Liu et al, 2016). Then, the diversity term can be formulated as (see subsection 3.1.2 for details)

(44)

where denotes the trace of a matrix. It should be noted that this term is variant to scale and orientation but it is easy to implement.

Eigenvalue-based measurements. Denote as the kernel matrix of the latent factors. Many prior works introduce diversity in the machine learning process based on the kernel matrix. The first method is submodular spectral diversity (see subsection 3.1.5 for details), which is based on the eigenvalues of the kernel matrix. Das et al (2012) introduces the submodular spectral diversity in the process of feature selection, which aims to select a diverse set of features. Feature selection is a key component in many machine learning settings. The process involves choosing a small subset of features in order to build a model to approximate the target concept well. The diversity term can be formulated as

(45)

In addition, Xie et al (2017c) also develops a uncorrelation and evenness measurement based on the kernel matrix (see subsection 3.1.6 for details). The basic idea is that we normalize the eigenvalues into a probability simplex and encourage the discrete distribution parameterized by the normalized eigenvalues to have small Kullback-Leibler (KL) divergence with the uniform distribution. The diversity-promoting uniform eigenvalue regularizer (UER) is formulated as

(46)

where is the dimension of each factor.

Divergence measurement. Bayesian method can also be applied in the model diversification. Traditionally, divergence can be used to measure the dissimilarity between different distributions. Some works (Cai et al, 2011) uses the divergence to formulate the similarity between different factors. The diversity term in machine learning can be formulated as

(47)

As subsection 3.3.1 shows, the norm of the learned factors need to satisfy which limits the application of the diversity measurement.

measurement. is another popular diversity measurements(Jiang et al, 2014; Hu et al, 2015; Wang et al, 2015). It can also be used for model diversification. can obtain a group-wise sparse representation of latent factors . The diversity term based on can be formulated as

(48)

where is the dimension of each factor . The internal norm encourages different factors to be sparse, while the external norm is used to control the complexity of entire model.

DPP measurement. The former diversity measurements mainly focus on the pairwise diversification. Different from these former measurements, the DPP measurement takes the multiple correlation between different latent factors into consideration. As subsection 3.2.1 shows, it can encourage the learned factors to repulse from each other. Therefore, the DPP-based diversifying prior can obtain machine learning models with a diverse set of learned factors other than a redundant one. Some works have shown that the DPP prior is usually not arbitrarily strong for some special case when applied into machine learning models (Lavancier et al, 2015). To encourage the DPP prior strong enough for all the training data, the DPP prior is augmented by an additional positive parameter . Therefore, the DPP prior can be reformulated as

(49)

The learned factors are usually normalized, and thus the optimization for machine learning can be written as

(50)

where represents the diversity term for machine learning. It should be noted that different kernels can be selected according to the special requirements of different machine learning tasks (Affandi et al, 2014).

In conclusion, there have been numerous approaches to diversify the learned factors in machine learning model. A summary of the most frequently encountered diversity methods is shown in Table 2. Although most papers use slightly different specifications for their diversification of the learned model, the fundamental representation of diversification is usually similar. It should be also noted that the thing in common among studied diversity methods is that the diversity enforced in a pairwise form between members strikes a good balance between complexity and effectiveness (Guo et al, 2017). In addition, different applications should choose the proper diversity measurements according to the specific requirements.

5.2 D-Models

The former subsection introduces the way to diversify the parameters in single model and improve the representational ability of the model directly. Much efforts have been done to obtain the highest probability (MAP) configuration in machine learning models. However, even when the training samples are sufficient, the maximum a (MAP) solution could be sub-optimal. In many situations, one could benefit from additional representations with multiple models. However, traditional way to train multiple models may provide representations that tend to be similar while the representations obtained from different models are desired to provide complement information. Recently, many diversifying methods have been proposed to overcome this problem. Through diversification, each base model can provide more complement information, such that more discriminative representation can be obtained with these multiple diversified representations.

Denote and as the parameters and the inference from the th model. Then, the optimization of the machine learning to obtain multiple models can be written as

(51)

where represents the optimization term of the th model and denotes the training samples of the th model. Traditionally, the training samples are randomly divided into multiple subsets and each subset trains a corresponding model. However, selecting subsets randomly may lead to the redundancy between different representations. Therefore, the first way to obtain multiple diversified models is to diversify these training samples over different base models, which we call sample-based methods.

Another way to encourage the diversification between different models is to measure the similarity between different base models with a special similarity measurement and encourage different base models to be diversified in the training process. The optimization of these methods can be written as

(52)

where measures the diversification between different base models.

Finally, some other methods try to obtain large amounts of models and select the top- as the final ensemble. In the following, we’ll summarize different methods for diversifying multiple models from the three aspects in detail.

Methods Measurements Papers
Optimization-based Divergence Kuncheva et al (2003); Zhu and Xing (2009)
Renyi-entropy Xing and Wang (2017)
Cross Entropy Gong et al (2018b); Lee et al (2015)
Cosine Similarity Yu et al (2011)
Exclusivity Guo et al (2017)
Wang et al (2015)
NCL Rosen (1996); Liu and Yao (1997); Alhamdoosh and Wang (2014)
Others Yin et al (2014b); Kuncheva et al (2003); Ho (1998); Tang et al (2006); Yu et al (2011); Giacinto and Roli (2001); Dietterich (2000)
Sample-based - Guzman-Rivera et al (2014b); Viola and Jones (2004); Zhang and Zhou (2013); Carreira-Perpinán and Raziperchikolaei (2016); Lee et al (2016); Shotton et al (2013a); Gu and Han (2013)
Ranking-based - Ahmed et al (2015)
Table 3: Overview of most frequently used diversification method in D-models and the papers in which example measurements can be found.

5.2.1 Optimization-Based Method

Optimization-based methods are one of the most common used methods to diversify multiple models. These methods try to obtain multiple diversified models by optimizing a given objective function, which includes a diversity measurement. The optimization for these methods can be written as Eq. 52. It can be noted that the main problem of these methods is to define diversity measurements which can calculate the difference between different models.

Many prior works (Tang et al, 2006; Yin et al, 2014b; Yu et al, 2011) have summarized some pairwise diversity measurements, such as Q-statistics measure (Kuncheva et al, 2003), correlation coefficient measure (Kuncheva et al, 2003), disagreement measure (Ho, 1998; Yin et al, 2014b), double-fault measure (Giacinto and Roli, 2001; Yin et al, 2014b), statistic measure (Dietterich, 2000), Kohavi-Wolpert variance (Tang et al, 2006), inter-rater agreement (Tang et al, 2006), the generalized diversity (Tang et al, 2006) and the measure of ”Difficult” (Tang et al, 2006). Recently, some more measurements have also been developed, including not only the pairwise diversity measurement (Zhu and Xing, 2009; Kuncheva et al, 2003; Yu et al, 2011) but also measurements which calculate the multiple correlation and others (Wang et al, 2015; Lee et al, 2012; Alhamdoosh and Wang, 2014; Liu and Yao, 1997). This subsection will summarize these methods systematically.

Bayesian-based measurements. Similar to D-model, Bayesian methods can also be applied in D-models. Among these Bayesian methods, divergence is a popular one (subsection 3.3.1). The way to formulate the diversity-promoting term by divergence is to calculate the divergence between different parameters of the model, respectively (Zhu and Xing, 2009; Kuncheva et al, 2003). The diversity-promoting term by divergence can be formulated as

(53)

where means the th entry in . In addition to the divergence measurements, Renyi-entropy which measures the kernelized distances between the images of samples and the center of ensemble in the high-dimensional feature space can also be used (Xing and Wang, 2017). The diversity-promoting term based on Renyi-entropy can be formulated as

(54)

where is a positive value and represents the Gaussian kernel function, which can be calculated as

(55)

where denotes the dimension of . Another measurement which bases on Bayesian method is cross entropy measurement(Gong et al, 2018b; Lee et al, 2015). Based on subsection 3.3.3, the diversity-promoting term can be formulated as

(56)

where is the inference of the th model and is the probability of the sample belonging to th class. Moreover, Lee et al (2012) proposes a hierarchical pair competition-based parallel genetic algorithm (HFC-PGA) to increase the diversity among the component neural networks. Then, the diversity term by HFC-PGA can be formulated as

(57)

Another method, namely negative correlation learning (NCL) (Rosen, 1996; Liu and Yao, 1997; Alhamdoosh and Wang, 2014), tries to reduce the covariance among all the models while the variance and bias terms are not increased. The NCL trains base models simultaneously in a cooperative manner that decorrelates individual errors. The penalty term can be designed in different ways depending on whether the models are trained sequentially or parallelly. Rosen (1996) defines the penalty by decorrelating the current learning model with all previously learned models

(58)

Define where . Then the penalty term can also be defined to reduce the correlation mutually among all the models by using the actual inference instead of the target function y (Liu and Yao, 1997; Alhamdoosh and Wang, 2014).

(59)

Yin et al (2014a) also combines the NCL with sparsity. The sparsity is purely pursued by the norm regularization without considering the complementary characteristics of the available base models. These Bayesian methods either take advantages of the probability distribution obtained from each base model or transform parameters in each base model as probability distribution form to measure the diversity between different models.

Cosine Similarity measurement. Different from the Bayesian methods which promote diversity from distribution view, Yu et al (2011) introduces the cosine similarity measurements to calculate the difference between different models from geometric view. As Section 3.1.1 shows, the diversity-promoting term can be written as

(60)

In addition, as a special form of angular-based measurement, a special form of inner product measurement, termed as exclusivity, has been proposed by Guo et al (2017) to obtain diversified models (see section 3.1.2 for details). It can jointly suppress the training error of ensemble and enhance the diversity between bases. The diversity-promoting term by exclusivity can be written as

(61)

These measurements try to encourage the pairwise models to be uncorrelated such that each base model can provide more complement information.

measurement. Just as the former subsection, norm can also be used as the diversification of multiple models(Wang et al, 2015). the diversity-promoting regularization by can be formulated as

(62)

The measurement uses the group-wise correlation between different base models and favors selecting diverse models residing in more groups.

Some other diversity measurements have been proposed for deep ensemble. Zhou et al (2002) reveals that it may be better to ensemble many instead of all of the neural networks at hand. The paper develops an approach named GASEN to obtain different weights of each neural network. Then based the obtained weights, the deep ensemble can be formulated. Moreover, Keshavarz-Hedayati and Dimopoulos (2017) also encourages the diversity of the deep ensemble by defining a pair-wise similarity between different terms.

These optimization-based methods utilize the correlation between different models and try to repulse these models from one another. The aim is to enforce these representations which are obtained from different models to be diversified and thus each base model can provide more complement information.

5.2.2 Sample-Based Method

In addition to obtain multiple diversified models from optimization view, we can also diversify the models from sample view. In general, we randomly divide the training set into multiple subsets where each base model corresponds to a specific subset as the training samples. However, this may cause the redundancy between the obtained features from different models. To overcome this problem and present more complement information from different models, Lee et al (2016) develops a novel method by dividing the training samples into multiple subsets. In (Lee et al, 2016), each sample is assigned into the specified subset where the corresponding learned model has the lowest predict error. Therefore, each base model focus on modeling different features. Moreover, clustering is another popular method to divide the training samples for different models (Shotton et al, 2013a). Although diversifying the obtained multiple subsets can make the multiple models provide more complement information, the less of training samples by dividing the whole training set would show negative effects over the performance.

To overcome this problem, another way to enforce different models to be diversified is to define each sample with different weights (Guzman-Rivera et al, 2014b). By training different models with different weights of samples, each base model can focus on complement information from the samples. The detailed steps in (Guzman-Rivera et al, 2014b) are as follows: first, define the weights over each training sample randomly, and train the model with the given weights; second, revise the weights over each training sample based on the final loss from the obtained model, and train the second model with the updated weights; finally, train models with the aforementioned strategies.

The former methods take advantage of the labelled training samples to enforce the diversity of multiple models. There exists another method, namely UDEED (Zhang and Zhou, 2013), which can use the unlabelled samples to provide diversity of the model. Unlike existing semi-supervised ensemble methods where error-prone pseudo-labels are estimated for unlabelled data to enlarge the labelled data to improve accuracy. UDEED works by maximizing accuracies of base models on labelled data while maximizing diversity among them on unlabelled data.

Moreover, Carreira-Perpinán and Raziperchikolaei (2016) combines the different initializations, different training sets and different feature subsets to encourage the diversity of the multiple models.

The methods in this subsection process on the training sets to diversify different models. By training different models with different training samples or samples with different weights, these models would provide different information and thus the whole models could provide a larger proportional of information.

5.2.3 Ranking-Based Method

Another kind of methods to promote diversity in the obtained multiple models is ranking-based methods. All the models is first ranked according to some criterion, and then the top- are selected to form the final ensemble. Here, Ahmed et al (2015) focuses on pruning techniques based on forward/backward selection, since they allow a direct comparison with the simple estimation of accuracy from different models.

Cluster can be also used as ranking-based method to enforce diversity of the multiple models (Gu and Han, 2013). In (Gu and Han, 2013), each model is first clustered based on the similarity of their predictions, and then each cluster is then pruned to remove redundant models, and finally the remaining models in each cluster are finally combined as the base models.

In addition to the former mentioned methods, Viola and Jones (2004) provides multiple diversified models by selecting different sets of multiple features. Through multi-scale or other tricks, each sample will provide large amounts of features, and then choose top- multiple features from the all the features as the base features (see Viola and Jones (2004) for details). Then, each base feature from the samples is used to train a specific model, and the final inference can be obtained through the combination of these models.

In summary, this paper summarizes the diversification of inference from three aspects: optimization-based methods, sample-based methods, and ranking-based methods. The details of the most frequently encountered diversity methods is shown in Table 3. Optimization-based methods encourage the multiple models to be diversified by imposing diversity regularization between different base models while optimizing these models. In contrary, sample-based methods obtain diversified models by training different models with specific training sets. Most of methods to diversify multiple models mainly focus on the two aspects. While the ranking-based methods obtain the multiple diversified models by choosing the top- models.

6 Inference Diversification

The former section summarizes the methods to diversify different parameters in the model or models. However, the D-model focuses on the diversification of parameters in the model and improves the representational ability of the model itself. While D-models tries to obtain multiple diversified models, which aims to diversify the parameters between different models. In addition, many methods focus on obtaining multiple diversified choices directly, which we call inference diversification. This part will summarize these methods for inference diversification under graph model.

We consider a set of discrete random variables , each taking value in a finite label set . Let (, ) be a graph defined over these variable. The set denotes a Cartesian product of sets of labels corresponding to the subset of variables. Let , () be functions defining the energe at each node and edge for the labelling of variables in scope. The goal of MAP inference is to find the labelling of the variables that minimizes this real-valued energy function:

(63)

Traditional methods to obtain multiple results try to solve the following optimization:

(64)

However, the obtained second-best choice will typically be one-pixel shifted versions of the best (Park and Ramanan, 2011). In other words, the next best choices will almost certainly be located on the upper slope of the peak corresponding with the most confident detection, while other peaks may be ignored entirely. To overcome this problem, many methods, such as diversified multiple choice learning (D-MCL), submodular, M-Modes, M-NMS, have been introduced. These methods try to diverse the obtained choices (do not overlap under a user-defined criteria) while obtaining high score on the optimization term.

Measurements Papers
Diversity-Promoting Multiple Choice Learning (D-MCL) Guzman-Rivera et al (2012); Batra et al (2012); Kirillov et al (2015); Yadollahpour et al (2013); Guzman-Rivera et al (2014a); Gimpel et al (2013)
Submodular for Diversification Prasad et al (2014a); Kirillov et al (2016); Nemhauser et al (1978)
M-modes Chen et al (2013)
M-NMS Blaschko (2011); Felzenszwalb et al (2010); Stephens et al (2013)
DPP Azadi et al (2017)
Table 4: Overview of most frequently used inference diversification methods and the papers in which example measurements can be found.

6.1 Diversity-Promoting Multiple Choice Learning (D-MCL)

The D-MCL tries to find a diverse set of highly probable solutions under a discrete probabilistic model. Given a dissimilarity function measuring similarity between pairwise choices, our formulation involves maximizing a linear combination of the probability and dissimilarity to previous choices. Even if the MAP solution alone is of poor quality, a diverse set of highly probable hypotheses might still enable accurate predictions. The goal of D-MCL is to produce a diverse set of low-energy solutions.

The first method is to approach the problem with a greedy algorithm, where the next choice is defined as the lowest energy state with at least some minimum dissimilarity from the previously chosen choices. To do so, we assume access to a dissimilarity function . In order to find the diverse, low energy, labellings , the method proceeds by solving a sequence of problems of the form (Batra et al, 2012; Yadollahpour et al, 2013; Guzman-Rivera et al, 2014a; Gimpel et al, 2013)

(65)

for , where determines a trade-off between diversity and energy, is the MAP-solution and the function defines the diversity of two labels. In other words, takes a large value if and are diverse, and a small value otherwise. For special case, the M-Best MAP is obtained when is a 0-1 dissimilarity (i.e. ).

Contrary to the former form, the second method formulate the -best diverse problem in form of a single energy minimization problem (Kirillov et al, 2015). Instead of the greedy sequential procedure in (65), this method suggests to infer all labellings jointly, by minimizing

(66)

where defines the total diversity of any labellings. To achieve this, let us first create copies of the initial model. Three specific different diversity measures are introduced. The split-diversity measure is written as the sum of pairwise diversities, i.e. those penalizing pairs of labellings (Kirillov et al, 2015)

(67)

The node-diversity measure is defined as (Kirillov et al, 2015)

(68)

Finally, the special case of the split-diversity and node-diversity measures is the node-split-diversity measure (Kirillov et al, 2015)

(69)

The D-MCL methods try to find multiple choices with a dissimilarity function. This can obtain choices with more difference and show more diversity. However, the obtained choices may not be optima and there exist other choices which could better represent the objects than the obtained ones.

6.2 Submodular for Diversification

The problem of searching for a diverse but high-quality subset of items in a ground set of items has been studied in information retrieval, web search, sensor placement, document summarization, viral marketing and robotics. In many of these works, an effective, theoretically-grounded and practical tool for measuring the diversity of a set are submodular set functions. Submodularity is a property that comes from marginal gains. A set function is submodular when its marginal gains are decreasing: for all and . In addition, if is monotone, i.e. whenever , then a simple greedy algorithm that iteratively picks the element with the largest marginal gain to add to the current set , achieves the best possible approximation bound of (Nemhauser et al, 1978). This result has had significant practical impact. Unfortunately, if the number of items is exponentially large, then even a single linear scan for greedy augmentation is simply infeasible. The diversity is measured by a monotone, nondecreasing and normalized submodular function .

Denote as the set of choices. The diversification is measured by a monotone, nondecreasing and normalized submodular function . Then, we aim to find a maximizing configurations for the combined score (Prasad et al, 2014a)

(70)

The optimization can be solved by the greedy algorithm that starts out with , and iteratively adds the best term (Prasad et al, 2014a):

(71)

where . The selected choice is within a factor of of the optimal solution :

(72)

The submodular takes advantage of the maximization of marginal gains to find multiple choices which can provide the maximum of complement information.

6.3 M-Nms

Another way to obtain multiple diversified choices is non-maximum suppression (M-NMS). the M-NMS is typically defined in an algorithmic way: starting from the MAP prediction one goes through all labellings according to an increasing order of the energy. A labelling becomes part of the predicted set if and only if it is more than away from the ones chosen before, where is the threshold defined by user to judge whether two labellings are similar. The M-NMS guarantee the choices to be apart from each other. The M-NMS is typically implemented by greedy algorithm (Felzenszwalb et al, 2010; Barinova et al, 2012; Desai et al, 2011).

A simple greedy algorithm for instantiating multiple choices are used: Search over the exponentially large space of choices for the maximally scoring choice, instantiate it, remove all choices which overlap, and repeat. The process is repeated until the score for the next-best choice is below a threshold or M choices have been instantiated. However, the traditional implementation of such an algorithm would take exponential time.

The M-NMS method tries to find M-best choices by throwing away the similar choices from candidate set. To be concluded, the D-MCL, submodular, and M-NMS have the similar idea. All of them tries to find the M-best choices under a dissimilarity function or the ones which can provide the most complement information.

6.4 M-modes

Even though the former three methods guarantee the obtained multiple choices to be apart from each other, the choices are typically not local extrema of the probability distribution. To guarantee both the local extrema and the diversification of the obtained multiple choices, the problem can be transformed to M-modes. M-modes have multiple possible applications, because they are intrinsically diverse. Different from M-best predictions where the main shortcoming is its lack of diversity because the choice space of the model is typically very large and fine grained.

For a non-negative integer , define the -neighborhood of a labelling to be as the set of labellings whose distances from is no more than , where measures the distance between two labellings, which we can choose Hamming distance. We call a labelling a local maximum of energy function , iff , .

Given , the set of modes is denoted by , formally, (Chen et al, 2013)

(73)

As increases from zero to infinity, the -neighborhood of monotonically grows and the set of modes monotonically decreases. Therefore, the can form a nested sequence, (Chen et al, 2013)

(74)

Thus, the M-modes can be defined as computing the labellings with minimal energies in . Then the problem has been transformed to M-modes: Compute the labellings with minimal energies in .

Chen et al (2013) validates that a labelling is a mode if and only if it behaves like a “local mode” everywhere, and thus a new chain has been constructed and M-modes problem is reduced into the best problem of the new chain.

Furthermore, it also validates the one-to-one cost preserving correspondence between consistent configurations and the set of modes . Therefore, the problem of computing the best modes are transferred to the problem of computing the best configurations in the new chain.

Different from the former three methods, M-modes can obtain M choices which are extrema of the optimization and this can provide M choices which contains the most complement information.

6.5 Dpp

Different from the former methods which obtain multiple diversified choices through repulsing different choices from each other, Azadi et al (2017) proposes a differentiable DPP layer to predict a set of diverse and informative proposals with enriched representations.

We all know that the key for object detection is to find multiple proposals for further detection. From Section 3.2.1, it can be noted that the DPP is a distribution over subsets of a fixed ground set, which prefers a diverse set of factors. Therefore, Azadi et al (2017) develops a DPP loss to find a subset of diverse bounding boxes using the outputs of the other two loss functions (namely, the probability of each proposal to belong to each object category as well as the location information of the proposals) and will reinforce them in finding more accurate object instances in the end. The DPP loss is employed to maximize the likelihood of an accurate selection given the pool of overlapping background and non-background boxes over multiple categories (Azadi et al, 2017).

6.6 Analysis

Even though all the methods in former subsections can be used for inference diversity, there exists some difference between these methods. These methods in prior works are summarized in Table 4. It can be noted from the former subsections that the D-MCL is the easiest to implement and common seen in prior works. One only needs to calculate the MAP choice and obtain other choices by constraining the optimization with a dissimilarity function. In contrary, the M-NMS neglects the choices which are in the neighbors of the former choice and obtain other choices from the remainder. The D-MCL and M-NMS obtain choices by solving the optimization with user-defined similarity while the submodular method tries to obtain the choices which can provide the maximal marginal and complement information. The former three methods may provide choices which is not local optima while the local optimal choices contain more information than others. Therefore, different from the former three methods, the M-modes tries to obtain multiple diversified choices which are also local optima. All these methods before can be used in traditional machine learning methods. The DPP method for inference diversification which is first proposed by Azadi et al (2017) is mainly applied in deep learning models. The method constructs a DPP layer as well as a DPP loss for the implementation in deep learning models. From the introduction of data diversification, D-model, D-models, and inference diversification, one could choose the proper method for diversification of machine learning in various computer vision tasks. In the following, we’ll introduce some applications of diversity technology in machine learning model.

7 Applications

Diversity technology in machine learning can significantly improve the representational ability of the model in many computer vision tasks, including remote sensing imaging tasks (Gong et al, 2018c; Zhong et al, 2017; Gong et al, 2018b, a), machine translation (Gimpel et al, 2013; Li and Jurafsky, 2016), camera relocalization (Guzman-Rivera et al, 2014b; Shotton et al, 2013b), natural image segmentation (Yadollahpour et al, 2013; Batra et al, 2012; Guzman-Rivera et al, 2014a), object detection (Blaschko, 2011; Felzenszwalb et al, 2010), topic modeling (Cohen, 2015), and so on. The diversity priors, which decrease the redundancy in the learned model or diversify the obtained multiple choices, can provide more informative features and show powerful ability in real-world application, especially for the computer vision tasks with limited training samples and complex structures in the training samples. In the following, the applications of diversity technology in machine learning will be introduced in detail.

7.1 Remote Sensing Imaging Tasks

Remote sensing images, including hyperspectral images, multi-spectral images, and so on, have played a more and more important role in the past two decades (Bioucas-Dias et al, 2013). However, there are typical difficulties in remote sensing imaging tasks. First, limited number of training samples in remote sensing imaging tasks usually make it difficult to represent the images. Since labelling is usually time-consuming and cost. We usually cannot provide enough training samples to train the model. Then, remote sensing images usually have large intra-class variance and low inter-class variance, which make it difficult to extract discriminative features from the images. Finally, deep models with large amounts of parameters are usually used to model the remote sensing images. These models usually have better performance than other shallow models while the limited training samples usually lead the learned model to be sub-optimal.

To overcome these problems, some works have applied the diversity-promoting prior to diversify the model (Gong et al, 2018c; Zhong et al, 2017). In (Gong et al, 2018c), the independence prior, which is based on cosine similarity in the former section, is imposed on a special deep structural metric learning method for remote sensing scene classification. Zhong et al (2017) imposes the independence prior on DBN for hyperspectral image classification. From (Xiong et al, 2014), we can find that the diversity promoting prior is effective for traditional RBM model. DBN, which is the stack of RBMs, can also be diversified for better representation.

Then, some other works focus on the diversification of multiple models for remote sensing images (Gong et al, 2018b, a). Gong et al (2018b) has applied cross entropy measurement to diversify the obtained multiple models and then the obtained multiple models could provide more complement information. Different from (Gong et al, 2018b), Gong et al (2018a) divides the training samples into several subsets for different models, separately. Then, each model focuses on the representation of different classes and the whole representation of these models could be improved.

7.2 Machine Translation

Machine translation (MT) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another. Recently, machine translation systems have been developed and widely used in real-world application. Commercial machine translation services, such as Google translator, Microsoft translator, and Baidu translator, have made great success. From the perspective of user interaction, the ideal machine translator is an agent that reads documents in one language and produces accurate, high quality translations in another. This interaction ideal has been implicit in machine translation (MT) research since the field’s inception. Unfortunately, when a real, imperfect MT system makes an error, the user is left trying to guess what the original sentence means. Therefore, to overcome this problem, providing the M-best translations instead of a single best one is necessary (Venugopal et al, 2008).

However, in MT, for example, many translations on M-best lists are extremely similar, often differing only by a single punctuation mark or minor morphological variation. We argue that the implicit goal behind these technologies is to better explore the output space by introducing diversity into the surrogate set.

Some prior works have introduced diversity into the obtained multiple choices and obtained better performance (Li and Jurafsky, 2016). (Gimpel et al, 2013) develops the method to diversify multiple choices which is introduced in subsection 6.1. The anthors define a novel dissimilarity function which is defined on different translations to increase the diversity between the obtained translations.

7.3 Camera Relocalization

Camera relocalization is to estimate the pose of a camera relative to a known 3D scene from a single RGB-D frame (Shotton et al, 2013b). It can be formulated as the inversion of the generative rendering procedure, which is to find the camera pose corresponding to a rendering of the 3D scene model that is most similar to the observed input. Since the problem is a non-convex optimization problem which has many local optima, one of the methods to solve the problem is to find a set of M predictors which generate M camera pose hypotheses and then infers the best pose from the multiple pose hypotheses. Similar to traditional M-best problems, the obtained M predictors is usually similar.

To overcome this problem and obtain hypotheses that are different from each other, Guzman-Rivera et al (2014b) tries to learn ‘marginally relevant’ predictors, which make complementary predictions, and compare their performance when used with different selection procedures. In (Guzman-Rivera et al, 2014b), greedy algorithm is used to obtain multiple diversified models. Different weights are defined on each training samples, and the weights is updated with the training loss from the former learned model. Finally, multiple diversified models can be obtained for camera relocalization.

7.4 Image Segmentation

In computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels). The goal of segmentation is to simplify and change the representation of an image into something that is more meaningful and easier to analyze. More precisely, image segmentation is the process of assigning a label to each pixel in an image such that pixels with the same label share certain characteristics. Since a semantic segmentation algorithm deals with tremendous amount of uncertainty from inter and intra object occlusion and varying appearance, lighting and pose, obtaining multiple best choices from all possible segmentations tends to be one of the possible way to solve the problem. Therefore, the image segmentation problem can be transformed into the M-best problem. However, as traditional problem in M-best problem, the obtained multiple choices are usually similar and the information provided to the user tends to be redundant.

The way to solve this problem is to introduce diversity into the training process to encourage the multiple choices to be diverse. Many works (Yadollahpour et al, 2013; Batra et al, 2012; Guzman-Rivera et al, 2014a, 2012; Prasad et al, 2014a; Ramakrishna and Batra, 2012; Sun and Batra, 2015; Lee et al, 2016; Prasad et al, 2014b) have introduced diversity into the image segmentation tasks via different ways. Batra et al (2012) first introduces the D-MCL in subsection 6.1 for image segmentation. Yadollahpour et al (2013); Guzman-Rivera et al (2012) combines the D-MCL with reranking which provide a way to obtain multiple diversified choices and select the proper one from the multiple choices. Prasad et al (2014a, b) uses the submodular to measure the diversification between multiple choices. Sun and Batra (2015) combines the NMS (see details in subsection 6.3) and the sliding window to obtain multiple choices. The former works mainly focus on the obtain of diversified multiple choices while Lee et al (2016) tries to obtain multiple models. The method proposed by Lee et al (2016) is to divide the training samples into several subsets where each base model is trained with a specific one. Through allocating each training sample to the model with lowest predict error, each model tends to model different classes from others.

7.5 Object Detection

Object detection is computer vision tasks which deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Similar to image segmentation tasks, great uncertainty is contained in the object detection algorithms. Therefore, obtaining multiple diversified choices is also an important way to solve the problem. Some prior works (Blaschko, 2011; Felzenszwalb et al, 2010) have made some tries to obtain multiple diversified choices. Blaschko (2011) demonstrates that the energies resulting from M-NMS lead to the maximization of submodular function, and then through branch-and-bound strategy, all the image can be explored and diversified multiple detections can be obtained.

7.6 Topic Modeling

In machine learning and natural language processing, a topic model is a statistical model for discovering the abstract “topics” that occur in a collection of documents. Probabilistic topic models such as Latent Dirichlet Allocation (LDA) can provide a useful and elegant tool for discovering hidden structure within large data sets of discrete data, such as corpuses of text. However, LDA implicitly discovers topics along only a single dimension. Recent research on multi-dimensional topic modeling aims to devise techniques that can discover multiple groups of topics, where each group models some different dimension or aspect of the data.

Cohen (2015) presents a new multi-dimensional topic model that uses a determinantal point process prior (see details in subsection 6.1) to encourage different groups of topics to model different dimensions of the data. Determinantal point processes are probabilistic models of repulsive phenomena which originated in statistical physics but have recently seen interest from the machine learning community.

8 Discussions

This article surveyed the available work on diversity technology in general supervised machine learning model, by systematically categorizing the diversity in training samples, D-model, D-models, and inference diversity in the model. We first summarize the main results and identify the challenges encountered throughout the article.

Machine learning methods have shown powerful ability for real-world applications. Due to the limited number and imbalance of training samples in usual tasks, the diversity technology can be used. We want to emphasize that the diversity technology is not decisive. The technology aims to improve the training process of machine learning methods.

Advice for implementation. We expect this article is useful to researchers who want to improve the representational ability of machine learning models for computer vision tasks. For a given computer vision task, the proper machine learning model should be chosen first. Then, we advise to consider adding diversity-promoting priors to improve the performance of the model and further what type of diversity measurement is desired. When one desires obtain multiple models or multiple choices, then one can consider diversifying multiple models or the obtained multiple choices and section 5.2 and 6 would be relevant and helpful. We advise the reader to first consider whether the multiple models or multiple choices can be helpful for the performance.

9 Conclusions

The training of machine learning models requires large amounts of labelled samples. However, the limited training samples constrain the performance of machine learning models. Therefore, effective diversity technology, which can encourage the model to be diversified and improve the representational ability of the model, is expected to be an active area of research in machine learning tasks. This paper summarizes the diversity technology for machine learning in previous work. We introduce diversity technology in data pre-processing, model training, inference, respectively. Other researchers can judge whether diversity technology is needed and choose the proper diversity method for the special requirements according to the introductions in former sections.

Acknowledgements.
This work was supported in part by the Natural Science Foundation of China under Grant 61671456 and 61271439, in part by the Foundation for the Author of National Excellent Doctoral Dissertation of China (FANEDD) under Grant 201243, and in part by the Program for New Century Excellent Talents in University under Grant NECT-13-0164.

References

  • Affandi et al (2012) Affandi RH, Kulesza A, Fox EB (2012) Markov determinantal point processes. arXiv preprint arXiv:12104850
  • Affandi et al (2014) Affandi RH, Fox EB, Adams RP, Taskar B (2014) Learning the parameters of determinantal point process kernels. In: International Conference on Machine Learning, pp 1224–1232
  • Ahmed et al (2015) Ahmed MA, Didaci L, Fumera G, Roli F (2015) An empirical investigation on the use of diversity for creation of classifier ensembles. In: International Workshop on Multiple Classifier Systems, pp 206–219
  • Alhamdoosh and Wang (2014) Alhamdoosh M, Wang D (2014) Fast decorrelated neural network ensembles with random weights. Information Sciences 264:104–117
  • Azadi et al (2017) Azadi S, Feng J, Darrell T (2017) Learning detection with diverse proposals. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7369–7377
  • Bardenet and AUEB (2015) Bardenet R, AUEB MTR (2015) Inference for determinantal point processes without spectral knowledge. In: Advances in Neural Information Processing Systems, pp 3393–3401
  • Barinova et al (2012) Barinova O, Lempitsky V, Kholi P (2012) On detection of multiple object instances using hough transforms. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(9):1773–1784
  • Batra et al (2012) Batra D, Yadollahpour P, Guzman-Rivera A, Shakhnarovich G (2012) Diverse m-best solutions in markov random fields. In: European Conference on Computer Vision, pp 1–16
  • Belkin and Niyogi (2002) Belkin M, Niyogi P (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems, pp 585–591
  • Bioucas-Dias et al (2013) Bioucas-Dias JM, Plaza A, Camps-Valls G, Scheunders P, Nasrabadi N, Chanussot J (2013) Hyperspectral remote sensing data analysis and future challenges. IEEE Geoscience and Remote Sensing Magazine 1(2):6–36
  • Bishop (2006) Bishop CM (2006) Pattern Recognition and Machine Learning. Springer
  • Blaschko (2011) Blaschko M (2011) Branch and bound strategies for non-maximal suppression in object detection. In: Energy Minimization Methods in Computer Vision and Pattern Recognition, pp 385–398
  • Cai et al (2011) Cai D, He X, Han J, Huang TS (2011) Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(8):1548–1560
  • Carreira-Perpinán and Raziperchikolaei (2016) Carreira-Perpinán MA, Raziperchikolaei R (2016) An ensemble diversity approach to supervised binary hashing. In: Advances in Neural Information Processing Systems, pp 757–765
  • Chen et al (2013) Chen C, Kolmogorov V, Zhu Y, Metaxas D, Lampert C (2013) Computing the m most probable modes of a graphical model. Artificial Intelligence and Statistics pp 161–169
  • Cohen (2015) Cohen J (2015) Multi-dimensional topic modeling with determinantal point processes. Independent Work Report Fall
  • Das et al (2012) Das A, Dasgupta A, Kumar R (2012) Selecting diverse features via spectral regularization. In: Advances in Neural Information Processing Systems, pp 1583–1591
  • Desai et al (2011) Desai C, Ramanan D, Fowlkes CC (2011) Discriminative models for multi-class object layout. International journal of computer vision 95(1):1–12
  • Dietterich (2000) Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2):139–157
  • Felzenszwalb et al (2010) Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(9):1627–1645
  • Giacinto and Roli (2001) Giacinto G, Roli F (2001) Design of effective neural network ensembles for image classification purposes. Image and Vision Computing 19(9-10):699–707
  • Gillenwater et al (2014) Gillenwater JA, Kulesza A, Fox E, Taskar B (2014) Expectation-maximization for learning determinantal point processes. In: Advances in Neural Information Processing Systems, pp 3149–3157
  • Gimpel et al (2013) Gimpel K, Batra D, Dyer C, Shakhnarovich G (2013) A systematic exploration of diversity in machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Naatural Language Processing, pp 1100–1111
  • Gong et al (2018a) Gong Z, Zhong P, Shan J, Hu W (2018a) A diversified deep ensemble for hyperspectral image classification. In: WHISPERS
  • Gong et al (2018b) Gong ZQ, Zhong P, Shan JX, Hu WD (2018b) Diversifying deep multiple choices for remote sensing scene classification. IGARSS
  • Gong et al (2018c) Gong ZQ, Zhong P, Yu Y, Hu WD (2018c) Diversity-promoting deep structural metric learning for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 56(1):371–390
  • Graff and Ellen (2016) Graff CA, Ellen J (2016) Correlating filter diversity with convolutional neural network accuracy. In: IEEE International Conference on Machine Learning and Applications (ICMLA), pp 75–80
  • Gu and Han (2013) Gu Q, Han J (2013) Clustered support vector machines. In: AISTATS
  • Guo et al (2017) Guo X, Wang X, Ling H (2017) Exclusivity regularized machine: a new ensemble svm classifier. In: IJCAI, pp 1739–1745
  • Guzman-Rivera et al (2012) Guzman-Rivera A, Batra D, Kohli P (2012) Multiple choice learning: Learning to produce multiple structured outputs. In: Advances in Neural Information Processing Systems, pp 1799–1807
  • Guzman-Rivera et al (2014a) Guzman-Rivera A, Kohli P, Batra D, Rutenbar R (2014a) Efficiently enforcing diversity in multi-output structured prediction. Artificial Intelligence and Statistics pp 284–292
  • Guzman-Rivera et al (2014b) Guzman-Rivera A, Kohli P, Glocker B, Shotton J, Sharp T, Fitzgibbon A, Izadi S (2014b) Multi-output learning for camera relocalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1114–1121
  • Ho (1998) Ho TK (1998) The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8):832–844
  • Hu et al (2015) Hu W, Li W, Zhang X, Maybank S (2015) Single and multiple object tracking using a multi-feature joint sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(4):816–833
  • Jiang et al (2014) Jiang L, Meng D, Yu SI, Lan Z, Shan S, Hauptmann A (2014) Self-paced learning with diversity. In: Advances in Neural Information Processing, pp 2078–2086
  • Kang (2013) Kang B (2013) Fast determinantal point process sampling with application to clustering. In: Advances in Neural Information Processing Systems, pp 2319–2327
  • Keshavarz-Hedayati and Dimopoulos (2017) Keshavarz-Hedayati B, Dimopoulos NJ (2017) Sensitivity and similarity regularization in dynamic selection of ensembles of neural networks. In: IJCNN, pp 3953–3958
  • Kirillov et al (2015) Kirillov A, Savchynskyy B, Schlesinger D, Vetrov D, Rother C (2015) Inferring m-best diverse labelings in a single one. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1814–1822
  • Kirillov et al (2016) Kirillov A, Shekhovtsov A, Rother C, Savchynskyy B (2016) Joint m-best-diverse labelings as a parametric submodular minimization. In: Advances in Neural Information Processing Systems, pp 334–342
  • Kulesza and Taskar (2010) Kulesza A, Taskar B (2010) Structured determinantal point processes. In: Advances in Neural Information Processing Systems, pp 1171–1179
  • Kulesza and Taskar (2012) Kulesza A, Taskar B (2012) Determinantal point processes for machine learning. Foundations and Trends in Machine Learning 5:123–286
  • Kuncheva et al (2003) Kuncheva LI, Whitaker CJ, Shipp C, Duin R (2003) Limits on majority vote accuracy in classifier fusion. Pattern Analysis and Applications 6(1):22–31
  • Kwok and Adams (2012) Kwok JT, Adams RP (2012) Priors for diversity in generative latent variable models. In: Advances in Neural Information Processing Systems, pp 2996–3004
  • Lang et al (2012) Lang C, Liu G, Yu J, Yan S (2012) Saliency detection by multitask sparsity pursuit. IEEE Transactions on Image Processing 21(3):1327–1338
  • Lavancier et al (2015) Lavancier F, Moller J, Rubak E (2015) Determinantal point process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77(4):853–877
  • Lee et al (2012) Lee H, Kim E, Pedrycz W (2012) A new selective neural network ensemble with negative correlation. Applied intelligence 37(4):488–498
  • Lee et al (2015) Lee S, Purushwalkam S, Cogswell M, Crandall D, Batra D (2015) Why m heads are better than one: Training a diverse ensemble of deep networks. ArXiv preprint arXiv: 151106314
  • Lee et al (2016) Lee S, Prakash SPS, Cogswell M, Ranjan V, Crandall D, Batra D (2016) Stochastic multiple choice learning for training diverse deep ensembles. In: Advances in Neural Information Processing Systems, pp 2119–2127
  • Li and Jurafsky (2016) Li J, Jurafsky D (2016) Mutual information and diverse decoding improve neural machine translation. In: arXiv preprint arXiv: 1601.00372
  • Li et al (2016) Li T, Dou Y, Liu X (2016) Joint diversity regularization and graph regularization for multiple kernel k-means clustering via latent variables. Neurocomputing 218:154–163
  • Li et al (2015) Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(10):2085–2098
  • Liu et al (2016) Liu X, Dou Y, Yin J, Wang L, Zhu E (2016) Multiple kernel k-means clustering with matrix-induced regularization. In: AAAI, pp 1888–1894
  • Liu and Yao (1997) Liu Y, Yao X (1997) Negatively correlated neural networks can produce best ensembles. Australian journal of intelligent information processing systems 4(3):176–185
  • Mariet and Sra (2015) Mariet Z, Sra S (2015) Fixed-point algorithms for learning determinantal point processes. In: International Conference on Machine Learning, pp 2389–2397
  • Nemhauser et al (1978) Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of approximations for maximizing submodular set functions. Mathematical Programming 14(1):265–294
  • Olshausen and Field (1996) Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607–607
  • Parizi et al (2014) Parizi SN, Vedaldi A, Zisserman A, Felzenszwalb P (2014) Automatic discovery and optimization of parts for image classification. arXiv preprint arXiv: 14126598
  • Park and Ramanan (2011) Park D, Ramanan D (2011) N-best maximal decoders for part models. In: IEEE International Conference on Computer Vision, pp 2627–2634
  • Peng et al (2017) Peng H, Li B, Ling H, Hu W, Xiong W, Maybank SJ (2017) Salient object detection via structured matrix decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4):818–832
  • Peng et al (2016) Peng Y, Zhai X, Zhao Y, Huang X (2016) Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology 26(3):583–596
  • Prasad et al (2014a) Prasad A, Jegelka S, Batra D (2014a) Submodular maximization and diversity in structured output spaces. In: Advances in Neural Information Processing Systems, pp 1–6
  • Prasad et al (2014b) Prasad A, Jegelka S, Batra D (2014b) Submodular meets structured: Finding diverse subsets in exponentially-large structured item sets. In: Advances in Neural Information Processing Systems, pp 2645–2653
  • Qiao et al (2015) Qiao M, Bian W, DaXu RY, Tao D (2015) Diversified hidden markov models for sequential labeling. IEEE Transactions on Knowledge and Data Engineering 27(11):2947–2960
  • Qiao et al (2017) Qiao M, Liu L, Yu J, Xu C, Tao D (2017) Diversified dictionaries for multi-instance learning. Pattern Recognition 64:407–416
  • Ramakrishna and Batra (2012) Ramakrishna V, Batra D (2012) Mode-marginals: Expressing uncertainty via m-best solutions. In: NIPS Workshop on Perturbations, Optimization, and Statistics, pp 1–5
  • Rao et al (2015) Rao V, Jain P, Jawahar CV (2015) Diverse yet efficient retrieval using hash functions. arXiv preprint arXiv: 150906553
  • Rosen (1996) Rosen BE (1996) Ensemble learning using decorrelated neural networks. Connection science 8(3):373–384
  • Shi and Shen (2016) Shi L, Shen YD (2016) Diversifying convex transductive experimental design for active learning. In: IJCAI, pp 1997–2003
  • Shotton et al (2013a) Shotton J, Glocker B, Zach C, Izadi S, Criminisi A, Fitzgibbon A (2013a) Scene coordinate regression forests for camera relocalization in rgb-d images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2930–2937
  • Shotton et al (2013b) Shotton J, Glocker B, Zach C, Izadi S, Criminisi A, Fitzgibbon A (2013b) Scene coordinate regression forests for camera relocalization in rgb-d images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2930–2937
  • Stephens et al (2013) Stephens GJ, Mora T, Tkacik G, Bialek W (2013) Statistical thermodynamics of natural images. Physical Review Letters 110(1):018,701–018,701
  • Sun and Batra (2015) Sun Q, Batra D (2015) Submodboxes: Near-optimal search for a set of diverse object proposals. In: Advances in Neural Information Processing Systems, pp 1378–1386
  • Sun et al (2016) Sun X, He Z, Zhang X, Zou W, Baciu G (2016) Saliency detection via diversity-induced multi-view matrix decomposition. In: Asian Conference on Computer Vision, pp 137–151
  • Sun et al (2017) Sun X, He Z, Xu C, Zhang X, Zou W, Baciu G (2017) Diversity induced matrix decomposition model for salient object detection. Pattern Recognition 66:253–267
  • Tang et al (2006) Tang EK, Suganthan PN, Yao X (2006) An analysis of diversity measures. Machine Learning 65(1):247–271
  • Venugopal et al (2008) Venugopal A, Zollmann A, Smith NA, Vogel S (2008) Wider pipelines: N-best alignments and parses in mt training. In: Proceedings of AMTA, pp 192–201
  • Vinje and Gallant (2000) Vinje WE, Gallant JL (2000) Sparse coding and decorrelation in primary visual cortex during natural vision. Science 287(5456):1273–1276
  • Viola and Jones (2004) Viola P, Jones MJ (2004) Robust real-time face detection. International Journal of Computer Vision 57(2):137–154
  • Wachinger and Golland (2015) Wachinger C, Golland P (2015) Diverse landmark sampling from determinantal point processes for scalable manifold learning. arXiv preprint arXiv:150303506
  • Wang et al (2015) Wang S, Peng J, Liu W (2015) An regularization framework for diverse learning tasks. Signal Processing 109:206–211
  • Xie et al (2015a) Xie P, Deng Y, Xing E (2015a) Diversifying restricted boltzmann machine for document modeling. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1315–1324
  • Xie et al (2015b) Xie P, Deng Y, Xing E (2015b) Latent variable modeling with diversity-inducing mutual angular regularization. arXiv preprint arXiv: 151207336
  • Xie et al (2015c) Xie P, Deng Y, Xing E (2015c) On the generalization error bounds of neural networks under diversity-inducing mutual angular regularization. arXiv preprint arXiv: 151107110
  • Xie et al (2016) Xie P, Zhu J, Xing E (2016) Diversity-promoting bayesian learning of latent variable models. In: International Conference on Machine Learning, pp 59–68
  • Xie et al (2017a) Xie P, Deng Y, Zhou Y, Kumar A, Yu Y, Zou J, Xing EP (2017a) Learning latent space models with angular constraints. In: International Conference on Machine Learning, pp 3799–3810
  • Xie et al (2017b) Xie P, Salakhutdinov R, Mou L, Xing EP (2017b) Deep determinantal point process for large-scale multi-label classification. In: IEEE International Conference on Computer Vision, pp 473–482
  • Xie et al (2017c) Xie P, Singh A, Xing EP (2017c) Uncorrelation and evenness: a new diversity-promoting regularizer. In: International Conference on Machine Learning, pp 3811–3820
  • Xing and Wang (2017) Xing HJ, Wang XZ (2017) Selective ensemble of svdds with renyi entropy based diversity measure. Pattern Recognition 61:185–196
  • Xiong et al (2014) Xiong H, Szedmak S, Rodriguez-Sanchez A, Piater J (2014) Towards sparsity and selectivity: Bayesian learning of restricted boltzmann machine for early visual features. In: International Conference on Artificial Neural Networks, pp 419–426
  • Xiong et al (2015) Xiong H, Rodriguez-Sanchez AJ, Szedmak S, Piater J (2015) Diversity priors for learning early visual features. In: Frontiers in Computational Neuroscience, pp 1–9
  • Yadollahpour et al (2013) Yadollahpour P, Batra D, Shakhnarovich G (2013) Discriminative re-ranking of diverse segmentations. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1923–1930
  • Yin et al (2014a) Yin XC, Huang K, Yang C, Hao HW (2014a) Convex ensemble learning with sparsity and diversity. Information Fusion 20:49–59
  • Yin et al (2014b) Yin XC, Yang C, Hao HW (2014b) Learning to diversify via weighted kernels for classifier ensemble. arXiv preprint arXiv:14061167
  • You and Tao (2014) You X, Tao RWD (2014) Diverse expected gradient active learning for relative attributes. IEEE Transactions on Image Processing 23(7):3203–3217
  • Yu et al (2006) Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: Proceedings of the 23rd international conference on Machine learning, pp 1081–1088
  • Yu et al (2008) Yu K, Zhu S, Xu W, Gong Y (2008) Non-greedy active learning for text categorization using convex transductive experimental design. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 635–642
  • Yu et al (2011) Yu Y, Li YF, Zhou ZH (2011) Diversity regularized machine. In: Proceedings of International Joint Conference on Artificial Intelligence, pp 1603–1608
  • Zhai et al (2014) Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Transaction on Circuits and Systems for Video Technology 24(6):965–978
  • Zhang et al (2017) Zhang C, Kjellstrom H, Mandt S (2017) Determinantal point processes for mini-batch diversification. In: 33rd Conference on Uncertainty in Artificial Intelligence
  • Zhang and Huan (2012) Zhang J, Huan J (2012) Inductive multi-task learning with multiple view data. In: proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 543–551
  • Zhang and Ou (2016) Zhang MJ, Ou Z (2016) Block-wise map inference for determinantal point processes with application to change-point detection. In: Statistical Signal Processing Workshop (SSP), pp 1–5
  • Zhang and Zhou (2013) Zhang ML, Zhou ZH (2013) Exploiting unlabeled data to enhance ensemble diversity. Data Mining and Knowledge Discovery 26(1):98–129
  • Zhao et al (2016) Zhao Y, Dou Y, Liu X, Li T (2016) Elm based multiple kernel k-means with diversity-induced regularization. In: International Joint Conference on Neural Networks (IJANN), pp 2699–2705
  • Zhong et al (2017) Zhong P, Gong ZQ, Li ST, Schönlieb CB (2017) Learning to diversify deep belief networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 66(6):3516–3530
  • Zhou et al (2002) Zhou ZH, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artificial intelligence 137(1):239–263
  • Zhu and Xing (2009) Zhu J, Xing EP (2009) Maximum entropy discrimination markov networks. Journal of Machine Learning Research 10:2531–2569
  • Zhu et al (2015) Zhu Y, Lan Y, Guo J, Cheng X (2015) Structural learning of diverse ranking. arXiv preprint arXiv: 150404596
  • Zylberberg et al (2011) Zylberberg J, Murphy JT, DeWeese MR (2011) A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of v1 simple cell receptive fields. PLoS computational biology 7(10):e1002,250–e1002,250
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
211792
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel