Towards Mixture Proportion Estimationwithout Irreducibility

# Towards Mixture Proportion Estimation without Irreducibility

## Abstract

Mixture proportion estimation (MPE) is a fundamental problem of practical significance, where we are given data from only a mixture and one of its two components to identify the proportion of each component. All existing MPE methods that are distribution-independent explicitly or implicitly rely on the irreducible assumption—the unobserved component is not a mixture containing the observable component. If this is not satisfied, those methods will lead to a critical estimation bias. In this paper, we propose Regrouping-MPE that works without irreducible assumption: it builds a new irreducible MPE problem and solves the new problem. It is worthwhile to change the problem: we prove that if the assumption holds, our method will not affect anything; if the assumption does not hold, the bias from problem changing is less than the bias from violation of the irreducible assumption in the original problem. Experiments show that our method outperforms all state-of-the-art MPE methods on various real-world datasets.

## 1 Introduction

Mixture proportion estimation (MPE) is to identify the mixture proportion of a component distribution in a mixture distribution. Let , and are distributions over a Hilbert space . it can be formulated as follows:

 F=(1−κ∗)G+κ∗H, (1)

where is the mixture proportion; is the mixture distribution; and are the component distributions. Given only samples and i.i.d. drawn from the mixture distribution and the component distribution respectively, MPE aims to identify the mixture proportion [Scott, 2015].

The mixture proportion is a crucial ingredient for constructing statistically consistent classifiers in many weakly supervised classification problems. It has different physical interpretations under different learning scenarios, e.g.,

• in positive-unlabeled learning [Elkan and Noto, 2008, Kiryo et al., 2017, Sakai et al., 2017], semi-supervised learning [Zhu, 2005, Grandvalet and Bengio, 2005, Lawrence and Jordan, 2005] and learning with unlabeled data sets [Lu et al., 2018, 2019], it represents the positive class prior, or the positive proportion, i.e., the proportion of the positive instances contained in a set of unlabeled instances;

• in multi-instance learning [Zhang and Goldman, 2002, Zhou, 2004], it represents the positive proportion;

• in label-noise learning [Scott et al., 2013, Han et al., 2018b, Xia et al., 2019], it represents the inverse flip rate [Liu and Tao, 2015, Scott, 2015], i.e., the probability of a true label given a noisy one;

• in similar-unlabeled learning [Bao et al., 2018], it represents the proportion of similar-data pairs in similar- and dissimilar-data pairs that are formed by exploiting an unlabeled data set.

More extensive reviews of weakly supervised classification problems related to MPE are given in Appendix A.1.

Since the distribution is unobserved, without any assumption on the latent distribution , MPE is ill-posed [Blanchard et al., 2010], i.e., the mixture proportion is not identifiable. For example, in Figure 1, we show both the mixture distribution and the component distribution . In Figure 1, we assume that the latent distribution is fixed as shown in the green color and that . However, can also be convexly combined by and with a different mixture proportion which is illustrated in Figure 1, where is a distribution over . Then, without any knowledge on the latent distribution , and are valid solutions for MPE in Figure 1. This implies that the physical quantities, e.g., class prior, positive proportion, or label noise flip rate, cannot be identified or learned. Figure 1: (a) The mixture distribution F and the component distribution H are given. (b) Assume that the latent distribution G is fixed, i.e., 0.5G is shown by the green curve, and that the mixture proportion κ∗ is 0.5, i.e., F=0.5G+0.5H. (c) The existing MPE estimators will output κ=0.7 instead of 0.5 because they always output the maximum proportion of H in F. (d) Applying the proposed Regrouping-MPE method, a new component distribution H′ will be created and the existing MPE estimators will output κ′=0.49 instead of 0.7 with input H′ and F instead of H and F.

By far, the weakest assumption to yield identifiability of the mixture proportion is the irreducible assumption [Blanchard et al., 2010], i.e., is irreducible to . Intuitively, it means that the component distribution is not contaminated by the distribution or the maximum proportion of in is zero. Mathematically, the irreducible assumption means there exists a set such that and , where represents the probability of the event in distribution . is an indicator function that returns if and otherwise. To the best of our knowledge, the irreducible assumption and its variants have been explicitly [Blanchard et al., 2010, Liu and Tao, 2015, Scott, 2015, Ramaswamy et al., 2016] or implicitly [Jain et al., 2016, Ivanov, 2019, Bekker and Davis, 2018] used in all the popular distribution-independent MPE estimators.

However, it is hard to check the irreducible assumption, because we do not have any sample from the latent distribution in practice. Moreover, the irreducible assumption may not hold in many real-world applications. For example, it is likely that there does not exist any set such that and 1. If the assumption is not satisfied, the estimation of all the popular distribution-independent MPE estimators [Blanchard et al., 2010, Liu and Tao, 2015, Scott, 2015, Ramaswamy et al., 2016] will produce a critical bias because they will output the maximum proportion of in . For example, in Figure 1, existing MPE estimators will output . It is different from the ground truth , since the distribution shown in Figure 1 is a mixture containing . When the irreducible assumption does not hold, how to design an unbiased estimator for the physical quantities, e.g., class priors, remains an unsolved but challenging problem.

In this paper, we propose a novel method for MPE without requiring the irreducible assumption, which is called Regrouping-MPE. Specifically, instead of estimating the mixture proportion of in , our method builds a new MPE problem by creating a new component distribution satisfying the irreducible assumption. Then we use the existing MPE estimator to solve for the maximum proportion of in , which is denoted by . In this way, the estimation bias can be greatly reduced. For example, in Figure 1, we create a new component distribution . By solving for the maximum proportion of in , . The estimation bias of the existing estimators will reduce to instead of . We will further show that, with both theoretical analyses and experimental validations, when the irreducible assumption holds, our Regrouping-MPE method does not hurt the existing estimators; when the irreducible assumption does not hold, our method will help the current estimators to have less estimation bias, which could greatly improve the performances of many weakly supervised classification tasks.

The rest of the paper is organized as follows. In Section 2, we review the irreducible assumption and its stronger variants. We also discuss the difficulty of checking the irreducible assumption. In Section 3, we provide some examples that do not fulfill the irreducible assumption and the estimation bias of the existing consistent estimators under such a circumstance. Then we propose our method Regrouping-MPE, followed by theoretically analysis of its estimation bias and convergence property, as well as the implementation details. All the proofs are listed in Appendix B. The experimental validations are provided in Section 4. Section 5 concludes the paper.

## 2 MPE with Irreducibility

In this section, we briefly review the irreducible assumption for the mixture proportion estimation (MPE) problem.

Identifiability of MPE.   Since we only have samples i.i.d. drawn from and respectively, MPE is unidentifiable without making assumptions on the latent distribution [Blanchard et al., 2010]. Specifically, let be the maximum proportion of in , i.e., , where is another distribution that may or may not be identical to . Given and , for any , we have

 F=(1−κ)M+κH=(1−κ+δ)K+(κ−δ)H, (2)

where is a distribution over the Hilbert space . Thus, without any restriction on the latent distribution, both and are valid latent distributions, and and are the corresponding valid mixture proportions.

Irreducibility.   To make MPE identifiable, or in other words, to ensure the estimated mixture proportion converges to the mixture proportion , the irreducible assumption has been proposed in Blanchard et al. .

###### Definition 1 (Irreducibility).

is said to be irreducible with respect to if is not a mixture containing . That is, there does not exist the decomposition that , where is some probability distribution over and .

MPE is identifiable under the irreducible assumption. In this case, is identical to and the mixture proportion is identical to which represents the maximum proportion of in and can be found as follows:

 κ(F|H) ≜ sup{κ|F=(1−κ)G+κH, (3) for some distribution G over X} = infS⊆X,H(S)>0F(S)H(S).

Suppose we can access the distribution , and the set containing all possible latent distributions, and an pseudo algorithm outputting is provided in Algorithm 1.

To the best of our knowledge, all existing distribution-independent MPE methods [Blanchard et al., 2010, Scott et al., 2013, Liu and Tao, 2015, Scott, 2015, Ramaswamy et al., 2016, Ivanov, 2019] are variants of estimating , i.e., solving for the maximum proportion of in explicitly or implicitly. Many of them are statistically consistent estimators [Blanchard et al., 2010, Scott et al., 2013, Liu and Tao, 2015, Scott, 2015].

Stronger variants of irreducibility.  Based on the irreducible assumption, estimators can be designed with theoretical guarantees that they will converge to the proportion [Blanchard et al., 2010]. However, the convergence rate can be arbitrarily slow [Scott, 2015]. The reason is that irreducibility implies [Blanchard et al., 2010, Scott et al., 2013]

 infS⊆X,H(S)>0G(S)H(S)=0, (4)

i.e., the mixture proportion of in approaches to . If the convergence rate to the infimum is arbitrarily slow, the convergence rate of the designed estimators will be arbitrarily slow. To ensure a bounded rate of convergence, the anchor set assumption, a stronger variant of irreducibility, has been proposed [Scott, 2015, Liu and Tao, 2015]. It assumes

 minS⊆X,H(S)>0G(S)H(S)=0, (5)

i.e., the mixture proportion of in is . The set achieving the minimum is called an anchor set. Scott  shows that, under this assumption, the universal estimator proposed by Blanchard et al.  can converge to at a rate , where denotes the size of the sample . The separability assumption [Ramaswamy et al., 2016], another stronger variant of irreducibility, was also proposed to bound the convergence rate of the method based on kernel-mean-matching (KMM) [Gretton et al., 2012].

To the best of our knowledge, the irreducible assumption or its variant has been explicitly [Blanchard et al., 2010, Liu and Tao, 2015, Scott, 2015, Ramaswamy et al., 2016] or implicitly [Jain et al., 2016, Ivanov, 2019, Bekker and Davis, 2018] used in all the popular distribution-independent MPE methods.

Difficulty to check irreducibility.  To check the irreducible assumption, i.e., whether the latent distribution is a mixture containing the distribution , we need to verify if the Eq. (4) or Eq. (5) is satisfied. However, since itself is not observable, it is difficult to verify Eq. (4) or Eq. (5). Therefore, it is difficult to check the irreducible assumption for MPE.

## 3 MPE without Irreducibility

In this section, we propose a regrouping method for MPE. We prove that, when the irreducible assumption holds, the proposed method will not affect the prediction of existing distribution-independent estimators; when the irreducible assumption does not hold, our method enables the estimators to learn a more accurate mixture proportion.

### 3.1 Motivation

The mixture proportion representing useful physical quantities, e.g., class prior and label noise flip rates, is essential to build the statistically consistent classifiers in many learning scenarios; see, e.g., Han et al. [2018a] and Xia et al. . The existing MPE methods can only estimate when the irreducible assumption holds. However, the irreducible assumption is impossible to check without making any assumption on . It is also noted that the irreducible assumption may not hold for many real-world problems, such as positive-unlabeled learning and similar-unlabeled learning. The detailed examples are provided in Appendix A.2.

Estimation Bias.  In following, we will show that when the irreducible assumption does not hold, the existing distribution-independent MPE methods will introduce an estimation bias.

###### Proposition 1.

Let be the maximum proportion of in , given , then,

 κ(F|H) = κ∗+(1−κ∗)infS∈X,H(S)>0G(S)H(S) (6) = κ∗+(1−κ∗)β.

According to Definition 1, if the irreducible assumption does not hold, then there exists a such that . In this case, can still be obtained. However, it is different from but equal to . This implies that when the irreducible assumption does not hold, if we directly employ existing MPE methods, they will introduce an estimation bias .

As the irreducible assumption is hard to check and the assumption may not always hold, it motivates us to seek for a new approach that can reduce the bias without checking the irreducible assumption.

### 3.2 Regrouping for MPE

If the irreducible assumption is not satisfied, estimating the maximum proportion of in will produce a critical estimation bias. As we have discussed in Proposition 1, the bias will be decided by . The larger is, the larger the bias. The intuition of our idea is to reduce the bias by building a new MPE problem. Our method changes the original component distributions and into new component distributions and which are satisfy the irreducible assumption, i.e., , while the mixture distribution is unchanged. This can be achieved by regrouping the examples drawn from a certain support of , e.g., , to . Note that, after regrouping, the probability will become , and will be larger than . Thus, . Let the maximum proportion of the new component distribution in be , and we want to be a good approximation of . However, as the component distributions are changed, the new mixture proportion can be largely different from . To minimize the difference, we should select a set with a small probability , such that will be close to and will close to . We name the proposed regrouping method for MPE as Regrouping-MPE. We will prove that if the irreducible assumption is not satisfied, our regrouping method will lead to less estimation bias; if is irreducible to , our method will not affect the prediction of the existing consistent estimators, since the selected set satisfies that closes to .

The rest of Section 3.2 will go as follows. In Theorem 1, we analyze the relationship between and the new mixture proportion which is obtained by employing our Regrouping-MPE method. We find that they are related by the set used for regrouping. Specifically, a small leads to a small difference between them.

In Theorem 2, we illustrate the condition under which is a better approximation of than . Theorem 3 discusses how the condition in Theorem 2 can be satisfied by exploiting the given samples.

Finally, we analyze the convergence property of the proposed method in Section 3.2.2. Let be the estimated maximum proportion of in . Theorem 4 shows the difference between and . We further discuss that, under an assumption (Assumption 1 also used in Scott ), with the increasing size of training samples, the difference will converge to with a fixed rate.

#### Bias and consistency

The new component distribution is generated by regrouping examples from a certain support of, or a “part” of, to . We first show how to split the “part” from the probability measure .

###### Definition 2.

Let be a probability measure over a Hilbert space . Given a set , we define a measure over the space as follows:2

 ∀S∈2A,MA(S)=M(S), (7) ∀S∈2X∖2A,MA(S)=M(S∩A). (8)

Given two measures and obtained according to Definition 2, where . Then and have the following property.

###### Lemma 1.

Let be a probability measure over a Hilbert space . For any set , we have

 MA+MAc=M. (9)

Now, we introduce the regrouping process. Fixing a set , the probability measure is split as and according to Definition 2 and Lemma 1. Then we regroup to , that is,

 F = (1−κ∗)G+κ∗H (10) = (1−κ∗)(GA+GAc)+κ∗H = (1−κ∗)GAc+(1−κ∗)GA+κ∗HRegrouped.

By regrouping to , we can rewrite as a mixture of two new component distributions where the anchor set assumption always holds. Note that the anchor set assumption is a stronger variant of the irreducible assumption.

###### Theorem 1.

Let . Let . By regrouping to , can be written as a mixture, i.e., , where

 κ′=κ(F|H′)=κ∗+(1−κ∗)G(A), (11) G′=GAcG(Ac), (12) H′=(1−κ∗)GA+κ∗H(1−κ∗)G(A)+κ∗, (13)

and and satisfy the anchor set assumption.

When is reducible to , is not identifiable, which will lead to an estimation bias as we discussed before. However, the above theorem states that the new mixture proportion is always identifiable as is always irreducible to . Thus, after regrouping, can be estimated by the existing MPE estimators based on Eq. (3).

According to Theorem 1, we can also see that, to make closer to , we need to select a set with a smaller probability in the distribution .

###### Theorem 2.

Suppose a set is selected to satisfy . Then,
1) if is irreducible to , ;
2) if is reducible to , then .

Theorem 2 provides a guideline on the selection of the set used for regrouping to make a good approximation of . Specifically, under the condition stated in Theorem 2, if is irreducible to , the new estimation will be identical to ; if is reducible to , will contain a smaller estimation bias compared to .

Let , which is a set satisfying the condition stated in Theorem 2, i.e., . The following theorem presents how to find . Let denote a discrete probability distribution over a set of observable distributions. In the MPE problem, and are distributions of observable distributions. Let be a random variable following the distribution . We will use to denote the prior knowledge that is drawn from . Note that .

###### Theorem 3.

The ratio is proportional to , where ; and are the density function of and respectively.

From the above theorem, to find the set , we need to estimate and . is equivalent to the class posterior probability estimated by constructing a binary classifier based on the samples from and . is the density of instances, which also can be directly estimated from the given samples by using density estimation methods Silverman .

#### Convergence Analysis

Our method can inherit the convergence rate of the current estimators. Let , and be the samples i.i.d. drawn from , and , respectively. Under Assumption 1, the statistically consistent estimators designed for Eq. (3) can converge to with a fixed rate [Scott, 2015], where denotes the sample size of .

Let be the set used for regrouping. Let , be a function that predicts for all elements in the set and otherwise, where denotes a hypothesis space. Then can be expressed as , where is the density function of the distribution . Let be the empirical version of , i.e., . Similarly, let be the empirical version of . The following theorem is proved by exploiting the Rademacher complexity [Mohri et al., 2018].

###### Theorem 4.

Let . By selecting a set and regrouping to . Then, with probability , the estimated obtained by solving satisfies

 |^κ′−κ∗| ≤ ϵδ,H(XH′)^H′(A)+ϵδ,H(XH′) +ϵδ,H(XF)^H′(A)+ϵδ,H(XH′)+(1−κ∗)G(A),

where and is the empirical Rademacher complexity of the hypothesis space [Bartlett and Mendelson, 2002].

With increasing the size of , will converge to zero, which means the error will converge to . To make small (or converge to ), a universal approximation assumption has been proposed.

###### Assumption 1.

(Universal Approximation Property [Scott, 2015]) Consider a sequence of VC classes of sets with finite VC dimension. The sequence is assumed to satisfy the universal approximation property if for any and any distribution ,

 liminfk→infinfS∈SkM(SΔS∗)=0, (15)

where is the symmetric set difference.

Under the above assumption, Scott  proved that, with increasing of the size of the sample , the error will converge to at a rate by exploiting the VC theory [Vapnik, 2013]. Since the empirical Rademacher complexity of a hypothesis space can be upper bounded by its VC-dimension [Mohri et al., 2018], then both and based on empirical Rademacher complexity will also converge to zero at a rate . Then, we can conclude that, under Assumption 1, will converge to at a rate .

### 3.3 Implementation

The proposed algorithm is summarized in Algorithm 2. Let and be the samples i.i.d. drawn from and . The inputs are , , and a hyper-parameter . The hyper-parameter is introduced as a trade-off between the theoretical and empirical findings. Theoretically, Eq. (11) shows that, to obtain a small estimation error, we prefer the set to have a small size. Empirically, the designed estimator is not sensitive to the small regrouped data. For example, the difference between the estimated mixture proportion by employing samples and and the one by employing samples and can be hardly observed if and only differ from on one or two points. Specifically, is selected for the experiments on all datasets, which leads to a significant improvement of the estimation accuracy. The details on selection of the hyper-parameter value will be explained in Section 4.1.

The core part of Regrouping-MPE is to create a new MPE problem according to Theorem 1 and a sample drawn from . It is constructed by copying a set of data points with a small probability in to the sample . To locate the support of the data points, we need to use Theorem 2 and Theorem 3. According to Theorem 3, we need to estimate both and the density function of the data drawn from and . However, the estimation of the density may introduce large error especially when the dimension is high. Furthermore, since we only need to use over a small and compact set, e.g., , it can be approximated by using the -th order Taylor series with a small error. Specifically, we have

 fX(x)=f(c)+∞∑n=1f(n)(c)n!(x−c)n, (16)

where denotes the -th derivative of at a point , and denotes the factorial of . If every is an open ball with center and a small radius , will be small. By approximating with the constant , in Theorem 3 can be cancelled out. Note that, by treating the sample from as being positive and the sample from as being negative, is equivalent to the positive class posterior probability.

As we do not have examples drawn from , it is hard to create , let alone to sample from it. We will approximate by using . The following proposition shows that when is small, is almost identical to .

###### Proposition 2.

Let , and . For all and for all , .

Note that the proposed method encourages the set in Theorem 1 to be small, as well as . Then will be a good approximation of .

Although we have introduced a trade-off hyper-parameter and used approximations in the implementation, empirical results on all synthetic and real-world datasets consistently show the superior of Regrouping-MPE.

## 4 Experiments

We run experiments on synthetic datasets and real word datasets3. The objectives of employing synthetic datasets are to validate the correctness of the proposed method and the selection of the trade-off parameter . The objective of using the real-word dataset is to illustrate the effectiveness of our methods.

To have a rigorous performance evaluation, for each dataset, experiments are conducted via random sampling. Specifically, we select fraction of either positive (or negative) examples to be the sample of the component distribution . We let the rest of the examples to be the sample of the mixture distribution . In such way, pairs of empirical mixture and component distributions are generated. Then, for each pair of the distributions, we randomly draw mixture and component samples with sizes of , and , respectively, which are used as input data. Note that, the mixture and component samples have the same size as did in Ramaswamy et al. . For each sample size, repeated experiments are carried out with random sampling.

For all experiments, we employ a simple deep network with hidden layers. Each hidden layer contains hidden units. The stochastic gradient descent optimizer is used with the batch size . The network is trained for 150 epochs with learning rate and momentum . The weight decay is set to . The model with the best validation accuracy is used to estimate the positive class posterior probability . We sample the validation set with 20% of the training data size.

### 4.1 Experiments on Synthetic Datasets

We create two datasets with one satisfying the irreducible assumption while the other not. The irreducible dataset is created by sampling from 2 different 10-dimensional Gaussian distributions as the component distributions. One of the distribution has zero mean and unit covariance matrix. Another one has 10-unit mean and unit covariance matrix. The reducible dataset is also created by drawing examples from 2 different 10-dimensional Gaussian distributions. One of the distribution has zero mean and unit covariance matrix. Another one has unit mean and covariance matrix. Then we remove the data points by training a binary classifier with drawn examples and remove all the points with or .

To validate the correctness of our method and to select a suitable value of the hyper-parameter , we carry out two experiments. The consistent distribution-independent estimator KM2 is used as the baseline, which is compared to our method RKM2, i.e., regrouping version of the KM2. Firstly, we compare the magnitude differences between and (i.e., ) with the different fractions of points to be copied from the mixture sample to the component sample , which is illustrated in Figure 2. Then we compare differences of the absolute error (i.e., ) between the baseline and our method with the increasing of the copy fractions. Note that each point in Figure 2 is obtained by averaging over experiments.

Figure 2 validates the correctness of our Theorem 2 and Eq. (11). Theorem 2 states that, by properly selecting the set , on the reducible dataset, should be smaller than ; on the irreducible dataset, should be close to . Figure 2 perfectly matches this statement. It shows that, on the reducible dataset, the values of are continuously smaller than with the copy fraction ; on the irreducible dataset, and have the similar values until the copy fraction . According to Eq. (11), the positive bias of our estimator should become larger with the increasing of . This fact is reflected by that the differences of become smaller on the both datasets when the copy fraction .

Figure 2 illustrates the average differences of absolute error between the baseline and the proposed method. On the reducible dataset, our method continuously outperforms the baseline with the copy fraction . However, the differences of average absolute error start to decrease with the copy fraction . On the irreducible dataset, the differences of average absolute error are close to zero until the copy fraction . Figure 2: Experiments on Synthetic Datasets. (a) Average estimation differences between KM2 and Regrouping-KM2 (RKM2) with the increasing of the copy fraction p. (b) Average differences of the absolute error between KM2 and Regrouping-KM2 (RKM2) with the increasing of the copy fraction p.

By observing Figure 2, we can see that the curves are smooth with increasing the copy fractions, which means that the proposed Regrouping-MPE method is not sensitive to the hyper-parameter . For simplicity and consistency, we select the value of the hyper-parameter as for all the following experiments.

### 4.2 Experiments on Real-world Datasets

We use popular baselines on the real-world datasets to compare with the proposed method, which are AlphaMax (AM) [Jain et al., 2016], Elkan-Noto (EN) [Elkan and Noto, 2008], KM1, KM2 [Ramaswamy et al., 2016] and ROC [Scott, 2015]4. By using our method, the regrouped version of them are implemented, which are called RAM, REN, RKM1, RKM2 and RROC. In Table 1, we compare the absolute estimation errors of each baseline with those of its regrouped version on different datasets with different sample lengths. Each number in Table 1 is the average over experiments.

Table 1 reflects the effectiveness of our regrouping method. Our regrouping method has state-of-the-art estimation accuracy. Overall, the estimation accuracy is increased for all popular MPE estimators by using our regrouping method. By observing the last row, except the KM2 estimator, the regrouped version of the rest estimators have much smaller average estimation errors among the most of the datasets with different sample lengths. Additionally, Regrouping-AlphaMax (RAM) results the smallest average estimation error among all methods.

## 5 Conclusion

In this paper, we first propose an effective regrouping method for MPE without the irreducible assumption, which is called Regrouping-MPE and only requires to train a binary classifier addition to the existing MPE estimators. We have theoretically analyzed the estimation bias and convergence property of Regrouping-MPE. By running experiments on benchmark datasets, its correctness and effectiveness are justified. Regrouping-MPE outperforms all state-of-the-art MPE methods. One future work will focus on how to generate a sample from instead of using an approximation.

## Acknowledgments

TLL was supported by Australian Research Council Project DE-190101473. BH was supported by HKBU Tier 1 Start-up Grant and HKBU CSD Start-up Grant. MS was supported by JST CREST Grant Number JPMJCR1403. DCT was supported by Australian Research Council Project FL-170100117.

## Appendix A Appendix A

### a.1 Applications of the MPE Problem

#### Positive and Unlabeled (PU) Learning

Let , , and be the positive class conditional distribution, negative class conditional distribution, and marginal distribution, respectively, where represents the positive class prior.

In learning with positive and unlabeled data (PU learning), there are two different settings for data generation, i.e., two sample (TS) and one sample (OS) Niu et al. .

In TS, the positive sample and the unlabeled sample are i.i.d. drawn from and respectively. Note that in this setting, the positive class prior can be estimated by employing MPE because we have and samples drawn from and .

In OS, the positive sample and the unlabeled sample are drawn dependently. Specifically, an unlabeled sample is first i.i.d. drawn from and then a positive sample is distilled from it, i.e., randomly selected from the positive instances contained in the unlabeled data. The remaining unlabeled sample has the following distribution , i.e.,

 PU′=θ+P++(1−θ+)P−,

where represents the ratio of positive examples contained in the remaining unlabeled examples. Note that can be estimated by , where represents the size of the unlabeled data sample and the positive data sample. As we have samples from and , to learn is a MPE problem.

To learn a classifier with positive and unlabeled data, we need to utilize the unlabeled data to evaluate the classification risk [Du Plessis et al., 2014, Kiryo et al., 2017, Sakai et al., 2017], where and play important roles.

#### Semi-supervised Learning

Let , and be defined the same as those in Section A.1.1.

In semi-supervised classification [Zhu, 2005], similar to PU learning, there are also two different settings for data generation. We call them three sample (THS) and one sample (OS), respectively.

In THS, the negative, positive, and unlabeled samples are i.i.d. drawn from , and , respectively Sakai et al. . Note that we have

 PU=π+P++(1−π+)P−,

where it represents the positive class prior. As we have samples from , , and , can be learned by MPE Yu et al. .

In OS, the negative, positive, and unlabeled samples are drawn dependently. Specifically, the positive sample and the negative sample are distilled from an unlabeled sample. Then the remaining unlabeled sample has the following distribution , i.e.,

 PU′=θ+P++(1−θ+)P−,

where is the mixture proportion, representing the ratio of positive examples contained in the unlabeled sample. As we have samples from , , and , can be learned by MPE Yu et al. .

Note that if or is identified, the unlabeled data can be exploited to build risk-consistent classifiers Sakai et al. .

#### Learning with Unlabeled Data Sets

Let and be defined the same as those in Section A.1.1. Learning with unlabeled data Lu et al. [2018, 2019] deals with the problem of having two unlabeled data sets with different distributions,

 P′U = θ′P++(1−θ′)P−, P′′U = θ′′P++(1−θ′′)P−,

where and are two mixture proportions. They represent the ratios of positive examples contained in the two unlabeled samples.

Learning and is essential to build statistically consistent classifiers Lu et al. . Since , , and are unknown, to obtain , , we need to solve other two MPE problems obtained by substituting the above equations to each other,

 P′U = ~θ′P++(1−~θ′)P′′U, P′′U = ~θ′′P++(1−~θ′′)P′U,

where and are two mixture proportions. As we have samples from and , and can be estimated by MPE.

#### Multi-instance Learning

In multi-instance learning [Zhou, 2004], instead of having individual examples, bags (or a collection) of individual examples are available, where labels are only given on the bag level. Specifically, a bag will be labeled as positive if it contains at least one positive example; otherwise it will be labeled as negative.

Let and be defined the same as those in Section A.1.1. Let is ratio of the positive examples mixed in a bag. The distribution of the examples in one bag can be formulated as

 ~P=θ+P++(1−θ+)P−.

As we have samples from (e.g., a positively labeled bag) and (e.g., a negatively labeled bag), for each positive bag can be estimated by MPE.

Note that s are helpful to build a classifier Niu et al.  to classify whether an instance is positive or negative, which can also be employed to classify a bag.

#### Label-noise Learning

In label-noise learning, let denotes the noisy label, and denotes the clean label. Let and denote the clean positive and clean negative class conditional distributions, respectively. Let and denote the noisy positive and noisy negative class conditional distributions, respectively.

Estimating and in following two equations are essential to build statically consistent classifiers [Liu and Tao, 2015].

 ~P− = λ−P++(1−λ−)P−, ~P+ = λ+P−+(1−λ+)P+,

where and are two mixture proportions. They represent the probability of a clean label given the noisy one, which are called inverse flip rates.

Since , , and are unknown, to obtain , , we need to solve other two MPE problems obtained by substituting the above equations to each other, i.e.,

 ~P− = ~λ−~P++(1−~λ−)P−, ~P+ = ~λ+~P−+(1−~λ+)P+,

where and are two mixture proportions. As we have samples from and