Semi-Supervised learning with Density-Ratio Estimation

# Semi-Supervised learning with Density-Ratio Estimation

## Abstract

In this paper, we study statistical properties of semi-supervised learning, which is considered as an important problem in the community of machine learning. In the standard supervised learning, only the labeled data is observed. The classification and regression problems are formalized as the supervised learning. In semi-supervised learning, unlabeled data is also obtained in addition to labeled data. Hence, exploiting unlabeled data is important to improve the prediction accuracy in semi-supervised learning. This problems is regarded as a semiparametric estimation problem with missing data. Under the the discriminative probabilistic models, it had been considered that the unlabeled data is useless to improve the estimation accuracy. Recently, it was revealed that the weighted estimator using the unlabeled data achieves better prediction accuracy in comparison to the learning method using only labeled data, especially when the discriminative probabilistic model is misspecified. That is, the improvement under the semiparametric model with missing data is possible, when the semiparametric model is misspecified. In this paper, we apply the density-ratio estimator to obtain the weight function in the semi-supervised learning. The benefit of our approach is that the proposed estimator does not require well-specified probabilistic models for the probability of the unlabeled data. Based on the statistical asymptotic theory, we prove that the estimation accuracy of our method outperforms the supervised learning using only labeled data. Some numerical experiments present the usefulness of our methods.

## 1Introduction

In this paper, we analyze statistical properties of semi-supervised learning. In the standard supervised learning, only the labeled data is observed, and the goal is to estimate the relation between and . In semi-supervised learning [3], the unlabeled data is also obtained in addition to labeled data. In real-world data such as the text data, we can often obtain both labeled and unlabeled data. A typical example is that and stand for the text of an article, and the tag of the article, respectively. Tagging the article demands a lot of effort. Hence, the labeled data is scarce, while the unlabeled data is abundant. In semi-supervised learning, studying methods of exploiting unlabeled data is an important issue.

In the standard semi-supervised learning, statistical models of the joint probability , i.e., generative models, are often used to incorporate the information involved in the unlabeled data into the estimation. For example, under the statistical model having the parameter , the information involved in the unlabeled data is used to estimate the parameter via the marginal probability . The amount of information in unlabeled samples is studied by [2]. This approach is developed to deal with a various data structures. For example, semi-supervised learning with manifold assumption or cluster assumption has been studied along this line [1]. Under some assumptions on generative models, it is revealed that unlabeled data is useful to improve the prediction accuracy.

Statistical models of the conditional probability , i.e., discriminative models, are also used in semi-supervised learning. It seems that the unlabeled data is not useful that much for the estimation of the conditional probability, since the marginal probability does not have any information on [14]. Indeed, the maximum likelihood estimator using a parametric model of is not affected by the unlabeled data. Sokolovska, et al. [23], however, proved that even under discriminative models, unlabeled data is still useful to improve the prediction accuracy of the learning method with only labeled data.

Semi-supervised learning methods basically work well under some assumptions on the population distribution and the statistical models. However, it was also reported that the semi-supervised learning has a possibility to degrade the estimation accuracy, especially when a misspecified model is applied [5]. Hence, a safe semi-supervised learning is desired. The learning algorithms proposed by Sokolovska, et al. [23] and Li and Zhou [15] have a theoretical guarantee such that the unlabeled data does not degrade the estimation accuracy.

In this paper, we develop the study of [23]. To incorporate the information involved in unlabeled data into the estimator, Sokolovska, et al. [23] used the weighted estimator. In the estimation of the weight function, a well-specified model for the marginal probability was assumed. This is a strong assumption for semi-supervised learning. To overcome the drawback, we apply the density-ratio estimator for the estimation of the weight function [24]. We prove that the semi-supervised learning with the density-ratio estimation improves the standard supervised learning. Our method is available not only classification problems but also regression problems, while many semi-supervised learning methods focus on binary classification problems.

This paper is organized as follows. In Section 2, we show the problem setup. In Section 3, we introduce the weighted estimator investigated by Sokolovska, et al.,[23]. In Section 4, we briefly explain the density-ratio estimation. In Section 5, the asymptotic variance of the estimators under consideration is studied. Section 6 is devoted to prove that the weighted estimator using labeled and unlabeled data outperforms the supervised learning using only labeled data. In Section 7, numerical experiments are presented. We conclude in Section 8.

## 2Problem Setup

We introduce the problem setup. We suppose that the probability distribution of training samples is given as

where is the conditional probability of given , and and are the marginal probabilities on . Here, is regarded as the probability in the testing phase, i.e., the test data is distributed from the joint probability , and the estimation accuracy is evaluated under the test probability. The paired sample is called “labeled data”, and the unpaired sample is called “unlabeled data”. Our goal is to estimate the conditional probability or the conditional expectation based on the labeled and unlabeled data in . When is a finite set, the problem is called the classification problem. For , the estimation of is referred to as the regression problem.

We describe the assumption on the marginal distributions, and in . In the context of the covariate shift adaptation [21], the assumption that is employed in general. The weighted estimator with the weight function is used to correct the estimation bias induced by the covariate shift; see [24] for details. Hence, the estimation of the weight function is important to achieve a good estimation accuracy. On the other hand, in the semi-supervised learning [3], the equality is assumed, and often is much larger than . This setup is also quite practical. For example, in the text data mining, the labeled data is scarce, while the unlabeled data is abundant. In this paper, we assume that the equality

holds.

We define the following semiparametric model,

for the estimation of the conditional probability , where is the set of all probability densities of the covariate . The parameter of interest is , and is the nuisance parameter. The model does not necessarily include the true probability , i.e., there may not exist the parameter such that holds. This is the significant condition, when we consider the improvement of the inference with the labeled and unlabeled data. Our target is to estimate the parameter satisfying

in which denotes the expectation with respect to the population distribution. If the model includes the true probability, we have due to the non-negativity of Kullback-Leibler divergence [4]. In the misspecified setup, however, the equality is not guaranteed.

## 3Weighted Estimator in Semi-supervised Learning

We introduce the weighted estimator. For the estimation of under the model , we consider the maximum likelihood estimator (MLE). For the statistical model , let be the score function

where denotes the gradient with respect to the model parameter. Then, for any , we have

Hence, we can estimate the conditional density by , where is a solution of the estimation equation

Under the regularity condition, the MLE has the statistical consistency to the parameter in ; see [26] for details. In addition, the score function is an optimal choice among Z-estimators [26], when the true probability density is included in the model . This implies that the efficient score of the semiparametric model is the same as the score function of the model . This is because, in the semiparametric model , the tangent space of the parameter of interest is orthogonal to that of the nuisance parameter. Here, the asymptotic variance matrix of the estimated parameter is employed to compare the estimation accuracy.

Next, we consider the setup of the semi-supervised learning. When the model is specified, we find that the estimator using only the labeled data is efficient. This is obtained from the results of numerous studies about the semiparametric inference with missing data; see [16] and references therein.

Suppose that the model is misspecified. Then, it is possible to improve the MLE in by using the weighted MLE [23]. The weighted MLE is defined as a solution of the equation,

where is a weight function. Suppose that . Then the law of large numbers leads to the probabilistic convergence,

Hence the estimator based on will provide a good estimator of under the marginal probability . This indicates that is expected to approximate over the region on which is large. The weight function has a role to adjust the bias of the estimator under the covariate shift [21]. On the setup of the semi-supervised learning, however, holds, and it is known beforehand. Hence, one may think that there is no need to estimate the weight function. Sokolovska, et al.,[23] showed that estimation of the weight function is useful, even though it is already known in the semi-supervised learning.

We briefly introduce the result in [23]. Let the set be finite. Then, is a finite dimensional parametric model. Suppose that the sample size of the unlabeled data is enormous, and that the probability function on is known with a high degree of accuracy. The probability is estimated by the maximum likelihood estimator based on the samples in the labeled data. Then, Sokolovska, et al. [23] showed that the weighted MLE with the estimated weight function improves the naive MLE, when the model is misspecified, i.e., .

Shimodaira [21] pointed out that the weighted MLE using the exact density ratio has the statistical consistency to the target parameter , when the covariate shift occurs. Under the regularity condition, it is rather straightforward to see that the weighted MLE using the estimated weight function also converges to in probability, since converges to in probability. Sokolovska’s result implies that when holds, the weighted MLE using the estimated weight function improves the weighted MLE using the true density ratio in the sense of the asymptotic variance of the estimator.

The phenomenon above is similar to the statistical paradox analyzed by [8]. In the semi-parametric estimation, Henmi and Eguchi [8] pointed out that the estimation accuracy of the parameter of interest can be improved by estimating the nuisance parameter, even when the nuisance parameter is known beforehand. Hirano, et al., [10] also pointed out that the estimator with the estimated propensity score is more efficient than the estimator using the true propensity score in the estimation of the average treatment effects. Here, the propensity score corresponds to the weight function in our context. The degree of improvement is described by using the projection of the score function onto the subspace defined by the efficient score for the semi-parametric model. In our analysis, also the projection of the score function plays an important role as shown in Section 6.

For the estimation of the weight function in , we apply the density-ratio estimator [24] instead of estimating the probability densities separately. We show that the density-ratio estimator provides a practical method for the semi-supervised learning. In the next section, we introduce the density-ratio estimation.

## 4Density-ratio estimation

Density-ratio estimators are available to estimate the weight function . Recently, methods of the direct estimation for density-ratios have been developed in the machine learning community [24]. We apply the density-ratio estimator to estimate the weight function instead of using the estimator of each probability density.

We briefly introduce the density-ratio estimator according to [18]. Suppose that the following training samples are observed,

Our goal is to estimate the density-ratio . The -dimensional parametric model for the density-ratio is defined by

where is assumed. For any function which may depend on the parameter , one has the equality

Hence, the empirical approximation of the above equation is expected to provide an estimation equation of the density-ratio. The empirical approximation of the above equality under the parametric model of is given as

Let be a solution of , and then, is an estimator of . Note that we do not need to estimate probability densities and separately. The estimation equation provides a direct estimator of the density-ratio based on the moment matching with the function .

Qin [18] proved that the optimal choice of is given as

where . By using above, the asymptotic variance matrix of is minimized among the set of moment matching estimators, when is realized by the model . Hence, is regarded as the counterpart of the score function for parametric probability models.

## 5Semi-Supervised Learning with Density-Ratio Estimation

We study the asymptotics of the weighted MLE using the estimated density-ratio. The estimation equation is given as

Here, the statistical models and are employed. The first equation is used for the estimation of the parameter of the model , and the second equation is used for the estimation of the density-ratio . The estimator defined by is refereed to as density-ratio estimation based on semi supervised learning, or DRESS for short.

In Sokolovska, et al.[23], the marginal probability density is estimated by using a well-specified parametric model. Clearly, preparing the well-specified parametric model is not practical, when is not finite set. On the other hand, it is easy to prepare a specified model of the density-ratio , whenever holds in . The model is an example. Indeed, holds. Hence, the assumption that the true weight function is realized by the model is not of an obstacle in semi-supervised learning.

We show the asymptotic expansion of the estimation equation . Let and be a solution of . In addition, define be a solution of

and be the parameter such that , i.e., . We prepare some notations: , . The Jacobian of the score function with respect to the parameter is denoted as , i.e., the by matrix whose element is given as . The variance matrix and the covariance matrix under the probability are denoted as and , respectively. Without loss of generality, we assume that at is represented as

where is an arbitrary function orthogonal to , i.e., holds. If does not have any component which is represented as a linear transformation of , the estimator would be degenerated. Under the regularity condition, the estimated parameters, and , converge to and , respectively. The asymptotic expansion of around leads to

Hence, we have

Therefore, we obtain the asymptotic variance,

On the other hand, the variance of the naive MLE, , defined as a solution of is given as

where .

## 6Maximum Improvement by Semi-Supervised Learning

Given the model for the density-ratio , we compare the asymptotic variance matrices of the estimators, and . First, let us define

i.e., is the projection of the score function onto the subspace consisting of all functions depending only on , where the inner product is defined by the expectation under the joint probability . Note that the equality holds. Let the matrix be

Then, a simple calculation yields that the difference of the variance matrix between and is equal to

In the second equality, we supposed that converges to a positive constant. When is positive definite, the estimator using the labeled and unlabeled data improves the estimator using only the labeled data. It is straightforward to see that the improvement is not attained if holds. In general, the score function satisfies , if the model is specified. When the model of the conditional probability is misspecified, however, there is a possibility that the proposed estimator outperforms the MLE .

We derive the optimal moment function for the estimation of the parameter . The optimal can be different from . We prepare some notations. Let be the -valued function on , each element of which is the projection of each element of onto the subspace spanned by . Here, the inner product is defined by the expectation under the marginal probability . In addition, let be the projection of onto the orthogonal complement of the subspace, i.e., .

Due to , one has and . Hence, one has . Our goad is to find which minimizes in in the sense of positive definiteness. The orthogonal decomposition leads to

because of the orthogonality between and , and the equality . Hence, satisfying

is an optimal choice. Since the matrix is row full rank, a solution of the above equation is given by

We obtain the maximum improvement of by using the equalities and .

Suppose that the optimal moment function presented in Theorem ? is used with the score function . Then, the improvement is maximized when is minimized. Hence, the model with the lower dimensional parameter is preferable as long as the assumption in Theorem ? is satisfied. This is intuitively understandable, because the statistical perturbation of the density-ratio estimator is minimized, when the smallest model is employed.

It is not practical to apply the optimal function defined by . The optimal moment function depends on , and one needs information on the probability to obtain the explicit form of . The estimation of needs non-parametric estimation, since the model misspecification of is significant in our setup. Thus, we consider more practical estimator for the density ratio. Suppose that holds for the moment function . For example, the optimal moment function satisfies at , i.e., . For the density-ratio model with and the moment function satisfying , a brief calculation yields that

Hence, the improvement is attained, when holds. As an interesting fact, we see that the larger model attains the better improvement in . Indeed, gets close to , when the density-ratio model becomes large. Hence, the non-parametric estimation of the density-ratio may be a good choice to achieve a large improvement for the estimation of the conditional probability. This is totally different from the case that the optimal presented in Theorem ? is used in the density-ratio estimation. The relation between using the optimal and with is illustrated in Figure 1. In the limit of the dimension of , both variance matrices converge to monotonically.

## 7Numerical Experiments

We show numerical experiments to compare the standard supervised learning and the semi-supervised learning using DRESS. Both regression problems and classification problems are presented.

### 7.1Regression problems

We consider the regression problem with the -dimensional covariate variable shown below.

labeled data:

unlabeled data:

.

regression model:

.

score function:

.

The parameter in implies the degree of the model misspecification. Let be the target function, , and define

which implies the squared distance from the true function to the linear regression model. On the other hand, the mean square error of the naive least mean square (LMS) estimator , i.e., , is asymptotically equal to , when the model is specified. We use the ratio

as the normalized measure of the model misspecification. When holds, the misspecification of the model can be statistically detected.

First, we use a parametric model for density ratio estimation. For any positive integer , let be the -dimensional vector . The density-ratio model is defined as

having dimensional parameter . We apply the estimator presented by Qin [18]. Note that the estimator satisfies at . Hence, the improvement is asymptotically given by . Under the setup of and , we compute the mean square errors for LMS estimator and DRESS . The difference of test errors,

is evaluated for each and each dimension of the density ratio, , where the expectation is evaluated over the test samples. The mean square error is calculated by the average over 500 iterations.

Figure ? shows the results. When the model is specified, i.e., , LMS estimator presents better performance than DRESS. Under the practical setup such as , however, we see that DRESS outperforms LMS estimator. The dependency on the dimension of the density-ratio model is not clearly detected in this experiment. Overall, larger density-ratio model presents rather unstable result. Indeed, in DRESS with large density ratio model, say the right bottom panel in Figure ?, the mean square error of DRESS can be large, i.e., the improvement is negative, even when the model misspecification is large.

Next, we compare LMS estimator and DRESS with a nonparametric estimator of the density-ratio. Here, we use KuLSIF [11] as the density-ratio estimator. KuLSIF is a non-parametric estimator of the density-ratio based on the kernel method. The regularization is efficiently conducted to suppress the degree of freedom of the nonparametric model. In KuLSIF, the kernel function of the reproducing kernel Hilbert space corresponds to the basis function .

Under the setup of and , we compute the mean square errors by the average over 100 iterations. In Figure ?, the square root of the mean square errors for LMS estimator and DRESS are plotted as the function of , i.e., (model error)/(statistical error). When is around , it is statistically hard to detect the model misspecification by the training data of the size . When the model is specified (), LMS estimator presents better performance than DRESS. Under the practical setup such as , however, we see that DRESS with KuLSIF outperforms LMS estimator. As shown in the asymptotic analysis, we notice that the sample size of the unlabeled data affects the estimation accuracy of DRESS. The numerical results show that DRESS with large attains the smaller error comparing to DRESS with small , especially when holds. In the numerical experiment, even DRESS with and slightly outperforms LMS estimator. This is not supported by the asymptotic analysis. Hence, we need more involved theoretical study about the statistical feature of semi-supervised learning.

### 7.2Classification problems

As a classification task, we use spam dataset in “kernlab” of R package [12]. The dataset includes 4601 samples. The dimension of the covariate is 57, i.e., whose elements represent statistical features of each document. The output is assigned to “spam” or “nonspam”.

For the binary classification problem, we use the logistic model,

where is the dimension of the covariate used in the logistic model. In numerical experiments, varies from 10 to 57, hence, the dimension of the model parameter varies from 11 to 58. We tested DRESS with KuLSIF [11] and MLE with randomly chosen labeled training samples and unlabeled training samples. The remaining samples are served as the test data. The score function is used for the estimation.

Table 1 shows the prediction errors with the standard deviation. We also show the p-value of the one-tailed paired -test for prediction errors of DRESS and MLE. Small p-values denote the superiority of DRESS. We notice that p-value is small when the dimension is not large. In other word, the numerical results meet the asymptotic theory in Section Section 6. For relatively high dimensional models, the prediction error of MLE is smaller than that of DRESS; see the row of in Table Table 1. The size of unlabeled data, , also affects the results. Indeed, the p-value becomes small for large . This result is supported by the asymptotic analysis presented in Section Section 6.

## 8Conclusion

In this paper, we investigated the semi-supervised learning with density-ratio estimator. We proved that the unlabeled data is useful when the model of the conditional probability is misspecified. This result agrees to the result given by Sokolovska, et al. [23], in which the weight function is estimated by using the estimator of the marginal probability under a specified model of . The estimator proposed in this paper is useful in practice, since our method does not require the well-specified model for the marginal probability. Numerical experiments present the effectiveness of our method. We are currently investigating semi-supervised learning from the perspective of semiparametric inference with missing data. A positive use of the statistical paradox in semiparametric inference is an interesting future work for semi-supervised learning.

## Acknowledgement

The authors are grateful to Dr. Masayuki Henmi, Dr. Hironori Fujisawa and Prof. Shinto Eguchi of Institute of Statistical Mathematics. TK was partially supported by Grant-in-Aid for Young Scientists (20700251).

### References

1. Semi-supervised learning on Riemannian manifolds.
Belkin, M., & Niyogi, P. (2004). Machine Learning, 56, 209–239.
2. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter.
Castelli, V., & Cover, T. M. (1996). IEEE Transactions on Information Theory, 42, 2102–2117.
3. Semi-supervised learning.
Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). MIT Press.
4. Elements of information theory (wiley series in telecommunications and signal processing).
Cover, T. M., & Thomas, J. A. (2006). Wiley-Interscience.
5. Semi-supervised learning of mixture models.
Cozman, F., Cohen, I., & Cirelo, M. (2003). Proceedings of the International Conference on Machine Learning.
6. Asymptotic analysis of generative semi-supervised learning.
Dillon, J. V., Balasubramanian, K., & Lebanon, G. (2010). 27th International Conference on Machine Learning (pp. 295–302).
7. Semi-supervised learning by entropy minimization.
Grandvalet, Y., & Bengio, Y. (2005). Neural Information Processing Systems 17 (NIPS 2004) (pp. 529–536).
8. A paradox concerning nuisance parameters and projected estimating functions.
Henmi, M., & Eguchi, S. (2004). Biometrika, 91, 929–941.
9. Importance sampling via the estimated sampler.
Henmi, M., Yoshida, R., & Eguchi, S. (2007). Biometrika, 94, 985–991.
10. Efficient estimation of average treatment effects using the estimated propensity score.
Hirano, K., Imbens, G. W., & Ridder, G. (2003). Econometrica, 71, 1161–1189.
11. Statistical analysis of kernel-based least-squares density-ratio estimation.
Kanamori, T., Suzuki, T., & Sugiyama, M. (2012). Machine Learning, 86, 335–367.
12. kernlab – an S4 package for kernel methods in R.
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). Journal of Statistical Software, 11, 1–20.
13. Statistical analysis of semi-supervised regression.
Lafferty, J. D., & Wasserman, L. A. (2007). NIPS.
14. Principled hybrids of generative and discriminative models.
Lasserre, J. A., Bishop, C. M., & Minka, T. P. (2006). CVPR (1) (pp. 87–94).
15. Towards making unlabeled data never hurt.
Li, Y.-F., & Zhou, Z.-H. (2011). ICML (pp. 1081–1088).
16. Asymptotic theory for the semiparametric accelerated failure time model with missing data.
Nan, B., Kalbfleisch, J. D., & Yu, M. (2009). The Annals of Statistics, 37, 2351–2376.
17. Text classification from labeled and unlabeled documents using em.
Nigam, K., Mccallum, A. K., Thrun, S., & Mitchell, T. (1999). Machine Learning (pp. 103–134).
18. Inferences for case-control and semiparametric two-sample density ratio models.
Qin, J. (1998). Biometrika, 85, 619–639.
19. Estimation of regression coefficients when some regressors are not always observed.
Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Journal of the American Statistical Association, 89, 846–866.
20. Learning with labeled and unlabeled data (Technical Report).
Seeger, M. (2001). Institute for Adaptive and Neural Computation, University of Edinburgh.
21. Improving predictive inference under covariate shift by weighting the log-likelihood function.
Shimodaira, H. (2000). Journal of Statistical Planning and Inference, 90, 227–244.
22. The value of labeled and unlabeled examples when the model is imperfect.
Sinha, K., & Belkin, M. (2007). NIPS.
23. The asymptotics of semi-supervised learning in discriminative probabilistic models.
Sokolovska, N., Cappé, O., & Yvon, F. (2008). Proceedings of the Twenty-Fifth International Conference on Machine Learning (pp. 984–991).
24. Machine learning in non-stationary environments: Introduction to covariate shift adaptation.
Sugiyama, M., & Kawanabe, M. (2012). The MIT Press.
25. Density ratio estimation in machine learning.
Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Cambridge University Press.
26. Asymptotic statistics.
van der Vaart, A. W. (1998). Cambridge University Press.
27. A probability analysis on the value of unlabeled data for classification problems.
Zhang, T., & Oles, F. J. (2000). 17th International Conference on Machine Learning.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters