# Uncoupled Regression

from Pairwise Comparison Data

###### Abstract

Uncoupled regression is the problem to learn a model from unlabeled data and the set of target values while the correspondence between them is unknown. Such a situation arises in predicting anonymized targets that involve sensitive information, e.g., one’s annual income. Since existing methods for uncoupled regression often require strong assumptions on the true target function, and thus, their range of applications is limited, we introduce a novel framework that does not require such assumptions in this paper. Our key idea is to utilize pairwise comparison data, which consists of pairs of unlabeled data that we know which one has a larger target value. Such pairwise comparison data is easy to collect, as typically discussed in the learning-to-rank scenario, and does not break the anonymity of data. We propose two practical methods for uncoupled regression from pairwise comparison data and show that the learned regression model converges to the optimal model with the optimal parametric convergence rate when the target variable distributes uniformly. Moreover, we empirically show that for linear models the proposed methods are comparable to ordinary supervised regression with labeled data.

## 1 Introduction

In supervised regression, we need a vast amount of labeled data in the training phase, which is costly and laborious to collect in many real-world applications. To deal with this problem, weakly-supervised regression has been proposed in various settings, such as semi-supervised learning (see Kostopoulos et al. [17] for the survey), multiple instance regression [27, 34], and transductive regression [4, 5]. See [35] for thorough review for the weakly-supervised learning in binary classification, which can be extended to regression with slight modifications.

Uncoupled regression [2] is one variant of weakly-supervised learning. In ordinary “coupled” regression, the pairs of features and targets are provided, and we aim to learn a model which minimizes the prediction error in test data. On the other hand, in the uncoupled regression problem, we only have access to unlabeled data and the set of target values, and we do not know the true target for each data point. Such a situation often arises when we aim to predict people’s sensitive matters such as one’s annual salary or total amount of deposit, the data of which is often anonymized for privacy concerns. Note that it is impossible to conduct uncoupled regression without further assumptions, since no labeled data is provided.

Carpentier and Schlueter [2] showed that uncoupled regression is solvable if the true target function is monotonic to a single dimensional feature by matching the empirical distributions of the feature and the target. Although their algorithm is of less practical use due to its strong assumption, their work offers a valuable insight, which is that a model is learnable from uncoupled data if we know the ranking in the dataset. In this paper, we show that, instead of imposing the monotonic assumption, we can infer such ranking information from data to solve uncoupled regression. We use pairwise comparison data as a source of ranking information, which consists of the pairs of unlabeled data that we know which data point has a larger target value.

Note that pairwise comparison data is easy to collect even for sensitive matters such as one’s annual earnings. Although people often hesitate to give explicit answers of it, it might be easier to answer indirect questions: “Which person earns more than you?” ^{1}^{1}1This questioning can be regarded as one type of randomized
response (indirect questioning) techniques [32], which is a survey method to avoid social desirability bias., which yields pairwise comparison data that we needed. Considering that we do not put any assumption on the true target function, our method is applicable to many situations.

One naive method for uncoupled regression with pairwise comparison data is to use a score-based ranking method [29], which learns a score function with the minimum inversions in pairwise comparison data. With such a score function, we can match unlabeled data and the set of target values, and then, conduct supervised learning. However, as discussed in Rigollet and Weed [28], we cannot consistently recover the true target function even if we know the true order of unlabeled data, when the target variable contains noise.

In contrast, our method directly minimizes the regression risk. We first rewrite the regression risk so that it can be estimated from unlabeled and pairwise comparison data, and learn a model through empirical risk minimization. Such an approach based on risk rewriting has been extensively studied in the classification scenario [7, 6, 23, 30, 18] and exhibits promising performance. We consider two estimators of the risk defined based on the expected Bregman divergence [11], which is a natural choice of the risk function. We show that if the target variable is marginally distributed uniformly then the estimators are unbiased and the learned model converges to the optimal model with the optimal rate in such a case. In general cases, however, we prove that it is impossible to have such an unbiased estimator in any marginal distributions and the learned model may not converge to the optimal one. Still, our empirical evaluations based on synthetic data and benchmark datasets show that our methods exhibit similar performances as a model learned from ordinary supervised learning.

The paper is structured as follows. After discussing the related work in Section 2, we formulate the uncoupled regression problem with pairwise comparison data in detail in Section 3. In Sections 4 and 5, we discuss two methods for uncoupled regression and derive estimation error bounds for each method. Finally, we show empirical results in Section 6 and conclude the paper in Section 7.

## 2 Related Work

Several methods have been proposed to match two independently collected data sources. In the context of data integration [3], the matching is conducted based on some contextual data provided for both data sources. For example, Walter and Fritsch [31] used spatial information as contextual data to integrate two data sources. Some work evaluated the quality of matching by some information criterion and found the best matching by the maximization of the metrics. This problem is called cross-domain object matching (CDOM), which is formulated in Jebara [15]. A number of methods have been proposed for CDOM, such as Quadrianto et al. [26], Yamada and Sugiyama [33], Jitta and Klami [16].

Another line of related work in the uncoupled regression problem imposed an assumption on the true target function. For example, Carpentier and Schlueter [2] assumed that the true target function is monotonic to a single feature, and it was refined by Rigollet and Weed [28]. Another common assumption is that the true target function is a linear function of the features, which was studied in Hsu et al. [14] and Pananjady et al. [24]. Although these methods yield accurate models, they are of less practical use due to their strong assumptions. On the other hand, our methods do not require any assumptions on such mapping functions and are applicable to wider scenarios.

It is worth noting that some methods use uncoupled data to enhance the performance of semi-supervised learning. For example, in label regularization [19], uncoupled data is used to regularize a regression model so that the distribution of prediction on unlabeled data is close to the marginal distribution of target variables, which is reported to increase the accuracy.

Pairwise comparison data was originally considered in the ranking problem [29, 22], which aims to learn a score function that can rank data correctly. In fact, we can apply ranking methods, such as rankSVM [13], to our problem. However, the naive application of them performs inferiorly compared to proposed methods, as we will show empirically, since our goal is not to order data correctly but to predict true target values.

## 3 Problem Settings

In this section, we formulate the uncoupled regression problem and introduce pairwise comparison data. We first define the uncoupled regression, and then, we describe the data generating process of pairwise comparison data.

### 3.1 Uncoupled Regression Problem

We first formulate the standard regression problem briefly. Let be a -dimensional feature space and be a target space. We denote as random variables on spaces , respectively. We assume these random variables follow the joint distribution . The goal of the regression problem is to obtain model in hypothesis space which minimizes the risk defined as

(1) |

where denotes the expectation over and is a loss function.

The loss function measures the closeness between a true target and an output of a model , which generally grows as the prediction gets far from the target . In this paper, we mainly consider to be the Bregman divergence , which is defined as

for some convex function , and denotes the derivative of . It is natural to have such a loss function since the minimizer of risk is when hypothesis space is rich enough [11], where is the conditional expectation over the distribution of given . Many common loss functions can be interpreted as the Bregman divergence; for instance, when , then becomes the -loss, and when , then becomes the Kullback–Leibler divergence between the Bernoulli distributions with probabilities and .

In the standard regression scenario, we are given labeled training data drawn independently and identically from . Then, based on the training data, we empirically estimate risk and learn model as the minimizer of the empirical risk. However, in uncoupled regression, no individual label is available, and thus this approach is no longer applicable. Instead of ordinary “coupled” data, what we are given is unlabeled data and target values . Here, is the size of unlabeled data. Furthermore, we denote the marginal distribution of feature as and its probability density function as . Similarly, stands for the marginal distribution of target , and is the density function of . We use and to denote the expectation over , and , respectively.

Unlike Carpentier and Schlueter [2], we do not try to match unlabeled data and target values. In fact, our methods do not use each target value in but use density function of the target, which can be estimated from . For simplicity, we assume that the true density function is known. The case where we need to estimate from is discussed in Appendix B.

### 3.2 Pairwise Comparison Data

Here, we introduce pairwise comparison data. It consists of two random variables , where the target value of is larger than that of . Formally, are defined as

(2) |

where are two independent random variables following . We denote the joint distribution of as and the marginal distributions as . Density functions and expectations are defined in the same way.

We assume that we have access to pairs of i.i.d. samples of as in addition to unlabeled data and density function of target variable . In the following sections, we show that uncoupled regression can be solved only from this information. In fact, our method only requires samples of either one of , which corresponds to the case where only a winner or loser of the ranking is observable.

One naive approach to conduct uncouple regression with would be to adopt ranking methods, which is to learn a ranker that minimizes the following expected ranking loss:

(3) |

where is the indicator function. By minimizing the empirical estimation of (3) based on , we can learn a ranker that can sort data points by target . Then, we can predict quantiles of test data by ranking , which leads to the prediction by applying the inverse of the cumulative distribution function (CDF) of . Formally, if the test point is ranked top -th in , we can predict the target value for as

(4) |

where is the CDF of .

This approach, however, is known to be highly sensitive to the noise as discussed in Rigollet and Weed [28]. This is because a noise involved in the single data point changes the ranking of all other data points and affects their predictions. As illustrated in Rigollet and Weed [28], even if when we have a perfect ranker, i.e., we know the true order in , model (4) is still different from the expected target given feature in presence of noise.

## 4 Empirical Risk Minimization by Risk Approximation

In this section, we propose a method to learn a model from pairwise comparison data , unlabeled data , and density function of target variable . The method follows the empirical risk minimization principle, while the risk is approximated so that it can be empirically estimated from data available. Therefore, we call this approach as risk approximation (RA) approach. Here, we present an approximated risk and derive its estimation error bound.

From the definition of the Bregman divergence, the risk function in (1) is expressed as

(5) |

In this decomposition, the last term is the only problematic part in uncoupled regression since it requires to calculate the expectation on the joint distribution. Here, we consider approximating the last term based on the following expectations over the distributions of

###### Lemma 1.

We have

The proof can be found in Appendix C.1. From Lemma 1, we can see that if , which corresponds to the case that target variable marginally distributes uniformly in . This motivates us to consider the approximation in the form of

(6) |

for some constants . Note that the above uniform case corresponds to . In general, if target marginally distributes uniformly on for , that is, for all , we can see that approximation (6) becomes exact for from Lemma 1. In such a case, we can construct an unbiased estimator of true risk from unlabeled and pairwise comparison data. For non-uniform target marginal distributions, we choose that minimizes the upper bound of the estimation error, which we will discuss in detail later.

Since we have from Lemma 1, the RHS of (6) equals

(7) |

for arbitrary . Hence, by approximating (5) by (7), we can write the approximated risk as

Here, can be ignored in the optimization process. Now, the empirical estimator of is | ||||

which is to be minimized in the RA approach. Again, we would like to emphasize that if marginal distribution is uniform on and is set to , we have and is an unbiased estimator of .

From the definition of , we can see that by setting to either or , becomes independent of either or . This means that we can conduct uncouple regression even if one of is missing in data, which corresponds to the case where only winners or only losers of the comparison are observed.

Another advantage of tuning free parameter is that we can reduce the variance in empirical risk as discussed in Sakai et al. [30] and Bao et al. [1]. As in Sakai et al. [30], the optimal that minimizes the variance in for is derived as follows.

###### Theorem 1.

For given model , let be

respectively, where is the variance with respect to the random variable . Then, setting

yields the estimator with the minimum variance among estimators in the form of when .

The proof can be found in Appendix C.3. From Theorem 1, we can see that the optimal does not equal zero, which means we can reduce the variance in the empirical estimation with a sufficient number of unlabeled data by tuning . Note that this situation is natural since unlabeled data is easier to collect than pairwise comparison data as discussed in Duh and Kirchhoff [9].

Now, from the discussion of the the pseudo-dimension [12], we establish the upper bound of the estimation error, which is used to choose weights . Let be the minimizers of and in hypothesis class , respectively. Then, we have the following theorem that bounds the excess risk in terms of parameters .

###### Theorem 2.

Suppose that the pseudo-dimensions of are finite and there exist constants such that for all and all . Then,

holds with probability , where is defined as

(8) |

The proof can be found in Appendix C.2. Note that the conditions of boundedness of hold for many losses, e.g., -loss, when we consider a hypothesis space of bounded functions.

From Theorem 2, we can see that we can learn a model with less excess risk by minimizing . Note that can be easily minimized since density function is known or can be estimated from . In particular, if target is uniformly distributed on , we have by setting . In such a case, becomes a consistent model, i.e., as and . The convergence rate is , which is optimal parametric rate for the empirical risk minimization without additional assumptions when we have enough amount of unlabeled and pairwise comparison data jointly [21].

One important case where target variable distributes uniformly is when the target is “quantile value”. For instance, we are to build a screening system for credit cards. Then, what we are interested in is “how much an applicant is credible in the population?”, which means that we want to predict the quantile value of the “credit score” in the marginal distribution. By definition, we know that such a quantile value distributes uniformly, and thus we can have a consistent model by minimizing .

In general cases, however, we may have , and becomes not consistent. Nevertheless, this is inevitable as suggested in the following theorem.

###### Theorem 3.

There exists a pair of joint distributions that yields the same marginal distributions of feature and target , and the same distributions of the pairwise comparison data but have different conditional expectation .

Theorem 3 states that there exists a pair of distributions that cannot be distinguished from available data. Considering that when hypothesis space is rich enough [11], this theorem implies that we cannot always obtain a consistent model. Still, by tuning weights , we can obtain a model competitive with the consistent one. In Section 6, we show that empirically exhibits a similar accuracy to a model learned from ordinary coupled data.

## 5 Empirical Risk Minimization by Target Transformation

In this section, we introduce another approach to uncoupled regression with pairwise comparison data, called the target transformation (TT) approach. Whereas the RA approach minimizes the approximation of the original risk, the TT approach transforms the target variable so that it marginally distributes uniformly, and it minimizes an unbiased estimator of the risk defined based on the transformed variable.

Although there are several ways to map to a uniformly distributed random variable, one natural candidate would be CDF , which leads to considering the following risk:

(9) |

Since distributes uniformly on by definition, we can construct the following unbiased estimator of below from the same discussion as in the previous section.

where is a hyper-parameter to be tuned. The TT approach minimizes to learn a model. However, the learned model is, again, not always consistent in terms of original risk . This is because, in rich enough hypothesis space , the minimizer of (9) is different from , the minimizer of (1), unless target distributes uniformly. Hence, for non-uniform target, we cannot always obtain a consistent model. However, we can still derive an estimation error bound if and target variable is generated as

(10) |

where is the true target function and is a zero-mean noise variable bounded in for some constant .

###### Theorem 4.

Assume that target variable is generated by (10) and . If the pseudo-dimensions of are finite and there exist constants such that for all , we have

with probability for , where is the minimizer of risk in .

The proof can be found in Appendix C.5. From Theorem 4, we can see that is not necessarily consistent. Again, this is inevitable due to the same reason as the RA approach. By comparing Theorems 2 and 4, we can see that the TT approach is more advantageous than the RA approach when the target contains less noise. In section 6, we empirically compare these approaches and show that which approach is more suitable differs from case to case.

## 6 Experiments

In this section, we present the empirical performances of proposed methods in the experiments based on synthetic data and benchmark data. We show that our proposed methods outperform the naive method described in (4) and have a similar performance to a model learned from ordinary supervised learning with labeled data. All codes are available on Github.

Before presenting the results, we describe the detailed procedure of experiments. In all experiments, we consider -loss , which corresponds to setting in Bregman divergence . The performance is also evaluated by the mean suqared error (MSE) in the held-out test data. We repeat each experiments for 100 times and report the mean and the standard deviation. We employ hypothesis space with linear functions . The procedure of hyper-parameter tuning in and can be found in Appendix A.

We introduce two types of baseline methods. One is a naive application of the ranking methods described in (4), in which we use SVMRank [13] as a ranking method. To have a fair comparison, we use the linear kernel in SVMRank. The other is an ordinary supervised linear regression (LR), in which we fit a linear model using the true labels in unlabeled data . Note that LR does not use pairwise comparison data .

##### Result for Synthetic Data.

First, we show the result for the synthetic data, in which we know the true marginal . We sample -dimensional unlabeled data from normal distribution , where is the identity matrix. Then, we sample true unknown parameter such that uniformly at random. Target is generated as , where is a noise following . Consequently, corresponds to , which is utilized in proposed methods and the ranking baseline. The pairwise comparison data is generated by (2). We first sample two features from , and then, compare them based on the target value calculated by . We fix to 100,000 and alter from 20 to 10,240 to see the change of performance with respect to the size of pairwise comparison data.

The result is presented in Figure 2. From this figure, we can see that with sufficient pairwise comparison data, the performances of our methods are significantly better than SVMRank baseline and close to LR. This is astonishing since LR uses the true label of , while our methods do not.

Moreover, we can see that the TT approach outperforms the RA approach with sufficient pairwise comparison data. This observation can be understood from the estimation error bound in Theorem 2, where the term becomes dominant when sufficient data is provided. This term becomes large in this synthetic data since is not bounded. Hence, the guarantee of the RA approach becomes weaker than the TT approach when is large enough, which results in the inferior empirical performance of the RA approach.

Meanwhile, when the size of pairwise comparison data is small, the TT approach is unstable and worse than the RA approach. This is because we learn the quantile value when we minimize , and this can be severely inaccurate when the size of pairwise comparison data is small. On the other hand, directly minimizes the approximation of true risk , which is less sensitive to small .

##### Result for Benchmark Datasets.

We conducted the experiments for the benchmark datasets as well, in which we do not know true marginal . The details of benchmark datasets can be found in Appendix A. We use the original features as unlabeled data . Density function is estimated from target values in the dataset by kernel density estimation [25] with Gaussian kernel. Here, the bandwidth of Gaussian kernel is determined by the cross-validation. The pairwise comparison data is constructed by comparing the true target values of two data points uniformly sampled from .

Supervised Regression | Uncoupled Regression | |||
---|---|---|---|---|

Dataset | LR | SVMRank | RA | TT |

housing | 24.5(5.0) | 110.3(29.5) | 29.5(6.9) | 22.5(6.2) |

diabetes | 3041.9(219.8) | 8575.9(883.1) | 3087.3(256.3) | 3127.3(278.8) |

airfoil | 23.3(2.2) | 62.1(7.6) | 23.7(2.0) | 22.7(2.2) |

concrete | 109.5(13.3) | 322.9(45.8) | 111.7(13.2) | 139.1(17.9) |

powerplant | 20.6(0.9) | 372.2(34.8) | 21.8(1.1) | 22.0(1.0) |

mpg | 12.1(2.04) | 125(15.1) | 12.8(2.16) | 10.3(2.08) |

redwine | 0.412(0.0361) | 1.28(0.112) | 0.442(0.0473) | 0.466(0.0412) |

whitewine | 0.574(0.0325) | 1.58(0.0691) | 0.597(0.0382) | 0.644(0.0414) |

abalone | 5.05(0.375) | 20.9(1.44) | 5.26(0.372) | 5.54(0.424) |

Figure 2 shows the performance of each method with respect to the size of pairwise comparison data for housing dataset. Although the TT approach performs unstably when is small, proposed methods significantly outperform SVMRank and approaches to LR. This fact suggests that the estimation error in has little impact on the performance. The results for various datasets when is 5,000 are presented in Table 1, in which both proposed methods show the promising performances. Note that the approach with less MSE differs by each dataset, which means that we cannot easily judge which approach is better.

## 7 Conclusions

In this paper, we proposed novel methods to deal with uncoupled regression by utilizing pairwise comparison data. We introduced two methods, the RA approach and the TT approach, for the problem. The RA approach is to approximate the expected Bregman divergence by the linear combination of expectations of given data, and the TT approach is to learn a model for quantile values and uses the inverse of the CDF to predict the target. We derived estimation error bounds for each method and showed that the learned model is consistent when the target variable distributes uniformly. Furthermore, the empirical evaluations based on both synthetic data and benchmark datasets suggested the competence of our methods. The empirical result also indicated the instability of the TT approach when the size of pairwise comparison data is small, and we may need some regularization scheme to prevent it, which is left for future work.

#### Acknowledge

LX utilized the facility provided by Masason Foundation. MS was supported by JST CREST Grant Number JPMJCR18A2.

## References

- Bao et al. [2018] H. Bao, G. Niu, and M. Sugiyama. Classification from pairwise similarity and unlabeled data. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- Carpentier and Schlueter [2016] A. Carpentier and T. Schlueter. Learning relationships between data obtained independently. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016.
- Cohen and Richman [2002] W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
- Cortes and Mohri [2007] C. Cortes and M. Mohri. On transductive regression. In Proceedings of 19th Advances in Neural Information Processing Systems, 2007.
- Cortes et al. [2008] C. Cortes, M. Mohri, D. Pechyony, and A. Rastogi. Stability of transductive regression algorithms. In Proceedings of the 25th International Conference on Machine Learning, 2008.
- du Plessis et al. [2015] M. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- du Plessis et al. [2014] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Proceedings of the 27th Advances in Neural Information Processing Systems, 2014.
- Dua and Graff [2017] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Duh and Kirchhoff [2011] K. Duh and K. Kirchhoff. Semi-supervised ranking for document retrieval. Computer Speech and Language, 25(2):261–281, 2011.
- Efron et al. [2004] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004.
- Frigyik et al. [2008] A. Frigyik, S. Srivastava, and M. R. Gupta. Functional Bregman divergence and bayesian estimation of distributions (preprint). IEEE Transactions on Information Theory, 54:5130–5139, 11 2008.
- Haussler [1992] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150, 1992.
- Herbrich et al. [2000] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 115–132, Cambridge, MA, 2000. MIT Press.
- Hsu et al. [2017] D. J. Hsu, K. Shi, and X. Sun. Linear regression without correspondence. In Proceedings of the 30th Advances in Neural Information Processing Systems, pages 1531–1540, 2017.
- Jebara [2004] T. Jebara. Kernelizing sorting, permutation and alignment for minimum volume PCA. In Proceedings of the 17th Annual Conference on Learning Theory, 2004.
- Jitta and Klami [2017] A. Jitta and A. Klami. Few-to-few cross-domain object matching. In Proceedings of The 3rd International Workshop on Advanced Methodologies for Bayesian Networks, 2017.
- Kostopoulos et al. [2018] G. Kostopoulos, S. Karlos, S. Kotsiantis, and O. Ragos. Semi-supervised regression: A recent review. Journal of Intelligent and Fuzzy Systems, 35:1–18, 2018.
- Lu et al. [2019] N. Lu, G. Niu, A. K. Menon, and M. Sugiyama. On the minimal supervision for training any binary classifier from only unlabeled data. In Proceedings of the 7th International Conference on Learning Representations, 2019.
- Mann and McCallum [2010] G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. The Journal of Machine Learning Research, 11:955–984, 2010.
- Massart [1990] P. Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of Probability, 18(3):1269–1283, 1990.
- Mendelson [2008] S. Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on Information Theory, 54:3797–3803, 2008.
- Mohri et al. [2012] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
- Niu et al. [2016] G. Niu, M. C. du Plessis, T. Sakai, Y. Ma, and M. Sugiyama. Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Proceedings of the 29th Advances in Neural Information Processing Systems 29, 2016.
- Pananjady et al. [2018] A. Pananjady, M. J. Wainwright, and T. A. Courtade. Linear regression with shuffled data: Statistical and computational limits of permutation recovery. IEEE Transactions on Information Theory, 64(5):3286–3300, 2018.
- Parzen [1962] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065–1076, 1962.
- Quadrianto et al. [2009] N. Quadrianto, L. Song, and A. J. Smola. Kernelized sorting. In Proceedings of the 21st Advances in Neural Information Processing Systems, 2009.
- Ray and Page [2001] S. Ray and D. Page. Multiple instance regression. In Proceedings of the 18th International Conference on Machine Learning, 2001.
- Rigollet and Weed [2019] P. Rigollet and J. Weed. Uncoupled isotonic regression via minimum Wasserstein deconvolution. Information and Inference: A Journal of the IMA, 2019.
- Rudin et al. [2005] C. Rudin, C. Cortes, M. Mohri, and R. E. Schapire. Margin-based ranking meets boosting in the middle. In Proceedings of the 18th Annual Conference on Learning Theory, 2005.
- Sakai et al. [2017] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semi-supervised classification based on classification from positive and unlabeled data. In Proceedings of the 34th International Conference on Machine Learning, 2017.
- Walter and Fritsch [1999] V. Walter and D. Fritsch. Matching spatial data sets: a statistical approach. International Journal of Geographical Information Science, 13(5):445–473, 1999.
- Warner [1965] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
- Yamada and Sugiyama [2011] M. Yamada and M. Sugiyama. Cross-domain object matching with model selection. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011.
- Zhang and Goldman [2002] Q. Zhang and S. A. Goldman. EM-DD: an improved multiple-instance learning technique. In Proceedings of 15th Advances in Neural Information Processing Systems, 2002.
- Zhou [2018] Z. Zhou. A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53, 2018.

## Appendix A Experiments Details

In this appendix, we explain the detailed setting of experiments. First, we describe the procedure of the hyper-parameter tuning during the experiments. Then, we provide the detail information of benchmark datasets.

### a.1 Procedure of hyper-parameter tuning

To construct risk , we need to tune , which is done by minimizing empirically approximated defined in (8). Let be the -quantile and -quantile of , respectively. Note that we can calculate these quantities since we have access to . Then, we define as , by which is approximated as

We employ that minimize the empirical approximation above with and fix to be in all cases.

We also use approximation in in order to reduce the computational time. Instead of calculating , we use , where is logistic function . We fix for this risk, and what we have minimized during the experiments is

After obtaining the minimizer of , we predict the target by .

### a.2 Benchmark dataset details

We use eight benchmark datasets from UCI repository [8] and one (diabetes) from Efron et al. [10]. The details of datasets can be found in Table 2. As preprocessing, we excluded all instances contains missing value, and we encoded categorical feature in abalone as one-hot vector.

Dataset | Datasize | Source | |
---|---|---|---|

housing | 404 | 13 | UCI Repository |

diabetes | 353 | 10 | [10] |

airfoil | 1202 | 5 | UCI Repository |

concrete | 824 | 8 | UCI Repository |

powerplant | 7654 | 4 | UCI Repository |

mpg | 313 | 7 | UCI Repository |

redwine | 1279 | 11 | UCI Repository |

whitewine | 3918 | 11 | UCI Repository |

abalone | 3341 | 10 | UCI Repository |

## Appendix B Estimating Density Function and Cumulative Distribution Function

In this section, we discuss the case where the true probability density function is not given. In such a case, we need a slight modification of proposed approaches since we have to estimate from the set of target values , where is the size of . We first introduce modification of the RA approach and derive a estimation error bound for it. Then, we discuss the same for the TT approach as well.

### b.1 Modification of the risk approximation approach

Although does not depend on or , we need the information of when tuning weights , which is done by the minimization of defined in (8). Since, can not be directly calculated without and , we propose another quantity below, which substitute expectation over and CDF function to empirical mean and the empirical CDF.

where is the empirical CDF defined as

Note that can be minimized given . To show the validity of the method, we establish an estimation error bound involving as follows.

###### Theorem 5.

Let be bounded in . Then, for all , we have

with probability .

###### Proof.

### b.2 Modification on the target transformation approach

On the other hand, we have in risk . Let be the risk which substitute in to empirical CDF, defined as

Using (11), we have

for all with probability . Let be the minimizer of in hypothesis space . Then, under the condition given in Theorem 4, we have

with probability , therefore we have

with probability , which can be shown by the slight modification of the proof of Theorem 4.

## Appendix C Proofs

### c.1 Proof of Lemma 1

Lemma 1 can be proved as follows.

###### Proof of Lemma 1.

Let be the probability density function (PDF) of . From the definition of , we have

where is the normalizing constant and is the PDF of . Now, is calculated as

The last equality holds from the integration by parts. Therefore, we have

The expectation over can be derived in the same way. ∎

### c.2 Proof of Theorem 2

Here, we show the proof of Theorem 2. First, we show the gap between and can be bounded as follows.

###### Lemma 2.

For all , such that for all , we have

for all .

###### Proof.

Now, Theorem 2 can be derived as follows.