Unsupervised Domain Adaptation Meets Offline Recommender Learning

Unsupervised Domain Adaptation Meets
Offline Recommender Learning

Yuta Saito
Tokyo Institute of Technology
saito.y.bj@m.titech.ac.jp
Abstract

To construct a well-performing recommender offline, eliminating selection biases of the rating feedback is critical. A current promising solution to the challenge is the propensity weighting approach that models the missing mechanism of rating feedback. However, the performance of existing propensity-based algorithms can be significantly affected by the propensity estimation bias. To alleviate the problem, we formulate the missing-not-at-random recommendation as the unsupervised domain adaptation problem and drive the propensity-agnostic generalization error bound. We further propose a corresponding algorithm minimizing the bound via adversarial learning. Our proposed theoretical framework and algorithm do not depend on the propensity score and can obtain a well-performing rating predictor without the true propensity information. Empirical evaluation using benchmark real-world datasets demonstrates the effectiveness and the real-world applicability of the proposed approach.

1 Introduction

The main objective of recommender systems is to obtain a well-performing rating predictor from sparse observed rating feedback. During the process, a great challenge is that most of the missing mechanism of the real-world rating dataset is missing-not-at-random (MNAR). The MNAR missing mechanism is created because of two factors. The first is the past recommendation policy. If one relied on a policy recommending popular items with high probability in the past, then the observed ratings under that policy include more data of popular items [5, 33]. The other is users’ self-selection. For example, users tend to rate items for which they have positive preferences, and the ratings for negative preferences are more likely to be missing [26, 20].

The MNAR problem makes it difficult to learn rating predictors from observable data [31] because it is widely recognized that naive methods often lead to sub-optimal and biased recommendations under the MNAR settings [31, 26, 27]. One of the most established solutions to the problem is the propensity-based approach. It defines the probability of each instance being observed as the propensity score and obtains an unbiased estimator for the true metric of interest by weighting each data by the inverse of its propensity [26, 18, 32]. In general, this unbiasedness is desirable; however, this property is ensured only when the true propensities are available; it is widely known that the performance of the propensity-based algorithms is highly susceptible to the propensity estimation methods [25, 2]. However, in real-world recommender systems, the true propensities are mostly unknown, and this leads to the severe bias in the estimation of the loss function of interest even when the propensity weighting estimators are used.

To solve the limitation of the previous propensity-based methods, in this work, we establish a new theory of MNAR recommendation inspired by the theoretical framework of unsupervised domain adaptation. Similar to the propensity weighting, unsupervised domain adaptation addresses the problem settings in which the feature distributions between the training and test sets are different. Moreover, methods of unsupervised domain adaptation generally utilize distance metrics measuring dissimilarity between probability distributions and does not depend on propensity weighting techniques [7, 6, 23, 24]. Thus, it is considered to be useful to solve the problem caused by the propensity estimation bias. However, the connection between the MNAR recommendation and unsupervised domain adaptation has not yet been thoroughly investigated.

To bridge the two potentially related fields, we first define a discrepancy metric to measure the distance between the two missing mechanisms inspired by domain discrepancy measures for unsupervised domain adaptation [3, 4]. Then, we derive a generalization error bound depending on the naive loss on the MNAR feedback and the discrepancy between the ideal missing-completely-at-random (MCAR) and the common MNAR missing mechanisms. Our theoretical bound is independent of the propensity score; thus, the bias problem relating to the propensity scoring is eliminated. Moreover, we propose an algorithm called Domain Adversarial Matrix Factorization. The proposed algorithm simultaneously minimizes the naive loss on the MNAR feedback and the discrepancy measure in an adversarial manner. Finally, we conduct an experiment on a standard real-world dataset to empirically demonstrate the effectiveness of the proposed approach.

The contributions of this paper can be summarized as follows.

  • We construct a new theoretical approach to the problem of MNAR recommendation based on the theoretical bound of unsupervised domain adaptation. Different from the previous propensity-based unbiased estimation approach, our proposed approach does not depend on the propensity score.

  • We proposed Domain Adversarial Matrix Factorization that eliminates the bias of a recommender by introducing the domain adversarial learning to the matrix factorization model.

  • We conducted comprehensive experiments on standard real-world datasets. In particular, we show that the existing propensity-based approaches are susceptible to the choice of propensity estimators. Besides, our proposed method outperforms the baseline methods with respect to rating prediction accuracy under situations where the true propensities are unknown, or the costly MCAR data is unavailable.

The rest of the paper is organized as follows. Section 2 summarizes the related literature. In Section 3, we formulate the MNAR recommendation problem and describe some limitations of existing methods. Then, in Section 4, we construct a new theoretical framework to address the MNAR problem inspired by the theory of unsupervised domain adaptation. Experimental setups and results are described in Section 5. Finally, we conclude and discuss future research directions in Section 6.

2 Related Work

2.1 De-biasing Recommender Systems

To address the bias of the MNAR explicit feedback, several related works assume the missing data model and rating model and estimate parameters using the iterative procedure [20, 11]. However, these methods are highly complex and do not perform well on real-world rating datasets [26, 32].

The propensity-based methods were proposed to solve the limitations of these methods and to theoretically address the bias of MNAR feedback [26, 18, 32, 31]. Among them, the most basic means is called the Inverse Propensity Score (IPS) estimation established in the context of causal inference [21, 22, 12]. This estimation method provides the unbiased estimator of the true metric of interest by utilizing the propensity score defined as the probability of observing each instance. It has been shown that the unbiased estimator can be derived by weighting each data by the inverse of its propensity. The rating predictor based on the IPS estimator empirically outperformed the naive matrix factorization [15] and the probabilistic generative model [11]. These propensity-based methods can remove the bias of the naive methods, but the performance of these methods largely depends on the propensity score estimation model. In fact, ensuring the performance of the propensity estimator is challenging in real-world recommendations because users are free to choose which items to rate, and one cannot control the missing mechanism [11].

Another approach for the MNAR recommendation is Causal Embeddings for Recommendations (CausE) proposed in [5]. This method jointly trains two matrix factorization models to reduce the effect of selection bias and empirically outperformed the propensity-based methods in terms of binary classification metrics. However, this method also requires some amount of MCAR rating feedback that is costly and inaccessible in most real-world recommender systems.

Therefore, the method that is independent of both the propensity score and MCAR data should be developed to be used in real-world applications but has not yet been proposed in the context of the MNAR recommendation.

2.2 Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) aims to train a predictor that works well on a target domain by using only labeled source data and unlabeled target data during training [23, 16]. The major challenge of this field is that the feature distributions and the labeling functions can differ between the source and target domains. Thus, a predictor trained using only the labeled source data does not generalize well on the target domain. Therefore, measuring the discrepancy between the two domains is essential to achieve the desired performance on the target domain [16, 17]. Several discrepancy measures to measure the difference in the feature distributions between the source and target domains have been proposed [3, 16, 17, 34]. For example, -divergence and -divergence [4, 3] have been used to construct many prediction methods in UDA such as DANN, ADDA, and MCD [7, 6, 30, 24]. These methods are built on the adversarial learning framework and can be theoretically explained as minimizing the empirical errors and the discrepancy measures between the source and target domains. The optimization of these methods does not depend on the propensity score. Thus, methods of UDA are considered to be beneficial to construct an effective recommender with biased rating feedback, because one cannot have access to the true propensities in most of the real-world recommender systems.

The work that is most related to ours is [2]. In this study, the propensity-agnostic lower bound of the performance of treatment policies is derived. The bound is based on the well-established -divergence and can be optimized through domain adversarial learning. The proposed DACPOL procedure empirically outperforms the propensity-based treatment policy optimization algorithm called POEM [28, 29] under the situation where the past treatment policies (propensities) are unknown. Note that this method is for the treatment policy optimization and cannot be directly applied for the rating prediction task. Our proposed method shares a similar structure with the method proposed in [2] but is the first extension of the domain adversarial learning to develop a method to alleviate the bias of the MNAR recommendation without true propensity information.

3 Preliminaries

In this section, we introduce the notations and formulation of the MNAR recommendation with explicit feedback. Then, we describe previous estimators and their limitations.

3.1 Notation and Formulation

In this study, is a set of users (), and is a set of items (). We also denote the set of all user and item pairs as . Let be a true rating matrix, where each entry represents the true rating of user to item .

The objective of this study is to develop an algorithm to obtain an optimal predicted rating matrix , where each entry is the predicted rating for . To achieve this objective, we formally define the ideal loss function that an optimal algorithm should minimize as follows:

(1)

where is an arbitrary loss function. For example, when , Eq. (1) is called the mean-absolute-error (MAE), in contrast, when , it is called the mean-squared-error (MSE).

In real-world recommender systems, calculating the ideal loss function is impossible because most of the rating data are missing. To precisely formulate this missing mechanism, we utilize two other matrices. The first is the propensity matrix denoted as . Each entry of this matrix is the propensity score of representing the probability of the feedback of the pair being observed. Next, let be an observation matrix where each entry is a Bernoulli random variable with its expectation . If , then the rating of the pair is observed; otherwise, it is unobserved. Throughout this study, we assume for all the observation matrices.

Under the formulation, constructing an effective estimator for the ideal loss function that can be estimated using only a set of observable feedback is critical to developing an effective recommendation algorithm.

3.2 Naive Estimator

Given a feedback data , the most basic estimator for the ideal loss is the naive estimator defined as follows:

(2)

The naive estimator is the averaged loss values over the observed rating feedback. This estimator is valid when the missing mechanism of the rating data is missing-completely-at-random (MCAR), because, under the MCAR settings, the estimator is unbiased against the ideal loss function [27, 26].

However, several previous studies have shown that, under the general MNAR settings, the simple naive estimator actually has a bias. Thus, it is undesirable to learn a recommendation algorithm [27, 26]; one should rely on an estimator addressing this bias as an alternative to using the naive estimator.

3.3 Inverse Propensity Score Estimator

To improve the naive estimator, several previous works applied the IPS estimation to the recommendation settings [26, 18]. In the context of causal inference, the propensity scoring estimator is widely used to estimate causal effects of treatments from observational data [21, 22, 12]. One can derive an unbiased estimator for the loss function of interest with the true propensity score as follows:

(3)

This estimator is unbiased against the ideal loss and thus is considered to be more desirable than the naive estimator in terms of bias. However, this unbiasedness is valid only when the true propensity score is available; the IPS estimator can have a bias with an inaccurate propensity estimator (see Lemma 5.1 of [26]). The bias problem of the IPS estimator often occurs in most real-world recommender systems. This is because the missing mechanism of the rating feedback can depend on user self-selection as well as past recommendation policy; it is challenging to accurately estimate the missing probability of each instance [20, 26, 32].

In fact, most of the previous studies estimate the propensity score for the propensity-based matrix factorization model using some amount of MCAR test data [26, 31]. However, this kind of propensity estimation is actually infeasible because of the costly annotation process [8]. Therefore, in the next section, we explore the theory and algorithm that are independent of the propensity score aiming to alleviate the problem of propensity estimation bias. In addition, we investigate the effect of using different propensity estimators on the performance of the propensity-based matrix factorization method in the experimental part.

4 Proposed Method

In this section, we first derive the generalization error bound of the ideal loss function based on the discrepancy measure between two different propensity matrices. Our bound is propensity-agnostic; thus, the problem relating to the propensity estimation is eliminated in this bound. Inspired by the theoretical analysis, we propose Domain Adversarial Matrix Factorization (DAMF), which minimizes the derived upper bound via the adversarial learning procedure. The optimization of the proposed algorithm is independent of the propensity score; thus, the benefit of the proposed method is emphasized in situations with unknown propensities. Note that all the proofs in this section can be found in the supplementary materials.

4.1 Theoretical Bound

First, we define the discrepancy measure for the recommendation settings.

Definition 1.

(-divergence for recommendation) Let be a class of predicted rating matrices and let be a loss function. Then, the -divergence between the two propensity matrices and is defined as follows:

(4)

where

Note that -divergence for recommendation is independent of the true rating matrices. Therefore, one can calculate this divergence for any given pair of propensity matrices without the true rating information.

However, in reality, the true propensity matrices ( and ) are unobserved. Thus, one has to estimate the divergence using realizations of them ( and ). The following lemma shows the deviation bound of -divergence.

Lemma 1.

Any pair of propensity matrices and and their realizations and are given. The loss function is bounded above by a positive constant . Then, for any , the following inequality holds with a probability of at least

(5)

Then, we state the generalization error bound based on an ideal MCAR observation.

Lemma 2.

(Generalization Error Bound under MCAR observation.) An MCAR-observation matrix where

and any finite hypothesis space of predictions are given. The loss function is bounded above by a positive constant . Then, for any hypothesis and for any , the following inequality holds with a probability of at least :

(6)

Next lemma relates the losses under two different propensity matrices.

Lemma 3.

Assume that the loss function obeys the triangle inequality. Then, for any given predicted rating matrices and two propensity matrices and , the following inequality holds

(7)

where

Finally, using the -divergence for recommendation, we derive the propensity-agnostic generalization error bound of the ideal loss function.

Theorem 1.

(Propensity-agnostic Generalization Error Bound) Two observation matrices and having MCAR and MNAR missing mechanisms, respectively, and any finite hypothesis space of predictions are given. The loss function is bounded above by a positive constant . Then, for any hypothesis and for any , the following inequality holds with a probability of at least

(8)

As previously explained, the bound is independent of the propensity score, and the problems relating to the propensity score estimation is avoided.

4.2 Algorithm

Here, we describe the detailed algorithm of the proposed DAMF. Inspired by Theorem 1, we consider minimizing the following objective:

where is the trade-off hyperparameter between the naive loss on the MNAR feedback and the discrepancy between the MCAR and MNAR observation mechanisms. This optimization criterion consists of the two controllable terms of the theoretical bound in Eq. (8); both terms are independent of the propensity score and one does not have to estimate the propensity to optimize the objective. Here, the minimization of the first term (loss on MNAR feedback) can easily be conducted. On the other hand, that of the second term (disc between MCAR and MNAR) is difficult because the optimization over the pair of hypothesis is needed.

Therefore, in this work, we introduce a discriminator to classify item latent factors into two classes, rare and popular, aiming to derive item latent factors such that item popularity bias is eliminated. We adopt this approach because item popularity bias is the most problematic type of bias in recommender systems [33], and a similar optimization approach has shown promising results in the neural word embedding literature [9].

Here we describe the proposed algorithm in detail. First, we denote the user and item latent factors as , and rating predictions are completed via the following dot product.

The loss function to derive these parameters is as follows:

Moreover, predictions for the item popularity are completed via the following linear transformation:

where is a vector-scalar parameter pair and is the sigmoid function. The outputs are confidence scores representing how rare each item is. The loss to derive these parameters is represented as the following binary cross entropy form.

where is the set of all users and rare items, in contrast, is the set of all users and popular items.

Following the framework of domain adversarial training [6, 7, 9], the rating predictor and the popularity discriminator are trained in a minimax manner as follows:

(9)

where is the trade-off hyperparameter between the prediction and domain loss. Given fixed latent factors and , the optimization of the discriminator is as follows:

(10)

Then, given fixed parameters , the optimization of and is as follows:

(11)

We implement the proposed algorithm by TensorFlow and optimize , and iteratively using the Adam optimizer [14]. The detailed training procedure of DAMF is described in Algorithm 1.

1:A set of observed ratings , rare and popular user-item pairs ; learning_rate ; trade-off hyperparameter ; number of steps.
2:user-item latent factors .
3:repeat
4:     Sample mini-batch from
5:     Update and by gradient descent according to Eq. (11) with fixed
6:     for  number of steps do
7:         Update by gradient ascent according to Eq. (10) with fixed and
8:     end for
9:until convergence;
10:return
Algorithm 1 Domain Adversarial Matrix Factorization (DAMF)

5 Experimental Evaluation

We conducted an empirical evaluation to compare the proposed method to other existing baselines. Note that the detailed description of the used datasets and the hyper-parameter tuning procedure can be found in the supplementary material.

5.1 Experimental Setup

5.1.1 Datasets

We used the following real-world datasets.

  • MovieLens (ML) 1M111http://grouplens.org/datasets/movielens/: This dataset contains five-star movie ratings collected from a movie recommendation service, and the ratings are MNAR. The dataset consists of approximately 1 million ratings from 6,040 users and 3,706 movies. In the experiments, we kept movies that had been rated by at least 20 users.

  • Yahoo! R3 dataset222http://webscope.sandbox.yahoo.com/: It contains five-star user-song ratings. The training data contains approximately 300,000 MNAR ratings from 15,400 users against 1,000 songs, and the test data is collected by asking a subset of 5,400 users to rate 10 randomly selected songs. Thus, the test data is considered to be desirable to evaluate the performance of a recommender.

  • Coat dataset333https://www.cs.cornell.edu/~schnabts/mnar/: It contains five-star user-coat ratings from 290 Amazon Mechanical Turk workers on an inventory of 300 coats. The training data contains 6,500 MNAR ratings collected through self-selections by the Turk workers. On the other hand, the test data is collected by asking the Turk workers to rate 16 randomly selected coats.

Figure 1: Rating distributions of training and test sets for all the datasets. The distributions are significantly different between the training and test sets. KL is the Kullback–Leibler divergence of the rating distributions between training and test sets.

5.1.2 Train/Validation/Test Splits

For the ML 1M dataset, we created a test set having different rating distribution from the original one. We created it by first dividing the original data into training and test sets, and then, resampling data from the test set based on the inverse of the rating density ratio in Eq. (12). This creates a test set with completely different rating distribution with the training set.

(12)

We tested the three prior rating distributions (type1, type2, and type3) for the ML 1M dataset. As shown in Figure 1, type1 has small, type2 has medium, and type3 has large difference between the training and test rating distributions.

For the Yahoo! R3 and Coat datasets, the original datasets were divided into training and test sets. We randomly selected 10% of the original training set for the validation set.

5.1.3 Baselines & Propensity estimators

We compared the MF-IPS in [26] to our proposed DAMF. It predicts each rating by , where and are user and item latent factors. We did not contain the use–item bias terms to give the same model complexity with our proposed method. MF-IPS optimizes its parameters by minimizing the IPS loss in Eq. (3) with regularization terms.

For the MF-IPS, we tested the following propensity estimators.

uniform propensity :
user propensity :
item propensity :
user-item propensity :
logistic regression :

Note that when the uniform propensity is used, the MF-IPS is identical to the MF with the naive loss function [15].

In contrast to previous works [26, 31]; we did not use any data in the test set for the propensity estimation to imitate the real-world situation. However, in Section 4.2, we report the results with the following propensity estimator, just as reference.

Naive Bayes (NB) with true prior :

where is a realized rating. NB with true prior is, in reality, infeasible in most of the real-world problems, because it requires the MCAR explicit feedback to estimate the prior rating distribution.

ML 1M (type 1) ML 1M (type 2) ML 1M (type 3) Yahoo! R3 Coat
MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE
MF (uniform) 0.905 1.282 1.024 1.585 1.201 2.012 1.150 2.044 0.803 1.091
MF-IPS (user) 0.936 1.353 1.114 1.814 1.330 2.339 1.128 1.973 0.815 1.127
MF-IPS (item) 0.938 1.368 1.075 1.739 1.292 2.277 1.101 1.973 0.794 1.064
MF-IPS (user-item) 0.957 1.410 1.097 1.789 1.297 2.289 1.110 2.009 0.935 1.444
MF-IPS (logistic) 0.908 1.293 1.031 1.605 1.187 1.979 1.157 2.075 0.856 1.249
DAMF (ours) 0.890 1.215 0.968 1.416 1.110 1.744 0.991 1.566 0.909 1.413
MF-IPS (NB with true) 0.845 1.143 0.741 0.908 0.562 0.592 0.796 1.095 0.793 1.107
Table 1: Performance of the different approaches on all the datasets over five different train/validation/test splits. DAMF significantly outperformed the other methods on both metrics. The bold fonts represent the best performance among the methods except for the MF-IPS (NB with true). We report the model performance on test sets with the lowest validation loss out of all iterations.

5.1.4 Hyperparameter Tuning

We tuned the dimensions of the latent factors within the range of , and the L2-regularization parameter within the range of for all the methods. The trade-off hyperparameter was tuned within the range of for the proposed method. The combinations of the hyperparameters minimized the loss on validation sets were selected using the Optuna software [1]. In addition, for the proposed method, we set the top 20% frequent items in the training set as popular items and the remainder as rare.


Figure 2: Test MSEs vs. number of iterations on all the datasets. Note that, in Table 1, we report the model with the lowest validation loss out of all iterations.

5.2 Results & Discussions

Table 1 provides the averaged MAE and MSE over five different simulations on the ML 1M, Yahoo! R3, and Coat datasets.

First, consistent with the previous work [26], MF-IPS with true prior information performed the best in terms of both MAE and MSE444The performance of the MF-IPS (NB with true) on the Yahoo! R3 data is a bit worse than that in the previous experiments [26], because we used a simple version of MF-IPS without user-item bias terms.. However, MF-IPS, with the other propensity estimators, did not always outperform the vanilla MF (MF-IPS with uniform). The results suggest that MF-IPS is potentially an effective de-biasing method but is highly sensitive to the way of propensity estimation. In particular, using only MNAR training data for the propensity estimation does not lead to a well-performing recommender.

DAMF achieved significant performance gains on the three types of ML 1M and Yahoo! R3 dataset over the propensity-based MF models. Moreover, as shown in Table 1, the performance gain on the ML 1M type 3 is much larger than those on type 1 and 2. It outperformed 13.4 % in type 3, 11.9 % in type 2, and 9.5 % in type 1 over the best baselines. These results suggest that the benefit of the proposed method is strengthened when there exists a large divergence between the training and test distributions.

On the other hand, the proposed method revealed a poor performance only on the coat dataset. This is because DAMF has a larger number of parameters to be optimized, and the size of this dataset is relatively small (only 6,264 ratings are in the training set). This hypothesis is also aligned with the significant gains over MF-IPS in both ML 1M and Yahoo! R3, where the number of observed ratings is much larger than the coat dataset.

Figure 2 shows the Test MSEs vs. number of iterations. For the three types of ML 1M dataset, DAMF generally outperformed the MF-IPS after the 300 iterations. For the Yahoo! R3 dataset, the performance of MF-IPS first reaches a very high level and then gradually worsens with iterations. This phenomenon is very similar to the memorizing effects in noisy label literature [19, 10, 13]. On the other hand, DMAF alleviates the decreasing processing and almost monotonically improves its performance after 300 iterations.

To sum up, the proposed DAMF algorithm significantly outperforms the other baseline methods, especially for the moderate size and severely biased datasets. The results validate the effectiveness of the proposed approach under situations where the true propensities are unknown, or the costly MCAR data is unavailable.

6 Conclusion

In this study, we explored the problem of learning rating predictors from MNAR explicit feedback. First, we derived the generalization error bound of the loss function of interest inspired by the theoretical framework of unsupervised domain adaptation. The bound is propensity-agnostic; thus, problems related to the propensity estimation are eliminated in this bound. Then, we proposed Domain Adversarial Matrix Factorization that simultaneously minimizes the naive loss of the MNAR feedback and the discrepancy between two missing mechanisms. Finally, we conducted an experiment on the standard real-world datasets and showed that the proposed method significantly outperformed the baseline methods under a realistic situation where the true propensities are inaccessible.

Important future research directions are the extension of the proposed method to the recommendation using implicit feedback. Moreover, several disconnections between theory and algorithm still exist, although the benefit of the proposed algorithm was empirically shown. Bridging the gap between the theory and algorithm is another important theme.

Acknowledgement. The author would like to thank Suguru Yaginuma and Kazuki Taniguchi for their helpful comments and discussions.

References

  • [1] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pages 2623–2631, New York, NY, USA, 2019. ACM.
  • [2] Onur Atan, William R Zame, and Mihaela van der Schaar. Learning optimal policies from observational data. arXiv preprint arXiv:1802.08679, 2018.
  • [3] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [4] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144, 2007.
  • [5] Stephen Bonner and Flavian Vasile. Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, pages 104–112, New York, NY, USA, 2018. ACM.
  • [6] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.
  • [7] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [8] Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 198–206. ACM, 2018.
  • [9] Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Frage: Frequency-agnostic word representation. In Advances in neural information processing systems, pages 1334–1345, 2018.
  • [10] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pages 8527–8537, 2018.
  • [11] José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Probabilistic matrix factorization with non-random missing data. In International Conference on Machine Learning, pages 1512–1520, 2014.
  • [12] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
  • [13] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 2017.
  • [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [15] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
  • [16] Seiichi Kuroki, Nontawat Charonenphakdee, Han Bao, Junya Honda, Issei Sato, and Masashi Sugiyama. Unsupervised domain adaptation based on source-guided discrepancy. arXiv preprint arXiv:1809.03839, 2018.
  • [17] Jongyeong Lee, Nontawat Charoenphakdee, Seiichi Kuroki, and Masashi Sugiyama. Domain discrepancy measure using complex models in unsupervised domain adaptation. arXiv preprint arXiv:1901.10654, 2019.
  • [18] Dawen Liang, Laurent Charlin, and David M Blei. Causal inference for recommendation. In Causation: Foundation to Application, Workshop at UAI, 2016.
  • [19] Eran Malach and Shai Shalev-Shwartz. Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems, pages 960–970, 2017.
  • [20] Benjamin M Marlin and Richard S Zemel. Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM conference on Recommender systems, pages 5–12. ACM, 2009.
  • [21] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • [22] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.
  • [23] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2988–2997, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [24] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018.
  • [25] Yuta Saito, Hayato Sakata, and Kazuhide Nakata. Doubly robust prediction and evaluation methods improve uplift modeling for observational data. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 468–476. SIAM, 2019.
  • [26] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1670–1679, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • [27] Harald Steck. Training and testing of recommender systems on data missing not at random. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 713–722. ACM, 2010.
  • [28] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015.
  • [29] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In advances in neural information processing systems, pages 3231–3239, 2015.
  • [30] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
  • [31] Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Doubly robust joint learning for recommendation on data missing not at random. In International Conference on Machine Learning, pages 6638–6647, 2019.
  • [32] Yixin Wang, Dawen Liang, Laurent Charlin, and David M. Blei. The deconfounded recommender: A causal inference approach to recommendation. CoRR, abs/1808.06581, 2018.
  • [33] Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, pages 279–287, New York, NY, USA, 2018. ACM.
  • [34] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, pages 7404–7413, 2019.

Supplementary Materials

Appendix A Omitted Proofs

a.1 Proof of Lemma 1

Proof.

First,

The deviations in the last line can be bounded as follows following the same logic flow in the proof of Theorem 5.2 in [26].

(13)

Therefore, the following inequalities hold with a probability of at least , respectively.

(14)
(15)

Combining Eq. (13), Eq. (14), and Eq. (15) with the union bound completes the proof. ∎

a.2 Proof of Lemma 2

Proof.

Replacing in Eq. (16), Theorem 5.2 in [26] for completes the proof. ∎

a.3 Proof of Lemma 3

Proof.

a.4 Proof of Theorem 1

Proof.

First, we obtain the following inequality by replacing and for and , respectively in Eq. (8).

(16)

where

by definition. Then, from Lemma 2 and Lemma 3, the following inequalities hold with a probability of at least .

(17)
(18)

Combining Eq. (16), Eq. (17), and Eq. (18) with the union bound completes the proof. ∎

Appendix B Detailed Experimental Setup

Here we describe the detailed experimental setups.

b.1 The statistics of the used datasets

The statistics of the datasets used in the experiments after preprocessing are given in Table 2. In addition, the prior rating distributions for type 1, 2, and 3 for ML 1M dataset are given in Table 3.

#User #Item #Train data Sparsity Avg rating of training Avg rating of test KL-divergence
ML 1M (type 1) 6,040 2,836 445,705 2.26% 3.58 3.00 0.155
ML 1M (type 2) 6,040 2,836 445,705 2.26% 3.58 2.20 0.641
ML 1M (type 3) 6,040 2,836 445,705 2.26% 3.58 1.67 1.571
Yahoo! R3 15,400 1,000 280,533 1.82% 2.89 1.82 0.470
Coat 290 300 6,264 7.20% 2.61 2.23 0.049
Table 2: Statistics of datasets used in the experiments after pre-processing. KL-divergence is the divergence of rating distributions between training and test sets.
ML 1M (type 1) 0.2 0.2 0.2 0.2 0.2
ML 1M (type 2) 0.35 0.3 0.2 0.1 0.05
ML 1M (type 3) 0.5 0.4 0.05 0.03 0.02
Table 3: The test probability masses of each rating value for the three types of the ML 1M dataset.

b.2 Hyper-parameter tuning procedure

Table 4 describes the used hyper-parameter searching spaces. Moreover, the selected sets of hyper-parameters for all the methods are shown in Table 5.

Methods optimizer init. learning_rate batch_size max iterations
MF-IPS - Adam 2,500
DAMF Adam 2,500
Table 4: Hyperparameter searching spaces. The same searching spaces were used in all the datasets. is the dimension of the latent factors. is the hyperparameter for the L2-regularization.
ML 1M (type1) ML 1M (type2) ML 1M (type3) Yahoo! R3 Coat
Models
MF-IPS (uniform) 5 5 5 5 35
MF-IPS (user) 5 5 5 5 40
MF-IPS (item) 5 5 5 5 40
MF-IPS (both) 5 5 5 5 5
MF-IPS (logistic) 5 5 5 5 5
MF-IPS (oracle) 5 5 5 5 40
DAMF 15 0.735 15 0.735 15 0.735 15 0.735 5 0.994
Table 5: The selected sets of hyper-parameters for all the methods and for all the datasets.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398301
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description