Few-shot Domain Adaptation by Causal Mechanism Transfer

# Few-shot Domain Adaptation by Causal Mechanism Transfer

## Abstract

We study few-shot supervised domain adaptation (DA) for regression problems, where only a few labeled target domain data and many labeled source domain data are available. Many of the current DA methods base their transfer assumptions on either parametrized distribution shift or apparent distribution similarities, e.g., identical conditionals or small distributional discrepancies. However, these assumptions may preclude the possibility of adaptation from intricately shifted and apparently very different distributions. To overcome this problem, we propose mechanism transfer, a meta-distributional scenario in which a data generating mechanism is invariant among domains. This transfer assumption can accommodate nonparametric shifts resulting in apparently different distributions while providing a solid statistical basis for DA. We take the structural equations in causal modeling as an example and propose a novel DA method, which is shown to be useful both theoretically and experimentally. Our method can be seen as the first attempt to fully leverage the structural causal models for DA.

## 1 Introduction

Learning from a limited amount of data is a long-standing yet actively studied problem of machine learning. Domain adaptation (DA) (Ben-David et al., 2010) tackles this problem by leveraging auxiliary data sampled from related but different domains. In particular, we consider few-shot supervised DA for regression problems, where only a few labeled target domain data and many labeled source domain data are available.

A key component of DA methods is the transfer assumption (TA) to relate the source and the target distributions. Many of the previously explored TAs have relied on certain direct distributional similarities, e.g., identical conditionals (Shimodaira, 2000) or small distributional discrepancies (Ben-David et al., 2007). However, these TAs may preclude the possibility of adaptation from apparently very different distributions. Many others assume parametric forms of the distribution shift (Zhang et al., 2013) or the distribution family (Storkey and Sugiyama, 2007) which can highly limit the considered set of distributions. (we further review related work in Section 5.1).

To alleviate the intrinsic limitation of previous TAs due to relying on apparent distribution similarities or parametric assumptions, we focus on a meta-distributional scenario where there exists a common generative mechanism behind the data distributions (Figures 1,2). Such a common mechanism may be more conceivable in applications involving structured table data such as medical records (Yadav et al., 2018). For example, in medical record analysis for disease risk prediction, it can be reasonable to assume that there is a pathological mechanism that is common across regions or generations, but the data distributions may vary due to the difference in cultures or lifestyles. Such a hidden structure (pathological mechanism, in this case), once estimated, may provide portable knowledge to enable DA, allowing one to obtain accurate predictors for under-investigated regions or new generations.

Concretely, our assumption relies on the generative model of nonlinear independent component analysis (nonlinear ICA; Figure 1), where the observed labeled data are generated by first sampling latent independent components (ICs) and later transforming them by a nonlinear invertible mixing function denoted by (Hyvärinen et al., 2019). Under this generative model, our TA is that representing the mechanism is identical across domains (Figure 2). This TA allows us to formally relate the domain distributions and develop a novel DA method without assuming their apparent similarities or making parametric assumptions.

#### Our contributions.

Our key contributions can be summarized in three points as follows.

1. We formulate the flexible yet intuitively accessible TA of shared generative mechanism and develop a few-shot regression DA method (Section 3). The idea is as follows. First, from the source domain data, we estimate the mixing function by nonlinear ICA (Hyvärinen et al., 2019) because is the only assumed relation of the domains. Then, to transfer the knowledge, we perform data augmentation using the estimated on the target domain data using the independence of the IC distributions. In the end, the augmented data is used to fit a target predictor (Figure 3).

2. We theoretically justify the augmentation procedure by invoking the theory of generalized U-statistics (Lee, 1990). The theory shows that the proposed data augmentation procedure yields the uniformly minimum variance unbiased risk estimator in an ideal case. We also provide an excess risk bound (Mohri et al., 2012) to cover a more realistic case (Section 4).

3. We experimentally demonstrate the effectiveness of the proposed algorithm (Section 6). The real-world data we use is taken from the field of econometrics, for which structural equation models have been applied in previous studies (Greene, 2012).

A salient example of the generative model we consider is the structural equations of causal modeling (Section 2). In this context, our method can be seen as the first attempt to fully leverage the structural causal models for DA (Section 5.2).

## 2 Problem Setup

In this section, we describe the problem setup and the notation. To summarize, our problem setup is homogeneous, multi-source, and few-shot supervised domain adapting regression. That is, respectively, all data distributions are defined on the same data space, there are multiple source domains, and a limited number of labeled data is available from the target distribution (and we do not assume the availability of unlabeled data). In this paper, we use the terms domain and distribution interchangeably.

#### Notation.

Let us denote the set of real (resp. natural) numbers by (resp. ). For , we define . Throughout the paper, we fix and suppose that the input space is a subset of and the label space is a subset of . As a result, the overall data space is a subset of . We generally denote a labeled data point by . We denote by the set of independent distributions on with absolutely continuous marginals. For a distribution , we denote its induced expectation operator by . Table 3 in Supplementary Material provides a summary of notation.

#### Basic setup: Few-shot domain adaptation.

Let be a distribution (the target distribution) over , and let be a hypothesis class. Let be a loss function where is a constant. Our goal is to find a predictor which performs well for , i.e., the target risk is small. We denote . To this goal, we are given an independent and identically distributed (i.i.d.) sample . In a fully supervised setting where is large, a standard procedure is to select by empirical risk minimization (ERM), i.e., , where . However, when is not sufficiently large, may not accurately estimate , resulting in a high generalization error of . To compensate for the scarcity of data from the target distribution, let us assume that we have data from distinct source distributions over , that is, we have independent i.i.d. samples whose relations to are described shortly. We assume for simplicity.

#### Key assumption.

In this work, the key transfer assumption is that all domains follow nonlinear ICA models with identical mixing functions (Figure 2). To be precise, we assume that there exists a set of IC distributions , and a smooth invertible function (the transformation or mixing) such that is generated by first sampling and later transforming it by

 ZSrck,i=f(SSrck,i), (1)

and similarly for . The above assumption allows us to formally relate and . It also allows us to estimate when sufficient identification conditions required by the theory of nonlinear ICA are met. Due to space limitation, we provide a brief review of the the nonlinear ICA method used in this paper and the known theoretical conditions in Supplementary Material A. The requirement for multiple source domains comes from the currently known identification condition of nonlinear ICA. Note that complex changes in are allowed, hence the assumption of invariant can accommodate intricate shifts in the apparent distribution . We discuss this further in Section 5.3 by taking a simple example.

#### Example: Structural equation models

A salient example of generative models expressed as Eq. (1) is structural equation models (SEMs; Pearl, 2009; Peters et al., 2017), more precisely, the reduced form (Reiss and Wolak, 2007) of Markovian SEMs (Pearl, 2009) under adequate assumptions such as invertibility. SEMs are used to describe the data generating mechanism involving the causality of random variables (Pearl, 2009). This interpretation of SEMs as Eq.(1) has been exploited in methods of causal structure discovery such as the linear non-Gaussian additive-noise models and their successors (Kano and Shimizu, 2003; Shimizu et al., 2006; Monti et al., 2019). In the case of SEMs, the key assumption of this paper translates into the invariance of the structural equations among domains. Note that we do not require the estimation of the exact structural equations, but only the reduced form needs to be estimated. Since the reduced-form equation can be derived given the structural-form equations by recursive imputation, we are only required to solve an easier estimation problem compared to causal discovery.

## 3 Proposed Method: Mechanism Transfer

In this section, we detail the proposed method, mechanism transfer (Algorithm 1). The method first estimates the common generative mechanism from the source domain data and then uses it to perform data augmentation of the target domain data to transfer the knowledge (Figure 3).

### 3.1 Step 1: Estimate f using the source domain data

The first step estimates the common transformation by nonlinear ICA, namely via generalized contrastive learning (GCL; Hyvärinen et al., 2019). GCL uses auxiliary information for training a certain binary classification function, , equipped with a parametrized feature extractor . The trained feature extractor is used as an estimator of . The auxiliary information we use in our problem setup is the domain indices . The classification function to be trained in GCL is consisting of , and the classification task of GCL is logistic regression to classify as positive and as negative. This yields the following domain-contrastive learning criterion to estimate :

 argmin^f∈F,{ψd}Dd=1⊂ΨK∑k=11nknk∑i=1(ϕ(r^f,ψ(ZSrck,i,k))+Ek′≠kϕ(−r^f,ψ(ZSrck,i,k′))),

where and are sets of parametrized functions, denotes the expectation with respect to ( denotes the uniform distribution), and is the logistic loss . We use the solution as an estimator of . In experiments, is implemented by invertible neural networks (Kingma and Dhariwal, 2018), by multi-layer perceptron, and is replaced by a random sampling renewed for every mini-batch.

### 3.2 Step 2: Extract and inflate the target ICs using ^f

The second step extracts and inflates the target domain ICs using the estimated . We first extract the ICs of the target domain data by applying the inverse of as

 ^si=^f−1(Zi).

After the extraction, we inflate the set of IC values by taking all dimension-wise combinations of the estimated IC:

 ¯si=(^s(1)i1,…,^s(D)iD),i=(i1,…,iD)∈[nTar]D,

to obtain new plausible IC values . The intuitive motivation of this procedure stems from the independence of the IC distributions. Theoretical justifications are provided in Section 4. In our implementation, we use invertible neural networks (Kingma and Dhariwal, 2018) to model the function to enable the computation of the inverse .

### 3.3 Step 3: Synthesize target data from the inflated ICs

The third step estimates the target risk by the empirical distribution of the augmented data:

 ˇR(g):=1nDTar∑i∈[nTar]Dℓ(g,^f(¯si)), (2)

and performs empirical risk minimization. In experiments, we use a regularization term to control the complexity of and select

The generated hypothesis is then used to make predictions in the target domain. In our experiments, we use , where and the norm is that of the reproducing kernel Hilbert space (RKHS) which we take the subset from. Note that we may well reduce the computing time by taking only a subset of combinations in Eq. (2).

## 4 Theoretical Insights

In this section, we state two theorems to investigate the statistical properties of the method proposed in Section 3 and provide plausibility beyond the intuition that we take advantage of the independence of the IC distributions.

### 4.1 Minimum variance property: Idealized case

The first theorem provides an insight into the statistical advantage of the proposed method: in the ideal case, the method attains the minimum variance among all possible unbiased risk estimators.

###### Theorem 1 (Minimum variance property of ˇR).

Assume that . Then, for each , the proposed risk estimator is the uniformly minimum variance unbiased estimator of , i.e., for any unbiased estimator of ,

 ∀q∈Q,Var(ˇR(g))≤Var(~R(g))

as well as holds.

The proof of Theorem 1 is immediate once we rewrite as a -variate regular statistical functional and as its corresponding generalized U-statistic (Lee, 1990). Details can be found in Supplementary Material D. Theorem 1 implies that the proposed risk estimator can have superior statistical efficiency in terms of the variance over the ordinary empirical risk.

### 4.2 Excess risk bound: More realistic case

In real situations, one has to estimate . The following theorem characterizes the statistical gain and loss arising from the estimation error . The intuition is that the increased number of points suppresses the possibility of overfitting because the hypothesis has to fit the majority of the inflated data, but the estimator has to be accurate so that fitting to the inflated data is meaningful. Note that the theorem is agnostic to how is obtained, hence it applies to more general problem setup as long as can be estimated.

###### Theorem 2 (Excess risk bound).

Let be the hypothesis generated by Eq. (2). Under appropriate assumptions (see Theorem 3 in Supplementary Material), for arbitrary , we have with probability at least ,

 R(ˇg)−R(g∗)≤CD∑j=1∥∥fj−^fj∥∥W1,1Approximation error+4DR(G)+2DBℓ√log2/δ2nEstimation error+κ1(δ′,n)+DBℓBqκ2(f−^f)Higher order terms.

Here, is the -Sobolev norm, and we define the effective Rademacher complexity by

 R(G):=1nE^SEσ[supg∈G∣∣ ∣∣n∑i=1σiES′2,…,S′D[~ℓ(^si,S′2,…,S′D)]∣∣ ∣∣], (3)

where are independent sign variables, is the expectation with respect to , the dummy variables are i.i.d. copies of , and is defined by using the degree- symmetric group as

 ~ℓ(s1,…,sD):=1D!∑π∈SDℓ(g,^f(s(1)π(1),…,s(D)π(D))),

and and are higher order terms. The constants and depend only on and , respectively, while depends only on , and .

Details of the statement and the proof can be found in Supplementary Material C. The Sobolev norm (Adams and Fournier, 2003) emerges from the evaluation of the difference between the estimated IC distribution and the ground-truth IC distribution. In Theorem 2, the utility of the proposed method appears in the effective complexity measure. The complexity is defined by a set of functions which are marginalized over all but one argument, resulting in mitigated dependence on the input dimensionality from exponential to linear (Supplementary Material C, Remark 3).

## 5 Related Work and Discussion

In this section, we review some existing TAs for DA to clarify the relative position of the paper. We also clarify the relation to the literature of causality-related transfer learning.

### 5.1 Existing transfer assumptions

Here, we review some of the existing work and TAs. See Table 1 for a summary.

#### (1) Parametric assumptions.

Some TAs assume parametric distribution families, e.g., Gaussian mixture model in covariate shift (Storkey and Sugiyama, 2007). Some others assume parametric distribution shift, i.e., parametric representations of the target distribution given the source distributions. Examples include location-scale transform of class conditionals (Zhang et al., 2013; Gong et al., 2016), linearly dependent class conditionals (Zhang et al., 2015), and low-dimensional representation of the class conditionals after kernel embedding (Stojanov et al., 2019). In some applications, e.g., remote sensing, some parametric assumptions have proven useful (Zhang et al., 2013).

#### (2) Invariant conditionals and marginals.

Some methods assume invariance of certain conditionals or marginals (Quiñonero-Candela et al., 2009), e.g., in the covariate shift scenario (Shimodaira, 2000), for an appropriate feature transformation in transfer component analysis (Pan et al., 2011), for a feature selection in Rojas-Carulla et al. (2018), in the target shift (TarS) scenario (Zhang et al., 2013; Nguyen et al., 2016), and few components of regular-vine copulas and marginals in Lopez-paz et al. (2012). For example, the covariate shift scenario has been shown to fit well to brain computer interface data (Sugiyama et al., 2007).

#### (3) Small discrepancy or integral probability metric.

Another line of work relies on certain distributional similarities, e.g., integral probability metric (Courty et al., 2017) or hypothesis-class dependent discrepancies (Ben-David et al., 2007; Blitzer et al., 2008; Ben-David et al., 2010; Kuroki et al., 2019; Zhang et al., 2019; Cortes et al., 2019). These methods assume the existence of the ideal joint hypothesis (Ben-David et al., 2010), corresponding to a relaxation of the covariate shift assumption. These TA are suited for unsupervised or semi-supervised DA in computer vision applications (Courty et al., 2017).

#### (4) Transferable parameter.

Some others consider parameter transfer (Kumagai, 2016), where the TA is the existence of a parameterized feature extractor that performs well in the target domain for linear-in-parameter hypotheses and its learnability from the source domain data. For example, such a TA has been known to be useful in natural language processing or image recognition (Lee et al., 2009; Kumagai, 2016).

### 5.2 Causality for transfer learning

Our method can be seen as the first attempt to fully leverage structural causal models for DA. Most of the causality-inspired DA methods express their assumptions in the level of graphical causal models (GCMs), which only has much coarser information than structural causal models (SCMs) (Peters et al., 2017, Table 1.1) exploited in this paper. Compared to previous work, our method takes one step further to assume and exploit the invariance of SCMs. Specifically, many studies assume the GCM (the anticausal scenario) following the seminal meta-analysis of Schölkopf et al. (2012) and use it to motivate their parametric distribution shift assumptions or the parameter estimation procedure (Zhang et al., 2013, 2015; Gong et al., 2016, 2018). Although such assumptions on the GCM have the virtue of being more robust to misspecification, they tend to require parametric assumptions to obtain theoretical justifications. On the other hand, our assumption enjoys a theoretical guarantee without relying on parametric assumptions.

### 5.3 Plausibility of the assumption

The invariance of causal mechanisms has been exploited in recent work of causal discovery such as Xu et al. (2014) and Monti et al. (2019), or under the name of the multi-environment setting in Ghassami et al. (2017). The SEMs are normally assumed to remain invariant unless explicitly intervened in (Hünermund and Bareinboim, 2019). As the first algorithm in the approach to fully exploit SCMs for DA, we consider the case where all variables are observable. Although it is often assumed in a causal inference problem that there are some unobserved confounding variables, we leave further extension to such a case for future work.

The relation between and can drastically change while is invariant. For example, even in a simple additive noise model , the conditional can shift drastically if the distribution of the independent noise changes in a complex manner, e.g., becoming multimodal from unimodal.

## 6 Experiment

In this section, we provide proof-of-concept experiments to demonstrate the effectiveness of the proposed approach. Note that the primary purpose of the experiments is to confirm whether the proposed method can properly perform DA in real-world data, and it is not to determine which DA method and TA are the most suited for the specific dataset.

### 6.1 Implementation details of the proposed method

#### Estimation of f (Step 1).

We model by an -layer Glow neural network (Supplementary Material B.2). We model by a -hidden-layer neural network with a varied number of hidden units, output units, and the rectified linear unit activation (LeCun et al., 2015). We use its -th output () as the value for . For training, we use the Adam optimizer (Kingma and Ba, 2017) with fixed parameters , fixed initial learning rate , and the maximum number of epochs . The other fixed hyperparameters of and its training process are described in Supplementary Material B.

#### Augmentation of target data (Step 3).

For each evaluation step, we take all combinations (with replacement) of the estimated ICs to synthesize target domain data. After we synthesize the data, we filter them by applying a novelty detection technique with respect to the union of source domain data. Namely, we use one-class support vector machine (Schölkopf et al., 2000) with the fixed parameter and radial basis function (RBF) kernel with . This is because the estimated transform is not expected to be trained well outside the union of the supports of the source distributions.

#### Predictor hypothesis class G.

As the predictor model, we use the kernel ridge regression (KRR) with RBF kernel. The bandwidth is chosen by the median heuristic similarly to Yamada et al. (2011) for simplicity. Note that the choice of the predictor model is for the sake of comparison with the other methods tailored for KRR (Cortes et al., 2019), and that an arbitrary predictor hypothesis class and learning algorithm can be easily combined with the proposed approach.

#### Hyperparameter selection.

We perform grid-search for hyperparameter selection. The number of hidden units for is chosen from and the coefficient of weight-decay from . The regularization coefficient of KRR is chosen from following Cortes et al. (2019). To perform hyperparameter selection as well as early-stopping, we record the leave-one-out cross-validation (LOOCV) mean-squared error on the target training data every epochs and select its minimizer. The leave-one-out score is computed using the well-known analytic formula instead of training the predictor for each split. Note that we only use the original target domain data and not the synthesized data as the held-out set.

#### Computation environment

All experiments were conducted on an Intel Xeon(R) 2.60 GHz CPU with 132 GB memory. They were implemented in Python using the PyTorch library (Paszke et al., 2019) or the R language (R Core Team, 2018).

### 6.2 Experiment using real-world data

#### Dataset.

We use the gasoline consumption data (Greene, 2012, p.284, Example 9.5), which is a panel data of gasoline usage in 18 of the OECD countries over 19 years. We consider each country as a domain, and we disregard the time-series structure and consider the data as i.i.d. samples for each country in this proof-of-concept experiment. The dataset contains four variables, all of which are log-transformed: motor gasoline consumption per car (the predicted variable), per-capita income, motor gasoline price, and the stock of cars per capita (the predictor variables) (Baltagi and Griffin, 1983). For further details of the data, see Supplementary Material B. We used the dataset because there are very few public datasets for domain adapting regression tasks (Cortes and Mohri, 2014) especially for multi-source DA, and also because the dataset has been used in econometric analyses involving SEMs (Baltagi, 2005), conforming to our approach.

#### Compared methods.

We compare the following DA methods, all of which apply to regression problems. Unless explicitly specified, the predictor class is chosen to be KRR with the same hyperparameter candidates as the proposed method (Section 6.1). Further details are described in Supplementary Material B.5.

• Naive baselines (SrcOnly, TrgOnly, and S&TV): SrcOnly (resp. TrgOnly) trains a predictor on the source domain data (resp. target training data) without any device. SrcOnly can be effective if the source domains and the target domain have highly similar distributions. The S&TV baseline trains on both source and target domain data, but the LOOCV score is computed only from the target domain data.

• TrAdaBoost: Two-stage TrAdaBoost.R2; a boosting method tailored for few-shot regression transfer proposed in Pardoe and Stone (2010). It is an iterative method with early-stopping (Pardoe and Stone, 2010), for which we use the leave-one-out cross-validation score on the target domain data as the criterion. As suggested in Pardoe and Stone (2010), we set the maximum number of outer loop iterations at . The base predictor is the decision tree regressor with the maximum depth (Hastie et al., 2009). Note that although TrAdaBoost does not have a clarified transfer assumption, we compare the performance for reference.

• IW: Importance weighted KRR using RuLSIF (Yamada et al., 2011). The method directly estimates a relative joint density ratio function for , where is a hypothetical source distribution created by pooling all source domain data. Following Yamada et al. (2011), we experiment on and report the results separately. The regularization coefficient is selected from using importance-weighted cross-validation (Sugiyama et al., 2007).

• GDM: Generalized discrepancy minimization (Cortes et al., 2019). This method performs instance-weighted training on the source domain data with the weights that minimize the generalized discrepancy (via quadratic programming). We select the hyper-parameters from as suggested by Cortes et al. (2019). The selection criterion is the performance of the trained predictor on the target training labels as the method trains on the source domain data and the target unlabeled data.

• Copula: Non-parametric regular-vine copula method (Lopez-paz et al., 2012). This method presumes using a specific joint density estimator called regular-vine (R-vine) copulas. Adaptation is realized in two steps: the first step estimates which components of the constructed R-vine model are different by performing two-sample tests based on maximum mean discrepancy (Lopez-paz et al., 2012), and the second step re-estimates the components in which a change is detected using only the target domain data.

• LOO (reference score): Leave-one-out cross-validated error estimate is also calculated for reference. It is the average prediction error of predicting for a single held-out test point when the predictor is trained on the rest of the whole target domain data including those in the test set for the other algorithms.

#### Evaluation procedure.

The prediction accuracy was measured by the mean squared error (MSE). For each train-test split, we randomly select one-third (6 points) of the target domain dataset as the training set and use the rest as the test set. All experiments were repeated 10 times with different train-test splits of target domain data.

#### Results.

The results are reported in Table 2. We report the MSE scores normalized by that of LOO to facilitate the comparison, similarly to Cortes and Mohri (2014). In many of the target domain choices, the naive baselines (SrcOnly and S&TV) suffer from negative transfer, i.e., higher average MSE than TarOnly (in 12 out of 18 domains). On the other hand, the proposed method successfully performs better than TarOnly or is more resistant to negative transfer than the other compared methods. The performances of GDM, Copula, and IW are often inferior even compared to the baseline performance of SrcAndTarValid. For GDM and IW, this can be attributed to the fact that these methods presume the availability of abundant (unlabeled) target domain data, which is unavailable in the current problem setup. For Copula, the performance inferior to the naive baselines is possibly due to the restriction of the predictor model to its accompanied probability model (Lopez-paz et al., 2012). TrAdaBoost works reasonably well for many but not all domains. For some domains, it suffered from negative transfer similarly to others, possibly because of the very small number of training data points. Note that the transfer assumption of TrAdaBoost has not been stated (Pardoe and Stone, 2010), and it is not understood when the method is reliable.

## 7 Conclusion

In this paper, we proposed a novel few-shot supervised DA method for regression problems based on the assumption of shared generative mechanism. Through theoretical and experimental analysis, we demonstrated the effectiveness of the proposed approach. By considering the latent common structure behind the domain distributions, the proposed method successfully induces positive transfer even when a naive usage of the source domain data can suffer from negative transfer. Our future work includes making an experimental comparison with extensively more datasets and methods as well as an extension to the case where the underlying mechanism are not exactly identical but similar.

## Acknowledgments

TT was supported by Masason Foundation. TT would like to thank Yuko Kuroki and Taira Tsuchiya for their feedback on the manuscript. MS was supported by JST CREST Grant Number JPMJCR18A2.

This is the Supplementary Material for “Few-shot Domain Adaptation by Causal Mechanism Transfer.” Table 3 summarizes the abbreviations and the symbols used in the paper.

## Appendix A Preliminary: Nonlinear ICA

Here, we use the same notation as the main text. The recently developed nonlinear ICA provides an algorithm to estimate the mixing function . For the case of nonlinear , the impossibility of identification (i.e., consistent estimation) of in the one-sample i.i.d. case had been established more than two decades ago (Hyvärinen and Pajunen, 1999). However, recently, various conditions have been proposed under which can be identified with the help of auxiliary information (Hyvärinen and Morioka, 2016, 2017; Hyvärinen et al., 2019; Khemakhem et al., 2019).

The identification condition that is directly relevant to this paper is that of the generalized contrastive learning (GCL) proposed in Hyvärinen et al. (2019). Hyvärinen et al. (2019) assumes that an auxiliary variable from some measurable set is obtained for each data point as and that the ICs are conditionally independent given :

 q(s|u)=D∑d=1q(d)(s(d)|u).

Under such conditions, GCL estimates by training a classification function

 r^f,ψ(z,u)=D∑d=1ψd(^f−1(z)d,u) (4)

parametrized by and with the logistic loss for classifying

 (z,u) vs. (z,~u),

where . The key condition for the identification of is the following.

###### Assumption 1 (Assumption of variability; Hyvärinen et al., 2019, Theorem 1).

For any , there exist distinct points in , denoted by , such that the set of -dimensional vectors are linearly independent, where

 w(z|u):=(∂q(1)(z1|u)∂z1,…,∂q(D)(zD|u)∂zD,∂2q(1)(z1|u)∂z21,…,∂2q(D)(zD|u)∂z2D).

Under Assumption 1 and some regularity conditions, Theorem 1 of Hyvärinen et al. (2019) states that the transformation in Eq. (4) trained by GCL is a consistent estimator of upto additional dimension-wise invertible transformations. Note that the assumption is intrinsically difficult to confirm based on data due to the unsupervised nature of the problem setting. In this paper, we use the source domain index as the auxiliary variable and employ GCL for domain adaptation. The present version of Assumption 1 requires that we have at least distinct source domains. Although this condition can be restrictive in high-dimensional data, we conjecture that there is a possibility for this assumption to be made less stringent in the future because the identification condition is only known to be a sufficient condition, not a necessary condition. However, pursuing a refinement of the identification condition is out of the scope of this paper. Among the various methods for nonlinear ICA, we chose to use GCL (Hyvärinen et al., 2019) because it can operate under a nonparametric assumption on the IC distributions whereas other nonlinear ICA methods (Hyvärinen and Morioka, 2016, 2017; Khemakhem et al., 2019) may require parametric assumptions.

## Appendix B Experiment Details

Here, we describe more implementation details of the experiment. We plan to publish the experiment code and the dataset if the manuscript is accepted for publication. A URL to the experiment code will appear here.

### b.2 Model details: Invertible neural networks

Here, we describe the details of the Glow architecture (Kingma and Dhariwal, 2018) used in our experiments. Glow consists of three types of layers which are invertible by design, namely affine coupling layers, convolution layers, and actnorm layers. In our implementation, we do not include actnorm layers, and each layer of our Glow architecture consist of a convolution layer followed by an affine coupling layer.

#### Affine coupling layers.

The coefficients and for affine coupling layers in the notation of Kingma and Dhariwal (2018) are parametrized by two one-hidden-layer neural networks whose number of hidden units is the same and the first layer parameter is shared. The activation functions of the first layer, the second layer of , and the second layer of are the rectified linear unit (ReLU) activation (LeCun et al., 2015), the hyperbolic tangent function, and the linear activation function, respectively. A standard practice of affine coupling layers is to compose the coefficient with an exponential function so as to simplify the computation of the log-determinant of the Jacobian (Kingma and Dhariwal, 2018). In our implementation, since we do not require the computation of the log-determinant, we omit this device and instead compose . The addition of shifts the parameter space so that corresponds to the the identity map, where denotes the constant zero function. The split of the affine coupling layers is fixed at .

#### 1×1 convolution layers.

We initialize the parameters of the neural networks by where is the number of parameters of each layer and is the normal distribution.

### b.3 Model details: Penultimate layer networks

We initialize the parameter for each layer of by , where is the number of input features and is the uniform distribution.

### b.4 Training details

During the training of GCL, we fix the batch size at 32.

### b.5 Compared methods details

As suggested in Pardoe and Stone (2010), we use the linear loss function and set the maximum number of internal boosting iterations at .

#### Gdm.

We fix the number of sampling required for approximating the maximization in the generalization discrepancy at . This method presumes using hypothesis classes in a reproducing kernel Hilbert space (RKHS).

#### Copula.

For this model, the probabilistic model of non-parametric R-vine copula of depth is used following Lopez-paz et al. (2012). Kernel density estimators with RBF kernel are used for estimating the marginal distributions and the copulas. The bandwidths of the RBF kernels are determined using the rule-of-thumb implemented as “normal-reference” in the np package of R language (Hayfield and Racine, 2008). The predictions are made by numerically aggregating the estimated conditional distribution over the interval where denotes the square root of the unbiased variance of . The aggregation is performed by discretizing the interval into a grid of points. The level of the two-sample test is fixed at for all combination of the two-sample tests following the experiment code of Lopez-paz et al. (2012). This method is a single-source domain adaptation method and we pool all source domain data for adaptation.

## Appendix C Details and Proofs of Theorem 2

Here, we detail the assumptions, the statement, and the proof of Theorem 2.

### c.1 Notation

To make the proof self-contained, we first recall some general and problem-specific notation. In the notation here, we omit the domain identifiers from the distributions and the sample size, such as Tar or Src, because only the target domain data or their distributions appear in the proofs. The theorem holds regardless of how is estimated as long as is independent of the target domain data. In the proof, we extend the maximal discrepancy bound of U-statistics previously proved for the case of degree- in Rejchel (2012), to allow higher degrees.

#### General mathematical notation.

We denote the set of natural numbers (resp. real numbers) by (resp. ). For any , we define . We use to denote the number of -combinations of elements. For a finite set , the notation denotes the operator to take an average over , i.e., . For a -dimensional function , we denote its -th dimension () by suffixing . For a vector , we denote its -th element by . We denote the Jacobian determinant of a differentiable function at by . We denote the identity matrix by regardless of the size of the matrices when there is no ambiguity. For finite dimensional vectors, we denote the -norm by and the -norm by . For square matrices, we denote the operator- norm by and the operator- norm by . We use to denote the Sobolev space (on ) of order and define its associated norm by where is a multi-index and denotes the partial derivative (Adams and Fournier, 2003, Paragraph 3.1). We let be the degree- symmetric group, be the set of grouping of indices in , and be the set of all size- combinations (without replacement) of indices in .

#### Distributions and expectations.

We denote by the set of all factorized distributions on with absolutely continuous marginals. For a measure , we denote its -product measure by (repeated times). We assume that all measures appearing in this proof are absolutely continuous with respect to the Lebesgue measure. The push-forward of a distribution by a function is denoted by . The expectation of a function with respect to measure is denoted by (if it exists) by abuse of notation. We also abuse the notation to use as the shorthand for where .

### c.2 Problem setup

We denote the target domain distribution by .

We fix a hypothesis class , and our goal is to find a such that the risk functional

 R(g):=∫p(z)ℓ(g,z)dz

is small, where is a loss function. We denote by a minimizer of (assuming it exists). To this end, we are given the training data . Throughout, we assume . To complement the smallness of , we assume the existence of a generative mechanism. Concretely, we assume that there exists a diffeomorphism such that satisfies . With this transform, the original risk functional is also expressed as

 R(g)=∫q(s)ℓ(g,f(s))ds.

As an estimator of , we are given another diffeomorphism such that . With this , the proposed method converts the dataset by . We can regard , where . We use (resp. ) to denote the probability measure corresponding to the density (resp. ). This conversion results in the relation:

 ˇq(s)=q(f−1∘^f(s))∣∣(Jf−1∘^f)(s)∣∣.

As a candidate hypothesis , the proposed method selects a minimizer of the proposed risk estimator defined as

 ˇR(g):=1nD∑(i1,…,iD)∈[n]Dℓ(g,^f(^s(1)i1,…,^s(D)iD)). (5)

In the proof, we evaluate its concentration around the expectation . We use to denote the expectation with respect to . Let denote a hypothesis which minimizes (assuming it exists).

In what follows, for notational simplicity, we define the -variate symmetric function as

where indicates an averaging operation over all permutations (without replacement) of . We use to denote the sample average operator with respect to or , depending on the context.

### c.3 Assumptions

###### Assumption 2 (The underlying density function is bounded and Lipschitz continuous).

Assume

 Bq:=sups∈RDq(s)<∞,Lq:=sups1≠s2|q(s1)−q(s2)|∥s1−s2∥<∞.
###### Assumption 3 (f−1 is Lipschitz continuous and Hölder continuous).

We assume where is the -Hölder space (Adams and Fournier, 2003, Paragraph 1.29) and

 Lf−1:=supz1≠z2∥f−1(z1)−f−1(z2)∥∥z1−z2∥<∞.
###### Assumption 4 (Bounded derivatives of f and f−1).

Assume that

 B∞∂f:=sups∈RD∥∥∥dfds(s)∥∥∥∞<∞,B∞∂f−1:=supz∈RD∥∥∥df−1dz(z)∥∥∥∞<∞.

where denotes the maximum absolute value of the elements of a matrix.

###### Assumption 5 (Loss function is bounded and uniformly Lipschitz continuous in Z).

The considered loss function takes values in a bounded interval:

 ℓ:G×Z→[0,Bℓ],

where . Also assume

 LℓG:=supg∈Gsupz1≠z2|ℓ(g,z1)−ℓ(g,z2)|∥z1−z2∥<∞.
###### Assumption 6 (Estimated feature extractor).

Assume is independent of and that for all .

Although and are assumed to be diffeomorphisms in the classical sense (implying that they are strongly differentiable), we introduce the Sobolev space because we want to measure their difference and their difference of derivatives in terms of integration.

###### Assumption 7 (Entropic condition: Euclidean class (Sherman, 1994)).

The function class is Euclidean for the envelope and constants and (Sherman, 1994), i.e., if is a measure for which , then

 D(t,dμ,Φ)≤At−V,0

where is the pseudo metric defined by

 dμ(ϕ1,ϕ2):=[μ|ϕ1−ϕ2|2/μF2]1/2

for , and denotes the packing number of with respect to the pseudometric and radius . Without loss of generality, we take the envelope such that .

###### Assumption 8.

The hypothesis class is expressive enough so that the model approximation error does not expand due to , i.e.,

 infg∈G¯ˇR(g)≤infg∈GR(g)

The following complexity measure of , which is a version of Rademacher complexity for our problem setting, is used to state the theorem.

###### Definition 1 (Effective Rademacher complexity).

Define

 R(G):=1nEˇDEσ[supg∈G∣∣ ∣∣n∑i=1σiES′2,…,S′D[~ℓ(Si,S′2,…,S′D)]∣∣ ∣∣]

where are independent uniform sign variables and are independent of all other random variables.

We provide the definition of the ordinary Rademacher complexity in Section C.8 and make a comparison of the two complexity measures in terms of how they depend on the input dimensionality.

### c.4 Theorem statement

Our goal is to prove the following theorem. This is a detailed version of the theorem appearing in the main body of the paper.

###### Theorem 3 (Excess risk bound).

Assume Assumptions 2, 3, 4, 5, 6, 7, and 8.

Then for arbitrary , we have with probability at least ,

 R(ˇg)−R(g∗)≤CD∑j=1∥∥fj−^fj∥∥W1,1Approximation error+4DR(G)+2DBℓ√log2/δ2nEstimation error+κ1(δ′,n)+DBℓBqκ2(f−^f)Higher order terms.

where

 C:=BqLℓG+DBℓ(LqLf−1+BqDC′1),C′1:=(D+1)3/2(B∞∂f(D∑k=1∥∥f−1k∥∥C1,1)+B∞∂f−1),κ1(δ′,n)=O(n−1)/δ′+O(n−1),κ2(f−^f)=D∑d=2(Dd)C′dD∑j=1∥∥fj−^fj∥∥dW1,d.

and are constants determined in Lemma 11.

###### Proof of Theorem 3.

By adding and subtracting terms, we have

 R(ˇg)−R(g∗)=(R−¯ˇR)(ˇg)\text{(A) Approximation error}+¯ˇR(ˇg)−¯ˇR(¯ˇg)