Domain Generalization by Marginal Transfer Learning
Abstract
Domain generalization is the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernelbased approach that predicts a classifier from the marginal distribution of features, by leveraging the trends present in related classification tasks. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on synthetic data and three real data applications demonstrate the superiority of the method with respect to a pooling strategy.
Institut für Mathematik
Universität Potsdam Aniket Anand Deshmukh aniketde@umich.edu
Electrical Engineering and Computer Science
University of Michigan Ürun Dogan urundogan@gmail.com
Microsoft Research Gyemin Lee gyemin@seoultech.ac.kr
Dept. Electronic and IT Media Engineering
Seoul National University of Science and Technology Clayton Scott clayscot@umich.edu
Electrical Engineering and Computer Science
University of Michigan
Keywords: Domain Generalization, Kernel Methods, Kernel Approximation
1 Introduction
Is it possible to leverage the solution of one classification problem to solve another? This is a question that has received increasing attention in recent years from the machine learning community, and has been studied in a variety of settings, including multitask learning, covariate shift, and transfer learning. In this work we study domain generalization, another setting in which this question arises, and one that incorporates elements of the three aforementioned settings and is motivated by many practical applications.
To state the problem, let be a feature space and a space of labels to predict. For a given distribution on , we refer to the marginal distribution as simply the marginal distribution, and the conditional as the posterior distribution. There are similar but distinct distributions on , . For each , there is a training sample of iid realizations of . There is also a test distribution that is similar to but again distinct from the “training distributions” . Finally, there is a test sample of iid realizations of , but in this case the labels are not observed. The goal is to correctly predict these unobserved labels. Essentially, given a random sample from the marginal test distribution , we would like to predict the corresponding labels.
Domain generalization, which has also been referred to as learning to learn or lifelong learning, may be contrasted with other learning problems. In multitask learning, only the training distributions are of interest, and the goal is to use the similarity among distributions to improve the training of individual classifiers (Thrun, 1996; Caruana, 1997; Evgeniou et al., 2005). In our context, we view these distributions as “training tasks,” and seek to generalize to a new distribution/task.^{1}^{1}1The terminology appears to vary. Here we call a specific distribution a task, but the terms domain or environment are also common in the literature.. In the covariate shift problem, the marginal test distribution is different from the marginal training distribution(s), but the posterior distribution is assumed to be the same (Bickel et al., 2009). In our case, both marginal and posterior test distributions can differ from their training counterparts (QuioneroCandela et al., 2009).
Finally, in transfer learning, it is typically assumed that at least a few labels are available for the test data, and the training data sets are used to improve the performance of a standard classifier, for example by learning a metric or embedding which is appropriate for all data sets (Ando and Zhang, 2005; Rettinger et al., 2006). In our case, no test labels are available, but we hope that through access to multiple training data sets, it is still possible to obtain collective knowledge about the “labeling process” that may be transferred to the test distribution. Some authors have considered transductive transfer learning, which is similar to the problem studied here in that no test labels are available. However, existing work has focused on the case and typically relies on the covariate shift assumption (Arnold et al., 2007).
We propose a distributionfree, kernelbased approach to domain generalization, based on the hypothesis that information about task is encoded in its marginal distribution. Our methodology is shown to yield a consistent learning procedure, meaning that the generalization error tends to the best possible as the sample sizes tend to infinity. We also offer a thorough experimental study validating the proposed approach on a synthetic data set, and on three realworld data sets, including comparisons to a simple pooling approach.
The general probabilistic framework we adopt to theoretically analyze the proposed algorithm is to assume that the training task generating distributions as well as the test task distribution are themselves drawn i.i.d. from a probability distribution over the set of probability distributions on . This twostage task sampling model (a task distribution is sampled from , then training examples are sampled from ) was first introduced in the seminal work of Baxter (1997, 2000), which also proposed a general learningtheoretical analysis of the model. A generic approach to the problem is to consider a family of hypothesis spaces, and use the training tasks in order to select in that family an hypothesis space that is optimally suited to learning tasks sampled from ; roughly speaking, this means finding a good tradeoff between the complexity of said class and its approximation capabilities for tasks sampled from , in an average sense. The information gained by finding a welladapted hypothesis space can lead to a significantly improved label efficiency of learning a new task. A related approach is to learn directly the task sampling distribution (Carbonell et al., 2013).
The family of hypothesis spaces under consideration can be implicit, as when learning a feature representation or metric that is suitable for all tasks. In the context of (reproducing) kernel methods, this has been studied under the form of learning a linear transform or projection in kernel space (Maurer, 2009), and learning the kernel itself (Pentina and BenDavid, 2015).
A related approach is to find a transformation (feature extraction) of so that transformed marginal distributions approximately match across tasks; the underlying assumption is that this allows to find some common information between tasks. This idea has been combined with the principle of kernel mean mapping (which represents entire distributions as points in a Hilbert space) to compare distributions (Pan et al., 2011; Muandet et al., 2013; Maurer et al., 2013; Pentina and Lampert, 2014; Grubinger et al., 2015; Ghifary et al., 2017), generally to find a projection in kernel space realizing a suitable compromise between matching of transformed marginal distributions and preserving of information between input and label. It has also been proposed to match task distributions by optimal transport of marginal distributions (Courty et al., 2016).
In the present paper, our aim is to learn to predict the classifier for a given task from the marginal distribution; for this we will use the principle of kernel mean mapping as well. Still, our ansatz is different from the previously discussed methods, because instead of transforming the data to match distributions, we aim to learn how the taskdependent hypothesis (e.g., a linear classifier) transforms as a function of the marginal. In this sense our approach is a complement to these other algorithms rather than a competitor. Indeed, after our initial conference publication (Blanchard et al., 2011), our methodology was successfully applied in conjunction with the feature transformation proposed by Muandet et al. (2013).
2 Motivating Application: Automatic Gating of Flow Cytometry Data
Flow cytometry is a highthroughput measurement platform that is an important clinical tool for the diagnosis of bloodrelated pathologies. This technology allows for quantitative analysis of individual cells from a given cell population, derived for example from a blood sample from a patient. We may think of a flow cytometry data set as a set of dimensional attribute vectors , where is the number of cells analyzed, and is the number of attributes recorded per cell. These attributes pertain to various physical and chemical properties of the cell. Thus, a flow cytometry data set is a random sample from a patientspecific distribution.
Now suppose a pathologist needs to analyze a new (test) patient with data . Before proceeding, the pathologist first needs the data set to be “purified” so that only cells of a certain type are present. For example, lymphocytes are known to be relevant for the diagnosis of leukemia, whereas nonlymphocytes may potentially confound the analysis. In other words, it is necessary to determine the label associated to each cell, where indicates that the th cell is of the desired type.
In clinical practice this is accomplished through a manual process known as “gating.” The data are visualized through a sequence of twodimensional scatter plots, where at each stage a line segment or polygon is manually drawn to eliminate a portion of the unwanted cells. Because of the variability in flow cytometry data, this process is difficult to quantify in terms of a small subset of simple rules. Instead, it requires domainspecific knowledge and iterative refinement. Modern clinical laboratories routinely see dozens of cases per day, so it would be desirable to automate this process.
Since clinical laboratories maintain historical databases, we can assume access to a number () of historical (training) patients that have already been expertgated. Because of biological and technical variations in flow cytometry data, the distributions of the historical patients will vary. In order to illustrate the flow cytometry gating problem, we use the NDD data set from the FlowCapI challenge.^{2}^{2}2We will revisit this data set in Section 7.5 where details are given. For example, Fig. 1 shows exemplary twodimensional scatter plots for two different patients – see caption for details. Despite differences in the two distributions, there are also general trends that hold for all patients. Virtually every cell type of interest has a known tendency (e.g., high or low) for most measured attributes. Therefore, it is reasonable to assume that there is an underlying distribution (on distributions) governing flow cytometry data sets, that produces roughly similar distributions thereby making possible the automation of the gating process.
3 Formal Setting
Let denote the observation space and the output space. We assume that we observe samples , .
Let denote the set of probability distributions on , the set of probability distributions on (which we call “marginals”), and the set of conditional probabilities of given (also known as Markov transition kernels from to , which we also call “posteriors”). The disintegration theorem (see for instance Kallenberg (2002), Theorem 6.4) tells us that (under suitable regularity properties, e.g., is a Polish space) any element can be written as a product , with , , that is to say,
for any integrable function . The space is endowed with the topology of weak convergence and the associated Borel algebra.
It is assumed that there exists a distribution on , where are i.i.d. realizations from , and the sample is made of i.i.d. realizations of following the distribution . Now consider a test sample , whose labels are not observed by the user. A decision function is a function that predicts , where is the empirical marginal distribution of the test sample and is any given test point (which can belong to the test sample or not). If is a loss, and predictions on the test sample are given by , then the empirical average loss incurred on the test sample is . Based on this, we define the average generalization error of a decision function over test samples of size ,
(1) 
An important point of the analysis is that, at training time as well as at test time, the marginal distribution for a sample is only known through the sample itself, that is, through the empirical marginal . As is clear from equation (1), because of this the generalization error also depends on the test sample size . As grows, will converge to (in the sense of weak convergence). This motivates the following generalization error when we have an infinite test sample, where we then assume that the true marginal is observed:
(2) 
To gain some insight into this risk, let us decompose into two parts, which generates the marginal distribution , and which, conditioned on , generates the posterior . Denote . We then have
Here is the distribution that generates by first drawing according to , and then drawing according to . Similarly, is generated, conditioned on , by first drawing according to , and then drawing from . From this last expression, we see that the risk is like a standard supervised learning risk based on . Thus, we can deduce properties that are known to hold for supervised learning risks. For example, in the binary classifications setting, if the loss is the 0/1 loss, then is an optimal predictor, where . More generally,
Our goal is a learning rule that asymptotically predicts as well as the global minimizer of (2), for a general loss . By the above observations, consistency with respect to a general (thought of as a surrogate) will imply consistency for the 0/1 loss, provided is classification calibrated (Bartlett et al., 2006). Despite the similarity to standard supervised learning in the infinite sample case, we emphasize that the learning task here is different, because the realizations are neither independent nor identically distributed.
Finally, we note that there is a condition where for almost all test distribution , the decision function (where is the global minimizer of (2)) coincides with an optimal Bayes decision function for (although no labels from this test distribution are observed). This condition is simply that the posterior is (almost surely) a function of (in other terms: that with the notation introduced above, is a Dirac measure for almost all ). Although we will not be assuming this condition throughout the paper, observe that it is implicitly assumed in the motivating application presented in Section 2, where an expert labels the data points by just looking at their marginal distribution.
Lemma 1
For a fixed distribution , and a decision function , let us denote and
the corresponding optimal (Bayes) risk for the loss function . Assume that is a distribution on such that a.s. it holds for some deterministic mapping . Let be a minimizer of the risk (2). Then we have for almost all :
and
Proof
For any ,
one has for all : .
For any fixed ,
consider and a Bayes decision
function
for this joint distribution. Pose . Then coincides
for almost all with a Bayes decision function for ,
achieving
equality in the above inequality. The second equality follows by taking
expectation over .
4 Learning Algorithm
We consider an approach based on kernels. The function is called a kernel on if the matrix is symmetric and positive semidefinite for all positive integers and all . It is wellknown that if is a kernel on , then there exists a Hilbert space and such that . While and are not uniquely determined by , the Hilbert space of functions (from to ) is uniquely determined by , and is called the reproducing kernel Hilbert space (RKHS) of .
One way to envision is as follows. Define , which is called the canonical feature map associated with . Then the span of , endowed with the inner product , is dense in . We also recall the reproducing property, which states that for all .
For later use, we introduce the notion of a universal kernel. A kernel on a compact metric space is said to be universal when its RKHS is dense in , the set of continuous functions on , with respect to the supremum norm. Universal kernels are important for establishing universal consistency of many learning algorithms. We refer the reader to Steinwart and Christmann (2008) for additional background on kernels.
Several wellknown learning algorithms, such as support vector machines and kernel ridge regression, may be viewed as minimizers of a normregularized empirical risk over the RKHS of a kernel. A similar development has also been made for multitask learning (Evgeniou et al., 2005). Inspired by this framework, we consider a general kernel algorithm as follows.
Consider the loss function . Let be a kernel on , and let be the associated RKHS. For the sample , let denote the corresponding empirical distribution. Also consider the extended input space and the extended data . Note that plays a role analogous to the task index in multitask learning. Now define
(3) 
4.1 Specifying the kernels
In the rest of the paper we will consider a kernel on of the product form
(4) 
where is a kernel on and a kernel on .
Furthermore, we will consider kernels on of a particular form. Let denote a kernel on (which might be different from ) that is measurable and bounded. We define the kernel mean embedding :
(5) 
This mapping has been studied in the framework of “characteristic kernels” (Gretton et al., 2007a), and it has been proved that universality of implies injectivity of (Gretton et al., 2007b; Sriperumbudur et al., 2010).
Note that the mapping is linear. Therefore, if we consider the kernel , it is a linear kernel on and cannot be a universal kernel. For this reason, we introduce yet another kernel on and consider the kernel on given by
(6) 
Note that particular kernels inspired by the finite dimensional case are of the form
(7) 
or
(8) 
where are real functions of a real variable such that they define a kernel. For example, yields a Gaussianlike kernel, while yields a polynomiallike kernel. Kernels of the above form on the space of probability distributions over a compact space have been introduced and studied in Christmann and Steinwart (2010). Below we apply their results to deduce that is a universal kernel for certain choices of , and .
4.2 Relation to other kernel methods
By choosing differently, one can recover other existing kernel methods. In particular, consider the class of kernels of the same product form as above, but where
If , the algorithm (3) corresponds to training kernel machines using kernel (e.g., support vector machines in the case of the hinge loss) on each training data set, independently of the others (note that this does not offer any generalization ability to a new data set). If , we have a “pooling” strategy that, in the case of equal sample sizes , is equivalent to pooling all training data sets together in a single data set, and running a conventional supervised learning algorithm with kernel (i.e., this corresponds to trying to find a single “onefitsall” prediction function which does not depend on the marginal). In the intermediate case , the resulting kernel is a “multitask kernel,” and the algorithm recovers a multitask learning algorithm like that of Evgeniou et al. (2005). We compare to the pooling strategy below in our experiments. We also examined the multitask kernel with , but found that, as far as generalization to a new unlabeled task is concerned, it was always outperformed by pooling, and so those results are not reported. This fits the observation that the choice does not provide any generalization to a new task, while at least offers some form of generalization, if only by fitting the same decision function to all data sets.
In the special case where all labels are the same value for a given task, and is taken to be the constant kernel, the problem we consider reduces to “distributional” classification or regression, which is essentially standard supervised learning where a distribution (observed through a sample) plays the role of the feature vector. Our analysis techniques could easily be specialized to this problem.
5 Learning Theoretic Study
Although the regularized estimation formula (3) defining is standard, the generalization error analysis is not, since the are neither identically distributed nor independent (Szabo et al., 2015). We begin with a generalization error bound that establishes uniform estimation error control over functions belonging to a ball of . We then discuss universal kernels, and finally deduce universal consistency of the algorithm.
To simplify somewhat the presentation, we assume below that all training samples have the same size . Also let denote the closed ball of radius , centered at the origin, in the RKHS of the kernel . We will consider the following assumptions on the loss function and on the kernels:

(L) The loss function is Lipschitz in its first variable and bounded by .

(KA)
The kernels and are bounded respectively by constants , and . In addition, the canonical feature map associated to satisfies a Hölder condition of order with constant , on :
(9)
Sufficient conditions for (9) are described in Section A.2. As an example, the condition is shown to hold with when is the Gaussianlike kernel on . The boundedness assumptions are also clearly satisfied for Gaussian kernels.
Theorem 2 (Uniform estimation error control)
Assume conditions (L) and (KA) hold. If are i.i.d. realizations from , and for each , the sample is made of i.i.d. realizations from , then for any , with probability at least :
(10) 
where is a numerical constant.
Proof [sketch] The full proofs of this and other results are given in Section A. We give here a brief overview. We use the decomposition
Bounding (I), using the Lipschitz property of the loss function, can be reduced to controlling
conditional to , uniformly for . This can be obtained using the reproducing property of the kernel , the convergence of to as a consequence of Hoeffding’s inequality in a Hilbert space, and the other assumptions (boundedness/Hölder property) on the kernels.
Concerning the control of the term (II), it can be decomposed in turn into the convergence conditional to , and the convergence of the conditional generalization error. In both cases, a standard approach using the AzumaMcDiarmid inequality (McDiarmid, 1989) followed by symmetrization and Rademacher complexity analysis on a kernel space (Koltchinskii, 2001; Bartlett and Mendelson, 2002) can be applied. For the first part, the random variables are the (which are independent conditional to ); for the second part, the i.i.d. variables are the (the being integrated out).
Next, we turn our attention to universal kernels (see Section 4 for the definition). A relevant notion for our purposes is that of a normalized kernel. If is a kernel on , then
is the associated normalized kernel. If a kernel is universal, then so is its associated normalized kernel. For example, the exponential kernel , , can be shown to be universal on through a Taylor series argument. Consequently, the Gaussian kernel
is universal, being the normalized kernel associated with the exponential kernel with . See Steinwart and Christmann (2008) for additional details and discussion.
To establish that is universal on , the following lemma is useful.
Lemma 3
Let be two compact spaces and be kernels on , respectively. Then if are both universal, the product kernel
is universal on .
Several examples of universal kernels are known on Euclidean space. For our purposes, we also need universal kernels on . Fortunately, this was studied by Christmann and Steinwart (2010). Some additional assumptions on the kernels and feature space are required:

(KB) , , , and satisfy the following:

is a compact metric space

is universal on

is continuous and universal on

is universal on any compact subset of .

Adapting the results of (Christmann and Steinwart, 2010), we have the following.
Theorem 4 (Universal kernel)
Assume condition (KB) holds. Then, for defined as in (6), the product kernel in (4) is universal on .
Furthermore, the assumption on is fulfilled if is of the form (8), where is an analytical function with positive Taylor series coefficients, or if is the normalized kernel associated to such a kernel.
Proof By Lemma 3, it suffices to show is a compact metric space, and that is universal on . The former statement follows from Theorem 6.4 of Parthasarathy (1967), where the metric is the Prohorov metric. We will deduce the latter statement from Theorem 2.2 of Christmann and Steinwart (2010). The statement of Theorem 2.2 there is apparently restricted to kernels of the form (8), but the proof actually only uses that the kernel is universal on any compact set of . To apply Theorem 2.2, it remains to show that is a separable Hilbert space, and that is injective and continuous. Injectivity of is equivalent to being a characteristic kernel, which follows from the assumed universality of (Sriperumbudur et al., 2010). The continuity of implies separability of (Steinwart and Christmann (2008), Lemma 4.33) as well as continuity of (Christmann and Steinwart (2010), Lemma 2.3 and preceding discussion). Now Theorem 2.2 of (Christmann and Steinwart, 2010) may be applied, and the results follows.
The fact that kernels of the form (8), where is
analytic with positive Taylor coefficients, are
universal on any compact set of was established in the proof
of
Theorem 2.2 of the same work (Christmann and Steinwart, 2010).
As an example, suppose that is a compact subset of . Let and be Gaussian kernels on . Taking , it follows that is universal on . By similar reasoning as in the finite dimensional case, the Gaussianlike kernel is also universal on . Thus the product kernel is universal on .
From Theorems 2 and 4, we may deduce universal consistency of the learning algorithm. Furthermore, we can weaken the assumption on the loss relative to Theorem 2. In particular, universal consistency does not require that the loss be bounded, and therefore holds for unbounded losses such as the hinge and logistic losses.
Corollary 5 (Universal consistency)
Let be Lipschitz in its first variable such that
(11) 
Further assume that conditions (KA) and (KB) are satisfied. Assume that as , for some , and . Also let be a sequence such that as , and
Then
almost surely.
The proof of the corollary relies on the bound established in Theorem 2, the universality of established in Theorem 4, and otherwise relatively standard arguments. The assumption (11) always holds for classification, and it holds for regression, for example, when is compact and is continuous as a function of .
6 Implementation
Implementation^{3}^{3}3Software available at https://github.com/aniketde/DomainGeneralizationMarginal of the algorithm in (3) relies on techniques that are similar to those used for other kernel methods, but with some variations. The first subsection illustrates how, for the case of hinge loss, the optimization problem corresponds to a certain costsensitive support vector machine. Subsequent subsections focus on more scalable implementations based on approximate feature mappings.
6.1 Representer theorem and hinge loss
For a particular loss , existing algorithms for optimizing an empirical risk based on that loss can be adapted to the marginal transfer setting. We now illustrate this idea for the case of the hinge loss, . To make the presentation more concise, we will employ the extended feature representation , and we will also “vectorize” the indices so as to employ a single index on these variables and on the labels. Thus the training data are where , and we seek a solution to
Here , where is the smallest positive integer such that . By the representer theorem (Steinwart and Christmann, 2008), the solution of (3) has the form
for real numbers . Plugging this expression into the objective function of (3), and introducing the auxiliary variables , we have the quadratic program
where . Using Lagrange multiplier theory, the dual quadratic program is
and the optimal function is
This is equivalent to the dual of a costsensitive support vector machine, without offset, where the costs are given by . Therefore we can learn the weights using any existing software package for SVMs that accepts exampledependent costs and a userspecified kernel matrix, and allows for no offset. Returning to the original notation, the final predictor given a test sample has the form
where the are nonnegative. Like the SVM, the solution is often sparse, meaning most are zero.
Finally, we remark on the computation of . When has the form of (7) or (8), the calculation of may be reduced to computations of the form . If and are empirical distributions based on the samples and , then
Note that when is a (normalized) Gaussian kernel, coincides (as a function) with a smoothing kernel density estimate for .
6.2 Approximate Feature Mapping for Scalable Implementation
Assuming for all , the computational complexity of a nonlinear SVM solver is between and (Joachims, 1999; Chang and Lin, 2011). Thus, standard nonlinear SVM solvers may be insufficient when either or both and are very large.
One approach to scaling up kernel methods is to employ approximate feature mappings together with linear solvers. This is based on the idea that kernel methods are solving for a linear predictor after first nonlinearly transforming the data. Since this nonlinear transformation can have an extremely high or even infinitedimensional output, classical kernel methods avoid computing it explicitly. However, if the feature mapping can be approximated by a finite dimensional transformation with a relatively lowdimensional output, one can directly solve for the linear predictor, which can be accomplished in time (Hsieh et al., 2008).
In particular, given a kernel , the goal is to find an approximate feature mapping such that . Given such a mapping , one then applies an efficient linear solver, such as Liblinear (Fan et al., 2008), to the training data to obtain a weight vector . The final prediction on a test point is then . As described in the previous subsection, the linear solver may need to be tweaked, as in the case of unequal sample sizes , but this is usually straightforward.
Recently, such lowdimensional approximate future mappings have been developed for several kernels. We examine two such techniques in the context of marginal transfer learning, the Nyström approximation (Williams and Seeger, 2001; Drineas and Mahoney, 2005) and random Fourier features. The Nyström approximation applies to any kernel method, and therefore extends to the marginal transfer setting without additional work. On the other hand, we give a novel extension of random Fourier features to the marginal transfer learning setting (for the case of all Gaussian kernels), together with performance analysis.
6.2.1 Random Fourier Features
The approximation of Rahimi and Recht is based on Bochner’s theorem, which characterizes shift invariant kernels (Rahimi and Recht, 2007).
Theorem 6
A continuous kernel on is positive definite iff is the Fourier transform of a finite positive measure , i.e.,
(12) 
If a shift invariant kernel is properly scaled then Theorem 6 guarantees that in (12) is a proper probability distribution.
Random Fourier features (RFFs) approximate the integral in (12) using samples drawn from . If are i.i.d. draws from ,
(13) 
where is an approximate nonlinear feature mapping of dimensionality . In the following, we extend the RFF methodology to the kernel on the extended feature space . Let and be i.i.d. realizations of and