Domain Adaptation: Learning Bounds and Algorithms
This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by ? (?), we introduce a novel distance between distributions, discrepancy distance, that is tailored to adaptation problems with arbitrary loss functions. We give Rademacher complexity bounds for estimating the discrepancy distance from finite samples for different loss functions. Using this distance, we derive novel generalization bounds for domain adaptation for a wide family of loss functions. We also present a series of novel adaptation bounds for large classes of regularization-based algorithms, including support vector machines and kernel ridge regression based on the empirical discrepancy. This motivates our analysis of the problem of minimizing the empirical discrepancy for various loss functions for which we also give novel algorithms. We report the results of preliminary experiments that demonstrate the benefits of our discrepancy minimization algorithms for domain adaptation.
Domain Adaptation: Learning Bounds and Algorithms
Yishay Mansour Google Research and Tel Aviv Univ. firstname.lastname@example.org Mehryar Mohri Courant Institute and Google Research email@example.com Afshin Rostamizadeh Courant Institute New York University firstname.lastname@example.org
In the standard PAC model [valiant] and other theoretical models of learning, training and test instances are assumed to be drawn from the same distribution. This is a natural assumption since, when the training and test distributions substantially differ, there can be no hope for generalization. However, in practice, there are several crucial scenarios where the two distributions are more similar and learning can be more effective. One such scenario is that of domain adaptation, the main topic of our analysis.
The problem of domain adaptation arises in a variety of applications in natural language processing [Dredze07Frustratingly, Blitzer07Biographies, jiang-zhai07, chelba, daume06], speech processing [Legetter&Woodlang, Gauvain&Lee, DellaPietra, Rosenfeld96, jelinek, roark03supervised], computer vision [martinez], and many other areas. Quite often, little or no labeled data is available from the target domain, but labeled data from a source domain somewhat similar to the target as well as large amounts of unlabeled data from the target domain are at one’s disposal. The domain adaptation problem then consists of leveraging the source labeled and target unlabeled data to derive a hypothesis performing well on the target domain.
A number of different adaptation techniques have been introduced in the past by the publications just mentioned and other similar work in the context of specific applications. For example, a standard technique used in statistical language modeling and other generative models for part-of-speech tagging or parsing is based on the maximum a posteriori adaptation which uses the source data as prior knowledge to estimate the model parameters [roark03supervised]. Similar techniques and other more refined ones have been used for training maximum entropy models for language modeling or conditional models [DellaPietra, jelinek, chelba, daume06].
The first theoretical analysis of the domain adaptation problem was presented by ? (?), who gave VC-dimension-based generalization bounds for adaptation in classification tasks. Perhaps, the most significant contribution of this work was the definition and application of a distance between distributions, the distance, that is particularly relevant to the problem of domain adaptation and that can be estimated from finite samples for a finite VC dimension, as previously shown by ? (?). This work was later extended by ? (?) who also gave a bound on the error rate of a hypothesis derived from a weighted combination of the source data sets for the specific case of empirical risk minimization. A theoretical study of domain adaptation was presented by ? (?), where the analysis deals with the related but distinct case of adaptation with multiple sources, and where the target is a mixture of the source distributions.
This paper presents a novel theoretical and algorithmic analysis of the problem of domain adaptation. It builds on the work of ? (?) and extends it in several ways. We introduce a novel distance, the discrepancy distance, that is tailored to comparing distributions in adaptation. This distance coincides with the distance for 0-1 classification, but it can be used to compare distributions for more general tasks, including regression, and with other loss functions. As already pointed out, a crucial advantage of the distance is that it can be estimated from finite samples when the set of regions used has finite VC-dimension. We prove that the same holds for the discrepancy distance and in fact give data-dependent versions of that statement with sharper bounds based on the Rademacher complexity.
We give new generalization bounds for domain adaptation and point out some of their benefits by comparing them with previous bounds. We further combine these with the properties of the discrepancy distance to derive data-dependent Rademacher complexity learning bounds. We also present a series of novel results for large classes of regularization-based algorithms, including support vector machines (SVMs) [ccvv] and kernel ridge regression (KRR) [krr]. We compare the pointwise loss of the hypothesis returned by these algorithms when trained on a sample drawn from the target domain distribution, versus that of a hypothesis selected by these algorithms when training on a sample drawn from the source distribution. We show that the difference of these pointwise losses can be bounded by a term that depends directly on the empirical discrepancy distance of the source and target distributions.
These learning bounds motivate the idea of replacing the empirical source distribution with another distribution with the same support but with the smallest discrepancy with respect to the target empirical distribution, which can be viewed as reweighting the loss on each labeled point. We analyze the problem of determining the distribution minimizing the discrepancy in both 0-1 classification and square loss regression. We show how the problem can be cast as a linear program (LP) for the 0-1 loss and derive a specific efficient combinatorial algorithm to solve it in dimension one. We also give a polynomial-time algorithm for solving this problem in the case of the square loss by proving that it can be cast as a semi-definite program (SDP). Finally, we report the results of preliminary experiments showing the benefits of our analysis and discrepancy minimization algorithms.
In section 2, we describe the learning set-up for domain adaptation and introduce the notation and Rademacher complexity concepts needed for the presentation of our results. Section 3 introduces the discrepancy distance and analyzes its properties. Section 4 presents our generalization bounds and our theoretical guarantees for regularization-based algorithms. Section 5 describes and analyzes our discrepancy minimization algorithms. Section 6 reports the results of our preliminary experiments.
2.1 Learning Set-Up
We consider the familiar supervised learning setting where the learning algorithm receives a sample of labeled points , where is the input space and the label set, which is in classification and some measurable subset of in regression.
In the domain adaptation problem, the training sample is drawn according to a source distribution , while test points are drawn according to a target distribution that may somewhat differ from . We denote by the target labeling function. We shall also discuss cases where the source labeling function differs from the target domain labeling function . Clearly, this dissimilarity will need to be small for adaptation to be possible.
We will assume that the learner is provided with an unlabeled sample drawn i.i.d. according to the target distribution . We denote by a loss function defined over pairs of labels and by the expected loss for any two functions and any distribution over : .
The domain adaptation problem consists of selecting a hypothesis out of a hypothesis set with a small expected loss according to the target distribution , .
2.2 Rademacher Complexity
Our generalization bounds will be based on the following data-dependent measure of the complexity of a class of functions.
Definition 1 (Rademacher Complexity)
Let be a set of real-valued functions defined over a set . Given a sample , the empirical Rademacher complexity of is defined as follows:
The expectation is taken over where s are independent uniform random variables taking values in . The Rademacher complexity of a hypothesis set is defined as the expectation of over all samples of size :
The Rademacher complexity measures the ability of a class of functions to fit noise. The empirical Rademacher complexity has the added advantage that it is data-dependent and can be measured from finite samples. It can lead to tighter bounds than those based on other measures of complexity such as the VC-dimension [koltchinskii_and_panchenko].
We will denote by the empirical average of a hypothesis and by its expectation over a sample drawn according to the distribution considered. The following is a version of the Rademacher complexity bounds by ? (?) and ? (?). For completeness, the full proof is given in the Appendix.
Theorem 2 (Rademacher Bound)
Let be a class of functions mapping to and a finite sample drawn i.i.d. according to a distribution . Then, for any , with probability at least over samples of size , the following inequality holds for all :
3 Distances between Distributions
Clearly, for generalization to be possible, the distribution and must not be too dissimilar, thus some measure of the similarity of these distributions will be critical in the derivation of our generalization bounds or the design of our algorithms. This section discusses this question and introduces a discrepancy distance relevant to the context of adaptation.
The distance yields a straightforward bound on the difference of the error of a hypothesis with respect to versus its error with respect to .
Assume that the loss is bounded, for some . Then, for any hypothesis ,
This provides us with a first adaptation bound suggesting that for small values of the distance between the source and target distributions, the average loss of hypothesis tested on the target domain is close to its average loss on the source domain. However, in general, this bound is not informative since the distance can be large even in favorable adaptation situations. Instead, one can use a distance between distributions better suited to the learning task.
Consider for example the case of classification with the 0-1 loss. Fix , and let denote the support of . Observe that . A natural distance between distributions in this context is thus one based on the supremum of the right-hand side over all regions . Since the target hypothesis is not known, the region should be taken as the support of for any two .
This leads us to the following definition of a distance originally introduced by ? (?) [pp. 271-272] under the name of generalized Kolmogorov-Smirnov distance, later by ? (?) as the distance, and introduced and applied to the analysis of adaptation in classification by ? (?) and ? (?).
Definition 3 (-Distance)
Let be a set of subsets of . Then, the -distance between two distributions and over , is defined as
As just discussed, in 0-1 classification, a natural choice for is . We introduce a distance between distributions, discrepancy distance, that can be used to compare distributions for more general tasks, e.g., regression. Our choice of the terminology is partly motivated by the relationship of this notion with the discrepancy problems arising in combinatorial contexts [chazelle].
Definition 4 (Discrepancy Distance)
Let be a set of functions mapping to and let define a loss function over . The discrepancy distance between two distributions and over is defined by
The discrepancy distance is clearly symmetric and it is not hard to verify that it verifies the triangle inequality, regardless of the loss function used. In general, however, it does not define a distance: we may have for , even for non-trivial hypothesis sets such as that of bounded linear functions and standard continuous loss functions.
Note that for the 0-1 classification loss, the discrepancy distance coincides with the distance with . But the discrepancy distance helps us compare distributions for other losses such as for some and is more general.
As shown by ? (?), an important advantage of the distance is that it can be estimated from finite samples when has finite VC-dimension. We prove that the same holds for the distance and in fact give data-dependent versions of that statement with sharper bounds based on the Rademacher complexity.
The following theorem shows that for a bounded loss function , the discrepancy distance between a distribution and its empirical distribution can be bounded in terms of the empirical Rademacher complexity of the class of functions . In particular, when has finite pseudo-dimension, this implies that the discrepancy distance converges to zero as .
Assume that the loss function is bounded by . Let be a distribution over and let denote the corresponding empirical distribution for a sample . Then, for any , with probability at least over samples of size drawn according to :
Proof: We scale the loss to by dividing by , and denote the new class by . By Theorem 2 applied to , for any , with probability at least , the following inequality holds for all :
The empirical Rademacher complexity has the property that for any hypothesis class and positive real number [bartlett]. Thus, , which proves the proposition.
For the specific case of regression losses, the bound can be made more explicit.
Let be a hypothesis set bounded by some for the loss function : , for all . Let be a distribution over and let denote the corresponding empirical distribution for a sample . Then, for any , with probability at least over samples of size drawn according to :
Proof: The function is -Lipschitz for :
and . For , . Thus, by Talagrand’s contraction lemma [talagrand], is bounded by with . Then, can be written and bounded as follows
using the definition of the Rademacher variables and the sub-additivity of the supremum function. This proves the inequality and the corollary.
A very similar proof gives the following result for classification.
Let be a set of classifiers mapping to and let denote the 0-1 loss. Then, with the notation of Corollary 5, for any , with probability at least over samples of size drawn according to :
The factor of can in fact be reduced to in these corollaries when using a more favorable constant in the contraction lemma. The following corollary shows that the discrepancy distance can be estimated from finite samples.
Let be a hypothesis set bounded by some for the loss function : , for all . Let be a distribution over and the corresponding empirical distribution for a sample , and let be a distribution over and the corresponding empirical distribution for a sample . Then, for any , with probability at least over samples of size drawn according to and samples of size drawn according to :
Proof: By the triangle inequality, we can write
The result then follows by the application of Corollary 5 to and .
As with Corollary 6, a similar result holds for the 0-1 loss in classification.
4 Domain Adaptation: Generalization Bounds
This section presents generalization bounds for domain adaptation given in terms of the discrepancy distance just defined. In the context of adaptation, two types of questions arise:
we may ask, as for standard generalization, how the average loss of a hypothesis on the target distribution, , differs from , its empirical error based on the empirical distribution ;
another natural question is, given a specific learning algorithm, by how much does deviate from where is the hypothesis returned by the algorithm when trained on a sample drawn from and the one it would have returned by training on a sample drawn from the true target distribution .
We will present theoretical guarantees addressing both questions.
4.1 Generalization bounds
Let and similarly let be a minimizer of . Note that these minimizers may not be unique. For adaptation to succeed, it is natural to assume that the average loss between the best-in-class hypotheses is small. Under that assumption and for a small discrepancy distance, the following theorem provides a useful bound on the error of a hypothesis with respect to the target domain.
Assume that the loss function is symmetric and obeys the triangle inequality. Then, for any hypothesis , the following holds
Proof: Fix . By the triangle inequality property of and the definition of the discrepancy , the following holds
We compare (11) with the main adaptation bound given by ? (?) and ? (?):
It is very instructive to compare the two bounds. Intuitively, the bound of Theorem 8 has only one error term that involves the target function, while the bound of (12) has three terms involving the target function. One extreme case is when there is a single hypothesis in and a single target function . In this case, Theorem 8 gives a bound of , while the bound supplied by (12) is , which is larger than when . One can even see that the bound of (12) might become vacuous for moderate values of and . While this is clearly an extreme case, an error with a factor of 3 can arise in more realistic situations, especially when the distance between the target function and the hypothesis class is significant.
While in general the two bounds are incomparable, it is worthwhile to compare them using some relatively plausible assumptions. Assume that the discrepancy distance between and is small and so is the average loss between and . These are natural assumptions for adaptation to be possible. Then, Theorem 8 indicates that the regret is essentially bounded by , the average loss with respect to on . We now consider several special cases of interest.
Finally, clearly Theorem 8 leads to bounds based on the empirical error of on a sample drawn according to . We give the bound related to the 0-1 loss, others can be derived in a similar way from Corollaries 5-7 and other similar corollaries. The result follows Theorem 8 combined with Corollary 7, and a standard Rademacher classification bound (Theorem 14) [bartlett].
Let be a family of functions mapping to and let the rest of the assumptions be as in Corollary 7. Then, for any hypothesis , with probability at least , the following adaptation generalization bound holds for the 0-1 loss:
4.2 Guarantees for regularization-based algorithms
In this section, we first assume that the hypothesis set includes the target function . Note that this does not imply that is in . Even when and are restrictions to and of the same labeling function , we may have and and the source problem could be non-realizable. Figure 1 illustrates this situation.
For a fixed loss function , we denote by the empirical error of a hypothesis with respect to an empirical distribution : . Let be a function defined over the hypothesis set . We will assume that is a convex subset of a vector space and that the loss function is convex with respect to each of its arguments. Regularization-based algorithms minimize an objective of the form
where is a trade-off parameter. This family of algorithms includes support vector machines (SVM) [ccvv], support vector regression (SVR) [vapnik98], kernel ridge regression [krr], and other algorithms such as those based on the relative entropy regularization [bousquet-jmlr].
We denote by the Bregman divergence associated to a convex function ,
and define as .
Let the hypothesis set be a vector space. Assume that is a proper closed convex function and that and are differentiable. Assume that admits a minimizer and a minimizer and that and coincide on the support of . Then, the following bound holds,
Proof: Since and , and a Bregman divergence is non-negative, the following inequality holds:
By the definition of and as the minimizers of and , and
This last inequality holds since by assumption is in .
We will say that a loss function is -admissible when there exists such that for any two hypotheses and for all , and ,
This assumption holds for the hinge loss with and for the loss with when the hypothesis set and the set of output labels are bounded by some : and .
Let be a positive-definite symmetric kernel such that for all , and let be the reproducing kernel Hilbert space associated to . Assume that the loss function is -admissible. Let be the hypothesis returned by the regularization algorithm based on for the empirical distribution , and the one returned for the empirical distribution , and that and that and coincide on . Then, for all , ,
Proof: For , is a proper closed convex function and is differentiable. We have , thus . When is differentiable, by Lemma 10,
This result can also be shown directly without assuming that is differentiable by using the convexity of and the minimizing properties of and with a proof that is longer than that of Lemma 10.
Now, by the reproducing property of , for all , and by the Cauchy-Schwarz inequality, . By the -admissibility of , for all , ,
which, combined with (20), proves the statement of the theorem.
Theorem 11 provides a guarantee on the pointwise difference of the loss for and with probability one, which of course is stronger than a bound on the difference between expected losses or a probabilistic statement. The result, as well as the proof, also suggests that the discrepancy distance is the “right” measure of difference of distributions for this context. The theorem applies to a variety of algorithms, in particular SVMs combined with arbitrary PDS kernels and kernel ridge regression.
In general, the functions and may not coincide on . For adaptation to be possible, it is reasonable to assume however that
This can be viewed as a condition on the proximity of the labeling functions (the s), while the discrepancy distance relates to the distributions on the input space (the s). The following result generalizes Theorem 11 to this setting in the case of the square loss.
Under the assumptions of Theorem 11, but with and potentially different on , when is the square loss and , then, for all , ,
Proof: Proceeding as in the proof of Lemma 10 and using the definition of the square loss and the Cauchy-Schwarz inequality give
Since , the inequality can be rewritten as
Solving the second-degree polynomial in leads to the equivalent constraint
The result then follows by the -admissibility of as in the proof of Theorem 11, with .
Using the same proof schema, similar bounds can be derived for other loss functions.
When the assumption is relaxed, the following theorem holds.
Under the assumptions of Theorem 11, but with not necessarily in and and potentially different on , when is the square loss and , then, for all , ,
5 Discrepancy Minimization Algorithms
The discrepancy distance appeared as a critical term in several of the bounds in the last section. In particular, Theorems 11 and 12 suggest that if we could select, instead of , some other empirical distribution with a smaller empirical discrepancy and use that for training a regularization-based algorithm, a better guarantee would be obtained on the difference of pointwise loss between and . Since is fixed, a sufficiently smaller discrepancy would actually lead to a hypothesis with pointwise loss closer to that of .
The training sample is given and we do not have any control over the support of . But, we can search for the distribution with the minimal empirical discrepancy distance:
where denotes the set of distributions with support . This leads to an optimization problem that we shall study in detail in the case of several loss functions.
Note that using instead of for training can be viewed as reweighting the cost of an error on each training point. The distribution can be used to emphasize some points or de-emphasize others to reduce the empirical discrepancy distance. This bears some similarity with the reweighting or importance weighting ideas used in statistics and machine learning for sample bias correction techniques [elkan, bias] and other purposes. Of course, the objective optimized here based on the discrepancy distance is distinct from that of previous reweighting techniques.
We will denote by the support of , by the support of , and by their union , with and .
In view of the definition of the discrepancy distance, problem (25) can be written as a min-max problem:
As with all min-max problems, the problem has a natural game theoretical interpretation. However, here, in general, we cannot permute the and operators since the convexity-type assumptions of the minimax theorems do not hold. Nevertheless, since the max-min value is always a lower bound for the min-max, it provides us with a lower bound on the value of the game, that is the minimal discrepancy:
We will later make use of this inequality. Let us now examine the minimization problem (25) and its algorithmic solutions in the case of classification with the 0-1 loss and regression with the loss.
5.1 Classification, 0-1 Loss
For the 0-1 loss, the problem of finding the best distribution can be reformulated as the following min-max program: