Transfer Regression via Pairwise Similarity Regularization
Transfer learning methods address the situation where little labeled training data from the “target” problem exists, but much training data from a related “source” domain is available. However, the overwhelming majority of transfer learning methods are designed for simple settings where the source and target predictive functions are almost identical, limiting the applicability of transfer learning methods to real world data. We propose a novel, weaker, property of the source domain that can be transferred even when the source and target predictive functions diverge. Our method assumes the source and target functions share a “Pairwise Similarity” property, where if the source function makes similar predictions on a pair of instances, then so will the target function. We propose Pairwise Similarity Regularization Transfer, a flexible graph-based regularization framework which can incorporate this modeling assumption into standard supervised learning algorithms. We show how users can encode domain knowledge into our regularizer in the form of spatial continuity, pairwise “similarity constraints” and how our method can be scaled to large data sets using the Nyström approximation. Finally, we present positive and negative results on real and synthetic data sets and discuss when our Pairwise Similarity transfer assumption seems to hold in practice.
Keywords: Transfer Learning, Regression
|Type of Transfer||Transfer Assumption||Existing Work||Failure Condition|
|[23, 24, 13, 19] and many others|
|Hypothesis Transfer||[25, 17, 21]||and are too different|
|Location Scale||for smooth and||[29, 27, 28]||and are not smooth|
|Pairwise Similarity Transfer||If then||None||When this “Pairwise Similarity” assumption doesn’t hold.|
Motivation. Standard supervised learning methods can require large amounts of labeled training data in order to learn an accurate function. Transfer learning methods mitigate this by using a well labeled source task to aid the poorly labeled target task . While previous work has been successful in a broad range of machine learning settings, traditional transfer learning methods assume the source and target predictive functions and are very similar  (Figure 1, top-left and top-right). As such, these methods will perform poorly if there are significant differences between these two functions.
As an example, consider the problem of predicting taxi pickups throughout San Francisco given a source data set of neighborhood housing prices throughout the city. While it’s reasonable to assume there is some relationship between the two problems, such as pricier neighborhoods requiring fewer pickups due to car ownership, this relationship between the two predictive functions may be some non-trivial, non-constant and non-smooth transformation. In this case, standard methods such as simply regularizing target estimates to be similar to source estimates may result in worse performance than just using the the small target data set, a phenomenon known as negative transfer 
Pairwise Similarity Transfer.
Consider the top-left, top-right, and bottom-left styles of transfer learning in Figure 1. Though mathematically different (see Table 1), they all effectively transfer the shape of the source function when estimating the target function, essentially requiring the target and source functions to be identical or near-identical. However, in our example with home prices and taxi-pickups this is unlikely to be the case. In this work we examine a new property that can be shared by the source and target, “Pairwise Similarity,” wherein if the source function makes similar predictions on a pair of instances (i.e. ) then so will the target function (i.e. . Note that this doesn’t make any assuptions on the relationship between and for any individual instance . Thus, rather than transferring the overall shape of the source function, we can transfer this pairwise similarity information. For example, this property would imply that if two neighborhoods have similar house prices, then they’ll each need a similar number of taxi pickups. This can be a weaker transfer assumption because instead of requiring pointwise similarity it assumes pairwise similarity.
In this work we propose Pairwise Similarity Transfer (Figure 1), which assumes Pairwise Similarity holds and seeks to transfer this property from source to target. This can be seen in Figure 1 in how both functions divide the instance space into the same three subsets and, while the functions within these subsets are not identical, the first two are both relatively smooth (but the last is not). As such, Pairwise Similarity Transfer doesn’t fit into any of the previously proposed transfer learning settings. This difference is key because it enables Pairwise Similarity Transfer based methods to outperform previous work when this modeling assumption is accurate. Along these lines, we will show in our experiments that many real world data sets fall into this setting, such as the previously noted neighborhood housing price and taxi pickup example.
A Graph Based Mechanism For Transfer. Our method leverages this transfer assumption by first constructing a “Pairwise Similarity Graph” which simultaneously measures the similarity of pairs of instances as a function of the source function predictions and potentially other aspects defined by the user such as spatial proximity (Section 4.1) and guidance from a domain expert (Section 4.2).
Given this graph, we construct a penalty term which regularizes the discrepancy of pairs of predictions made by the target function. Since transfer is performed using a regularizer, this is a general purpose strategy that can be used with many learning algorithms. In our work we pair it with a nonparametric regression method and show it leads to better generalization performance than standard transfer learning methods on several spatial data sets which seem to display this pairwise similarity property. We also experimentally explore other data sets which do not display this property, and discuss when our method is appropriate.
Our contributions are:
We propose Pairwise Similarity Transfer, a new transfer learning setting wherein the pairwise similarity of source function predictions are transferred to the target (Section 3).
We propose Pairwise Similarity Regularization Transfer, a general purpose graph regularization transfer framework which uses an intuitive “Pairwise Similarity Graph” to transfer this information (Section 3).
We present both positive and negative experimental results on synthetic and real data sets. We also discuss when and why our methods perform better or worse than previous work (Section 6).
The overview of our paper is as follows. In Section 2 we discuss related work. In Section 3 we introduce our method and compare and contrast it with previous work. In Section 4 we present extensions to our method. In Section 5 we present theoretical results about our method. In Section 6 we experimentally compare different versions of our method and previous work. Finally we conclude in Section 7 and discuss future work.
2 Related Work
Transfer learning methods attempt to improve generalization performance using data from related source domains . Transfer learning algorithms vary in what relationships they assume hold between the source and the target, but most previous focuses on Covariate Shift and Hypothesis Transfer, which either implicitly or explicitly assume the corresponding predictive functions and are identical [23, 24, 13, 19] or almost identical [7, 25, 12, 9, 17]. Exceptions to this include [29, 27, 28], which propose learning “Location-Scale” transformations to account for discrepancies between the domains.
2.1 Location-Scale Transfer.
Previous work proposed using “Location-Scale” transformations (compositions of scaling and translation functions) in order to adapt the source data to the target domain [29, 27, 28]. These methods first use what labeled target data is available to learn these transformations and then they augment the target data with the transformed source data. All these methods assume the scaling and translation functions are relatively simple in order to prevent overfitting, and as we show these methods can perform poorly on real world data.
2.2 Graph Regularization.
Laplacian Regularization is a popular framework for semisupervised learning . These methods first construct a weighted graph over both the labeled and unlabeled training data where is the edge weight between instances and . They then use its Laplacian, , during the learning process to regularize the prediction discrepancy between pairs of instances , where is a diagonal matrix with entries equal to the row sums of . While our regularization formulation is similar, our work differs in how the graph is constructed and what it represents. Also, to our knowledge, using graph regularization for transferring pairwise similarity is novel.
3 Our Method
We assume we are given a labeled target data set , an unlabeled target data set and a source function which was trained on a source data set . Importantly, we do not assume the ground truth target function is identical or near identical to . Rather, Our work assumes that, to some extent, the source and target have the Pairwise Similarity property:
A source and target domain with ground truth predictive functions and are Pairwise Similarity if, for much of the instance space, implies .
Note that this definition doesn’t require this pairwise similarity to hold for the entire instance space.
We propose modeling this pairwise information using a weighted graph. Specifically, we treat each instance of the target domain as a node and for every pair of instances we compute a weight where is a kernel, such as the Uniform or Gaussian kernels, and the weights are based on the source function predictions. These weights will be large when the source prediction discrepancy is small and vice versa. Using these weights we use soft constraints to regularize the target function estimate to enforce Pairwise Similarity. In a manner analogous to Laplacian Regularization, we propose the following optimization problem:
where is a loss function, is a regularization parameter and for the moment we do not assume a specific form for . The first term is the training error of the learned function on the labeled data, while the second is a graph based regularizer which penalizes deviations in the pairwise discrepancies between and . For example, if is large (indicated is small), then a large penalty will be incurred if is large. Conversely, if , then can vary an arbitrary amount (with respect to these two instances).
Let be the graph Laplacian constructed from , where is a diagonal matrix with entries equal to the row sums of , and let be the vector of predictions made by on the entire data set . Using this notation, the latter term of equation 1 can be more concisely written in matrix form :
While this regularization framework does not assume a specific form for , we experiment with the Nadaraya-Watson estimator (NW) , a nonparametric regression estimator. The first step of NW is to construct a similarity matrix , a matrix which measures the similarity between the labeled training instances and the instances to predict. Next, letting be a diagonal matrix constructed from the row sums of , the smoothing matrix is calculated. Finally, letting be the predictions, the NW solution is . In order to incorporate our regularizer, note that NW implicitly solves the following optimization problem:
Thus, in order to incorporate Pairwise Similarity Regularization we add the graph regularizer from equation 2 to get:
The first term is simply the objective of the Nadaraya-Watson estimator, while the second is our graph based regularizer.
This formulation has the closed form solution , which we derive in the supplementary materials. Also, while this formulation only makes predictions on the given data , out of sample extensions can be made by using the data set to train any standard supervised learning model.
3.1 Comparison With Previous Transfer Learning Assumptions.
Existing transfer learning formulations such as the first three in Figure 1 make strong assumptions on the similarity of the source and target functions. Specifically, (Location-Scale makes this assumption after the learned transformations are applied to the source data). Pairwise Similarity can be seen as a weaker assumption than these previous transfer learning settings for several reasons. First, it assumes a relationship between pairs of predictions, without requiring to to have similar shapes. Second, it makes assumptions on subsets of the instance space where pairs of similar predictions are made in the source, while it makes no assumptions on the remainder of the instance space. Finally, since it is implemented as a regularizer, the impact of the assumption can be controlled by tuning .
Here we outline three extensions to our work: incorporating spatial continuity, adding human guidance to the construction of the graph and scaling our method to large data sets.
4.1 Enforcing Spatial Continuity.
The previous formulation only considers source prediction discrepancies when constructing , but in many settings also considering spatial discrepancies between pairs of instances can be useful. Returning to the taxi pickup example, we might want to disentangle two neighborhoods which have similar housing prices but lie on different sides of the city. To do this, we modify the graph to include a spatial component as well:
where is the kernel over the source prediction discrepancies and is the kernel over the spatial discrepancies. Using this graph, will only be large if the instances have similar predictions in the source domain and are spatially close.
4.2 Adding User Guidance.
One advantage of using a graph based regularizer is it can be easily modified to incorporate pairwise constraints generated by the user or from side information. This information can improve the performance of our method by replacing entries in which gets “wrong.” Clearly the ideal guidance of this form would be to replace entries in using an “Oracle” which generates weights using the ground truth target labels, but in practice, users are unlikely to be able to generate accurate estimates of the delta , so we instead propose the use of binary guidance, where each modified entry in is set to or (see Algorithm 2). For example, if the user knows and should be similar, then they could set . Alternatively, the user could set if they do not think they’re related.
4.3 Scaling to Large Data Sets.
The closed form solution to our method requires inverting a large matrix (or solving a large system of linear equations), which runs in time cubic with respect to the number of instances in . This can be too computationally expensive to run on large data sets, so we scale our method by approximating using the Nyström method, a column sampling method for matrix sketching .
Using Nyström, we approximate as where are the sampled columns, is the upper block of and denotes the pseudoinverse of . Using the Woodbury matrix identity , and letting , we have (see Algorithm 3). The latter term requires inverting a diagonal matrix and a dense matrix of size matrix where , both of which can be done significantly faster. We found that setting to a tiny fraction of had a minor impact on generalization performance for most data sets while dramatically improving running time.
5 An Error Bound
In this section we provide an error bound for our algorithm. This bound is useful because it provides theoretical justification for our algorithm and shows when our algorithm will be effective. The proof is included in the supplementary materials.
First, some notation:
: Ground truth target and source functions
: Laplacian created using . i.e. where
: Laplacian created using . i.e. where
: Vectors of predictions on the test data using Nadaraya Watson, our method with , and our method with respectively.
Our goal is to bound the error , the error of our method when using the source Laplacian .
Theorem 5.1 (Error Bound)
The prediction error of our method is bounded by:
The first term on the right hand side is the error of our method when using the “correct” Laplacian . This “approximation error” is a property of and the current hyperparameters .
The second term depends on two quantities: the error of the NW estimate and the difference between the source and target Laplacians. Ideally, we would simply use , but because is unavailable it cannot be used in practice. However, this term will be small when the source and target prediction graphs are very similar, showing our method can be successful if an appropriate source is used. While this term is minimized by setting to , doing so would make , which would cause the first term to grow.
5.1 Adding Constraints.
Suppose the user adds guidance to , as suggested in section 4.2, to produce . In this case, the previous bound becomes a function of . Thus, adding guidance is theoretically motivated as long as the new Laplacian better approximates . This has two consequences. First, it shows that adding constraints can improve the accuracy of this method. Second, it shows that even adding noisy guidance can be helpful as long as it makes it more similar to .
In our experiments we attempt to answer the following questions regarding the accuracy and scalability of our method, as well as the impact of adding simulated human guidance:
What is the impact of including spatial continuity when constructing the Pairwise Similarity Graph (Table 7)?
How does using the Nyström method effect the performance of our method (Figure 7)?
Does incorporating domain knowledge in the form of pairwise constraints when constructing the graph improve the accuracy of our method (Table 7)?
|Target Only||Training using only the target data.|
|Semisupervised ||Semisupervised regression using the “Learning with Local and Global Consistency” method.|
|Stacked ||An Hypothesis Transfer method, where predictions made by learned source and target functions are used as features for a linear function. This method performs best when the source and target functions are close to identical. While simple, this method can perform very well in practice.|
|Offset ||A state-of-the-art transfer learning method which account for differences between the source and target by estimating a translation transformation between the source and target domains.|
|Location-Scale ||A state-of-the-art transfer learning method which estimates a location-scale transformation to map the source data to the target task.|
The competing methods we used are described in Table 2. The variations of our method we used are:
PSRT: The Pairwise Similarity Regularization Transfer method we proposed (Section 3).
PSRT+SC: The Pairwise Similarity Regularization Transfer method we proposed, including spatial continuity (Section 4.1).
PSRT: Nyström: Our method where we used the Nyström method to accelerate our solver (Section 4.3), where is the percentage of columns sampled.
PSRT: Guidance: Our method with guidance (Section 4.2), where some percentage of the Similarity Graph entries were replaced with if the ground truth target labels were similar and otherwise. Here, we defined “similar” to be if the absolute difference in labels was within the bottom tenth percentile of all pairwise differences. is the percentage of the possible pairs which were sampled for guidance.
Because the data sets we use do not suffer from “Covariate Shift,” we did not run experiments using Covariate Shift and Domain Adaptation methods such as those in [23, 24, 13, 19]. However, these methods could be used a preprocessing step for our method when using data with Covariate Shift. For all methods all parameters were tuned on a validation set and results are the average on 30 train/test splits. Values in parentheses are confidence intervals. If this paper is accepted then all code and data will be made available online.
Data: We used 7 regression data sets: four spatial data sets which we found Pairwise Similarity Transfer performed well on, and three data sets where Pairwise Similarity Transfer did not perform well. This is to be expected as the Pairwise Similarity transfer assumption, like any transfer learning assumption, need not always hold. We included both positive and negative experiments to better understand the benefits and limitations of Pairwise Similarity Transfer. The data sets are described in Tables 3 and 4. To better understand the performance of our method we also included heat maps of the gradient magnitudes of the predictive functions in Figures 3, 4 and 5. These plots were generating by estimating the gradients of the predictive functions through the instance space and visualizing the norms of the gradients, where lighter values correspond to a lower magnitude. e.g. Darker areas indicate less variation of the function.
|Synthetic Piecewise||A synthetic piecewise constant data set with Gaussian noise. The source and target data were different piecewise constant functions, but share the same set of discontinuities.|
|Census ||Predicting the mean household size of zip codes in southern California as recorded from the 2010 US census. Source is mean income of households, as recorded in the 2014 American Community Survey.|
|Temperature ||Predicting the mean, high temperature in April 2016 across the southern United States. Source is mean, high temperature from January 2016.|
|Taxi+Housing [4, 22]||Predicting the average number of taxi pickups at various regions of San Francisco, California, between 5am and 12pm over thirty days. Source is average home price of neighborhood.|
|Bike Sharing [8, 18]||Bike rental prediction in 2011 and 2012 as a function of weather. We used the 2011 data as the source and the 2012 data as the target.|
|Boston Housing (BH) [14, 18]||House price prediction in Boston as a function of number of rooms in the house. Bottom quartile of LSTAT (percentage of lower status of the population) for target, second quartile for source.|
|King County Housing (KC Housing) ||Predicting housing prices in King County, Washington as a function of location. We split houses into source and target data sets by the number of floors of each house.|
|Bike Sharing||0.066 (0.01)|
|KC Housing||0.497 (0.05)|
|PSTR+SC||PSTR: 10% Guidance||PSTR: 20% Guidance||PSTR: 10% Nystrom||PSTR: 20% Nystrom|
All Methods Experiments: Table 5 shows the performance of these methods on real and synthetic data sets. For these data sets our method performed best. The positive performance of our method seems to be due to Pairwise Similarity holding for these data sets. For example, the climate data is shown in Figures 1(a), 1(b), and the magnitude of the gradients are shown in Figure 3. From these figures we see the two data sets seem to share subsets of the instance space where the functions are relatively smooth. In particular, looking at the gradient information shows shared areas of near-zero gradient magnitude. The other transfer methods likely performed worse due to their transfer assumptions being violated.
Table 6 shows the performance of these methods on a different set of data sets. While our method generally performed better than Location-Scale, Target Only and Stacked, Offset performed best overall. This is likely due to the transformation between the source and target being smooth. Looking at the gradient information in Figure 5 gives us some insight into this. While there is some alignment in where the functions do not rapidly vary, it is to a much lesser degree than in Figures 3 and 4. This matches the empirical performance of our method, which was good, but not as strong as Offset.
It’s worth noting how the “worst case” performance of our method and Offset differed. While Offset displays negative transfer on most of the first four data sets (performing worse than just using the target data), our method always performed better than Target Only. This is important because it suggests our method is more robust to negative transfer than Offset.
Another key insight is that all the real data sets with positive results were spatial data sets, indicating Pairwise Similarity is a property that can likely hold for spatial data.
Spatial Continuity: Tables 5 and 6 shows the performance of our method with and without the incorporation of spatial continuity when construction the Pairwise Similarity Graph. These results show including it can have a dramatic improvement on performance, but did not universally help. Rather, the utility of spatial continuity seems to be a property of data set, but using it seems wise in practice.
Nyström Approximation: Table 7 shows the performance of our method when using the Nyström method to approximate the source prediction graph. These results show that for most of our data sets sampling had a negligible impact on performance, even when only a small fraction of the Pairwise Similarity Graph’s columns were used. The exception to this is the Synthetic Piecewise data set, which displayed erratic performance. We suspect this is due to the Pairwise Similarity Graph not being low rank.
Overall, these results demonstrate that our method can be scaled by using the Nyström approximation without significant losses in accuracy.
Binary Guidance: Table 7 shows the performance of our method when incorporating varying amounts of binary guidance (see Section 4.2). These results show our approach of encoding guidance can dramatically improve the performance of our method. This is an important result because it shows even the relatively noisy guidance that humans could provide can aid our algorithm. Additionally, this shows that pairwise similarity guidance can be an effective mechanism for improving transfer learning.
We proposed Pairwise Similarity Transfer, a new transfer learning setting where the similarity between pairs of source predictions are transferred, and we empirically showed this assumption leads to positive results for a variety of complicated spatial transfer learning problems. We modeled Similarity Transfer by creating a graph from the source data where each node is an instance and the edge weights are a function of the source prediction discrepancies. Modeling this form of transfer through a graph has several benefits. First, it is easy to extend the method to include spatial discrepancy. Second, we can use the Nyström Approximation to scale the method to larger data sets. Finally, this allows a novel method of encoding pairwise user guidance. We showed that our method has strong theoretical justification by bounding the prediction error. In the future, we will explore whether Pairwise Similarity can occur in non spatial data sets. In particular, we suspect it will hold in some time series data.
-  House sales in king county, wa. https://www.kaggle.com/harlfoxem/housesalesprediction.
-  National centers for environmental information. https://www.ncdc.noaa.gov.
-  United states census bureau. https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml.
-  Zillow research. https://www.zillow.com/research/data/.
-  M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, JMLR, (2006).
-  L. Bottou, Large-scale machine learning with stochastic gradient descent, in COMPSTAT, 2010.
-  H. Daume III and D. Marcu, Domain adaptation for statistical classifiers, Journal of Artificial Intelligence Research, (2006), pp. 101–126.
-  H. Fanaee-T and J. Gama, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence, (2014).
-  B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, Unsupervised visual domain adaptation using subspace alignment, in ICCV, 2013.
-  C. Fowlkes, S. Belongie, F. Chung, and J. Malik, Spectral grouping using the nystrom method, PAMI, (2004).
-  G. H. Golub and C. F. Van Loan, Matrix computations, 2012.
-  B. Gong, Y. Shi, F. Sha, and K. Grauman, Geodesic flow kernel for unsupervised domain adaptation, in CVPR, 2012.
-  A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf, Covariate shift by kernel mean matching, Dataset shift in machine learning, (2009).
-  D. Harrison and D. L. Rubinfeld, Hedonic housing prices and the demand for clean air, Journal of environmental economics and management, (1978).
-  T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin, The elements of statistical learning: data mining, inference and prediction, The Mathematical Intelligencer, (2005).
-  R. A. Horn and C. R. Johnson, Matrix analysis, Cambridge university press, 2012.
-  I. Kuzborskij and F. Orabona, Stability and hypothesis transfer learning., in ICML, 2013.
-  M. Lichman, UCI machine learning repository, 2013.
-  S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, Domain adaptation via transfer component analysis, Transactions on Neural Networks, (2011).
-  S. J. Pan and Q. Yang, A survey on transfer learning, KDE, (2010).
-  N. Patricia and B. Caputo, Learning to learn, from transfer learning to domain adaptation: A unifying perspective, in CVPR, 2014.
-  M. Piorkowski, N. Sarafijanovic-Djukic, and M. Grossglauser, CRAWDAD dataset epfl/mobility (v. 2009-02-24). Downloaded from http://crawdad.org/epfl/mobility/20090224, Feb. 2009.
-  M. Sugiyama, M. Krauledat, and K.-R. MÃžller, Covariate shift adaptation by importance weighted cross validation, JMLR, (2007).
-  M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe, Direct importance estimation with model selection and its application to covariate shift adaptation, in NIPS, 2008.
-  T. Tommasi, F. Orabona, and B. Caputo, Safety in numbers: Learning categories from few examples with multi model knowledge transfer, in CVPR, 2010.
-  U. Von Luxburg, A tutorial on spectral clustering, Statistics and computing, (2007).
-  X. Wang, T.-K. Huang, and J. Schneider, Active transfer learning under model shift, in ICML, 2014.
-  X. Wang and J. Schneider, Flexible transfer learning under support and model shift, in NIPS, 2014.
-  K. Zhang, K. Muandet, Z. Wang, et al., Domain adaptation under target and conditional shift, in ICML, 2013.
-  D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, Learning with local and global consistency, NIPS, (2004).