A Data-Dependent Distance for Regression

# A Data-Dependent Distance for Regression

Jeff M. Phillips School of Computing, University of Utah, Salt Lake City, Utah, US
jeffp@cs.utah.edu
Pingfan Tang School of Computing, University of Utah, Salt Lake City, Utah, US
tang1984@cs.utah.edu
###### Abstract

We develop a new data-dependent distance for regression problems to compare two regressors (hyperplanes that fits or divides a data set). Most distances between objects attempt to capture the intrinsic geometry of these objects and measure how similar that geometry is. However, we argue that existing measures are inappropriate for regressors to a data set. We introduce a family of new distances that measure how similarly two regressors interact with the data they attempt to fit. For the variant we advocate we show it is a metric (under mild assumptions), induces metric balls with bounded VC-dimension, it is robust to changes in the corresponding data, and can be approximated quickly. We show a simple extension to trajectories that inherits these properties, as well as several other algorithmic applications.

Moreover, in order to develop efficient approximation algorithms for this distance we formalize the relationship between sensitivity and leverage scores. This may be of independent interest.

data-dependent distance, sensitivity, leverage score, trajectories

John Q. Open and Joan R. Access\subjclassI.3.5 Computational Geometry and Object Modeling, G.3 Probability and Statistics\EventEditors \EventNoEds2 \EventLongTitle \EventShortTitle \EventAcronym \EventYear \EventDate \EventLocation \EventLogo \SeriesVolume \ArticleNo

## 1 Introduction

Linear models are simple and widely used in many areas. There are many linear regression algorithms, like least square regression, lasso regression, and the Siegel estimator [30], that predict a dependent variable based on the value of several independent variables. Linear classifiers use data labeled by a class (usually positive or negative) to separate a domain into corresponding regions for use in predicting that class for new data. These can also be viewed as regression, e.g., logistic regression, although there are many ways to build such models. All of these algorithms build a model from a data set and output a hyperplane. Despite the central role these techniques play in data analysis, and the extensive analysis of their efficiency and robustness, there is very little work on how to compare these outputs.

In other areas of data analysis complex objects like lines or hyperplanes take on a first class role, deserving of study. For instance, objects such as airplane routes, walking paths, or migratory paths of birds are collected in large quantities and need to be clustered, classified, or have their density analyzed. Moreover, these objects are relevant not in just their intrinsic geometric (where there is a burgeoning set of approaches [8, 22, 10, 9, 33]), but also in how they interact with a set of observation point , such as wifi access points, radio/cell towers, other airports, or no-fly/construction zones (for which there are little or no approaches).

Moreover, in both scenarios, most of the existing approaches are quite limited or have unintuitive and undesirable properties.

#### The default dual-Euclidian distance.

Consider the least square regression problem in : given , return a line such that . If is an alternate fit to this data, then to measure the difference in these variants, we can define a distance between and . A simple and commonly used distance (which we called the dual-Euclidean distance) is

 ddE(ℓ,ℓ1):=√(a−a1)2+(b−b1)2.

This can be viewed as dualizing the lines into a space defined by their parameters (slope and intercept ), and then taking the Euclidean distance between these parametric points. However, as shown in Figure 1, if both and have the same slope , and are offset the same amount from (), then , although intuitively does a much more similar job to with respect to than does .

#### Other potential distances.

A geometric object is usually described by an (often compact) set in . There are many ways to define and compute distances between such objects [1, 2, 18, 19]. These can be based on the minimum [18, 19] or maximum (e.g., Hausdorff) [1, 2] distance between objects. For lines or hyperplanes which extend infinitely and may intersect at single points, clearly such measures are not meaningful.

Besides the dual-Euclidean distance whose deficiencies are discussed above, another plausible approach to defining a distance between lines would be via Plücker coordinates [29]. These assign six homogeneous coordinates to each line in projective 3-space that are unique up to a non-zero scaling factor. So one could then normalize these coordinates and define a distance as the geodesic on (or even just the Euclidean distance). However, this representation does not account for or extend to higher-dimensional structures like hyperplanes in , for , as our distance will.

The relation to other distances is explored further in Appendix A.

#### What is needed in a distance?

A definition of a distance is the key building block in most data analysis tasks. For instance, it is at the heart of any assignment-based clustering (e.g., -means) or for nearest-neighbor searching and analysis. We can also define a radial-basis kernel (or similarly), which is required for kernel SVM classification, kernel regression, and kernel density estimation. A change in the distance, directly affects the meaning and modeling inherent in each of these tasks. So the first consideration in choosing a distance should always be, does it capture the properties between the objects that matter.

The second goal for a distance is often that it should be a metric: for instance this is essential in the analysis for the Gonzalez algorithm [20] for -center clustering, and many other contexts such as nearest-neighbor searching.

The third goal, is that its metric balls are well-behaved. More specifically, consider a set of hyperplanes as the base object and a distance , let be a metric ball around of radius . Then we can define a range space where , and consider its VC-dimension [31]. When the VC-dimension is small, it implies that the metric balls cannot interact with each other in too complex a way, indicating the distance is roughly as well-behaved as a roughly -dimensional Euclidean ball. More directly, this implies for kernel based tasks that subsampling from a large set of hyperplanes that we can preserve density estimates [21], estimate multi-modality, and build classifiers with bounded generalization error [31]. In other words, this ensures that these tasks are stable with respect to the underlying data set .

A fourth goal, when a distance depends on a base set , is that we want the distance to be stable with respect to this as well. That is, for any (potentially large) set , there should be another one with bounded size so that is guaranteed to be close to for all inputs. Such a property implies that the distance is resilient to modifications to the base set , and (in our case) will imply that the cost of computing can always be reduced to something manageable, independent of the size of . Moreover, when is very large, it is important to be able to compute such a efficiently, and if possible in a streaming setting.

Finally, as a sanity check when the base objects are lines to halfspaces, then it is desirable for any distance to align with intuitive notions. For instance when a set of halfspaces are all parallel, then should be isomorphic to the offset distance between these halfspaces.

### 1.1 Main Results

We define a new data dependent distance for linear models built from a data set . Our distance is a simple intuitive way to capture the relationship between the model and , it does not have the same deficiencies as , and satisfies all of the goals outlined in the above section: it is a metric, its metric ball has VC-dimension only depending on the ambient dimension, we can approximate independent of its size.

We first define for lines in (in Section 2) and show it is a metric when some subset of points in are non-collinear. Then we show the same for the general case of hyperplanes in (in Section 3).

We show that the range space induced by metric balls of in has VC-dimension that depends only , and not on the size or shape of . We find this surprising since there is a natural mapping under this metric from any hyperplane to a point in where is then the Euclidean distance in . This would suggest a VC-dimension of ; and indeed this is the best such bound we have for another variant of the distance we considered.

Then we show in Section 4 that we can approximate with another distance defined by a point set . This result appeals to sensitivity sampling [23] a form of importance sampling. We also show how to perform this sampling in a streaming setting using a result for matrix row sampling with leverage scores [7]. This is required since the sensitivity scores change dynamically as more points are seen in the stream.

The stream sampling result requires showing an equivalence between sensitivity sampling and leverage score sampling. We believe this may be of independent interest since it ports recent advances in adaptive leverage-score based row sampling [7] to sensitivity-based methods common for clustering and subspace approximation.

Next, in Section 5 we show a simple way to extend the distance, and its aforementioned nice properties, to operate on trajectories. We show a few examples how this directly leads to straightforward and effective clustering and classification algorithms.

Finally, we discuss algorithmic applications of . The VC-dimension bound immediately implies results in understanding distributions of linear estimators arising from uncertain data analysis and statistical variance analysis. It also validates an associated kernel density estimate over these estimators. We also describe how the results for sampling from make efficient various goals including evaluating coresets and detecting multi-modality.

## 2 The Distance Between Two Lines

In the section, as a warm up to the general case, we define a new data dependent distance between two lines, and give the condition under which it is a metric.

Suppose where has coordinates for , and is a line in , then can be uniquely expressed as

 ℓ={(x,y)∈R |  u1x+u2y+u3=0}, (1)

where . Here and the first nonzero entry of is positive, is a canonical way to normalize where is unit normal is an offset parameter. Let ; it is the signed distance from to the closest point on . Then is the -dimensional vector of these distances. For two lines , in , we can now define

 dQ(ℓ1,ℓ2)=∥∥1√n(vQ(ℓ1)−vQ(ℓ2))∥∥=(n∑i=11n(vQi(ℓ1)−vQi(ℓ2))2)12. (2)

As shown in Figure 2, is the distance from to . With the help of , we convert each line in to point in , and use the Euclidean distance between two points to define the distance between the original two lines. Via this Euclidean embedding, it directly follows that is symmetric and follows the triangle inequality. The following theorem shows, under reasonable assumptions of , no two different lines can be mapped to the same point in , so (2) is a metric.

{theorem}

Suppose in there are three non-collinear points, and , then is a metric in .

###### Proof.

The function is symmetric and by mapping to satisfies the triangle inequality, and implies ; we now show if , then .

Without loss of generality, we assume are not on the same line, which implies

 ∣∣ ∣∣x1y11x2y21x3y31∣∣ ∣∣≠0. (3)

Suppose and are expressed in the form:

 ℓ1= {(x,y)∈R | u(1)1x+u(1)2y+u(1)3=0,}ℓ2= {(x,y)∈R | u(2)1x+u(2)2y+u(2)3=0,}

where represent lines and , respectively. If , then we have

 xi(u(1)1−u(2)1)+yi(u(1)2−u(2)2)+(u(1)3−u(2)3)=0

for . Using (3) we can write this as the system

 ⎡⎢⎣x1y11x2y21x3y31⎤⎥⎦⎡⎢⎣z1z2z3⎤⎥⎦=0 where ⎡⎢⎣z1z2z3⎤⎥⎦=⎡⎢ ⎢ ⎢⎣u(1)1−u(2)1u(1)2−u(2)2u(1)3−u(2)3⎤⎥ ⎥ ⎥⎦.

It has the unique solution for . So, we have , and , and thus . ∎

#### Remark.

For in (2), its absolute value is the distance from to the line , i.e. . Moreover, if is parallel to , then for any , which means definition (2) is a generalization of the natural offset distance between two parallel lines.

#### Remark.

There are several other nicely defined variants of this distance. For a line we could define , as the unsigned distance from to the line . When the sign of the distance from to some bounded object (in place of ), this distance may be more natural. However, we are not able to show a constant VC-dimension for the metric balls associated with this distance. We are able to show in Appendix C that under similar mild restrictions on that this is a metric; the condition requires points instead of . However, we are not able to show constant-size VC-dimension for its metric balls (as we do for in Section 3.1). There we also introduce another matrix Frobenius norm variant.

## 3 The Distance Between Two Hyperplanes

In this section, we generalize to the distance between two hyperplanes, and we bound the VC dimension of the range space induced by this distance. Let represent the space of all hyperplanes.

Suppose , where has the coordinate . Any hyperplane can be uniquely expressed in the form

 h={x=(x1,⋯,xd)∈Rd | ∑dj=1ujxj+ud+1=0},

where is a vector in , i.e. is the unit normal vector of , and is the offset. We introduce the notation where is again the signed distance from to the closest point on . We can specify , which is a dot-product with the unit normal of , plus offset . Now for two hyperplanes in define

 dQ(h1,h2):=∥∥1√n(vQ(h1)−vQ(h2))∥∥=(n∑i=11n(vQi(h1)−vQi(h2))2)12. (4)

For , similar to in , we want to consider the case that there are points in which are not on the same hyperplane. We refer to such a point set as full rank since if we treat the points as rows, and stack them to form a matrix, then that matrix is full rank. Like lines in , a hyperplane can also be mapped to a point in , and if is full rank, then no two lines will be mapped to the same point in . So, similar to Theorem 2, we can prove (4) is a metric. {theorem} If is full rank, then is a metric in .

#### Remark.

The definition (4) can be generalized to weighted point sets and continuous probability distributions. Suppose , , and is a probability measure on . For two hyperplanes in , we define

 dQ,W(h1,h2) =(n∑i=1wi(vQi(h1)−vQi(h2))2)12,\ \ and \ dμ(h1,h2) =(∫Rd(vx(h1)−vx(h2))2dμ(x) )12

where is defined in the same way as for .

### 3.1 VC-Dimension of Metric Balls for dQ

The distance can induce a range space , where again is the collection of all hyperplanes in , and with metric ball . We prove that the VC dimension [31] of this range space only depends on , and is independent of the number of points in .

{theorem}

Suppose is full rank, then the VC-dimension of the range space is at most .

###### Proof.

For any , suppose with and . This implies , so if is represented by a unique vector , then we have

 n∑i=11n(d∑j=1ujxi,j+ud+1−vQi(h0))2≤r2. (5)

Since (5) can be viewed as a polynomial of , we can use a standard lifting map to convert it to a linear equation about new variables, and then use the VC-dimension of the collection of halfspaces to prove the result.

To this end, we introduce the following data parameters [for ] and [for ] which only depend on , , and . That is these only depend on the metric and the choice of metric ball.

We also introduce another set of new variables [for ] and [for ] which only depend on the choice of :

 yj=uj [for 1≤j≤d+1] and yj,j′=ujuj′ [for 1≤j≤j′≤d+1].

Now (5) can be rewritten as

 d+1∑j=1ajyj+∑1≤j≤j′≤d+1aj,j′yj,j′+a0≤0.

Since the and only depend on , , and , and the above equation holds for any and implied by an , then it describes a way to convert into a halfspace in where . Since the VC-dimension of the collection of all halfspaces in is , the VC dimension of is at most . ∎

#### Remark.

This distance, metric property, and VC-dimension result extends naturally to operate between any objects, such as polynomial models of regression, which can be linearized to hyperplanes in .

## 4 Estimating dQ

In this section, we study how to efficiently compute approximately, when the data set is very large. The basic idea is to use the sensitivity sampling method [23], and an online row sampling algorithm designed for leverage sampling [12].

### 4.1 Estimation of dQ by Sensitivity Sampling on Q

We need the following concept that describes the importance of objects from a data set.

#### Sensitivity score.

Suppose is drawn from a family of nonnegative real-valued functions on a metric space , and is a probability measure on , . The sensitivity [23] of w.r.t. is defined as:

 σF,X,μ(x):=supf∈Ff(x)¯f,

and the total sensitivity of is defined as: This concept is quite general, and has been widely used in applications ranging from various forms of clustering [14, 17] to dimensionality reduction [15] to shape-fitting [32].

Now, we can use sensitivity sampling to estimate with respect to a tuple . First suppose is full rank and . Then we can let and ; what remains is to define the appropriate . Roughly, is defined with respect to a -dimensional vector space , where for some , for each ; and is the set of all linear functions on .

We now define in more detail. Recall for each can be represented as a vector . This defines a function , and these functions are elements of . The vector space is however larger and defined

so that there can be for which ; rather it can more generally by in . Then the desired family of real-valued functions is defined

 F={f:Q↦[0,∞)∣∃ v∈V s.t. f(x)=v(x)2, ∀x∈Q}.

To see how this can be applied to estimate , consider two hyperplanes in and the two unique vectors which represent them. Now introduce the vector ; note that , but not necessarily in . Now for define a function as

 fh1,h2(x)=fh1,h2(x1,⋯,xd)=(∑di=1uixi+ud+1)2,

so . And thus an estimation of provides an estimation of . In particular, given the sensitivities score for each , we can invoke Lemma 2.1 in [23] to reach the following theorem.

{theorem}

Suppose and is full rank. Let be an iid random sample from of size with weight , according to the distribution where is the sensitivity of . Then, with probability at least , we have

 (1−\eps)dQ(h1,h2)≤d˜Q,W(h1,h2)≤(1+\eps)dQ(h1,h2),  ∀  h1,h2∈\cH.

### 4.2 Sensitivity Computation and its Relationship with Leverage Score

The next step is to compute the sensitivity score for each . To this end we can invoke a theorem about vector norms by Langberg and Shulman [23]:

{lemma}

[Theorem 2.2 in  [23]] Suppose is a probability measure on a metric space , and is a real vector space of dimension . Let , and be an orthonormal basis for under the inner product , . Then, and .

We have already set and , and have defined and . To apply the above theorem need to define an orthonormal basis for . A straightforward basis (although not necessarily an orthonormal one) exists as and for all , where is an indicator vector with all zeros except 1 in th coordinate. That is the th basis element is simply the th coordinate of the input. Since is full rank, is a basis of .

We are now ready to state our main theorem on computing sensitivity scores on a general , where we typically set .

{theorem}

Suppose is a probability measure on a metric space such that for all , is a real vector space of dimension with a basis , and . If we introduce a matrix whose th column is defined as: , then we have

 σF,Q,μ(xi)⋅pi=aTi(AAT)−1ai,   ∀ xi∈Q. (6)

This almost directly follows from Lemma 4.2, however Lemma 4.2 requires an orthonormal basis, and we have only defined a basis, but not shown it is orthonormal. The proof, which we defer to Appendix B, shows that we can always orthonormalize the straightforward basis we provided using the Gram-Schmidt process, and the resulting sensitivity score is simply derived from the described matrix . The details are a bit tedious, but not unexpected.

This theorem not only shows how to compute the sensitivity of a point, but also gives the relationship between sensitivity and the leverage score.

#### Leverage score.

Let denotes the Moore-Penrose pseudoinverse of a matrix, so when is full rank. The leverage score [12] of the th column of matrix is defined as:

This definition is more specific and linear-algebraic than sensitivity. However, Theorem 4.2 shows that value is just the leverage score of the th column of the matrix . Compared to sensitivity, leverage scores have received more attention for scalable algorithm development and approximation [12, 3, 11, 6, 28, 7], which we exploit in Section 4.3.

### 4.3 Estimate the Distance by Online Row Sampling

If the dimensionality is too high and the number of points is too large to be stored and processed in memory, we can apply online row sampling [7] to estimate . Note that as more rows are witnessed the leverage score of older rows change. While other approaches (c.f. [11, 6, 28]) can obtain similar (and maybe slightly stronger) bounds, they rely on more complex procedures to manage these updating scores. The following Algorithm 1 by Cohen \etal [7], on the other hand, simply samples columns as they come proportional to their estimated ridge leverage score [6]; thus it seems like the “right” approach.

According to the Theorem 3 in [7], Algorithm 1 returns a matrix , with high probability, such that , and the number of rows in is . (Recall means for every vector .)

Given a set of points , where has the coordinates , we introduce an matrix whose th row is defined as:

 ai=(xi,1,⋯,xi,d,1),

For any two hyperplanes , they can be uniquely expressed by vectors , and define , then we have . So, if is very large we can apply Algorithm 1 to efficiently sample rows from , and use to estimate . From Theorem 3 in [7], we have the following result.

{theorem}

Suppose , and matrix is defined above. Let be the matrix returned by Algorithm 1. Then, with probability at least , for any two hyperplanes expressed by , suppose , we have

 11+\eps(1n∥A˜Quh1,h2∥2−1nδ∥uh1,h2∥2)12≤dQ(h1,h2)≤11−\eps(1n∥A˜Quh1,h2∥2+1nδ∥uh1,h2∥2)12,

where is the Euclidean norm, and with probability at least the number of rows in is .

To make the above bound hold with arbitrarily high probability, we can use the standard median trick: run Algorithm 1 times in parallel to obtain , then for any two hyperplanes , we take the median of .

#### Remark.

Since , we have

 ∥uh1,h2∥2=(∥u(1)−u(2)∥)2≤(∥u(1)∥+∥u(2)∥)2≤2(∥u(1)∥2+∥u(2)∥2)=2(2+(u(1)d+1)2+(u(2)d+1)2)=4+2d2(0,h1)+2d2(0,h2),

where is the distance from a choice of origin to . If we assume that any hyperplanes we consider must pass within a distance to the choice of origin, then let and . Now where is the set of points corresponding to rows in , and the weighting is defined so . Then the conclusion of Theorem 4.3 can be rewritten as

 11+\eps(d˜Q,W(h1,h