K-Plane Regression

# K-Plane Regression

Naresh Manwani, P. S. Sastry,  Naresh Manwani and P. S. Sastry are with the Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, India e-mail: (naresh,sastry@ee.iisc.ernet.in).
###### Abstract

In this paper, we present a novel algorithm for piecewise linear regression which can learn continuous as well as discontinuous piecewise linear functions. The main idea is to repeatedly partition the data and learn a liner model in each partition. While a simple algorithm incorporating this idea does not work well, an interesting modification results in a good algorithm. The proposed algorithm is similar in spirit to -means clustering algorithm. We show that our algorithm can also be viewed as an EM algorithm for maximum likelihood estimation of parameters under a reasonable probability model. We empirically demonstrate the effectiveness of our approach by comparing its performance with the state of art regression learning algorithms on some real world datasets.

Piecewise Linear, Regression, Mixture Models, Expectation Maximization, Learning.

## I Introduction

In a regression problem, given the training dataset containing pairs of multi-dimensional feature vectors and corresponding real-valued target outputs, the task is to learn a function that captures the relationship between feature vectors and their corresponding target outputs.

Least square regression and support vector regression are well known and generic approaches for regression learning problems [1, 2, 3, 4, 5]. In the least squares approach, nonlinear regression functions can be learnt by using user-specified fixed nonlinear mapping of feature vectors from original space to some suitable high dimensional space though this could be computationally expensive. In support vector regression (SVR), kernel functions are used for nonlinear problems. Using a nonlinear kernel function, SVR implicitly transforms the examples to some high dimensional space and finds a linear regression function in the high dimensional space. SVR has a large margin flavor and has well studied performance guarantees. In general, SVR solution is not easily interpretable in the original feature space for nonlinear problems.

A different approach to learning a nonlinear regression function is to approximate the target function by a piecewise linear function. Piecewise linear approach for regression problems provides better understanding of the behavior of the regression surface in the original feature space as compared to the kernel-based approach of SVR. In piecewise linear approaches, the feature space is partitioned into disjoint regions and for every partition a linear regression function is learnt. The goal here is to simultaneously estimate the optimal partitions and linear model for each partition. This problem is hard and is computationally intractable [6].

The simplest piecewise linear function is either a convex or a concave piecewise linear function which is represented as a maximum or minimum of affine functions. A generic piecewise linear regression function can be represented as a sum of these convex/concave piecewise linear functions [7, 8, 9].

In this paper we present a novel method of learning piecewise linear regression functions. In contrast to all the existing methods, our approach is capable of learning discontinuous functions also. We show, through empirical studies, that this algorithm is attractive in comparison to the SVR approach as well as the hinge hyperplanes method which is among the best algorithms for learning piecewise linear functions.

Existing approaches for piecewise linear regression learning can be broadly classified into two classes. In the first set of approaches one assumes a specific form for the function and estimates the parameters. Form of a regression function can be fixed by fixing the number of hyperplanes and fixing the way these hyperplanes are combined to approximate the regression surface. In the second set of approaches, the form of the regression function is not fixed apriori.

In fixed structure approaches we search over a parametrized family of piecewise linear regression functions and the parameters are learnt by solving an optimization problem to, typically, minimize the sum of the squared errors. Some examples of such methods are mixture of experts and hierarchical mixture of experts [10, 11, 12] models.

In the set of approaches where no fixed structure is assumed, regression tree [13, 14] is the most widely used method. A regression tree is built by binary or multivariate recursive partitioning in a greedy fashion. Regression trees split the feature space at every node in such a way that fitting a linear regression function to each child node will minimize the sum of squared errors. This splitting or partitioning is then applied to each of the child nodes. The process continues until the number of data points at a node reaches a user-specified minimum size or the error becomes smaller than some tolerance limit. In contrast to decision trees where leaf nodes are assigned class labels, leaf nodes in regression trees are associated with linear regression models. Most of the algorithms for learning regression trees are greedy in nature. At any node of the tree, once a hyperplane is learnt to split the feature space, it can not be altered by any of its child nodes. The greedy nature of the method can result in convergence to a suboptimal solution.

A more refined regression tree approach is hinging hyperplane method [7, 15] which overcomes several drawbacks of regression tree approach. A hinge function is defined as maximum or minimum of two affine functions [7]. In the hinging hyperplane approach, the regression function is approximated as a sum of these hinge functions where the number of hinge functions are not fixed apriori. The algorithm starts with fitting a hinge function on the training data using the hinge finding algorithm [7]. Then, residual error is calculated for every example and based on this a new hinge function may be added to the model (unless we reach the maximum allowed number of hinges). Every time a new hinge function is added, its parameters are found by fitting the residual error. This algorithm overcomes the greedy nature of regression tree approach by providing a mechanism for re-estimation of parameters of each of the earlier hinge function whenever a new hinge is added. Overall, hinge hyperplanes algorithm tries to learn an optimal regression tree, given the training data.

A different greedy approach for piecewise linear regression learning is bounded error approach [16, 17, 18]. In bounded error approaches, for a given bound () on the tolerable error, the goal is to learn a piecewise linear regression function such that for every point in the training set, the absolute difference between the target value and the predicted value is less than . This property is called bounded error property. Greedy heuristic algorithms [17, 18] have been proposed to find such a piecewise linear function. These algorithms start with finding a linear regression function which should satisfy the bounded error property for as many points in the training set as possible. This problem is known as maximum feasible sub-system problem (MAX-FS) and is shown to be NP-hard [16]. MAX-FS problem is repeated on the remaining points until all points are exhausted. So far, there are no theoretical results to support the quality of the solution of these heuristic approaches.

Most of the existing approaches for learning regression functions find a continuous approximation for the regression surface even if the actual surface is discontinuous. In this paper, we present a piecewise linear regression algorithm which is able to learn both continuous as well as discontinuous functions.

We start with a simple algorithm that is similar, in spirit, to the -means clustering algorithm. The idea is to repeatedly keep partitioning the training data and learning a hyperplane for each partition. In each such iteration, after learning the hyperplanes, we repartition the feature vectors so that all feature vectors in a partition have least prediction error with the hyperplane of that partition. We call it -plane regression algorithm. Though we are not aware of any literature where such a method is explicitly proposed and investigated for learning regression functions, similar ideas have been proposed in related contexts. For example, a similar problem is addressed in the system identification literature [16]. A probabilistic version of such an idea was discussed under the title mixtures of multiple linear regression [1, Chapter 14].

This -plane regression algorithm is attractive because it is conceptually very simple. However, it suffers from some serious drawbacks in terms of convergence to non-optimal solutions, sensitivity to additive noise and lack of model function. We discuss these issues and based on this insight propose new and modified -plane regression algorithm. In the modified algorithm also we keep repeatedly partition the data and learning a linear model for each partition. However, we try to separately and simultaneously learn the centers of the partitions and the corresponding linear models. Through empirical studies we show that this algorithm is very effective for learning piecewise linear regression surfaces and it compares favorably with other state-of-art regression function learning methods.

The rest of the paper is organized as follows. In Section II we discuss -plane regression algorithm, its drawbacks and possible reasons behind them. We then propose modified -plane regression algorithm in Section III. We also show that modified -plane regression algorithm monotonically decreases the error function after every iteration. In Section IV we show the equivalence of our algorithm with an EM algorithm in limiting case. Experimental results are given in Section V. We conclude the paper in Section VI.

## Ii K-Plane Regression

We begin by defining a -piecewise affine function. We use the notation that a hyperplane in is parametrized by where and .

###### Definition 1

A function , is called -piecewise affine if there exists a set of hyperplanes with parameters , and sets (which form a partition of ), such that, .

From the definition above, it is clear that . Also, note that a -piecewise affine function may be discontinuous.

### K-Plane Regression

Let be the training dataset, where . Let . -plane regression approach tries to find a pre-fixed number of hyperplanes such that each point in the training set is close to one of the hyperplanes. Let be the number of hyperplanes. Let , be the parameters of the hyperplanes. -plane regression minimizes the following error function.

 E(Θ)=N∑n=1mink∈{1…K}(~wTk~xn−yn)2

where . Given the parameters of hyperplanes, , define sets , as where we break ties by putting in the set with least . The sets are disjoint. We can now write as

 E(Θ)=K∑k=1∑xn∈Sk(~wTk~xn−yn)2 (1)

If we fix all , then can be found by minimizing (over ) . However, in defined in equation (1), the sets themselves are function of the parameter set .

To find which minimize in (1), we can have an EM-like algorithm as follows. Let, after iteration, the parameter set be . Keeping fixed, we first find sets . Now we keep these sets , fixed. Thus the error function becomes

 Ec(Θ) =

where superscript denotes the iteration and hence emphasizes the fact that the error function is evaluated by fixing the sets , and

 Eck(~wk)=∑xn∈Sck(~wTk~xn−yn)2. (2)

Thus, minimizing with respect to boils down to minimizing each of with respect to . For every , a new weight vector is found using standard linear least square solution as follows.

 ~wc+1k = argmin~wk∑xn∈Sck(~wTk~xn−yn)2 (3) = [∑xn∈Sck~xn~xTn]−1[∑xn∈Sckyn~xn]

Now we fix and find new sets , and so on. We can now summarize -plane regression algorithm. We first find sets , for iteration (using ). Then for each , we find (as in equation (3)) by minimizing which is defined in equation (2). We keep on repeating these two steps until there is no significant decrement in the error function . does not change when the weight vectors do not change or sets , do not change. The complete -plane regression approach is described more formally in Algorithm 1.

### Ii-a Issues with K-Plane Regression

In spite of its simplicity and easy updates, -plane regression algorithm has some serious drawbacks in terms of convergence and model issues.

#### 1. Convergence to Non-optimal Solutions

It is observed that the algorithm has serious problem of convergence to non-optimal solution. Even when the data is generated from a piecewise linear function, the algorithm often fails to learn the structure of the target function.

Figure 1(a) shows points sampled from a concave (triangle shaped) 2-piecewise affine function on the real line. At the horizontal axis, circles represent set and squares represent set , where and constitute the correct partitioning of the training set in this problem. We see that convex hulls of sets and are disjoint. This 2-piecewise affine function can be written (as per defn. 1) by choosing to be the convex hulls of and .

Figure 1(b) shows the 2-piecewise linear function learnt using -plane regression approach for a particular initialization. (represented as circles) and (represented as squares) are sets corresponding to the two lines in the learnt function. Here, -plane regression algorithm completely misses the shape of the target function. We also see that convex hulls of sets and intersect with each other.

#### 2. Sensitivity to Noise

It has been observed in practice that the simple -plane regression algorithm is very sensitive to the additive noise in the target values in training set. Under noisy examples, the algorithm performs badly. We illustrate it later in Section V.

#### 3. Lack of Model Function

The output of the -plane regression algorithm is a set of hyperplanes. But this algorithm does not provide a way to use these hyperplanes to predict the value for a given test point. In other words, -plane regression algorithm does not have any model function for prediction. We expand this issue in the next section.

## Iii Modified K-Plane Regression

As we have mentioned, given the training data, , the -plane regression algorithm outputs hyperplanes, . To convert this into a proper -piecewise linear model in , we also need to have a -partition of such that in the partition, the appropriate model to use would be . We could attempt to get such a partition of by considering the convex hulls of (where ). However, as we saw, under the -plane regression, the convex hulls of such need not be disjoint. Hence another method to get the required partition is as follows. Let be the mean or centroid of . Then, for any point, , our prediction could be , where is such that (break ties arbitrarily). This would define a proper model function with the hyperplanes obtained through -plane regression. However, this may not give good performance. Often, the convex hulls of sets, (learnt using -plane regression), have non-null intersection because each of these sets may contain points from different disjoint regions of (for example, see Figure 1). In such cases, if we re-partition the training data using distances to different , we may get sets much different from and hence our final prediction on even training data may have large error. The main reason for this problem with -plane regression is that the algorithm is not really bothered about the geometry of the sets ; it only focuses on to be a good fit for points in set . Moreover, in situations where same affine function works for two or more disjoint clusters, -plane regression will consider them as a single cluster as the objective function of -plane regression does not enforce that points in the same cluster should be close to each other. As a result, the clusters learnt using -plane regression approach will have overlapping convex hulls and some times even their means may be very close to each other. This may create problems during prediction. If we use the hyperplane whose corresponding cluster mean is closest to a point, then we may not pick up the correct hyperplane. This identification problem of -plane regression approach results in poor performance.

Motivated by this, we modify -plane regression as follows. We want to simultaneously estimate and , such that if, is a good fit for partition, all the points in partition should be closer to than any other . Intuitively, we can think of as center of the (cluster or) set of points . However, as we saw from our earlier example, if we simply make as the centroid of the final learnt, all the earlier problem still remain. Hence, in the modified -plane regression, we try to independently learn both and from the data. To do that, we add an extra term to the objective function of -plane regression approach which tries to ensure that all the points of same cluster are close together.

As earlier, let the number of hyperplanes be . Here, in the modified -plane regression, we have to learn parameter vectors. Corresponding to partition, we have two parameter vectors, and . represents parameter vector of the hyperplane associated with the partition and represents center of the partition. Note that we want to simultaneously learn both and for every partition.

The error function minimized by modified -plane regression algorithm is

 E(Θ)=N∑n=1mink∈{1,…,K}[(~wTk~xn−yn)2+γ||xn−% \boldmathμk||2] (4)

where and is a user defined parameter which decides relative weight of the two terms.

Given , we define sets , as

 Sk:={xn|k=argminj[(~wTj~xn−yn)2+γ||xn−\boldmathμj||2]}

where we break ties by putting in the set with least . The sets are disjoint. We can now write as

 E(Θ)=K∑k=1∑xn∈Sk(~wTk~xn−yn)2+γ||xn−\boldmathμk||2 (5)

As can be seen from the above, now, for a data point, to be put in , we not only need to be close to as earlier, but also need to be close to , the ‘current center’ of . The motivation is that, under such a partitioning strategy, each would contain only points that are close to each other. As we shall see later through simulations, this modification ensures that the algorithm performs well. As an example of where this modification is important, consider learning a piecewise linear model which is given by same affine function in two (or more) disjoint regions in the feature space. For any splitting of all examples from these two regions into two parts, there will be a good linear model that fits each of the two parts. Hence, in the -plane regression method, the function (cf.eq.(1)) would be same for any splitting of the examples from these two regions which means we would not learn a good model. However, the modified -plane regression approach will not treat all such splits as same because of the term involving . This helps us learn a proper piecewise linear regression function. We illustrate this in Section V.

Now consider finding to minimize given by equation (5). If we fix all , then and can be found by minimizing (over ) . However, in defined in equation (5), the sets themselves are functions of the parameter set .

To find which minimize in (5), we can, once again, have an EM-like algorithm as follows. As earlier, let the parameter set after iteration be . Keeping fixed, we find the sets , as follows

 Sck={xn|k=argminj[(~xTn~wcj−yn)2+γ||xn−\boldmathμcj||2]} (6)

Now we keep these sets fixed. Thus the error function becomes

 Ec(Θ) = K∑k=1∑xn∈Sck[(~wTk~xn−yn)2+γ||xn−% \boldmathμk||2] = K∑k=1Ec(~wk,\boldmathμk)

where superscript denotes the iteration and emphasizes the fact that the error function is evaluated by fixing the sets . Thus minimizing with respect to boils down to minimizing each of with respect to . Each is composed of two terms. The first term depends only on and it is the usual sum of squares of errors. The second term depends only on and it is the usual cost function of -means clustering. Thus, the update equations for finding and , , are

 ~wc+1k = argmin~wkK∑j=1Ec(~wj,\boldmathμj) (7) = argmin~w∑xn∈Sck(~wT~xn−yn)2 = [∑xn∈Sck~xn~xTn]−1[∑xn∈Sckyn~xn] \boldmathμc+1k = argmin\boldmathμkK∑j=1Ec(~wj,\boldmathμj) (8) = argmin\boldmathμ∑xn∈Sck||xn−\boldmathμ||2=1|Sck|∑xn∈Sckxn

Once we compute , we find new sets , and so on.

In summary, the modified -plane regression algorithm works as follows. We first find sets , for iteration (using ) as given by eq.(6). Then for each , we find (as in equation (7) and (8)) by maximizing . We keep on repeating these two steps until there is no significant decrement in the error function . The complete description of modified -plane regression approach is given in Algorithm 2.

### Monotone Error Decrement Property

Now we will show that modified -plane regression algorithm monotonically decreases the error function defined by equation (4).111 Note that this does not necessarily mean that we find the global minimum of the error function. More importantly, we can not claim that minimizing the error as defined would lead to learning of a good piece-wise linear model. We note here that the simple -plane regression algorithm also results in monotonic decrease in the error as defined for that algorithm even though it may not learn good models. However, the fact that the algorithm continuously decreases the error at each iteration, is an important property for a learning algorithm.

###### Theorem 1

Algorithm 2 monotonically decreases the cost function given by equation (4) after every iteration.

{proof}

We have

 Ec(Θc)=K∑k=1∑xn∈Sck[(~xTn~wck−yn)2+γ||xn−\boldmathμck||2]

Given the sets , parameters , are found using equation (7) and (8), in the following way.

 ~wc+1k = argmin~wk∑xn∈Sck(~wTk~xn−yn)2 \boldmathμc+1k = argmin\boldmathμk∑xn∈Sck||xn−\boldmathμk||2

Thus, we have

 ∑xn∈Sck(~xTn~wck−yn)2≥∑xn∈Sck(~xTn~wc+1k−yn)2,k=1…K ∑xn∈Sck||xn−% \boldmathμck||2≥∑xn∈Sck||xn−\boldmathμc+1k||2,k=1…K

This will further give us

 K∑k=1∑xn∈Sck(~xTn~wck−yn)2+γ||xn−% \boldmathμck||2≥ (9) K∑k=1∑xn∈Sck(~xTn~wc+1k−yn)2+γ||xn−%\boldmath$μ$c+1k||2 ⇒ Ec(Θc)≥Ec(Θc+1)

Given , sets , are found as follows

 Sc+1k={xn|k=argminj[(~xTn~wc+1j−yn)2+γ||xn−\boldmathμc+1j||2]}

Using , we can find , which is

 Ec+1(Θc+1) = = K∑j,k=1∑xn∈Scj∩Sc+1k(~xTn~wc+1k−yn)2+γ||xn−\boldmathμc+1k||2

By the definition of sets , , we have,

 (~xTn~wc+1k−yn)2+γ||xn−\boldmathμc+1k||2 ≤ (~xTn~wc+1j−yn)2+γ||xn−\boldmathμc+1j||2,∀j≠k

which is also true for any . Thus

 Ec+1(Θc+1) (10) ≤ K∑j,k=1∑xn∈Scj∩Sc+1k(~xTn~wc+1j−yn)2+γ||xn−\boldmathμc+1j||2 = K∑j=1∑xn∈Scj(~xTn~wc+1j−yn)2+γ||xn−%\boldmath$μ$c+1j||2 = Ec(Θc+1)

Combining (9) and (10), we get . Which means after one complete iteration modified -plane regression Algorithm decreases the error function.

## Iv EM View of Modified K-Plane Regression Algorithm

Here, we show that modified -plane regression algorithm presented in Section III can be viewed as a limiting case of an EM algorithm. In the general -plane regression idea, the difficulty is due to the following credit assignment problem. When we decompose the problem into sub-problems, we do not know which should be considered in which sub-problem. We can view this as the missing information in the EM formulation.

Recall that is the training data set. In the EM framework, can be thought of as incomplete data. The missing data would be , where , such that, . Then are defined as,

This gives us the following probability model

 ⎧⎪⎨⎪⎩P(xn,yn|znk=1,Θ)=p(xn,yn|~wk,\boldmathμk)=p(xn|\boldmathμk)p(yn|xn,~wk)P(xn,yn|zn,Θ)=∑Kk=1znkp(xn|\boldmathμk)p(yn|xn,~wk)

In our formulation, represents the center of the set of all for which the linear model is appropriate. Hence we take , a multivariate Gaussian in which the covariance matrix is given by , where is a variance parameter, and is the identity matrix. This covariance matrix is common for all components. We assume that the target values given in the training set may be corrupted with zero mean Gaussian noise. Thus, for component, the target value is assigned using as , where is Gaussian noise with mean 0 and variance . Variance is kept same for all components. Thus, , a Gaussian with mean and variance . Thus,

 p(xn,yn|~wk,\boldmathμk)=N(\boldmathμk,ϵγI)N(~wTk~xn,ϵ) = 1(2πϵ)12exp(−12ϵ(~wTk~xn−yn)2) = 1Lexp(−12ϵ[(yn−~wTk~xn)2+γ||xn−\boldmathμk||2])

where . Note that and are assumed to be fixed constant, instead of parameters to be re-estimated. Thus, the density model for incomplete data becomes

 p(xn,yn|Θ) (11) = K∑k=1I{k=an}Lexp(−(yn−~wTk~xn)2+γ||xn−\boldmathμk||22ϵ) = 1Lexp(−12ϵmink[(yn−~wTk~xn)2+γ||xn−% \boldmathμk||2])

Negative of the log-likelihood under the model given in equation (11) is same as the error function minimized in the modified -plane regression algorithm. Hence, we can now compute the EM iteration for maximizing log-likelihood computed from (IV).

However, the incomplete data log-likelihood under our probability model (11) becomes non-differentiable due to the hard minimum function. To get around this, we change the probability model for incomplete data into a mixture model with mixing coefficients as part of :

 p(xn,yn|Θ)=K∑k=1αkLexp(−(yn−~wTk~xn)2+γ||xn−\boldmathμk||22ϵ) (12)

where ; , and . Note that here,

 p(yn|xn,Θ) ∝ K∑k=1αkexp(−γ2ϵ||xn−\boldmathμk||2)∑Kj=1αjexp(−γ2ϵ||x