Multiple penalized principal curves: analysis and computation

# Multiple penalized principal curves: analysis and computation

Slav Kirov and Dejan Slepčev Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA, 15213, USA.
July 27, 2019
###### Abstract.

We study the problem of determining the one-dimensional structure that best represents a given data set. More precisely, we take a variational approach to approximating a given measure (data) by curves. We consider an objective functional whose minimizers are a regularization of principal curves and introduce a new functional which allows for multiple curves. We prove existence of minimizers and investigate their properties. While both of the functionals used are non-convex, we show that enlarging the configuration space to allow for multiple curves leads to a simpler energy landscape with fewer undesirable (high-energy) local minima. We provide an efficient algorithm for approximating minimizers of the functional and demonstrate its performance on real and synthetic data. The numerical examples illustrate the effectiveness of the proposed approach in the presence of substantial noise, and the viability of the algorithm for high-dimensional data.

Keywords. principal curves, geometry of data, curve fitting

Classification. 49M25, 65D10, 62G99, 65D18, 65K10, 49Q20

## 1. Introduction

We consider the problem of finding one-dimensional structures best representing the data given as point clouds. This is a classical problem. It has been studied in 80’s by Hastie and Stuetzle [23] who introduced principal curves as the curves going through the “middle” of the data. A number of modifications of principal curves, which make them more stable and easier to compute, followed [12, 15, 21, 25, 36]. However there are very few precise mathematical results on the relation between the properties of principal curves and their variants, and the geometry of the data. On the other hand rigorous mathematical setup has been developed for related problems studied in the context of optimal (transportation) network design [5, 6]. In particular, an objective functional studied in network design, the average-distance functional, is closely related to a regularization of principal curves.

Our first aim is to carefully investigate a desirable variant of principal curves suggested by the average-distance problem. The objective functional includes a length penalty for regularization, and we call its minimizers penalized principal curves. We establish their basic properties (existence and basic regularity of minimizers) and investigate the sense in which they approximate the data and represent the one-dimensional structure. One of the shortcomings of principal curves is that they tend to overfit noisy data. Adding a regularization term to the objective functional minimized by principal curves is a common way to address overfitting. The drawback is that doing so introduces bias: when data lie on a smooth curve the minimizer is only going to approximate them. We investigate the relationship between the data and the minimizers and establish how the length scales present in the data and the parameters of the functional dictate the length scales seen in the minimizers. In particular we provide the critical length scale below which variations in the input data are treated as noise and establish the typical error (bias) when the input curve is smooth. We emphasize that the former has direct implications for when penalized principal curves begin to overfit given data.

Our second aim is to introduce a functional that permits its minimizers to consist of more than one curve (multiple penalized principal curves). The motivation is twofold. The data itself may have one-dimensional structure that consists of more than one component, and the relaxed setting would allow it to be appropriately represented. The less immediate appeal of the new functional is that it guides the design of an improved scheme for computing penalized principal curves. Namely, for many datasets the penalized principal curves functional has a complicated energy landscape with many local minima. This is a typical situation and an issue for virtually all present approaches to nonlinear principal components. As we explain below, enlarging the set over which the functional is considered (from a single curve to multiple curves) and appropriately penalizing the number of components leads to significantly better behavior of energy descent methods (they more often converge to low-energy local minima).

We find topological changes of multiple penalized principal curves are governed by a critical linear density. The linear density of a curve is the density of the projected data on the curve with respect to its length. If the linear density of a single curve drops below the critical value over a large enough length scale, a lower-energy configuration consisting of two curves can be obtained by removing the corresponding curve segment. Such steps are the means by which configurations following energy descent stay in higher-density regions of the data, and avoid local minima that penalized principal curves are vulnerable to. Identification of the critical linear density and the length scale over which it is recognized by the functional further provide insight as to the conditions under and the resolution to which minimizers to can recover one-dimensional components of the data.

We apply modern optimization algorithms based on alternating direction method of multipliers (ADMM) [3] and closely related Bregman iterations [22, 30] to compute approximate minimizers. We describe the algorithm in detail, discuss its complexity and present computational examples that both illustrate the theoretical findings and support the viability of the approach for finding complex one-dimensional structures in point clouds with tens of thousands of points in high dimensions.

### 1.1. Outline

In Section 2 we introduce the objective functionals (both for single and multiple curve approximation), and recall some of the related approaches. We establish basic properties of the functionals, including the existence of minimizers and their regularity. Under assumption of smoothness we derive the Euler-Lagrange equation for critical points of the functional. We conclude Section 2 by computing the second variation of the functional. In Section 3 we provide a number of illustrative examples and investigate the relation between the length scales present in the data, the parameters of the functional and the length scales present in the minimizers. At the end of Section 3 we discuss parameter selection for the functional. In Section 4 we describe the algorithm for computing approximate minimizers of the (MPPC) functional. In Section 4.7 we provide some further numerical examples that illustrate the applicability of the functionals and algorithm. Section 5 contains the conclusion and a brief discussion of and comparison with other approaches for one-dimensional data cloud approximation. Appendix A contains some technical details of an analysis of a minimizer considered in Section 3.

## 2. The functionals and basic properties

In this section we introduce the penalized principal curves functional and the multiple penalized principal curves functional. We recall and prove some of their basic properties. Let be the set of finite, compactly supported measures on , with and .

### 2.1. Penalized principal curves

Given a measure (distribution of data) , , and , the penalized principal curves are minimizers of

 (PPC) Eλμ(γ):=∫Rdd(x,Γ)pdμ(x)+λL(γ)

over , and where , is the distance from to set and is the length of :

 L(γ):=||γ||TV:=sup{n∑i=2|γ(xi)−γ(xi−1)|:0≤x1

and where denotes the Euclidean norm. The functional is closely related to the average-distance problem introduced by Buttazzo, Oudet, and Stepanov [5] having in mind applications to optimal transportation networks [6]. In this context, the first term can be viewed as the cost of a population to reach the network, and the second a cost of the network itself. There the authors considered general connected one-dimensional sets and instead of length penalty considered a length constraint. The penalized functional for one-dimensional sets was studied by Lu and one of the authors in [28], and later for curves (as in this paper) in [27]. Similar functionals have been considered in statistics and machine learning literature as regularizations of the principal curves problem by Tibshirani [37] (introduces curvature penalization) Kegl, Krzyzak, Linder, and Zeger [25] (length constraint), Biau and Fischer [2] (length constraint) Smola, Mika, Schölkopf, and Williamson [36] (a variety of penalizations including penalizing length as in (PPC)) and others.

The first term in (PPC) measures the approximation error, while the second one penalizes the complexity of the approximation. If has smooth density, or is concentrated on a smooth curve, the minimizer is typically supported on smooth curves. However this is not universally true. Namely it was shown in [35] that minimizers of average-distance problems can have corners, even if has smooth density. An analogous argument applies to (PPC) (and later introduced (MPPC)). This raises important modeling questions regarding what the best functional is and if further regularization is appropriate. We do not address these questions in this paper.

Existence of minimizers of (PPC) in was shown in [27]. There it was also shown that any minimizer has the following total curvature bound

 ||γ′min||TV≤pλdiam(supp(μ))p−1μ(Rd).

The total variation (TV) above allows to treat the curvature as a measure, with delta masses at locations of corners, which is necessary in light of the possible lack of regularity. In [27], it was also shown that minimizing curves are injective (i.e. do not self-intersect) in dimension if .

### 2.2. Multiple penalized principal curves

We now introduce an extension of (PPC) which allows for configurations to consist of more than one component. Since (PPC) can be made arbitrarily small by considering with many components, a penalty on the number of components is needed. Thus we propose the following functional for multiple curves

 (MPPC) Eλ1,λ2μ(γ):=∫Rdd(x,Γ)pdμ(x)+λ1(L(γ)+λ2(k(γ)−1))

where, we relax to now be piecewise Lipschitz and is the number of curves used to parametrize . More precisely, we aim to minimize (MPPC) over the admissible set

 A:={γ={γi}k(γ)i=1:k(γ)∈N,γi∈C,i=1,...,k(γ)}.

We call elements of multiple curves. One may think of this functional as penalizing both zero- and one-dimensional complexities of approximations to . In particular we can recover the (PPC) functional by taking large enough. On the other hand, taking large enough leads to a k-means clustering problem which penalizes the number of clusters, and has been encountered in [4, 26].

The main motivation for considering (MPPC), even if only one curve is sought, has to do with the non-convexity of the (PPC). We will see that numerically minimizing (MPPC) often helps evade undesirable (high-energy) local minima of the (PPC) functional. In particular (MPPC) can be seen as a relaxation of (PPC) to a larger configuration space. The energy descent for (MPPC) allows for curve splitting and reconnecting which is the mechanism that enables one to evade local minima of (PPC).

### 2.3. Existence of minimizers of (Mppc)

We show that minimizers of (MPPC) exist in . We follow the approach of [27], where existence of minimizers was shown for (PPC). We first cover some preliminaries, including defining the distance between curves. If with respective domains , where , we define the extension of to as

 ~γ1(t)={γ1(t)if t∈[0,a1]γ1(a1)if t∈(a1,a2].

We let

 dC(γ1,γ2)=maxt∈[0,a2]|~γ1(t)−γ2(t)|.

We have the following lemma, and the subsequent existence of minimizers.

###### Lemma 2.1.

Consider a measure and .

1. For any minimizing sequence of (MPPC)

1. , and

2. There exists a minimizing sequence of (MPPC) such that is contained in , the convex hull of the support of .

###### Proof.

The first property follows by taking a singleton as a competitor. The second follows from projecting any minimizing sequence onto . Doing so can only decrease the energy, as shown in [6, 27]. The argument relies on the fact that projection onto a convex set decrease length. ∎

###### Lemma 2.2.

Given a positive measure and , the functional (MPPC) has a minimizer in . Moreover, the image of any minimizer is contained in the convex hull of the support of .

###### Proof.

The proof is an extension of the one found in [27] for (PPC). Let be a minimizing sequence in . Since the number of curves is bounded, we can find a subsequence (which we take to be the whole sequence) with each member having the same number of curves . We enumerate the curves in each member of the sequence as . We assume that each curve is arc-length parametrized for all . Since the lengths of the curves are uniformly bounded, let , and extend the parametrization for each curve in the way defined above. Then for each , the curves satisfy the hypotheses of the Arzela-Ascoli Theorem. Hence for each , up to a subsequence converge uniformly to a curve . Diagonalizing, we find a subsequence (which we take to be the whole sequence) for which the aforementioned convergence holds for all . Moreover, the limiting object is a collection of curves which are 1-Lipschitz since all of the curves in the sequence are. Thus .

The mapping is continuous and is lower-semicontinuous with respect to convergence in . Thus , and so is a minimizer. ∎

### 2.4. First variation

In this section, we compute the first interior variation of the (PPC) functional and state under what conditions a smooth curve is a critical (i.e. stationary) point. In the case of multiple curves, one can apply the following analysis to each curve separately.

Let be a compactly supported measure. We will assume the curve is . Although minimizers may have corners as mentioned earlier, our use of the first variation is aimed at understanding how the parameters of the functional relate to the length scales observed in minimizers. We generally expect that minimizers will be except at finitely many points, and hence that our analysis applies to intervals between such points.

To compute the first variation we only perturb the interior of the curve, and not the endpoints. Without loss of generality we assume that , where denotes the partial derivative in . We consider variations of of the form , where , . We note that one could allow for and to be nonzero (as has been considered for example in [35]), but it is not needed for our analysis. We can furthermore assume (by reparameterizing the curves if necessary) that is orthogonal to the curve: .

Let , the image of . To compute how the energy (PPC) is changing when is perturbed we need to describe how the distance of points to is changing with . For any let be a point on which is closest to , that is let be a minimizer of over . For simplicity, we assume that is unique for all . The general case that the closest point is non-unique can also be considered [27, 35], but we omit it here. A further reason this assumption is inconsequential is that since is absolutely continuous with respect to the Lebesgue measure , the set of points where is non-unique has -measure zero [29]. We call the projection of data onto . For let . Then

 ∂g∂t =−2(x−Πt(x))⋅γt (2.1) ∂2g∂t2 =2(|γt|2−(x−Πt(x))⋅γtt−(γt⋅γs−(x−Πt(x))⋅γst)2|γs|2−(x−Πt(x))⋅γss)

where and its derivatives are evaluated at , where , that is . Note that for any the set of points projecting onto satisfies . Taking the derivative in , and changing coordinates so that the approximation-error term is written as double integral, we obtain

 dEdt=∫ba(λ1γs|γs|⋅γst−2α(s)∫Π−1t(γ)(x−Πt(x))⋅γt|1−→K(s)⋅(x−Πt(x))|dμs(x))ds

where we have suppressed notation for dependence on (and in some places ), is the Jacobian for change of coordinates, and is the curvature vector of . Here we have used the disintegration theorem (see for example pages 78-80 of [13]) to rewrite an integral for over as an iterated integral along slices orthogonal to the curve (more precisely over the set of points that project to a given point on the curve). We have denoted by the linear density of the projection of to , pulled back to the parameterization of ; that is . By we denote the probability measure supported on the slice . Integrating by parts we obtain

 dEdt∣∣∣t=0=∫ba(−λ1→K⋅γt−2α(s)∫Π−1t(γ)(x−Πt(x))⋅γt|1−→K⋅(x−Πt(x))|dμs(x))ds.

We conclude is a stationary configuration if and only if

 (2.2) λ1→K(s)=−2α(s)∫Π−1t(γ)(x−Πt(x))|1−→K(s)⋅(x−Πt(x))|dμs(x)

for a.e. .

### 2.5. Second variation.

In this section we compute the second variation of (PPC) for the purpose of providing conditions for linear stability. That is, we focus on the case that a straight line segment is a stationary configuration (critical point), and find when it is stable under the considered perturbations (when the second variation is greater than zero). This has important implications for determining when the penalized principal curves start to overfit the data, and is further investigated in the next section.

If is a straight line segment, , and (2.2) simplifies to

 γ(s)=¯x(s):=∫Π−1t(γ(s,t))xdμs(x)

for a.e. such that . This simply states that a straight line is a critical point of the functional if and only if almost every point on the line it is the mean of points projecting there. In other words, the condition is equivalent to being a principal curve (in the original sense).

The second variation of the length term is

 (2.3)

We note that , and therefore , so that the second variation of the length term becomes just . Thus using (2.1) we get

 (2.4) d2Edt2∣∣∣t=0=∫ba(λ1|γst|2+2α(s)∫Π−1t(γ)(|γt|2−((x−Πt(x))⋅γst)2+)dμs(x))ds.

## 3. Relation between the minimizers and the data

In this section, our goal is to relate the parameters of the functional, the length-scales present in the data, and the length-scales seen in the minimizers. To do so we consider examples of data and corresponding minimizers, use the characterization of critical points of (MPPC), and perform linear stability analysis.

### 3.1. Examples and properties of minimizers

Here we provide some insight as to how minimizers of (MPPC) behave. We start by characterizing minimizers in some simple yet instructive cases. In the first couple of cases we focus on the behavior of single curves, and then investigate when minimizers develop multiple components.

#### 3.1.1. Data on a curve.

Here we study the bias of penalized principal curves when the data lie on a curve without noise. If is supported on the image of a smooth curve, and a local minimizer of (PPC) is sufficiently close to , one can obtain an exact expression for the projection distance. More precisely, suppose that for each , only one point in projects to it. That is , the set of such that minimizes over is a singleton. Then (2.2) simplifies to

 λ1K(s)=2α(s)h(1+K(s)h)

where , denotes the unsigned scaler curvature of , and is the projected linear density. Consequently

 (3.1) h=12K⎛⎝√1+2λ1K2α−1⎞⎠≈⎧⎪⎨⎪⎩√λ12αif 1K≪√λ1αλ1K2αif 1K≫√λ1α.

Note that always . We illustrate the transition of the projection distance indicated in (3.1) with the example below.

###### Example 3.1.

Curve with decaying oscillations. We consider data uniformly spaced on the image of the function , which ensures that the amplitude and period are decreasing with the same rate, as . In Figure 2, the linear density of the data is constant (with respect to arc length) with total mass , and solution curves are shown for two different values of . For small enough the minimizing curve is flat, as it is not influenced by oscillations whose amplitude is less than . As the amplitude of oscillations grows beyond the smoothing length scale the minimizing curves start to follow them. As gets larger and becomes smaller, the projection distances at the peaks start to scale linearly with , as predicted by (3.1). Indeed, as decreases to zero the ratio of the curvature of the minimizer to that of the data curve approaches one and converges to a constant,. Hence from (3.1) follows that the ratio of the projection distances at the peaks converges to the ratio of the values.

#### 3.1.2. Linear stability.

In this section we establish conditions for the linear stability of penalized principal curves. For simplicity we consider the case when . Suppose that is arc-length parametrized and a stationary configuration of (PPC), and that for some , is a line segment. As previously, we let denote the projected linear density of onto .

We evaluate the second variation (2.4) over the interval , where the considered variations of are , where . Since is a line segment on , we can consider coordinates where . We then have

 d2Edt2∣∣∣t=0=∫baλ1(v′2)2+2α(s)∫Π−1t(γ)(v22−(v′2(x−Πt(x)))2)dμs(x)ds.

We define the mean squared projection distance

 (3.2) H(s):=(∫Π−1t(γ)(x−Πt(x))2dμs(x))12

and obtain

 (3.3) d2Edt2∣∣∣t=0=∫ba(λ1−2α(s)H(s)2)(v′2)2+2α(s)v22ds.

We see that if for almost every , then and so is linearly stable.

On the other hand, suppose that on some subinterval – without loss of generality we take it to be the entire interval . Consider the perturbation given by . Then the RHS of (3.3) becomes

 n2∫ba(λ1−2α(s)H(s)2)cos2(ns)ds+2∫baα(s)sin2(ns)ds

and we see the first term dominates (in absolute value) the second for large enough. Hence line segment

 (3.4) γ is linearly unstable on intervals where λ1<2αH2.

In the following examples we examine linear stability for some special cases of the data .

###### Example 3.2.

Parallel lines. We start with a simple case in which data, , lie uniformly on two parallel lines. In Figure 3 we show computed local minimizers starting with a slight perturbation of the initial straight line configuration, using the algorithm later described in Section 4. The data lines are of length 2, so that for the straight line configuration. The parameter and hence the condition for linear instability (3.4) of the straight line steady state becomes . The numerical results show that indeed straight line steady state becomes unstable when becomes slightly larger than .

###### Example 3.3.

Uniform density in rectangle. Consider a probability measure, , with uniform density over with . Linear instability of the line segment (which is a critical point of (PPC)) can be seen as indication of when a local minimizer starts to overfit the data. It follows from (3.2) that , and from (3.4) that is the critical value for linear stability.

In Figure 4, we show the resulting local minimizers of (PPC) when starting from a small perturbation of the straight line, for several values of , for and . The results from the numerical experiment appear to agree with the predicted critical value of , as the computed minimizer corresponding to has visible oscillations, while that of does not.

To illustrate how closely the curves approximate that data we consider average mean projection distance, , for various values of . We expect that condition for linear stability (3.4), which was derived for straight-line critical points applies, approximately, to curved minimizers. In particular we expect that curves where is larger than (approximately) will not be minimizers and will be evolved further by the algorithm. Here we investigate numerically if for minimizers , as is the case in one regime of (3.1). Our findings are presented on Figure 5.

###### Example 3.4.

Vertical Gaussian noise. Here we briefly remark on the case that has Gaussian noise with variance orthogonal to a straight line. We note that the mean squared projection distance is just the standard deviation . Therefore linear instability (overfitting) occurs if and only if .

#### 3.1.3. Role of λ2.

We now turn our attention to the role of in (MPPC). Our goal is to understand when do transitions in the number of curves in minimizers occur.

By direct inspection of (MPPC), it is always energetically advantageous to connect endpoints of distinct curves if the distance between them is less than . Similarly, it is never advantageous to disconnect a curve by removing a segment which has length less than . Thus represents the smallest scale at which distinct components can be detected by the (MPPC) functional. When distances are larger than , connectedness is governed by the projected linear density of the curves, as we investigate with the following simple example.

###### Example 3.5.

Uniform density on line. In this example, we consider the measure to have uniform density on the line segment . We relegate the technical details of the analysis to Appendix A; here we report the main conclusions. By (A.6) there is a critical density

 α∗=(43)2λ1λ22

such that if then the minimizer has one component and is itself a line segment contained in . It is straightforward to check that will be shorter than by a length of on each side. Note that at the endpoints , which is less than the upper bound at interior points predicted by (3.1).

On the other hand, if and is long enough then the minimizer consists of regularly spaced points on with space between them approximately (because of finite size effects)

 (3.5) gap≈2(3λ1λ24α)13.

An example of this scenario is provided in Figure 6.

### 3.2. Summary of important quantities and length scales.

Here we provide an overview of how length scales present in the minimizers are affected by the parameters and , and the geometric properties of data. We identify key quantities and length scales that govern the behavior of minimizers to (MPPC). We start with those that dictate the local geometry of penalized principal curves.

• — smoothing length scale (discussed in Sections 3.1.1 and 3.1.2 and illustrated in Example 3.3). This scale represents the resolution at which data will be approximated by curves. Consider data generated by smooth curve with data density per length and added noise (high-frequency oscillations, uniform distribution in a neighborhood of the curve, etc.). Then the noise will be "ignored" by the minimizer as long as its average amplitude (distance in space from the generating curve) is less than a constant multiple (depending on the type of noise) of . In other words is the length scale over which the noise is averaged out. Noise below this scale is neglected by the minimizer, while noise above is interpreted as signal that needs to be approximated. For example, if we think of data as drawn by a pen, then is the widest the pen tip can be, for the line drawn to be considered a line by the algorithm.

• — bias or approximation-error length scale (discussed in Section 3.1.1). Consider again data generated by smooth curve with data density per length and curvature . If the curvature of the curve is small (compared to ) and reach is comparable to , then the distance from the curve to the minimizer is going to scale like . That is the typical error in reconstruction of a smooth curve that a minimizer makes (due to the presence of the length penalty term) scales like .

In addition to the above length scales, the following quantities govern the topology of multiple penalized principal curves:

• — connectivity threshold (discussed in Section 3.1.3). This length scale sets the minimum distance between distinct components of the solution. Gaps in the data of size or less are not detected by the minimizer. Furthermore, this quantity provides the scale over which the following critical density is recognized.

• — linear density threshold (discussed in Example 3.5 and Appendix A.6). Consider again data generated (possibly with noise) by a smooth curve (with curvature small compared to ) with data density per length . If is smaller than , then it is cheaper for the data to be approximated by a series of points than by a continuous curve. That is if there are too few data points the functional no longer sees them as a continuous curve. If , then the minimizers of (PPC) and (MPPC) are expected to coincide, while if , then the minimizer of (MPPC) will consist of points spaced at distance about . Note that the condition can also be written as , and thus the minimizer can be expected to consist of more than component if the connectivity threshold is greater than the smoothing length scale.

We also remark the following scaling properties of the functionals. Note that for any . Thus, when the total mass of data points is changed, should scale like to preserve minimizers. Alternatively, if for every and some , one easily obtains that .

### 3.3. Parameter selection

Understanding the length scales above can guide one in choosing the parameters . Here we discuss a couple of approaches to selecting these parameters. We will assume that the data measure has been normalized, so that it is a probability measure.

A natural quantity to specify is a critical density , which ensures that the linear density of any found curve will be at least . From Section 3.2 it follows that setting imposes the following constraint on the parameters: . Alternatively, one can set if provided a bound on the desired curve length – if one is seeking a single curve with approximately constant linear density and length or less, then set .

There are a couple of ways of obtaining a second constraint, which in conjunction with the first determine values for .

#### 3.3.1. Specifying critical density α∗ and desired resolution H∗.

One can set a desired resolution for minimizers by bounding the mean squared projection distance . If is set to equal the minimum of along the curves then, the spatial resolution from the data to minimizing curves is at most . Consequently, if one specifies and desires spatial resolution , or better, the desired parameters are:

 λ1=2α∗H∗2 and λ2=4√23H∗.

Choosing proper depends on the level of noise present in the data. In particular, needs to be at least the mean squared height of vertical noise in order to prevent overfitting.

#### 3.3.2. Specifying critical density α∗ and λ2.

One may be able to choose directly, as it specifies the resolution for detecting distinct components. In particular, there needs to be a distance of at least between components, in order for them to detected as separate. Once set, .

Typically one desires the smallest (best) resolution , that does not lead to larger than desired. Even if a single curve is sought, taking a smaller value for can ensure less frequent undesirable local minima. One case of this is later illustrated in Example 4.1, where local minimizers can oscillate within the parabola.

###### Example 3.6.

Line segments. Here we provide a simple illustration of the role of parameters, using data generated by three line segments with noise. The line segments are of the same length, and the ratio of the linear density of data over the segments is approximately 4:2:1 (left to right). In addition, the first gap is larger than the second gap. Figure 7 shows how the minimizers of (MPPC) computed depend on parameters used. In the Subfigures 7(a), 7(b) 7(c) we keep fixed while decreasing . As the critical gap length is decreased, and equivalently having more components in the minimizer becomes cheaper, the gaps in the minimizer begin to appear. It no longer sees the data representing one line but two or three separate lines. the only difference between functionals in Subfigures 7(c) and Subfigures 7(d) is that is increased from to . This results in length of the curve becoming more expensive. In Subfigure 7(d) we see that, due to low data density per length (), the minimizer approximates the two data patches to the right by singletons rather than curves.

## 4. Numerical algorithm for computing multiple penalized principal curves

For this section we assume the data measure is discrete, with points and corresponding weights . The weights are uniform () for most applications, but we make note of our flexibility in this regard for cases when it is convenient to have otherwise.

For a piecewise linear curve , we consider projections of data to ’s only. Hence, we approximate , unless otherwise stated. (Notation: when is a vector, as are in the previous line, denotes the Euclidean norm). Before addressing minimization of (MPPC), we first consider (PPC) where represents a single curve. The discrete form is

 (4.1) m∑j=1∑i∈Ijwi|xi−yj|2+λ1m−1∑j=1|yj+1−yj|

where

 (4.2) Ij:={i:(∀k=1,...,m)|xi−yj|≤|xi−yk|}

represents the set indexes of data points for which is the closest among . In case that the closest point is not unique an arbitrary assignment is made so that partition (for example set ).

### 4.1. Basic approach for minimizing Ppc

Here we restrict our attention to performing energy decreasing steps for the (PPC) functional. We emphasize again that this minimization problem is non-convex. The projection assignments depend on itself. However, if the projection assignments are fixed, then the resulting minimization problem is convex. This suggests the following EM-type algorithm outlined in Algorithm 1.

Note that if the minimization of (4.1) is solved exactly, then Algorithm 1 converges to a local minimum in finitely many steps (since there are finitely many projection states, which cannot be visited more than once).

#### 4.1.1. Minimize functional with projections fixed

We now address the minimization of (4.1) with projections fixed (step 2 of Algorithm 1). One may observe that that his subproblem resembles that of a regression, and in particular the fused lasso [38].

To perform the minimization we apply the alternating direction method of multipliers (ADMM) [3], which is equivalent to split Bregman iterations [22] when the constraints are linear (our case) [17]. We rewrite the total variation term as , where is the difference operator, and again denotes the Euclidean norm. An equivalent constrained minimization problem is then

 miny,z:z=Dym∑j=1∑i∈Ijwi|xi−yj|2+λ||z||1,2

Expanding the quadratic term and neglecting the constant, we obtain

 (4.3) miny,z:z=Dy||y||2¯w−2(y,¯x)¯w+λ||z||1,2

where notation was introduced for total mass projecting to by , center of mass , and weighted inner product . One iteration of the ADMM algorithm then consists of the following updates:

where is a parameter that can be interpreted as penalizing violations of the constraint. As such, lower values of tend to make the algorithm more adventurous, though the algorithm is known to converge to the optimum for any fixed value of .

The minimization in the first step is convex, and the first order conditions yield a tridiagonal system for . The tridiagonal matrix to be inverted is the same for all subsequent iterations, so only one inversion is necessary, which can be done in time. In the second step, decouples, and the resulting solution is given by block soft thresholding

 zk+1i=⎧⎨⎩vki−λρvki||vki||if||vki||>λρ0else

where we have let . We therefore see that ADMM applied to (4.3) is very fast.

Note that one only needs for the energy to decrease in this step for Algorithm 1 to converge to a local minimum. This is typically achieved after one iteration of ADMM. In such cases few iterations may be appropriate, as finer precision typically gets lost once projections are updated. On the other hand, the projection step is more expensive, requiring operations to compute exactly. It may be worthwhile to investigate how to optimize alternating these steps, as well as more efficient methods for updating projections especially when changes in are small. In our implementation we exactly recompute all projections, and if the resulting change in energy is small, we minimize (4.1) to a higher degree of precision (apply more iterations of ADMM before again recomputing projections).

### 4.2. Approach to minimizing Mppc

We now discuss how we perform steps that decrease the energy of the modified functional (MPPC). We allow to consist of any number, , of curves, and we denote them , where . The indexes of the curve ends are for , and we set . The discrete of form of (MPPC) can then be written as

 (4.4) m∑j=1∑i∈Ijwi|xi−yj|2+λ1k−1∑c=0mc+1∑j=1|ysc+j+1−ysc+j|+λ1λ2(k−1).

Our approach to (locally) minimizing the problem over is to split the functional into parts that are decreased over different variables. Keeping constant and minimizing over we can decrease (4.4) by simply applying step 1 and step 2 of Algorithm 1 to each curve (note that step 2 can be run in parallel). To minimize over we introduce topological routines below that disconnect, connect, add, and remove curves based on the resulting change in energy.

#### 4.2.1. Disconnecting and connecting curves

Here we describe how to perform energy decreasing steps by connecting and disconnecting curves. We first examine the energy contribution of an edge . To do so we compare the energies corresponding to whether or not the given edge exists. It is straightforward to check that the energy contribution of the edge with respect to the continuum functional (MPPC) is

 ΔEi,i′:=λ1|yi′−yi|−λ1λ2−∑j∈Ii,i′wjmin(|yi−Πi,i′(xj)|,|yi′−Πi,i′(xj)|)2

where is the set of data points projecting to the edge , and is the orthogonal projection onto edge . Our connecting and disconnecting routines will be based on the sign of . We note that above criterion is based on the variation of the continuum functional rather than its discretization (4.4), in which projections to the vertices only (not edges) are considered. Our slight deviation here is motivated by providing a stable criterion that is invariant to further discretizations of the line segment . While we use the discrete functional to simplify computations in approximating the optimal fitting of curves, we will connect and disconnect curves based on the continuum energy (MPPC).

We first discuss disconnecting. We compute the energy contribution for each existing edge and if , then we remove edge . Note this condition can only be true if the length of the edge is at least . It may happen that all edge lengths are less than , but that the energy may be decreased by removing a sequence of edges, whose total length is greater than . Thus, in addition to checking single edges, we implement an analogous check for sequences of edges. The energy contribution of a sequence of edges (including the corresponding interior vertices ) is given by

 ΔEi:i+k:= λ1(