Adaptive piecewise polynomial estimationvia trend filtering\thanksrefT1

# Adaptive piecewise polynomial estimation via trend filtering

## Abstract

We study trend filtering, a recently proposed tool of Kim et al. [SIAM Rev. 51 (2009) 339–360] for nonparametric regression. The trend filtering estimate is defined as the minimizer of a penalized least squares criterion, in which the penalty term sums the absolute th order discrete derivatives over the input points. Perhaps not surprisingly, trend filtering estimates appear to have the structure of th degree spline functions, with adaptively chosen knot points (we say “appear” here as trend filtering estimates are not really functions over continuous domains, and are only defined over the discrete set of inputs). This brings to mind comparisons to other nonparametric regression tools that also produce adaptive splines; in particular, we compare trend filtering to smoothing splines, which penalize the sum of squared derivatives across input points, and to locally adaptive regression splines [Ann. Statist. 25 (1997) 387–413], which penalize the total variation of the th derivative. Empirically, we discover that trend filtering estimates adapt to the local level of smoothness much better than smoothing splines, and further, they exhibit a remarkable similarity to locally adaptive regression splines. We also provide theoretical support for these empirical findings; most notably, we prove that (with the right choice of tuning parameter) the trend filtering estimate converges to the true underlying function at the minimax rate for functions whose th derivative is of bounded variation. This is done via an asymptotic pairing of trend filtering and locally adaptive regression splines, which have already been shown to converge at the minimax rate [Ann. Statist. 25 (1997) 387–413]. At the core of this argument is a new result tying together the fitted values of two lasso problems that share the same outcome vector, but have different predictor matrices.

[
\kwd
\doi

10.1214/13-AOS1189 \volume42 \issue1 2014 \firstpage285 \lastpage323 \newproclaimremRemark

\runtitle

Trend filtering

\pdftitle

Adaptive piecewise polynomial estimation via trend filtering

{aug}

A]\fnmsRyan J. \snmTibshirani\correflabel=e1]ryantibs@cmu.edu \thankstextT1Supported by NSF Grant DMS-13-09174.

class=AMS] \kwd62G08 \kwd62G20 Trend filtering \kwdnonparametric regression \kwdsmoothing splines \kwdlocally adaptive regression splines \kwdminimax convergence rate \kwdlasso stability

\pdfkeywords

62G08, 62G20, Trend filtering, nonparametric regression, smoothing splines, locally adaptive regression splines, minimax convergence rate, lasso stability

## 1 Introduction.

Per the usual setup in nonparametric regression, we assume that we have observations from the model

 yi=f0(xi)+εi,i=1,…,n, (1)

where are input points, is the underlying function to be estimated, and are independent errors. For the most part, we will further assume that the inputs are evenly spaced over the interval , that is, for . (However, this assumption can be relaxed, as discussed in the supplementary document [Tibshirani (2014)].) The literature on nonparametric regression is rich and diverse, and there are many methods for estimating given observations from the model (1); some well-known examples include methods based on local polynomials, kernels, splines, sieves and wavelets.

This paper focuses on a relative newcomer in nonparametric regression: trend filtering, proposed by Kim et al. (2009). For a given integer , the th order trend filtering estimate of is defined by a penalized least squares optimization problem,

 ^β=argminβ∈Rn12∥y−β∥22+nkk!⋅λ∥∥D(k+1)β∥∥1, (2)

where is a tuning parameter, and is the discrete difference operator of order . (The constant factor multiplying is unimportant, and can be absorbed into the tuning parameter , but it will facilitate comparisons in future sections.) When ,

 D(1)=⎡⎢ ⎢ ⎢ ⎢⎣−110⋯000−11⋯00⋮000⋯−11⎤⎥ ⎥ ⎥ ⎥⎦∈R(n−1)×n (3)

and so . Hence, the 0th order trend filtering problem, which we will also call constant trend filtering, is the same as one-dimensional total variation denoising [Rudin, Osher and Faterni (1992)], or the one-dimensional fused lasso [Tibshirani et al. (2005)] (with pure fusion penalty, i.e., without an additional penalty on the coefficients themselves). In this case, , the components of the trend filtering estimate form a piecewise constant structure, with points corresponding to the nonzero entries of . See Figure 1 for an example.

For , the operator is most easily-defined recursively, as in

 D(k+1)=D(1)⋅D(k). (4)

[Above, is the version of the first order discrete difference operator (3).] In words, the definition (4) says that the st order difference operator is built up by evaluating differences of differences, a total of times. Therefore, the matrix can be thought of as the discrete analogy to the st order derivative operator, and the penalty term in (2) penalizes the discrete st derivative of the vector , that is, the changes in the discrete th derivative of . Accordingly, one might expect the components of the th order trend filtering estimate to exhibit the structure of a piecewise polynomial of order , for example, for first order trend filtering, the estimate would be piecewise linear, for second order, it would be piecewise quadratic, etc. Figure 1 gives empirical evidence towards this claim. Later, in Section 4, we provide a more definitive confirmation of this piecewise polynomial structure when we examine a continuous-time representation for trend filtering.

It is straightforward to check that

 D(2) = ⎡⎢ ⎢ ⎢ ⎢⎣1−210⋯001−21⋯0001−2⋯0⋮⎤⎥ ⎥ ⎥ ⎥⎦, D(3) = ⎡⎢ ⎢ ⎢ ⎢⎣−13−31⋯00−13−3⋯000−13⋯0⋮⎤⎥ ⎥ ⎥ ⎥⎦

and in general, the nonzero elements in each row of are given by the st row of Pascal’s triangle, but with alternating signs. A more explicit (but also more complicated-looking) expression for the th order trend filtering problem is therefore

 ^β=argminβ∈Rn12n∑i=1(yi−βi)2+nkk!⋅λn−k−1∑i=1∣∣∣i+k+1∑j=i(−1)j−i(k+1j−i)βj∣∣∣.

The penalty term above sums over successive linear combinations of adjacent coefficients, that is, the discrete difference operator is a banded matrix with bandwidth .

### 1.1 The generalized lasso and related properties.

For any order , the trend filtering estimate is uniquely defined, because the criterion in (2) is strictly convex. Furthermore, the trend filtering criterion is of generalized lasso form with an identity predictor matrix (this is called the signal approximator case) and a specific choice of penalty matrix . Some properties of the trend filtering estimate therefore follow from known results on the generalized lasso [Tibshirani and Taylor (2011; 2012)], for example, an exact representation of the trend filtering estimate in terms of its active set and signs, and also, a formula for its degrees of freedom:

 df(^β)=E[number of knots in ^β]+k+1, (5)

where the number of knots in is interpreted to mean the number of nonzero entries in (the basis and continuous-time representations of trend filtering, in Sections 3.3 and 4, provide a justification for this interpretation). To repeat some of the discussion in Tibshirani and Taylor (2011; 2012), the result in (5) may seem somewhat remarkable, as a fixed-knot th degree regression spline with knots also has degrees of freedom—and trend filtering does not employ fixed knots, but rather, chooses them adaptively. So, why does trend filtering not have a larger degrees of freedom? At a high level, the answer lies in the shrinkage due to the penalty in (2): the nonzero entries of are shrunken toward zero, compared to the same quantity for the corresponding equality-constrained least squares estimate. In other words, within each interval defined by the (adaptively chosen) knots, trend filtering fits a th degree polynomial whose th derivative is shrunken toward its th derivatives in neighboring intervals, when compared to a th degree regression spline with the same knots. Figure 2 gives a demonstration of this phenomenon for and .

In terms of algorithms, the fact that the discrete difference operator is banded is of great advantage for solving the generalized lasso problem in (2). Kim et al. (2009) describe a primal–dual interior point method for solving (2) at a fixed value of , wherein each iteration essentially reduces to solving a linear system in , costing operations. In the worst case, this algorithm requires iterations, so its complexity is .1 [Kim et al. (2009) focus mainly on linear trend filtering, the case , but their arguments carry over to the general case as well.] On the other hand, instead of solving (2) at a fixed , Tibshirani and Taylor (2011) describe a path algorithm to solve (2) over all values of , that is, to compute the entire solution path over . This path is piecewise linear as a function of (not to be confused with the estimate itself at any fixed , which has a piecewise polynomial structure over the input points ). Again, the bandedness of is key here for efficient computations, and Tibshirani and Arnold (2013) describe an implementation of the path algorithm in which computing the estimate at each successive critical point in the path requires operations.

Software for both of these algorithms is freely available online. For the primal–dual interior point method, see http://stanford.edu/~boyd/l1_tf, which provides Matlab and C implementations (these only cover the linear trend filtering case, but can be extended to the general polynomial case); for the path algorithm, see the function trendfilter in the R package genlasso, available on the CRAN repository.

### 1.2 Summary of our results.

Little is known about trend filtering—mainly, the results due to its generalized lasso form, for example, the degrees of freedom result (5) discussed in the previous section—and much is unknown. Examining the trend filtering fits in Figures 1 and 2, it appears that the estimates not only have the structure of piecewise polynomials, they furthermore have the structure of splines: these are piecewise polynomial functions that have continuous derivatives of all orders lower than the leading one [i.e., a th degree spline is a th degree piecewise polynomial with continuous th through st derivatives at its knots]. Figure 3 plots an example cubic trend filtering estimate, along with its discrete 1st, 2nd and 3rd derivatives (given by multiplication by , , and , resp.). Sure enough, the lower order discrete derivatives appear “continuous” across the knots, but what does this really mean for such discrete sequences? Does trend filtering have an analogous continuous-time representation, and if so, are the estimated functions really splines?

Besides these questions, one may also wonder about the performance of trend filtering estimates compared to other methods. Empirical examples (like those in Section 2) show that trend filtering estimates achieve a significantly higher degree of local adaptivity than smoothing splines, which are arguably the standard tool for adaptive spline estimation. Other examples (like those in Section 3) show that trend filtering estimates display a comparable level of adaptivity to locally adaptive regression splines, another well-known technique for adaptive spline estimation, proposed by Mammen and van de Geer (1997) on the basis of being more locally adaptive (as their name would suggest). Examples are certainly encouraging, but a solely empirical conclusion here would be unsatisfactory—fixing as a metric the squared error loss in estimating the true function , averaged over the input points, can we say more definitively that trend filtering estimates actually outperform smoothing splines, and perform as well as locally adaptive regression splines?

We investigate the questions discussed above in this paper. To summarize our results, we find that:

• for (constant or linear orders), the continuous-time analogues of trend filtering estimates are indeed th degree splines; moreover, they are exactly the same as th order locally adaptive regression splines;

• for (quadratic or higher orders), the continuous-time versions of trend filtering estimates are not quite splines, but piecewise polynomial functions that are “close to” splines (with small discontinuities in lower order derivatives at the knots); hence, they are not the same as th order locally adaptive regression splines;

• for any , if the th derivative of true function is of bounded variation, then the th order trend filtering estimate converges to (in terms of squared error loss) at the minimax rate; this rate is achieved by locally adaptive regression splines [Mammen and van de Geer (1997)], but not by smoothing splines nor any other estimate linear in [Donoho and Johnstone (1998)].

We note that, although trend filtering and locally adaptive regression splines are formally different estimators for , they are practically indistinguishable by eye in most examples. Such a degree of similarity, in finite samples, goes beyond what we are able to show theoretically. However, we do prove that trend filtering estimates and locally adaptive regression spline estimates converge to each other asymptotically (Theorem 1). The argument here boils down to a bound on the difference in the fitted values of two lasso problems that have the same outcome vector, but different predictor matrices (because both trend filtering and locally adaptive regression splines can be represented as lasso problems, see Section 3). To the best of our knowledge, this general bound is a new result (see the supplementary document [Tibshirani (2014)]). Further, we use this asymptotic pairing between trend filtering and locally adaptive regression splines to prove the minimax convergence rate for trend filtering (Corollary 1). The idea is simple: trend filtering and locally adaptive regression splines converge to each other at the minimax rate, locally adaptive regression splines converge to the true function at the minimax rate [Mammen and van de Geer (1997)], and hence so does trend filtering.

### 1.3 Why trend filtering?

Trend filtering estimates, we argue, enjoy the favorable theoretical performance of locally adaptive regression splines; but now it is fair to ask: why would we ever use trend filtering, over, say, the latter estimator? The main reason is that trend filtering estimates are much easier to compute, due to the bandedness of the discrete derivative operators, as explained previously. The computations for locally adaptive regression splines, meanwhile, cannot exploit such sparsity or structure, and are considerably slower. To be more concrete, the primal–dual interior point method described in Section 1.1 above can handle problems of size on the order of 1,000,000 points (and the path algorithm, on the order of 100,000 points), but even for 10,000 points, the computations for locally adaptive regression splines are prohibitively slow. We discuss this in Section 3.

Of course, the nonparametric regression toolbox is highly-developed and already offers plenty of good methods. We do not presume that trend filtering should be regarded as the preferred method in every nonparametric regression problem, but simply that it represents a useful contribution to the toolbox, being both fast and locally adaptive, that is, balancing the strengths of smoothing splines and locally adaptive regression splines. This manuscript mainly focuses on the comparison to the aforementioned estimators because they, too, like trend filtering, fit piecewise polynomials functions and they are widely used. Though we do not compare wavelets or smoothing splines with a spatially variable tuning parameter in as much detail, we consider them in Section 6 in an analysis of astrophysics data. It should be mentioned that for trend filtering to become a truly all-purpose nonparametric regression tool, it must be able to handle arbitrary input points (not just evenly spaced inputs). We give an extension to this case in the supplementary document [Tibshirani (2014)]. Our analysis of trend filtering with arbitrary inputs shows promising computational and theoretical properties, but still, a few questions remain unanswered. This will be the topic of future work.

As a separate point, another distinguishing feature of trend filtering is that it falls into what is called the analysis framework with respect to its problem formulation, whereas locally adaptive regression splines, smoothing splines, and most others fall into the synthesis framework. Synthesis and analysis are two terms used in signal processing that describe different approaches for defining an estimator with certain desired characteristics. In the synthesis approach, one builds up the estimate constructively from a set of characteristic elements or atoms; in the analysis approach, the strategy is instead to define the estimate deconstructively, via an operator that penalizes undesirable or uncharacterisic behavior. Depending on the situation, it can be more natural to implement the former rather than the latter, or vice versa, and hence both are important. We discuss the importance of the analysis framework in the context of nonparametric regression estimators in Section 7.2, where we define extensions of trend filtering that would be difficult to construct from the synthesis perspective, for example, a sparse variant of trend filtering.

Here is an outline for the rest of this article (though we have discussed its contents throughout the Introduction, we list them here in proper order). In Sections 2 and 3, we compare trend filtering to smoothing splines and locally adaptive regression splines, respectively. We give data examples that show trend filtering estimates are more locally adaptive than smoothing splines, and that trend filtering and locally adaptive regression splines are remarkably similar, at any common value of their tuning parameters. We also discuss the differing computational requirements for these methods. In Section 3, we show that both locally adaptive regression splines and trend filtering can be posed as lasso problems, with identical predictor matrices when or 1, and with similar but slightly different predictor matrices when . This allows us to conclude that trend filtering and locally adaptive regression splines are exactly the same for constant or linear orders, but not for quadratic or higher orders. Section 4 develops a continuous-time representation for the trend filtering problem, which reveals that (continuous-time) trend filtering estimates are always th order piecewise polynomials, but for , are not th order splines. In Section 5, we derive the minimax convergence rate of trend filtering estimates, under the assumption that the th derivative of the true function has bounded total variation. We do this by bounding the difference between trend filtering estimates and locally adaptive regression splines, and invoking the fact that the latter are already known to converge at the minimax rate [Mammen and van de Geer (1997)]. We also study convergence rates for a true function with growing total variation. In Section 6, we consider an astrophysics data example, and compare the performance of several commonly used nonparametric regression tools. Section 7 presents ideas for future work: multivariate trend filtering, sparse trend filtering, and the synthesis versus analysis perspectives. Essentially all proofs, and the discussion of trend filtering for arbitrary inputs, are deferred until the supplementary document [Tibshirani (2014)].

## 2 Comparison to smoothing splines.

Smoothing splines are a popular tool in nonparametric regression, and have been extensively studied in terms of both computations and theory [some well-known references are de Boor (1978), Wahba (1990), Green and Silverman (1994)]. Given input points , which we assume are unique, and observations , the th order smoothing spline estimate is defined as

 ^f=argminf∈W(k+1)/2n∑i=1(yi−f(xi))2+λ∫10(f((k+1)/2)(t))2dt, (6)

where is the derivative of of order , is a tuning parameter, and the domain of minimization here is the Sobolev space

 W(k+1)/2 = {f\dvtx[0,1]→R\dvtxf is (k+1)/2 times differentiable and ∫10(f((k+1)/2)(t))2dt<∞}.

Unlike trend filtering, smoothing splines are only defined for an odd polynomial order . In practice, it seems that the case (i.e., cubic smoothing splines) is by far the most common case considered. In the next section, we draw a high-level comparison between smoothing splines and trend filtering, by writing the smoothing spline minimization problem (6) in finite-dimensional form. Following this, we make empirical comparisons, and then discuss computational efficiency.

### 2.1 Generalized ridge regression and Reinsch form.

Remarkably, it can be shown that the infinite-dimensional problem in (6) is has a unique minimizer, which is a th degree natural spline with knots at the input points [see, e.g., Wahba (1990), Green and Silverman (1994), Hastie, Tibshirani and Friedman (2008)]. Recall that a th degree natural spline is a simply a th degree spline that reduces to a polynomial of degree before the first knot and after the last knot; it is easy to check the set of natural splines of degree , with knots at , is spanned by precisely basis functions. Hence, to solve (6), we can solve for the coefficients in this basis expansion:

 ^θ=argminθ∈Rn∥y−Nθ∥22+λθTΩθ, (7)

where contains the evaluations of th degree natural spline basis functions over the knots , and contains the integrated products of their nd derivatives; that is, if denotes a collection of basis functions for the set of th degrees natural splines with knots at , then

 Nij=ηj(xi)andΩij=∫10η((k+1)/2)i(t)⋅η((k+1)/2)j(t)dt (8) \eqntextfor all i,j=1,…,n. (9)

The problem in (7) is a generalized ridge regression, and from its solution , the function in (6) is simply given at the input points by

 (^f(x1),…,^f(xn))=N^θ.

More generally, the smoothing spline estimate at an arbitrary input is given by

 ^f(x)=n∑j=1^θjηj(x).

To compare the smoothing spline problem, as expressed in (7), with trend filtering, it helps to rewrite the smoothing spline fitted values as follows:

 N^θ = N(NTN+λΩ)−1NTy (10) = N(NT(I+λN−TΩN−1)N)−1NTy = (I+λK)−1y,

where . The expression in (10) is called the Reinsch form for the fitted values. From this expression, we can view as the solution of the minimization problem

 ^u=argminu∈Rn∥y−u∥22+λuTKu, (11)

which is of similar form to the trend filtering problem in (2), but here the penalty is replaced by the quadratic penalty . How do these two penalties compare? First, the penalty matrix used by smoothing splines is similar in nature to the discrete derivative operators [we know from its continuous-time analog in (6) that the term penalizes wiggliness in something like the nd derivative of ] but is still strictly different. For example, for (cubic smoothing splines) and input points , , it can be shown [Green and Yandell (1985)] that the smoothing spline penalty is where is the second order discrete derivative operator, and is a tridiagonal matrix, with diagonal elements equal to and off-diagonal elements equal to .

A second and more important difference is that smoothing splines utilize a (squared) penalty, while trend filtering uses an penalty. Analogous to the usual comparisons between ridge regression and the lasso, the former penalty shrinks the components of , but does not set any of the components to zero unless (in which case all components are zero), whereas the latter penalty shrinks and also adaptively sets components of to zero. One might imagine, recalling that and both act in a sense as derivative operators, that trend filtering estimates therefore exhibit a finer degree of local adaptivity than do smoothing splines. This idea is supported by the examples in the next section, which show that trend filtering estimates outperform smoothing splines (when both are optimally tuned) in estimating functions with spatially inhomogeneous smoothness. The idea is also supported by our theory in Section 5, where we prove that trend filtering estimates have a better rate of convergence than smoothing splines (in fact, they achieve the optimal rate) over a broad class of underlying functions.

### 2.2 Empirical comparisons.

We compare trend filtering and smoothing spline estimates on simulated data. We fix (i.e., we compare cubic trend filtering versus cubic smoothness splines), because the smooth.spline function in the R programming language provides a fast implementation for smoothing splines in this case. Generally speaking, smoothing splines and trend filtering provide similar estimates when the underlying function has spatially homogeneous smoothness, or to put it simply, is either entirely smooth or entirely wiggly throughout its domain. Hence, to illustrate the difference between the two estimators, we consider two examples of functions that display varying levels of smoothness at different spatial locations.

Our first example, which we call the “hills” example, considers a piecewise cubic function over , whose knots are spaced farther apart on the left-hand side of the domain, but bunched closer together on the right-hand side. As a result, is smooth for between and about , but then abruptly becomes more wiggly—see the top left panel of Figure 4. We drew noisy observations from over the evenly spaced inputs , (with independent, normal noise), and fit a trend filtering estimate, tuned to have 19 degrees of freedom, as shown in the top right panel.2 We can see here that the estimate adapts to the appropriate levels of smoothness at both the left and right sides of the domain. But this is not true of the smoothing spline estimate with 19 degrees of freedom, displayed in the bottom left panel: the estimate is considerably oversmoothed on the right-hand side. As we increase the allowed flexibility, the smoothing spline estimate is able to fit the small hills on the right, with a total of 30 degrees of freedom; however, this causes undersmoothing on the left-hand side, as shown in the bottom right panel.

For our second example, we take to be the “Doppler” function [as considered in, e.g., Donoho and Johnstone (1995), Mammen and van de Geer (1997)]. Figure 5, clockwise from the top left, displays the Doppler function and corresponding noisy observations, the trend filtering estimate with 50 degrees of freedom, the smoothing spline estimate with 50 degrees of freedom, and the smoothing spline estimate with 90 degrees of freedom. The same story, as in the hills example, holds here: trend filtering adapts to the local level of smoothness better than smoothing splines, which have trouble with the rapidly increasing frequency of the Doppler function (as decreases).

In Figure 6, we display the input-averaged squared error losses3

 1nn∑i=1(^βi−f0(xi))2and1nn∑i=1(^f(xi)−f0(xi))2

for the trend filtering and smoothing spline estimates and , respectively, on the hills and Doppler examples. We considered a wide range of model complexities indexed by degrees of freedom, and averaged the results over 50 simulated data sets for each setup (the dotted lines show plus or minus one standard deviations). Aside from the visual evidence given in Figures 4 and 5, Figure 6 shows that from the perspective of squared error loss, trend filtering outperforms smoothing splines in estimating underlying functions with variable spatial smoothness. As mentioned previously, we will prove in Section 5 that for a large class of underlying functions , trend filtering estimates have a sharper convergence rate than smoothing splines.

### 2.3 Computational considerations.

Recall that the smoothing spline fitted values are given by

 N^θ=N(NTN+λΩ)−1NTy, (12)

where contains the evaluations of basis functions for the subspace of th degree natural splines with knots at the inputs, and contains their integrated products of their nd order derivatives, as in (2.1). Depending on exactly which basis we choose, computation of (12) can be fast or slow; by choosing the -spline basis functions, which have local support, the matrix is banded, and so the smoothing spline fitted values can be computed in operations [e.g., see de Boor (1978)]. In practice, these computations are extremely fast.

By comparison, Kim et al. (2009) suggest a primal–dual interior point method, as mentioned in Section 1.1, that computes the trend filtering estimate (at any fixed value of the tuning parameter ) by iteratively solving a sequence of banded linear systems, rather than just a single one. Theoretically, the worst-case number of iterations scales as , but the authors report that in practice the number of iterations needed is only a few tens, almost independent of the problem size . Hence, trend filtering computations with the primal–dual path interior point method are slower than those for smoothing splines, but not by a huge margin.

To compute the trend filtering estimates for the examples in the previous section, we actually used the dual path algorithm of Tibshirani and Taylor (2011), which was also discussed in Section 1.1. Instead of solving the trend filtering problem at a fixed value of , this algorithm constructs the solution path as varies from  to . Essentially, it does so by stepping through a sequence of estimates, where each step either adds one knot to or deletes one knot from the fitted piecewise polynomial structure. The computations at each step amount to solving two banded linear systems, and hence require operations; the overall efficiency depends on how many steps along the path are needed before the estimates of interest have been reached (at which point the path algorithm can be terminated early). But because knots can be both added and deleted to the fitted piecewise polynomial structure at each step, the algorithm can take much more than steps to reach an estimate with knots. Consider the Doppler data example, in the last section, with points: the path algorithm used nearly 4000 steps to compute the trend filtering estimate with 46 knots (50 degrees of freedom) shown in the upper right panel of Figure 5. This took approximately 28 seconds on a standard desktop computer, compared to the smoothing spline estimates shown in the bottom left and right panels of Figure 5, which took about 0.005 seconds each. We reiterate that in this period of time, the path algorithm for trend filtering computed a total of 4000 estimates, versus a single estimate computed by the smoothing spline solver. (A quick calculation, , shows that the time per estimate here is comparable.) For the hills data set in the last section, where , the dual path algorithm constructed the entire path of trend filtering estimates (consisting of 548 steps) in less than 3 seconds; both smoothing spline estimates took under 0.005 seconds each.

## 3 Comparison to locally adaptive regression splines.

Locally adaptive regression splines are an alternative to smoothing splines, proposed by Mammen and van de Geer (1997). They are more computationally intensive than smoothing splines but have better adaptivity properties (as their name would suggest). Let denote the inputs, assumed unique and ordered as in , and denote the observations. For the th order locally adaptive regression spline estimate, where is a given arbitrary integer (not necessarily odd, as is required for smoothing splines), we start by defining the knot superset

This is essentially just the set of inputs , but with points near the left and right boundaries removed. We then define the th order locally adaptive regression spline estimate as

 ^f=argminf∈Gk12n∑i=1(yi−f(xi))2+λ⋅TV(f(k)), (14)

where is now the th weak derivative of , is a tuning parameter, denotes the total variation operator, and is the set

 Gk={f\dvtx[0,1]→R\dvtxf is k% th degree spline with knots contained in T}. (15)

Recall that for a function , its total variation is defined as

 TV(f)=sup{p∑i=1∣∣f(zi+1)−f(zi)∣∣\dvtxz1<⋯

and this reduces to if is (strongly) differentiable.

Next, we briefly address the difference between our definition of locally adaptive regression splines in (14) and the original definition found in Mammen and van de Geer (1997); this discussion can be skipped without interrupting the flow of ideas. After this, we rewrite problem (14) in terms of the coefficients of with respect to a basis for the finite-dimensional set . For an arbitrary choice of basis, this new problem is of generalized lasso form, and in particular, if we choose the truncated power series as our basis for , it simply becomes a lasso problem. We will see that trend filtering, too, can be represented as a lasso problem, which allows for a more direct comparison between the two estimators.

### 3.1 Unrestricted locally adaptive regression splines.

For readers familiar with the work of Mammen and van de Geer (1997), it may be helpful to explain the difference between our definition of locally adaptive regression splines and theirs: these authors define the locally adaptive regression spline estimate as

 ^f∈argminf∈Fk12n∑i=1(yi−f(xi))2+λ⋅TV(f(k)), (16)

where is the set

 Fk={f\dvtx[0,1]→R\dvtxf is k times weakly differentiable and TV(f(k))<∞}.

[The element notation in (16) emphasizes the fact that the minimizer is not generally unique.] We call (16) the unrestricted locally adaptive regression spline problem, in reference to its minimization domain compared to that of (14). Mammen and van de Geer (1997) prove that the minimum in this unrestricted problem is always achieved by a th degree spline, and that this spline has knots contained in if or 1, but could have knots outside of (and in fact, outside of the input set ) if . In other words, the solution in (14) is always a solution in (16) when or 1, but this need not be true when ; in the latter case, even though there exists a th degree spline that minimizes (16), its knots could occur at noninput points.

The unrestricted locally adaptive regression estimate (16) is the main object of theoretical study in Mammen and van de Geer (1997), but practically speaking, this estimate is difficult to compute when , because the possible knot locations are generally not easy to determine [see also Rosset and Zhu (2007)]. On the other hand, the restricted estimate as defined in (14) is more computationally tractable. Fortunately, Mammen and van de Geer (1997) show that essentially all of their theoretical results for the unrestricted estimate also apply to the restricted estimate, as long as the input points are not spaced too far apart. In particular, for evenly spaced inputs, , , the convergence rates of the unrestricted and restricted estimates are the same. We therefore focus on the restricted problem (14) in the current paper, and mention the unrestricted problem (16) purely out of interest. For example, to anticipate results to come, we will show in Lemma 3 of Section 3.3 that the trend filtering estimate (2) for or 1 is equal to the locally adaptive regression spline estimate (14) (i.e., they match at the input points ); hence, from what we discussed above, it also solves the unrestricted problem in (16).

### 3.2 Generalized lasso and lasso form.

We note that the knot set in (13) has elements, so is spanned by basis functions, call them . Since each , is a th degree spline with knots in , we know that its th weak derivative is piecewise constant and (say) right-continuous, with jump points contained in ; therefore, writing and , we have

 TV(g(k)j)=n−k−1∑i=1∣∣g(k)j(ti)−g(k)j(ti−1)∣∣.

Similarly, any linear combination of satisfies

 TV((n∑j=1θjgj)(k))=n−k−1∑i=1∣∣∣n∑j=1(g(k)j(ti)−g(k)j(ti−1))⋅θj∣∣∣.

Hence, we can reexpress (14) in terms of the coefficients in its basis expansion with respect to ,

 ^θ=argminθ∈Rn12∥y−Gθ∥22+λ∥Cθ∥1, (17)

where contains the evaluations of over the inputs , and contains the differences in their th derivatives across the knots, that is,

 Gij = gj(xi)for i,j=1,…,n, (18) Cij = g(k)j(ti)−g(k)j(ti−1)for i=1,…,n−k−1,j=1,…,n. (19)

Given the solution in (17), we can recover the locally adaptive regression spline estimate in (14) over the input points by

 (^f(x1),…,^f(xn))=G^θ,

or, at an arbitrary point by

 ^f(x)=n∑j=1^θjgj(x).

The problem (17) is a generalized lasso problem, with predictor matrix and penalty matrix ; by taking to be the truncated power basis, we can turn (a block of) into the identity, and hence (17) into a lasso problem.

###### Lemma 1

Let denote the set defined in (13), and let denote the th order truncated power basis with knots in ,

 g1(x) = 1,g2(x)=x,…,gk+1(x)=xk, (20) gk+1+j(x) = (x−tj)k⋅1{x≥tj},j=1,…,n−k−1.

(For the case , we interpret .) Then the locally adaptive regression spline problem (14) is equivalent to the following lasso problem:

 ^θ=argminθ∈Rn12∥y−Gθ∥22+λn∑j=k+2|θj|, (21)

in that for . Here, is the basis matrix as in (18).

Lemma 1 follows from the fact that for the truncated power basis, the penalty matrix in (19) satisfies for , and otherwise. It is worth noting that Osborne, Presnell and Turlach (1998) investigate a lasso problem similar to (21) for the purposes of knot selection in regression splines.

Note that (21) is of somewhat nonstandard form for a lasso problem, because the penalty is only taken over the last components of . We will see next that the trend filtering problem in (2) can also be written in lasso form (again with the penalty summing over the last coefficients), and we will compare these two formulations. First, however, it is helpful to express the knot superset in (13) and the basis matrix in (18) in a more explicit form, for evenly spaced input points , (this being the underlying assumption for trend filtering). These become

 T={((k+2)/2+i)/n,\quad for i=1,…,n−k−1, if k is even,((k+1)/2+i)/n,\quad for i=1,…,n−k−1, if k is odd (22)

and

 Missing or unrecognized delimiter for \bigl (23)

(It is not really important to separate the definition of for from that for , even; this is only done to make transparent the structure of .)

### 3.3 Trend filtering in lasso form.

We can transform the trend filtering problem in (2) into lasso form, just like the representation for locally adaptive regression splines in (21).

###### Lemma 2

The trend filtering problem in (2) is equivalent to the lasso problem

 ^α=argminα∈Rn12∥y−Hα∥22+λn∑j=k+2|αj|, (24)

in that the solutions satisfy . Here, the predictor matrix is given by

where we define for all and

 σ(k)i=i∑j=1σ(k−1)jfor k=1,2,3,…,

that is, is the th order cumulative sum of .

The proof of this lemma basically inverts the st order discrete difference operator ; see the supplementary document [Tibshirani (2014)]. We remark that this result, in the special case of , can be found in Kim et al. (2009).

It is not hard to check that in the case or 1, the definitions of in (23) and in (25) coincide, which means that the locally adaptive regression spline and trend filtering problems (21) and (24) are the same. But when , we have , and hence the problems are different.

###### Lemma 3

Consider evenly spaced inputs , , and the basis matrices defined in (23), (25). If or 1, then , so the lasso representations for locally adaptive regression splines and trend filtering, (21) and (24), are the same. Therefore, their solutions are the same, or in other words,

 ^βi=^f(xi)for i=1,…,n,

where and are the solutions of the original trend filtering and locally adaptive regression spline problems, (2) and (14), at any fixed common value of the tuning parameter .

If , however, then , so the problems (21) and (24) are different, and hence the trend filtering and locally adpative regression spline estimators are generically different.

See the supplement for the proof [Tibshirani (2014)]. Though the trend filtering and locally adaptive regression spline estimates are formally different for polynomial orders , they are practically very similar (at all common values of ). We give examples of this next, and then compare the computational requirements for the two methods.

### 3.4 Empirical comparisons.

We revisit the hills and Doppler examples of Section 2.2. Figure 7 displays, for (cubic order), the trend filtering and locally adaptive regression spline estimates at matching values of the tuning parameter . The estimates are visually identical in both examples (but not numerically identical—the average squared difference between the estimates across the input points is around for the hills example, and for the Doppler example). This remains true for a wide range of common tuning parameter values, and only for very small values of do slight differences between the two estimators become noticeable.

Nothing is special about the choice here or about these data sets in particular: as far as we can tell, the same phenomenon occurs for any polynomial order , and any set of observations. This extreme similarity between the two estimators, holding in finite sample and across essentially all common tuning parameter values, is beyond what we show theoretically in Section 5. In this section, we prove that for tuning parameters of a certain order, the two estimators converge asymptotically at a fast rate. Sharper statements are a topic for future work.

### 3.5 Computational considerations.

Both the locally adaptive regression spline and trend filtering problems can be represented as lasso problems with dense, square predictor matrices, as in (21) and (24). For trend filtering, we do this purely for analytical reasons, and computationally it is much more efficient to work from its original representation in (2), where the penalty operator is sparse and banded. As discussed in Sections 1.1 and 2.3, two efficient algorithms for trend filtering are the primal–dual interior point method of Kim et al. (2009) and the dual path algorithm of Tibshirani and Taylor (2011); the former computes the trend filtering estimate at a fixed value of , in worst-case complexity [the authors claim that the practical complexity is closer to ]; the latter computes the entire solution path over , with each critical point in this piecewise linear path requiring operations.

For locally adaptive regression splines, on the other hand, there is not a better computational alternative than solving the lasso problem in (21). One can check that the inverse of the truncated power basis matrix is dense, so if we converted (21) to generalized lasso form [to match the form of trend filtering in (2)], then it would have a dense penalty matrix. And if we were to choose, for example, the -spline basis over the truncated power basis to parameterize in (14), then although the basis matrix would be sparse and banded, the resulting penalty matrix in (17) would be dense. In other words, to compute the locally adaptive regression spline estimate, we are more or less stuck with solving the lasso problem in (21), where is the dense predictor matrix in (23). This task is manageable for small or moderately sized problems, but for large problems, dealing with the dense matrix , and even holding it in memory, becomes burdensome.

To compute the locally adaptive regression spline estimates for the examples in the last section, we solved the lasso problem in (21) using the LARS algorithm for the lasso path [Efron et al. (2004)], as implemented by the lars R package.4 This particular algorithm was chosen for the sake of a fair comparison to the dual path algorithm used for trend filtering. For the Doppler data example with points, the LARS algorithm computed the locally adaptive regression spline estimate (shown in the right panel of Figure 7) in a comparable amount of time to that taken by the dual path algorithm for trend filtering—in fact, it was faster, at approximately 16 versus 28 seconds on a standard desktop computer. The real issue, however, is scalability. For points, each of these algorithms required about 4000 steps to compute their respective estimates; for 10,000 noisy observations from the Doppler curve, the dual path algorithm completed 4000 steps in a little under 2.5 minutes, whereas the LARS algorithm completed 4000 steps in 1 hour. Furthermore, for problem sizes somewhat larger than 10,000, just fitting the basis matrix used by the LARS algorithm into memory becomes an issue.

## 4 Continuous-time representation.

Section 3.3 showed that the trend filtering minimization problem (2) can be expressed in lasso form (24), with a predictor matrix as in (25). The question we consider is now: is there a set of basis functions whose evaluations over the inputs give this matrix ? Our next lemma answers this question affirmatively.

###### Lemma 4

Given inputs , consider the functions defined as

 h1(x) = 1,h2(x)=x,…,hk+1(x)=xk, (26) hk+1+j(x) = k∏ℓ=1(x−xj+ℓ)⋅1{x≥xj+k},j=1,…,n−k−1.

If the input points are evenly spaced over , for , then the trend filtering basis matrix in (25) is generated by evaluating the functions over , that is,

 Hij=hj(xi),i,j=1,…,n. (27)

The proof is given in the supplementary document [Tibshirani (2014)]. As a result of the lemma, we can alternatively express the trend filtering basis matrix in (25) as

 Hij=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩ij−1/nj−1,for i=1,…,n, j=1,…,k+1,0,for i≤j−1, j≥k+2,k∏ℓ=1(i−(j−k−1+ℓ))/nk,% for i>j−1, j≥k+2. (28)

This is a helpful form for bounding the difference between the entries of and , which is needed for our convergence analysis in the next section. Moreover, the functions defined in (26) give rise to a natural continuous-time parameterization for trend filtering.

###### Lemma 5

For inputs , and the functions in (26), define the linear subspace of functions

 Hk=span{h1,…,hn}={n∑j=1αjhj\dvtxα1,…,αn∈R}. (29)

If the inputs are evenly spaced, , , then the continuous-time minimization problem

 ^f=argminf∈Hk12n∑i=1(yi−f(xi))2+λ⋅TV(f(k)) (30)

(where as before, is understood to mean the th weak derivative of ) is equivalent to the trend filtering problem in (2), that is, their solutions match at the input points,

 ^βi=^f(xi)for i=1,…,n.

This result follows by expressing in (30) in finite-dimensional form as , and then applying Lemmas 4 and 2. Lemma 5 says that the components of trend filtering estimate, , can be seen as the evaluations of a function over the input points, where solves the continuous-time problem (30). The function is a piecewise polynomial of degree , with knots contained in , and for , it is continuous since each of the basis functions are continuous. Hence, for or 1, the continuous-time trend filtering estimate is a spline (and further, it is equal to the locally adaptive regression spline estimate by Lemma 3). But is not necessarily a spline when , because in this case it can have discontinuities in its lower order derivatives (of orders ) at the input points. This is because each basis function , , though infinitely (strongly) differentiable in between the inputs, has discontinuous derivatives of all lower orders at the input point . These discontinuities are visually quite small in magnitude, and the basis functions look extremely similar to the truncated power basis functions ; see Figure 8 for an example.

Loosely speaking, the basis functions in (26) can be thought of as the falling factorial analogues of the truncated power basis in (20). One might expect then that the subspaces of th degree piecewise polynomial functions and are fairly close, and that the (continuous-time) trend filtering and locally adaptive regression spline problems (30) and (14) admit similar solutions. In the next section, we prove that (asymptotically) this is indeed the case, though we do so by instead studying the discrete-time parameterizations of these problems. We show that over a broad class of true functions , trend filtering estimates inherit the minimax convergence rate of locally adaptive regression splines, because the two estimators converge to each other at this same rate.

## 5 Rates of convergence.

In this section, we assume an observation model

 yi=f0(xi)+εi,i=1,…,n, (31)

where , are evenly spaced input points, is an unknown regression function to be estimated, and , are i.i.d. sub-Gaussian errors with zero mean, that is,

 E[εi]=0,P(|εi|>t)≤Mexp(−t2/(2σ2)) (32) \eqntextfor all t>0,i=1,…,n (33)

for some constants . [We will write to denote that is bounded in probability, for random sequences . We will also write to denote for constant sequences , and finally to denote and .]

In Mammen and van de Geer (1997), the authors consider the same setup, and study the performance of the locally adaptive regression spline estimate (14) when the true function belongs to the set

 Fk(C)={f\dvtx[0,1]→R\dvtxf is k times weakly differentiable and TV(f(k))≤C}

for some order and constant . Theorem 10 of Mammen and van de Geer (1997) shows that the th order locally adaptive regression spline estimate in (14), with , satisfies

 Extra close brace or missing open brace (34)

and also that .

### 5.1 Minimax convergence rate.

We note that the rate in (34) is the minimax rate for estimation over the function class , provided that . To see this, define the Sobolev smoothness class

 Wk(C)={f\dvtx[0,1]→R\dvtxf is% k times differentiable and ∫10(f(k)(t))2dt≤C}.

Minimax rates over the Sobolev classes are well-studied, and it is known [e.g., see Nussbaum (1985)] that

 min