Additive Models with Trend Filtering
We consider additive models built with trend filtering, i.e., additive models whose components are each regularized by the (discrete) total variation of their st (discrete) derivative, for a chosen integer . This results in th degree piecewise polynomial components, (e.g., gives piecewise constant components, gives piecewise linear, gives piecewise quadratic, etc.). Analogous to its advantages in the univariate case, additive trend filtering has favorable theoretical and computational properties, thanks in large part to the localized nature of the (discrete) total variation regularizer that it uses. On the theory side, we derive fast error rates for additive trend filtering estimates, and show these rates are minimax optimal when the underlying function is additive and has component functions whose derivatives are of bounded variation. We also show that these rates are unattainable by additive smoothing splines (and by additive models built from linear smoothers, in general). On the computational side, as per the standard in additive models, backfitting is an appealing method for optimization, but it is particularly appealing for additive trend filtering because we can leverage a few highly efficient univariate trend filtering solvers. Going one step further, we derive a new backfitting algorithm whose iterations can be run in parallel, which (as far as we know) is the first of its kind. Lastly, we present experiments that examine the empirical performance of additive trend filtering, and outline some possible extensions.
As the dimension of the input space grows large, nonparametric regression turns into a notoriously difficult problem. In this work, we adopt the stance taken by many others, and consider an additive model for responses , and corresponding input points , , of the form
where is an overall mean parameter, each is a univariate function with for identifiability, , and the errors , are i.i.d. with mean zero. A comment on notation: here and throughout, when indexing over the samples we use superscripts, and when indexing over the dimensions we use subscripts, so that, e.g., denotes the th component of the th input point. (Exceptions will occasionally be made, but the role of the index should be clear from the context.)
Additive models are a special case of the more general projection pursuit regression model of Friedman and Stuetzle (1981). Additive models for the Cox regression and logistic regression settings were studied in Tibshirani (1983) and Hastie (1983), respectively. Some of the first asymptotic theory for additive models was developed in Stone (1985). Two algorithms closely related to (backfitting for) additive models are the alternating least squares and alternating conditional expectations methods, from van der Burg and de Leeuw (1983) and Breiman and Friedman (1985), respectively. The work of Buja et al. (1989) advocates for the use of additive models in combination with linear smoothers, a surprisingly simple combination that gives rise to flexible and scalable multidimensional regression tools. The book by Hastie and Tibshirani (1990) is the definitive practical guide for additive models for exponential family data distributions, i.e., generalized additive models.
More recent work on additive models is focused on high-dimensional nonparametric estimation, and here the natural goal is to induce sparsity in the component functions, so that only a few select dimensions of the input space are used in the fitted additive model. Some nice contributions are given in Lin and Zhang (2006); Ravikumar et al. (2009); Meier et al. (2009), all primarily focused on fitting splines for component functions and achieving sparsity through a group lasso type penalty. In other even more recent and interesting work sparse additive models, Lou et al. (2016) consider a semiparametric (partially linear) additive model, and Petersen et al. (2016) consider a formulation that uses fused lasso (i.e., total variation) penalization applied to the component functions.
The literature on additive models (and by now, sparse additive models) is vast and the above is far form a complete list of references. In this paper, we examine a method for estimating additive models wherein each component is fit in a way that is locally adaptive to the underlying smoothness along its associated dimension of the input space. The literature on this line of work, as far as we can tell, is much less extensive. First, we review linear smoothers in additive models, motivate our general goal of local adaptivity, and then describe our specific proposal.
1.1 Review: additive models and linear smoothers
The influential paper by Buja et al. (1989) studies additive minimization problems of the form
where denotes the vector of responses, and is its centered version, with denoting the sample mean of , and the vector of all 1s. Each vector represents the evaluations of the th component function in our model, i.e., tied together by the relationship
In the problem (1), is a regularization parameter and , are penalty matrices. As a typical example, we might consider to be the Reinsch penalty matrix for smoothing splines along the th dimension of the input space, for . Under this choice, a backfitting (block coordinate descent) routine for (1) would repeatedly cycle through the updates
where the th update fits a smoothing spline to the th partial residual, over the th dimension of the input points, denoted by . At convergence, we arrive at an additive smoothing spline estimate, which solves (1).
Modeling the component functions as smoothing splines is arguably the most common formulation for additive models, and it is the standard in several statistical software packages like the R package gam. As Buja et al. (1989) explain, the backfitting steps in (2) suggest that a more algorithmic approach to additive modeling can be taken. Instead of starting with a particular criterion in mind, as in (2), one can instead envision repeatedly cycling through updates
where each is a particular (user-chosen) linear smoother, meaning, a linear map that performs a univariate smoothing across the th dimension of inputs . The linear smoothers , could correspond to, e.g., smoothing splines, regression splines (regression using a spline basis with given knots), kernel smoothing, local polynomial smoothing, or a combination of these, across the input dimensions. The convergence point of the iterations (3) solves a problem of the form (1) with , where is the Moore-Penrose pseudoinverse of , for .
The class of linear smoothers is broad enough to offer fairly flexible, interesting mechanisms for smoothing, and simple enough to understand precisely. Buja et al. (1989) provide a unified analysis of additive models with linear smoothers, in which they derive the effective degrees of freedom of these estimators and a generalized cross-validation routine for tuning; they also study fundamental properties such as uniqueness of the component fits, and convergence of the backfitting steps.
Much of the work following Buja et al. (1989) remains in keeping with the idea of using linear smoothers in combination with additive models. Studying high-dimensional additive models, Lin and Zhang (2006); Ravikumar et al. (2009); Meier et al. (2009); Koltchinskii and Yuan (2010); Raskutti et al. (2012) all essentially build their methods off of linear smoothers, with modifications to induce sparsity in the estimated component functions. Ravikumar et al. (2009) consider a sparsified version of backfitting in (3), while the others consider penalized versions of the additive criterion in (1).
1.2 The limitations of linear smoothers
The beauty of linear smoothers lies in their simplicity. However, with this simplicity comes serious limitations, in terms of their ability to adapt to varying local levels of smoothness. In the univariate setting, the seminal theoretical work by Donoho and Johnstone (1998) makes this idea precise. With , suppose that underlying regression function lies in the univariate function class
for a constant , where is the total variation operator, and the th weak derivative of . The class in (4) allows for greater fluctuation in the local level of smoothness of than, say, more typical function classes like Holder and Sobolev spaces. The results of Donoho and Johnstone (1998) (see also Section 5.1 of Tibshirani (2014)) imply that the minimax error rate for estimation over is , but the minimax error rate when we consider only linear smoothers (linear transformations of ) is . This difference is highly nontrivial, e.g., for this is a difference of (optimal) versus (optimal among linear smoothers) for estimating a function of bounded variation.
It is important to emphasize that this shortcoming is not just a theoretical one; it is also clearly noticeable in basic practical examples. This does not bode well for additive models built from linear smoothers, when estimating component functions , that display locally heterogeneous degrees of smoothness. Just as linear smoothers will struggle in the univariate setting, an additive estimate based on linear smoothers will not be able to efficiently track local changes in smoothness, across any of the input dimensions. This could lead to a loss in accuracy even if only some (or one) of the components , possesses heterogeneous smoothness across its domain.
Two well-studied univariate estimators that are locally adaptive, i.e., that attain the minimax error rate over the th order total variation class in (4), are wavelet smoothing and locally adaptive regression splines, as developed by Donoho and Johnstone (1998) and Mammen and van de Geer (1997), respectively. There is a substantial literature on these methods in the univariate case (especially for wavelets), but fewer authors have considered using these locally adaptive estimators in the additive models context. Notable exceptions are Zhang and Wong (2003); Sardy and Tseng (2004), who study additive models built from wavelets, and Petersen et al. (2016), who study sparse additive models with components given by 0th order locally adaptive regression splines (or equivalently, the components are regularized via fused lasso penalties, i.e., total variation penalties). The latter work is especially related to our focus in this paper.
1.3 Additive trend filtering
We consider additive models that are constructed using trend filtering (instead of linear smoothers, wavelets, or locally adaptive regression splines) as their componentwise smoother. Proposed independently by Steidl et al. (2006) and Kim et al. (2009), trend filtering is a relatively new approach to univariate nonparametric regression. As explained in Tibshirani (2014), it can be seen as a discrete-time analog of the locally adaptive regression spline estimator. Denoting by the vector of univariate input points, where we assume , the trend filtering estimate of order is defined as the solution of the optimization problem
where is a tuning parameter, and is a th order difference operator, constructed based on . These difference operators can be defined recursively, as in
(The leading matrix in (7) is the version of the difference operator in (6).) Intuitively, the interpretation is that the problem (5) penalizes the sum of absolute st order discrete derivatives of across the input points . Thus, at optimality, the coordinates of the trend filtering solution obey a th order piecewise polynomial form.
This intuition is formalized in Tibshirani (2014) and Wang et al. (2014), where it is shown that the components of the th order trend filtering estimate are precisely the evaluations of a fitted th order piecewise polynomial function across the inputs, and that the trend filtering and locally adaptive regression spline estimates of the same order are asymptotically equivalent. When or , in fact, there is no need for asymptotics, and the equivalence between trend filtering and locally adaptive regression spline estimates is exact in finite samples. It is also worth pointing out that when , the trend filtering estimate reduces to the 1d fused lasso estimate (Tibshirani et al., 2005), which is known as 1d total variation denoising in signal processing (Rudin et al., 1992).
Over the th order total variation function class defined in (4), Tibshirani (2014); Wang et al. (2014) prove that th order trend filtering achieves the minimax optimal error rate, just like th order locally adaptive regression splines. Another important property, as developed by Kim et al. (2009); Tibshirani (2014); Ramdas and Tibshirani (2016), is that trend filtering estimates are relatively cheap to compute—much cheaper than locally adaptive regression spline estimates—owing to the bandedness of the difference operators in (6), (7), which means that specially implemented convex programming routines can solve (5) in an efficient manner.
It is this computational efficiency, along with its capacity for local adaptivity, that makes trend filtering a particularly desirable candidate to extend to the additive model setting. Specifically, we consider the additive trend filtering estimate of order , defined as a solution in the problem
As before, is the centered response vector, and is a regularization parameter. Not to be confused with the notation for linear smoothers from a previous subsection, in (8) is a permutation matrix that sorts the th component of inputs into increasing order, i.e.,
Also, in (8) is the st order difference operator, as in (6), (7), but defined over the sorted th dimension of inputs , for . With backfitting (block coordinate descent), computation of a solution in (8) is still quite efficient, since we can leverage the efficient routines for univariate trend filtering.
1.4 A motivating example
Figure 1 shows a simulated example that compares the additive trend filtering estimates in (8) (of quadratic order, ), to the additive smoothing spline estimates in (1) (of cubic order). In the simulation, we used and . We drew input points , , and drew responses , , where was set to give a signal-to-noise ratio of about 1. The underlying component functions were defined as
so that possess different levels of smoothness ( being the smoothest, less smooth, and the least smooth), and so that itself has heteregeneous smoothness across its domain.
The first row of Figure 1 shows the estimated component functions from additive trend filtering, at a value of that minimizes the mean squared error (MSE), computed over 20 repetitions. The second row shows the estimates from additive smoothing splines, also at a value of that minimizes the MSE. We see that the trend filtering fits adapt well to the varying levels of smoothness, but the smoothing spline fits are undersmoothed, for the most part. In terms of effective degrees of freedom (df), the additive smoothing spline estimate is much more complex, having about 85 df (computed via Monte Carlo over the 20 repetitions); the additive trend filtering has only about 42 df. The third row of the figure shows the estimates from additive smoothing splines, when is chosen so that the resulting df is roughly matches that of additive trend filtering in the first row. Now we see that the first component fit is oversmoothed, yet the third is still undersmoothed.
Figure 2 displays the MSE curves from additive trend filtering, as a function of df. We see that trend filtering achieves a lower MSE, and moreover, its MSE curve is optimized at a lower df (i.e., less complex model) than that for smoothing splines. This is analogous to what is often seen in the univariate setting (Tibshirani, 2014).
1.5 Summary of contributions
A summary of our contributions, and an outline for the rest of this paper, are given below.
In Section 2, we develop some basic properties of the additive trend filtering model: an equivalent continuous-time formulation, a condition for uniqueness of component function estimates, and a simple formula for the effective degrees of freedom of the additive fit.
In Section 3, we introduce two estimators related to additive trend filtering, based on splines, that facilitate theoretical analysis (and are perhaps of interest in their own right).
In Section 4, we derive error bounds for additive trend filtering. Assuming that the underlying regression function is additive, denoted by , and that is bounded, for , we prove that the th order additive trend filtering estimator converges to at the rate when the dimension is fixed (under weak assumptions), and at the rate when is growing (under stronger assumptions). We prove that these rates are optimal in a minimax sense, and also show that additive smoothing splines (or more generally, additive models built from linear smoothers of any kind) are suboptimal over such a class of functions .
In Section 5, we study the backfitting algorithm for additive trend filtering models, and give a connection between backfitting and an alternating projections scheme in the additive trend filtering dual problem. This inspires a new, provably convergent, parallelized backfitting algorithm for additive trend filtering.
2 Basic properties
In this section, we derive a number of basic properties of additive trend filtering estimates, starting with a representation for the estimates as continuous functions over (rather than simply discrete fitted values at the input points).
2.1 Falling factorial representation
We may describe additive trend filtering in (8) as an estimation problem written in analysis form. The components are modeled directly by the parameters , , and the desired structure is established by regularizing the discrete derivatives of these parameters, through the penalty terms , . Here, we present an alternative representation for (8) in basis form, where each component is expressed as a linear combination of basis functions, and regularization is applied to the coefficients in this expansion.
Before we derive the basis formulation that underlies additive trend filtering, we first recall the falling factorial basis (Tibshirani, 2014; Wang et al., 2014). Given knot points , the th order falling factorial basis functions are defined by
We denote when , and 0 otherwise. (Also, our convention is to define the empty product to be 1, so that .) The functions are piecewise polynomial functions of order , and appear very similar in form to the th order truncated power basis functions. In fact, when or , the two bases are exactly equivalent (meaning that they have the same span). Similar to an expansion in the truncated power basis, an expansion in the falling factorial basis,
is a continuous piecewise polynomial function, having a global polynomial structure determined by , and exhibiting a knot—i.e., a change in its th derivative—at the location when . But, unlike the truncated power functions, the falling factorial functions in (9) are not splines, and when (as defined above) has a knot at a particular location, it displays a change not only in its th derivative at this location, but also in all lower order derivatives (i.e., all derivatives of orders ).
Tibshirani (2014); Wang et al. (2014) establish a connection between univariate trend filtering and the falling factorial functions, and show that the trend filtering problem can be interpreted as a sparse basis regression problem using these functions. As we show next, the analogous result holds for additive trend filtering.
Lemma 1 (Falling factorial representation).
An alternative way of expressing problem (10) is
For , define the th order falling factorial basis matrix by
Note that the columns of follow the order of the sorted inputs , but the rows do not; however, for , both its rows and columns of follow the order of . From Wang et al. (2014), we know that
for some matrix , i.e.,
and so , for each . ∎
This lemma not only provides an interesting reformulation for additive trend filtering, it is also practically useful in that it allows us to perform interpolation or extrapolation using the additive trend filtering model. That is, from a solution in (8), we can extend each component fit to the real line, by forming an appropriate linear combination of falling factorial functions:
The coefficients above are determined by the relationship , and are easily computable given the highly structured form of , as revealed in (13). Writing the coefficients in block form, as in , we have
The first coefficients index the pure polynomial functions . These coefficients will be generically dense (the form of is not important here, so we omit it for simplicity, but details are given in Appendix A.1). The last coefficients index the knot-producing functions , and when , the fitted function exhibits a knot at the th sorted input point among , i.e., at . Figure 3 gives an example.
2.2 Uniqueness of component fits
It is easy to see that, for the problem (8), the additive fit is always uniquely determined: denoting for a linear operator and , the loss term is strictly convex in the variable , and this, along with the convexity of the problem (8), implies a unique additive fit , no matter the choice of solution .
On the other hand, when , the criterion in (8) is not strictly convex in , and hence there need not be a unique solution , i.e., the individual components fits , need not be uniquely determined. We show next that uniqueness of the component fits can be guaranteed under some conditions on the input matrix . We will rely on the falling factorial representation for additive trend filtering, introduced in the previous subsection, and on the notion of general position: a matrix is said to have columns in general position provided that, for any , subset of columns denoted , and signs , the affine span of does not contain any element of . Informally, if the columns of are not in general position, then there must be some small subset of columns that are affinely dependent.
Lemma 2 (Uniqueness).
For , let be the falling factorial basis matrix constructed over the sorted th dimension of inputs , as in (12). Decompose into its first columns , and its last columns . The former contains evaluations of the pure polynomials ; the latter contains evaluations of the knot-producing functions . Also, let denote the matrix with its first column removed, for , and . Define
the product of and the columnwise concatenation of , . Let denote the projection operator onto the space orthogonal to the column span of , where has orthonormal columns, and define
the product of and the columnwise concatenation of , . A sufficient condition for uniqueness of the additive trend filtering solution in (8) can now be given in two parts.
If has columns in general position, then the knot-producing parts of all component fits are uniquely determined, i.e., for each , the projection of onto the column space of is unique.
If in addition has full column rank, then the polynomial parts of component fits are uniquely determined, i.e., for each , the projection of onto the column space of is unique, and thus the component fits , are all unique.
The proof is deferred to Appendix A.2. To rephrase, the above lemma decomposes each component of the additive trend filtering solution according to
where exhibits a purely polynomial trend over , and exhibits a piecewise polynomial trend over , and hence determines the knot locations, for . The lemma shows that the knot-producing parts , are uniquely determined when the columns of are in general position, and the polynomial parts , are unique when the columns of are in general position, and the columns of are linearly independent.
The conditions placed on in Lemma 2 are not strong. When , and the elements of input matrix are drawn from a density over , it is not hard to show that has full column rank with probability 1. We conjecture that, under the same conditions, will also have columns in general position with probability 1, but do not pursue a proof.
Remark 1 (Relationship to concurvity).
It is interesting to draw a connection to Buja et al. (1989). In the language used by these authors, when has linearly dependent columns, we say that the predictor variables display concurvity, i.e., linear dependence after nonlinear (here, polynomial) transformations are applied. Buja et al. (1989) establish that the components in the additive model (1), built with quadratic penalties, are unique provided there is no concurvity between variables. In comparison, Lemma 2 establishes uniqueness of the additive trend filtering components when there is no concurvity between variables, and additionally, the columns of are in general position. The latter two conditions together can be seen as requiring no generalized concurvity—if were to fail the general position assumption, then there would be a small subset of the variables that are linearly dependent after nonlinear (here, piecewise polynomial) transformations are applied.
2.3 Dual problem
Let us abbreviate , for the penalty matrices in the additive trend filtering problem (8). Basic arguments in convex analysis, deferred to Appendix A.3, show that the dual of problem (8) can be expressed as:
From the form of (19), it is clear that we can write the (unique) dual solution as , where is the (Euclidean) projection operator onto . Moreover, using (20), we can express the additive fit as , where is the operator that gives the residual from projecting onto . These relationships will be revisited in Section 5, where we return to the dual perspective, and argue that the backfitting algorithm for the additive trend filtering problem (8) can be seen as a type of alternating projections algorithm for its dual problem (19).
2.4 Degrees of freedom
where . Roughly speaking, the above definition sums the influence of the th component on its corresponding fitted value , across . A precise understanding of degrees of freedom is useful for model comparisons (recall the x-axis in Figure 2), and other reasons. For linear smoothers, in which for some , it is clear that , the trace of . (This also covers additive models whose components are built from univariate linear smoothers, because in total these are still just linear smoothers: the additive fit is still just a linear function of .)
Of course, additive trend filtering is a not a linear smoother; however, it is a particular type of generalized lasso estimator, and degrees of freedom for such a class of estimators is well-understood (Tibshirani and Taylor, 2011, 2012). The next result is an consequence of existing generalized lasso theory, proved in Appendix A.4.
Lemma 3 (Degrees of freedom).
Assume the conditions of Lemma 2, i.e., that the matrix in (17) has full column rank, and the matrix in (18) is in general position. Assume also that the response is Gaussian, , and treat the input points , as fixed and arbitrary, as well as the tuning parameter value . Then the additive trend filtering fit from (8) has degrees of freedom
Remark 2 (The effect of shrinkage).
Lemma 3 says that for an unbiased estimate of the degrees of freedom of the additive trend filtering fit, we count the number of knots in each component fit (recall that this is the number of nonzeros in , i.e., the number of changes in the discrete st derivative), add them up over , and add . This may seem surprising, as these knot locations are chosen adaptively based on the data . But, such adaptivity is counterbalanced by the shrinkage induced by the penalty in (8) (i.e., for each component fit , there is shrinkage in the differences between the attained th derivatives on either side of a selected knot). See Tibshirani (2015) for a study of this phenomenon.
Remark 3 (Easy unbiased degrees of freedom estimation).
It is worth emphasizing that an unbiased estimate from Lemma 3 for the degrees of freedom of the total fit in additive trend filtering is very easy to calculate: we scan the individual component fits and add up the number of knots that appear in each one. The same cannot be said for additive smoothing splines, or additive models built from univariate linear smoothers, in general. Although computing the fit itself is typically cheaper with additive linear smoothers than with additive trend filtering, computing the degrees of freedom is more challenging. For example, for the additive model in (1) built with quadratic penalties, we have
where has copies of the centering matrix stacked across its columns, is a block diagonal matrix with blocks , , and denotes the Moore-Penrose pseudoinverse of a matrix . The above formula does not obviously decompose into a sum of quantities across components, and is nontrivial to compute post optimization of (1), specifically when a backfitting algorithm as in (2) has been used to compute a solution.
3 Two related additive spline estimators
Additive trend filtering is closely related to two other additive spline estimators, introduced here.
3.1 Additive locally adaptive splines
As a natural extension of the univariate estimator given in Mammen and van de Geer (1997), let us define the additive locally adaptive regression spline estimate of order as a solution in
Note the close analogy to the additive trend filtering problem, when written in the functional form (11): the above minimization problem considers all additive functions (whose components are all times weakly differentiable), but the additive trend filtering problem just considers functions whose components are in the span of falling factorial functions, , .
The univariate results in Mammen and van de Geer (1997) extend immediately to the additive case, and imply that the infinite-dimensional locally adaptive regression spline problem (21) has a solution such that each component function , is a spline of degree (i.e., justifying the name). These results further imply that when or , the knots of the spline lie among the th dimension of the input points , for . But for , this need not be true, and in general the component functions can be splines with knots at locations other than the inputs, making computation of the additive locally adaptive regression spline estimate very difficult.
3.2 Restricted additive locally adaptive splines
To simplify the computational problems just raised, we define the restricted additive locally adaptive regression spline estimator of order as a solution in
Here, we restrict the minimization domain, by requiring , where is the space of splines of degree with knots lying among the th dimension of inputs , for . More precisely, we require that the splines in have knots in a set , which, writing for the sorted inputs along the th dimension, is defined by
i.e., defined by removing inputs at the boundaries, for . This makes (22) a finite-dimensional problem, just like (11). When or , as claimed in Section 2.1 (and shown in Tibshirani (2014)), the falling factorial functions are simply splines, which means that for , hence (11) and (22) are equivalent problems. When , this is no longer true, and the solutions in (11) and (22) will differ; the former will be much easier to compute, since does not admit a nice representation in terms of discrete derivatives for a th order spline (and yet it does for a th order falling factorial function, as seen in (8)).
To summarize, when or , all three total variation penalized additive estimators as defined in (21), (22), and (11) are equivalent. For , these estimators are generically different, though our intuition tells us that their differences should not be too large: the first problem admits a solution that is a spline in each component; the second restricts these splines to have knots at the input points; the third problem swaps splines for falling factorial functions, which are highly similar in form. Next we give theory given that will confirm this intuition, in large samples.
4 Error bounds
We derive error bounds for additive trend filtering and additive locally adaptive regression splines (both the unrestricted and restricted variants), when the underlying regression function is additive, and has components whose derivatives are of bounded variation. These results are actually special cases of a more general result we prove in this section, on a generic roughness-penalized estimator of an additive model, where we assume to have control over the entropy of the unit ball in the penalty functional. We treat separately the settings in which the dimension of the input space is fixed and growing. We also complement our error rates with minimax lower bounds. But first, we introduce some helpful notation.
Given a distribution supported on a set , and i.i.d. samples , from , denote by the associated empirical distribution. We define the and inner products, denoted