Exact Post-Selection Inference for Changepoint Detection and Other Generalized Lasso Problems

Exact Post-Selection Inference for Changepoint Detection and Other Generalized Lasso Problems

Abstract

We study tools for inference conditioned on model selection events that are defined by the generalized lasso regularization path. The generalized lasso estimate is given by the solution of a penalized least squares regression problem, where the penalty is the norm of a matrix times the coefficient vector. The generalized lasso path collects these estimates as the penalty parameter varies (from down to 0). Leveraging a (sequential) characterization of this path from [?], and recent advances in post-selection inference from [?], we develop exact hypothesis tests and confidence intervals for linear contrasts of the underlying mean vector, conditioned on any model selection event along the generalized lasso path (assuming Gaussian errors in the observations).

Our construction of inference tools holds for any penalty matrix . By inspecting specific choices of , we obtain post-selection tests and confidence intervals for specific cases of generalized lasso estimates, such as the fused lasso, trend filtering, and the graph fused lasso. In the fused lasso case, the underlying coordinates of the mean are assigned a linear ordering, and our framework allows us to test selectively chosen breakpoints or changepoints in these mean coordinates. This is an interesting and well-studied problem with broad applications; our framework applied to the trend filtering and graph fused lasso cases serves several applications as well. Aside from the development of selective inference tools, we describe several practical aspects of our methods such as (valid, i.e., fully-accounted-for) post-processing of generalized estimates before performing inference in order to improve power, and problem-specific visualization aids that may be given to the data analyst for he/she to choose linear contrasts to be tested. Many examples, both from simulated and real data sources, are presented to examine the empirical properties of our inference methods.

Keywords: generalized lasso, fused lasso, trend filtering, changepoint detection, post-selection inference

1Introduction

Consider a classic Gaussian model for observations , with known marginal variance ,

where the (unknown) mean is the parameter of interest. In this paper, we examine problems in which is believed to have some specific structure (at least approximately so), in that it is sparse when parametrized with respect to a particular basis. A key example is the changepoint detection problem, in which the components of the mean correspond to ordered underlying positions (or locations) , and many adjacent components and are believed to be equal, with the exception of a sparse number of breakpoints or changepoints to be determined. See the left plot in Figure Figure 1 for a simple example.

Many methods are available for estimation and detection in the changepoint problem. We focus on the 1-dimensional fused lasso [?], also called 1-dimensional total variation denoising [?] in signal processing, for reasons that will become clear shortly. This method, which we call the 1d fused lasso (or simply fused lasso) for short, is often used for piecewise constant estimation of the mean, but it does not come with associated inference tools after changepoints have been detected. In the top right panel of Figure 1, we inspect the 1d fused lasso estimate that has been tuned to detect two changepoints, in a data model where the mean only has one true changepoint. Writing the changepoint locations as , we might consider testing

where we write and for notational convenience. If we were to naively ignore the data-dependent nature of (these are the estimated changepoints from the two-step fused lasso procedure), i.e., treat them as fixed, then the natural tests for the null hypothesess , would be to reject for large magnitudes of the statistics

respectively, where we use to denote the average of components of between positions and . Indeed, these can be seen as likelihood ratio tests stemming from the Gaussian model in .

Figure 1: A simple example with n=100 points generated around a piecewise constant mean with one true changepoint at location 50, shown in the top left panel. The 1d fused lasso path, stopped at the (end of the) second step, produces the estimate in the top right panel, with two detected changepoints at locations 11 and 50, labeled A and B in the figure. The table reports p-values from the naive Z-test, which does not account for the data-dependent nature of the changepoints, and from our TG test for the 1d fused lasso, which does.
A simple example with n=100 points generated around a piecewise constant mean with one true changepoint at location 50, shown in the top left panel. The 1d fused lasso path, stopped at the (end of the) second step, produces the estimate in the top right panel, with two detected changepoints at locations 11 and 50, labeled A and B in the figure. The table reports p-values from the naive Z-test, which does not account for the data-dependent nature of the changepoints, and from our TG test for the 1d fused lasso, which does.
Figure 1: A simple example with points generated around a piecewise constant mean with one true changepoint at location 50, shown in the top left panel. The 1d fused lasso path, stopped at the (end of the) second step, produces the estimate in the top right panel, with two detected changepoints at locations 11 and 50, labeled and in the figure. The table reports p-values from the naive Z-test, which does not account for the data-dependent nature of the changepoints, and from our TG test for the 1d fused lasso, which does.
Location Naive p-values TG p-values
A 11 0.057 0.359
B 50 0.000 0.000

The table in Figure 1 shows the results of running such naive Z-tests. At location (labeled location in the figure), which corresponds to a true changepoint in the underlying mean, the test returns a very small p-value, as expected. But at location (labeled in the figure), a spurious detected changepoint, the naive Z-test also produces a small p-value. This happens because the location has been selected by the 1d fused lasso, which inexorably links it to an unusually large magnitude of ; in other words, it is no longer appropriate to compare against its supposed Gaussian null distribution, with mean zero and variance . Also shown in the table are the results of running our new truncated Gaussian (TG) test for the 1d fused lasso, which properly accounts for the data-dependent nature of the changepoints detected by fused lasso, and produces p-values that are exactly uniform under the null1, conditional on having been selected by the fused lasso in the first place. We now see that only the changepoint at location has a small associated p-value.

1.1Summary

In this paper, we make the following contributions.

  • We introduce the usage of post-selection inference tools to selection events defined by a class of methods called generalized lasso estimators. The key mathematical task is to show that the model selection event defined by any (fixed) step the generalized lasso solution path can be expressed as a polyhedron in the observation vector (Section Section 3.1). The (conditionally valid) TG tests and confidence intervals of [?] can then be applied, to test or cover any linear contrast of the mean vector .

  • We describe a stopping rule based on a generic information criterion (akin to AIC or BIC), to select a step along the generalized lasso path at which we are to perform conditional inference. We give a polyhedral representation for the ultimate model selection event that encapsulates both the selected path step and the generalized lasso solution at this step (Section Section 3.2). Along with the TG tests and confidence intervals, this makes for a practical (nearly-automatic) and broadly applicable set of inference tools.

  • We study various special cases of the generalized lasso problem—namely, the 1d fused lasso, trend filtering, graph fused lasso, and regression problems—and for each, we develop specific forms for linear contrasts that can be used to test different population quantities of interest (Sections Section 4.1 through Section 4.5). In each case, we believe that our tests provide new advancements in the set of currently available inferential tools. For example, in the 1d fused lasso, i.e., the changepoint detection problem, our tests are the first that we know of that are specifically designed to yield proper inferences after changepoint locations have been detected.

  • We present two of extensions of the basic tools described above for post-selection inference in generalized lasso problems: a post-processing tool, to improve the power of our methods, and a visualization aid, to improve practical useability.

  • We conduct a comprehensive simulations across the various special problem cases, to investigate the (conditional) power of our methods, and verify their (conditional) type I error control (Sections Section 5.1 through Section 5.5). We also demonstrate a realistic application of our selective inference tools for changepoint detection to a data set of comparative genomic hybridization (CGH) measurements from two glioblas-toma multiforme (GBM) tumors (Section Section 5.6).

1.2Related work

Post-selection inference, also known as selective inference, is a new but rapidly growing field. Unlike other recent developments in high-dimensional inference using a more classic full-population model, the point of selective inference is to provide a means of testing hypotheses derived from a selected model, the output of an algorithm that has been applied to data at hand. In a sequence of papers, [?] prove impossibility results about estimating the post-selection distribution of certain estimators in a classical regression setting. [?], [?] circumvent this by more directly conducting inference on post-selection targets (rather than basing inference on the distribution of the post-selection estimator itself). The former work is very broad and considers all selection mechanisms in regression (hence yielding more conservative inference); the latter is much more specific and considers the lasso estimator in particular. [?] improve on the method in [?], and introduced a pivot-based framework for post-selection inference. [?] describe the application to the lasso problem at a fixed tuning parameter ; [?] describe the application to the lasso path at a fixed number of steps (and also, the least angle regression and forward stepwise paths). A number of extensions to different problem settings are given in [?]. Asymptotics for non-Gaussian error distributions are presented in [?]. A broad treatment of selective inference in exponential family models and selective power is presented in [?]. An improvement based on auxiliary randomization is considered in [?]. A study of selective sequential tests and stopping rules is given in [?]. Ours is the first work to consider selective inference in structural problems like the generalized lasso.

Changepoint detection carries a huge body of literature; reviews can be found in, e.g., [?]. Far sparser is the literature on changepoint inference, say, inference for the location or size of changepoints, or segment lengths. [?] are some examples, and [?] provide nice surveys and extensions. The main tools are built around likelihood ratio test statistics comparing two nested changepoint models, but at fixed locations. Since interesting locations to be tested are typically estimated, these inferences can be clearly invalid (if estimation and inference are both done on the same data samples).

Probably most relevant to our goal of valid post-selection changepoint inference is [?], who develop a simultaneous confidence band for the mean in a changepoint model. Their Simultaneous Multiscale Changepoint Estimator (SMUCE) seeks the most parsimonious piecewise constant fit subject to an upper limit on a certain multiscale statisic, and is solved via dynamic programming. Because the final confidence band has simultaneous coverage (over all components of the mean), it also has valid coverage for any (data-dependent) post-selection target. In contrast, our proposal does not give simultaneous coverage of the mean, but rather, selective coverage of a particular post-selection target. An empirical comparison between the two methods (SMUCE, and ours) is given in Section 5.2. While this comparison is useful and informative, it is also worth emphasizing that the framework in this paper applies far outside of the changepoint detection problem, i.e., to trend filtering, graph clustering, and regression problems with structured coefficients.

1.3Notation

For a matrix , we will denote by the submatrix whose rows are in a set . We write to mean . Similarly, for a vector , we write or to extract the subvector whose components are in or not in , respectively. We use for the pseudoinverse of a matrix , and , , for the column space, row space, and null space of , respectively. We write for the projection matrix onto a linear subspace . Lastly, we will often abbreviate a sequence by .

2Preliminaries

2.1The generalized lasso regularization path

Given a response , the generalized lasso estimator is defined by the optimization problem

where is a matrix of predictors, is a prespecified penalty matrix, and is a regularization parameter. This matrix is chosen so that sparsity of induces some type of desired structure in the solution in . Important special cases, each corresponding to a specific class of matrices , include the 1d fused lasso, trend filtering, and graph fused lasso problems. More details on these problems is given in Section 4; see also Section 2 in [?].

We review the algorithm of [?] to compute the entire solution path in , i.e., the continuum of solutions as the regularization parameter desends from to 0. We focus on the problem of signal approximation, where :

For a general , a simple modification to the arguments used for will deliver the solution path for , and we refrain from describing this until Section 4.5. The path algorithm of [?] for is derived from the perspective of its equivalent Lagrange dual problem, namely

The primal and dual solutions, in and in , are related by

as well as

The strategy is now to compute a solution path in the dual problem, as descends from to 0, and then use to deliver the primal solution path. Therefore it suffices to describe the path algorithm as it operates on the dual problem; this is given next.

Explained in words, the dual path algorithm in Algorithm ? tracks the coordinates of the computed dual solution that are equal to , i.e., that lie on the boundary of the constraint region . The collection of such coordinates, at any given step in the path, is called the boundary set, and is denoted . Critical values of the regularization parameter at which the boundary set changes (i.e., at which coordinates join or leave the boundary set) are called knots, and are denoted . From the form of the dual solution as presented in Algorithm ?, and the primal-dual relationship , the primal solution path may be expressed in terms of the current boundary set and boundary sign list , as in

As we can see, the primal solution lies in the subspace , which implies it expresses a certain type of structure. This will become more concrete as we look at specific cases for in Section 4, but for now, the important point is that the structure of the generalized lasso solution is determined by the boundary set . Therefore, by conditioning on the observed boundary set after a certain number of steps of the path algorithm, we are effectively conditioning of the observed model structure in the generalized lasso solution at this step. This is essentially what is done in Section 3.

Lastly, we note the following important point. In some generalized lasso problems, Step 2(c) in Algorithm ? does not need to be performed, i.e., we can formally replace this step by , and accordingly, the boundary set will only grow over iterations . This is true, e.g., for all 1d fused lasso problems; more generally, it is true for any generalized lasso signal approximator problem in which is diagonally dominant.

2.2Exact inference after polyhedral conditioning

Under the Gaussian observation model in , [?] build a framework for inference on an arbitrary linear constrast of the mean , conditional on , where is an arbitrary polyhedron. A core tool in these works is an exact pivotal statistic for , conditional on : they prove that there exists random variables such that

where denotes the cumulative distribution function of a univariate Gaussian random variable conditional on lying in the interval . The above is called the truncated Gaussian (TG) pivot. The truncation limits are easily computable, given a half-space representation for the polyhedron , where and and the inequality here is to be interpreted componentwise. Specifically, we have

where . The TG pivotal statistic in enables us to test the null hypothesis against the one-sided alternative . Namely, it is clear that the TG test statistic

is itself a p-value for , with finite sample validity, conditional on . (A two-sided test is also possible: we simply use as our p-value; see [?] for a discussion of the merits of one-sided and two-sided selective tests.) Confidence intervals follow directly from as well. For an (equi-tailed) interval with exact finite sample coverage , conditional on the event , we take , where are obtained by inverting the TG pivot, i.e., defined to satisfy

At this point, it may seem unclear how this framework applies to post-selection inference in generalized lasso problems. The key ingredients are, of course, the polyhedron and the contrast vector . In the next section, we will show how to construct polyhedra that correspond to model selection events of interest, at points along the generalized lasso path. In the following section, we will suggest choices of contrast vectors that lead to interesting and useful tests in specific settings, such as the 1d fused lasso, trend filtering, and graph fused lasso problems.

2.3Can we not just use lasso inference tools?

When the penalty matrix is square and invertible, the generalized lasso problem is equivalent to a lasso problem, in the variable , with design matrix . More generally, when has full row rank, problem is reducible to a lasso problem (see [?]). In this case, existing inference theory for the lasso path (from [?]) could be applied to the equivalent lasso problem, to perform post-selection inference on generalized lasso models. This covers inference for the 1d fused lasso and trend filtering problems. But when is row rank deficient (when it has more rows than columns), the generalized lasso is not equivalent to a lasso problem (see again [?]), and we cannot simply resort to lasso inference tools. This would hence rule out treating problems like the 2d fused lasso, the graph fused lasso (for any graph with more edges than nodes), the sparse 1d fused lasso, and sparse trend filtering from a pure lasso perspective. Our paper presents a unified treatment of post-selection inference across all generalized lasso problems, regardless of the penalty matrix .

3Inference along the generalized lasso path

3.1The selection event after a given number of steps

Here, we suppose that we have run a given (fixed) number of steps of the generalized lasso path algorithm, and we have a contrast vector in mind, such that is a parameter of interest (to be tested or covered). Define the generalized lasso model at step of the path to be

where are the boundary set and signs at step , and are quantities to be defined shortly. We will show that the entire model sequence from steps , denoted , is a polyhedral set in . By this we mean the following: if denotes the model sequence as a function of , and a given realization, then the set

is a polyhedron, more specifically, a convex cone, and can therefore be expressed as for a matrix that we will show how to construct, based on .

Our construction uses induction. When , and we write and , it is clear from the first step of Algorithm ? that is the hitting coordinate-sign pair if and only if

Hence we can construct to have the corresponding rows—to be explicit, these are , . We note that at the first step, there is no characterization needed for and (for simplicity, we may think of these as being empty sets).

Now assume that, given a model sequence , we have constructed a polyhedral representation for , i.e., we have constructed a matrix such that . To show that can also be written in the analogous form, we will define by appending rows to that capture the generalized lasso model at step of Algorithm ?. We will add rows to characterize the hitting time , leaving time , and the next action (either hitting or leaving) . Keeping with the notation in , a simple argument shows that the next hitting time can be alternatively written as

Plugging in for , we may characterize the viable hitting signs at step , , as well as the next hitting coordinate and hitting sign, and , by the following inequalities:

This corresponds to rows to be appended to .

For , we first define the viable leaving coordinates, denoted , by the subset of for which and . We may write , where is the set of for which , and is the set of for which . Plugging in for , we notice that only the former set depends on , and is deterministic once we have characterized . This gives rise to the following inequalities determining :

which corresponds to rows to be appended to . Given this characterization for , we may now characterize the next leaving coordinate by:

This corresponds to rows that must be appended to . Recall that the leaving coordinate is given by .

Lastly, for , we either use

if , or the above with the inequality sign flipped, if . In either case, only one more row is to be appended to . This completes the inductive proof.

It is worth noting that, in the inductive step that constructs by appending rows to , we append a total of at most rows. Therefore after steps, the polyhedral representation for the model sequence uses a matrix with at most rows.

Combining the results of this subsection with the TG pivotal statistic from Section 2.2, we are now equipped to perform conditional inference on the model that is selected at any fixed step of the generalized lasso path. (Recall, we are assuming that a reasonable contrast vector has been determined such that is a quantity of interest in the -step generalized lasso model; in-depth discussion of reasonable choices of contrast vectors, for particular problems, is given in Section 4.) Of course, the choice of which step to analyze is somewhat critical. The high-level idea is to fix a step that is large enough for the selected model to be interesting, but not so large that our tests will be low-powered. In some practical applications, choosing a priori may be natural; e.g., in the 1d fused lasso problem, where the selected model correponds to detected changepoints (as discussed in the introduction), we may choose (say) steps, if in our particular setting we are interested in detecting and performing inference on at most 10 changepoints. But in most practical applications, fixing a step a priori is likely a difficult task. Hence, we present a rigorous strategy that allows the choice of to be data-driven, next.

3.2The selection event after an IC-selected number of steps

We develop approaches based on a generic information criterion (IC), like AIC or BIC, for selecting a number of steps along the generalized path that admits a “reasonable” model. By “reasonable”, our IC approach admits a -step generalized lasso solution balances training error and some notion of complexity. Importantly, we specifically design our IC-based approaches so that the selection event determining is itself a polyhedral function of . We establish this below.

Defined in terms of a generalized lasso model at step , we consider the general form IC:

The first term above is the squared loss between and its projection onto the subspace ; recall that the -step generalized lasso solution itself lies in this subspace, as written in , and so here we have replaced the squared loss between and with the squared error loss between and the unshrunken estimate . (This is needed in order for our eventual IC-based rule to be equivalent to a polyhedral constraint in , as will be seen shortly.) The second term in utilizes , the dimension of , i.e., the dimension of the solution subspace. It hence penalizes the complexity associated with the -step generalized lasso solution. Indeed, from [?], the quantity is an unbiased estimate of the degrees of freedom of . Further, is a penalty function that is allowed to depend on and (the marginal variance in the observation model ). Some common choices are: , which makes like AIC; , motivated by BIC; and , where is a parameter to be chosen (say, for simplicity), motivated by extended BIC (EBIC) of [?]. Beyond these, any choice of complexity penalty will do as long as is an increasing function of .

Unfortunately, choosing to stop the path at the step that minimizes the IC defined in does not define a polyhedron in . Therefore, we use a modified IC-based rule. We first define

the set of steps at which we see action (nonzero adjacent differences) in the IC.2 For , we have , meaning that the structure of the primal solution is unchanged between steps and , and the IC is trivially constant as we move across these steps; we will hence restrict our attention to candidate steps in in crafting our stopping rule. Denoting by the sorted elements of , we define for each ,

the sign of the difference in IC values between steps and (two adjacent elements in at which the IC values are known to change nontrivially). We are now ready to define our stopping rule, which chooses to stop the path at the step

or in words, it chooses the smallest step such that the IC defined in experiences successive rises in a row, among the elements of the candidate set . Here is a prespecified integer; in practice, we have found that works well in most scenarios. It helps to see a visual depiction of the rule, see Figure 2.

Figure 2: Two illustrations of the IC selection rule; on the left, an example in which  changes at only 6 steps (representing, e.g., the 2d fused lasso case); on the right, an example in which  changes at every step (representing, e.g., the 1d fused lasso case). In both panels, solid blue circles mark the candidate set  in , and a large red circle is drawn around the IC-selected step in .
Two illustrations of the IC selection rule; on the left, an example in which  changes at only 6 steps (representing, e.g., the 2d fused lasso case); on the right, an example in which  changes at every step (representing, e.g., the 1d fused lasso case). In both panels, solid blue circles mark the candidate set  in , and a large red circle is drawn around the IC-selected step in .
Figure 2: Two illustrations of the IC selection rule; on the left, an example in which changes at only 6 steps (representing, e.g., the 2d fused lasso case); on the right, an example in which changes at every step (representing, e.g., the 1d fused lasso case). In both panels, solid blue circles mark the candidate set in , and a large red circle is drawn around the IC-selected step in .

We now show that the following set is a polyhedron in ,

(By specifying , we have also implicitly specified the first elements of , and so we do not need to explicitly include a realization of the latter set in the definition of .) In particular, we show that we can express , for the matrix described in the previous subsection, and for and , whose construction is to be described below. Since the polyhedron (cone) characterizes the part , it suffices to study , given . And as this part is defined entirely by pairwise comparisons of IC values, it suffices to show that, for any ,

is equivalent to (a pair of) linear constraints on . By symmetry, if we reverse the inequality sign above, then this will still be equivalent to linear constraints on , and collecting these constraints over steps gives the polyhedron that determines . Simply recalling the IC definition in , and rearranging, we find that is equivalent to

Note that, by construction, the sets and differ by at most one element. For concreteness, suppose that ; the other direction is similar. Then , and the two subspaces are of codimension 1. Further, it is not hard to see that the difference in projection operators is itself the projection onto a subspace of dimension 1.3 Writing for the unit-norm basis vector for this subspace, and for the right hand side in , we see that becomes

or, since (this is implied by , and the complexity penalty being an increasing function),

This is a pair of linear constraints on , and we have verified the desired fact.

Altogether, with the final polyhedron , we can use the TG pivot from Section 2.2 to perform valid inference on linear contrasts of the mean , conditional on having chosen step with our IC-based stopping rule, and on having observed a given model sequence over the first steps of the generalized lasso path.

3.3What is the conditioning set?

For a fixed , suppose that we have computed steps of the generalized lasso path and observed a model sequence . From Section Section 3.1, we can form a matrix such that . From Section 2.2, for any vector , we can invert the TG pivot as in to compute a conditional confidence interval , with the property

This holds for all possible realizations of model sequences, and thus we can marginalize along any dimension to yield a valid conditional coverage statement. For example, by marginalizing over all possible realizations of model sequences up to step , we obtain

Above, is the boundary set at step as a function of , and likewise are the boundary signs, viable hitting signs, and viable leaving coordinates at step , respectively, as functions of . Since a data analyst typically never sees the viable hitting signs or viable leaving coordinates at a generalized lasso solution (i.e., these are “hidden” details of the path computation, at least compared to the boundary set and signs, which are reflected in the structure of solution itself, recall and ), the conditioning event in may seem like it includes “unnecessary” details. Hence, we can again marginalize over all possible realizations to yield

Among , , , the latter is the cleanest statement and offers the simplest interpretation. This is reiterated when we cover specific problem cases in Section 4.

Similar statements hold when is chosen by our IC-based rule, from Section 3.2. Applying the TG framework from Section 2.2 to the full conditioning set, in order to derive a confidence interval for , and following a reduction analogous to , , , we arrive at the property

Again this is a clean conditional coverage statement and offers a simple interpretation, for chosen in a data-driven manner.

4Special applications and extensions

4.1Changepoint detection via the 1d fused lasso

Changepoint detection is an old topic with a vast literature. It has applications in many areas, e.g., bioinformatics, climate modeling, finance, and audio and video processing. Instead of attempting to thoroughly review the changepoint detection literature, we refer the reader to the comprehensive surveys and reviews in [?]. Broadly speaking, a changepoint detection problem is one in which the distribution of observations along an ordered sequence potentially changes at some (unknown) locations. In a slight abuse of notation, we use the term changepoint detection to refer to the particular setting in which there are changepoints in the underlying mean. Our focus is on conducting valid inference related to the selected changepoints. The existing literature applicable to this goal is relatively small; is reviewed in Section 1.2 and compared to our methods in Section 5.2.

Among various methods for changepoint detection, the 1d fused lasso [?], also known as 1d total variation denoising in signal processing [?], is of particular interest because it is a special case of the generalized lasso. Let denote values observed at . Then the 1d fused lasso estimator is defined as in , with the penalty matrix being the discrete first difference operator, :

In the 1d fused lasso problem, the dual boundary set tracked by Algorithm ? has a natural interpretation: it provides the locations of changepoints in the primal solution, which we can see more or less directly from (see also [?]). Therefore, we can rewrite as

Here denote the sorted elements of the boundary set , with , for convenience, denotes a vector with 1 in positions and 0 elsewhere, and denote levels estimated by the fused lasso with parameter . Note that in , we have implicitly used the fact that the boundary set after steps of the path algorithm has exactly elements; this is true since the path algorithm never deletes coordinates from the boundary set in 1d fused lasso problems (as mentioned following Algorithm ?). The dual boundary signs also have a natural meaning: writing the elements of as , these record the signs of differences (or jumps) between adjacent levels,

Below, we describe several aspects of selective inference with 1d fused lasso estimates. Similar discussions could be given for the different special classes of generalized lasso problems, like trend filtering and the graph fused lasso, but for brevity we only go into such detail for the 1d fused lasso.

Contrasts for the fused lasso. The framework laid out in Section 3 allows us to perform post-selection TG tests for hypotheses about , for any contrast vector . We introduce two specific forms of interesting contrasts, which we call the segment and spike contrasts. From the -step fused lasso solution, as portrayed in , , there are two natural questions one could ask about the changepoint , for some : first, whether there is a difference in the underlying mean exactly at ,

and second, whether there is an average difference in the mean between the regions separated by ,

These hypotheses are fundamentally different: that in is sensitive to the exact location of the underlying mean difference, whereas that in can be non-null even if the change in mean is not exactly at . To test , we use the so-called spike contrast

The resulting TG test, as in with , is called the spike test, since it tests differences in the mean at exactly one location. To test , we use the so-called segment contrast

The resulting TG test, as in with , is called the segment test, because it tests average differences across segments of the mean .

In practice, the segment test often has more power than the spike test to detect a change in the underlying mean, since it averages over entire segments. However, it is worth pointing out that the usefulness of the segment test at also depends on the quality of the other detected changepoints 1d fused lasso model (unlike the spike test, which does not), because these determine the lengths of the segments drawn out on either side of . And, to emphasize what has already been said: unlike the spike test, the segment test does not test the precise location of a changepoint, so a rejection of its null hypothesis must not be mistakenly interpreted (also, refer to the corresponding coverage statement in ).

Which test is appropriate ultimately depends on the goals of the data analyst. Figure 3 shows a simple example of the spike and segment tests. The behaviors of these two tests will be explored more thoroughly in Section 5.1.

Figure 3: An example with n=60 points, portraying the differences between the spike and segment tests for the fused lasso. The underlying mean has true changepoints at locations 20 and 30; the 2-step fused lasso estimate, drawn in blue, detects changepoints at locations 21 and 30, labeled A and B. P-values from the segment test are reported in the left panel, and from the spike test in the right panel. The segment and spike contrast vectors around changepoint B are visualized on the panels (the entries of these vectors have been scaled up for visibility). We can see that both segment p-values are small, and both segment null hypotheses defined around locations A and B should be rejected; but only the spike p-value at location B is small, and only the the spike null hypothesis around location B should be rejected (as location A does not correspond to a true changepoint in the underlying mean).
An example with n=60 points, portraying the differences between the spike and segment tests for the fused lasso. The underlying mean has true changepoints at locations 20 and 30; the 2-step fused lasso estimate, drawn in blue, detects changepoints at locations 21 and 30, labeled A and B. P-values from the segment test are reported in the left panel, and from the spike test in the right panel. The segment and spike contrast vectors around changepoint B are visualized on the panels (the entries of these vectors have been scaled up for visibility). We can see that both segment p-values are small, and both segment null hypotheses defined around locations A and B should be rejected; but only the spike p-value at location B is small, and only the the spike null hypothesis around location B should be rejected (as location A does not correspond to a true changepoint in the underlying mean).
Figure 3: An example with points, portraying the differences between the spike and segment tests for the fused lasso. The underlying mean has true changepoints at locations 20 and 30; the 2-step fused lasso estimate, drawn in blue, detects changepoints at locations 21 and 30, labeled and . P-values from the segment test are reported in the left panel, and from the spike test in the right panel. The segment and spike contrast vectors around changepoint are visualized on the panels (the entries of these vectors have been scaled up for visibility). We can see that both segment p-values are small, and both segment null hypotheses defined around locations and should be rejected; but only the spike p-value at location is small, and only the the spike null hypothesis around location should be rejected (as location does not correspond to a true changepoint in the underlying mean).

Alternative motivation for the contrasts. It may be interesting to note that, for the segment contrast in , the statistic

is the likelihood ratio test statistic for testing the null versus the alternative , if the locations were fixed. An equivalent way to write these hypotheses, which will be a helpful generalization going forward (as we consider other classes of generalized lasso problems), is

In this notation, the segment contrast in is the unique (up to a scaling factor) basis vector for the rank 1 subspace , and is the likelihood ratio test statistic for the above set of null and alternative hypotheses.

Lastly, both segment and spike tests can be viewed from an equivalent regression perspective, after transforming the 1d fused lasso problem in , into an equivalent lasso problem (recall Section Section 2.3). In this context, it can be shown that the segment test corresponds to a test of a partial regression coefficient in the active model, whereas the spike test corresponds to a test of a marginal regression coefficient.

Inference with an interpretable conditioning event. As explained in Section 3.3, there are different levels of conditioning that can be used to interpret the results of the TG tests for model selection events along the generalized lasso path. Here we demonstrate for the segment test in , what we see as the simplest interpretation of its conditional coverage property, with respect to its parameter . The TG interval in , computed by inverting the TG pivot, has the exact finite sample property

obtained by marginalizing over some dimensions of the conditioning set, as done in Section 3.3. In words, the coverage statement says that, conditional on the estimated changepoints and estimated jump signs in the -step 1d fused lasso solution, the interval traps the jump in segment averages with probability . This all assumes that the choice of step is fixed; for chosen by an IC-based rule as described in Section 3.2, the interpretation is very similar and we only need to add to the right-hand side of the conditioning bar in . A similar interpretation is also available for the spike test, which we omit for brevity.

One-sided or two-sided inference? We note that both setups in and use a one-sided alternative hypothesis, and the contrast vectors in and are defined accordingly. To put it in words, we are testing for changepoint in the underlying mean (either exactly at one location, or in an average sense across local segments) and are looking to reject when a jump in occurs in the direction we have already observed in the fused lasso solution, as dictated by the sign . On the other hand, for coverage statements as in , we are implicitly using a two-sided alternative, replacing the alternative in by (since the coverage interval is the result of inverting a two-sided pivotal statistic). Two-sided tests and one-sided intervals are also possible in our inference framework, however, we find them less natural, and our default is therefore to consider the aforementioned versions.

4.2Knot detection via trend filtering

Trend filtering can be seen as an extension of the 1d fused lasso for fitting higher-order piecewise polynomials [?]. It can be defined for any desired polynomial order, written as , with giving piecewise constant segments and reducing to the 1d fused lasso of the last subsection. Here we focus on the case , where piecewise linear segments are fitted. The general case is possible by following the exact same logic, though for simplicity, we do not cover it.

As before, we assume the data has been measured at ordered locations . The linear trend filtering estimate is defined as in with , the discrete second difference operator:

For the linear trend filtering problem, the elements of the dual boundary set are in one-to-one correspondence with knots, i.e., changes in slope, in the piecewise linear sequence . This comes essentially from (for more, see [?]). Specifically, enumerating the elements of the boundary set as (and using and for convenience), each location , serves a knot in the trend filtering solution, so that we may rewrite as

Above, denotes the number of knots in the -step linear trend filtering solution, which in general need not be equal to , since (unlike the 1d fused lasso) the path algorithm for linear trend filtering can both add to and delete from the boundary set at each step. Also, for each , the quantities and denote the “local” intercept and slope parameters, respectively, of the linear trend filtering solution, over the segment .4 Denoting the dual boundary signs by , we have

i.e., these signs of changes in slopes between adjacent trend filtering segments.

Contrasts for linear trend filtering. We can construct both spike and segment tests for linear trend filtering using similar motivations as in the 1d fused lasso. Given the trend filtering solution in , , we consider testing a particular knot location , for some . The spike contrast is defined by

and the TG statistic in with provides us with a test for

The segment contrast is harder to define explicitly from first principles, but can be defined following one of the alternative motivations for the segment contrast in the 1d fused lasso problem: consider the rank 1 subspace , and define to be a basis vector for this subspace (unique up to scaling). The segment contrast is then

i.e., we align so that its second difference around the knot matches that in the trend filtering solution. To test , we can use the TG statistic in with ; however, as is not easy to express in closed-form, this null hypothesis is also not easy to express in closed-form. Still, we can rewrite it in a slightly more explicit manner:

versus the appropriate one-sided alternative hypothesis. In words, is the projection of onto the space of piecewise linear vectors with knots at locations , , and is a single piecewise linear activation vector that rises from zero at location .

The same high-level points comparing the spike and segment tests for the fused lasso also carry over to the linear trend filtering problem: the segment test can often deliver more power, but at a given location , the power of the segment test will depend on the other knot locations in the estimated model. The spike test at location does not depend on any other knot points in the trend filtering solution. Furthermore, the segment null does not specify a precise knot location, and one must be careful in interpreting a rejection here. Figure 4 gives examples of the segment test for linear trend filtering. More examples are investigated in Section 5.3.

Figure 4: An example with n=60 points, portraying two segment tests for trend filtering. The underlying piecewise linear mean has knots at locations 20 and 40; the 2-step linear trend filtering estimate, in blue, detects knots at locations 17 and 39, labeled A and B. The left plot shows the result the segment test at knot A, and the right plot at knot B. In each, the segment contrast is visualized. Both p-values are small.
An example with n=60 points, portraying two segment tests for trend filtering. The underlying piecewise linear mean has knots at locations 20 and 40; the 2-step linear trend filtering estimate, in blue, detects knots at locations 17 and 39, labeled A and B. The left plot shows the result the segment test at knot A, and the right plot at knot B. In each, the segment contrast is visualized. Both p-values are small.
Figure 4: An example with points, portraying two segment tests for trend filtering. The underlying piecewise linear mean has knots at locations 20 and 40; the 2-step linear trend filtering estimate, in blue, detects knots at locations 17 and 39, labeled A and B. The left plot shows the result the segment test at knot A, and the right plot at knot B. In each, the segment contrast is visualized. Both p-values are small.

4.3Cluster detection via the graph fused lasso

The graph fused lasso is another generalization of the 1d fused lasso, in which we depart from the 1-dimensional ordering of the components of . Now we think of these components as being observed over nodes of a given (undirected) graph, with edges , where say each joins some nodes and , for . Note that the 1d fused lasso corresponds to the special case in which