VectorValued Graph Trend Filtering with NonConvex Penalties
Abstract
This work studies the denoising of piecewise smooth graph signals that exhibit inhomogeneous levels of smoothness over a graph, where the value at each node can be vectorvalued. We extend the graph trend filtering framework to denoising vectorvalued graph signals with a family of nonconvex regularizers, which exhibit superior recovery performance over existing convex regularizers. Using an oracle inequality, we establish the statistical error rates of firstorder stationary points of the proposed nonconvex method for generic graphs. Furthermore, we present an ADMMbased algorithm to solve the proposed method and establish its convergence. Numerical experiments are conducted on both synthetic and realworld data for denoising, support recovery, event detection, and semisupervised classification.
I Introduction
Signal estimation from noisy observations is a classic problem in signal processing and has applications in signal inpainting, collaborative filtering, recommendation systems and other largescale data completion problems. Since noise can have deleterious, cascading effects in many downstream tasks, being able to efficiently and accurately filter and reconstruct a signal is of significant importance.
With the explosive growth of information and communication, signals are generated at an unprecedented rate from various sources, including social networks, citation networks, biological networks, and physical infrastructure [2]. Unlike time series or images, these signals lie on complex, irregular structures, and require novel processing techniques, leading to the emerging field of graph signal processing [3][5]. This framework models the structure by a graph and generalizes concepts and tools from classical discrete signal processing to graph signal processing. The associated graphstructured data are referred to as graph signals.
In graph signal processing, a common assumption is that the graph signal is smooth with respect to the graph, that is, the signal coefficients do not vary much over local neighborhoods of the graph. However, this characterization is insufficient for many realworld signals that exhibit spatially inhomogeneous levels of smoothness over the graph. In social networks for example, within a given community or social circle, users’ profiles tend to be homogeneous, while within a different social circle they will be of different, yet still have homogeneous values. Consequently, the signal is often characterized by large variations between regions and small variations within regions such that there are localized discontinuities and patterns in the signal. As a result, it is necessary to develop representations and algorithms to process and analyze such piecewise smooth graph signals.
In this work, we study the denoising of the class of piecewise smooth graph signals (including but not limited to piecewise constant graph signals), which is complementary to the class of smooth graph signals that exhibit homogeneous levels of smoothness over the graph. The reconstruction of smooth graph signals has been well studied in previous work both within graph signal processing [4][9] as well as in the context of Laplacian regularization [10, 11].
The Graph Trend Filtering (GTF) framework [12], which applies total variation denoising to graph signals [13], is a particularly flexible and attractive approach that regularizes discrete graph differences using the norm. Although the norm based regularization has many attractive properties [14], the resulting estimates are biased toward zero for large coefficients. To alleviate this bias effect, nonconvex penalties such as the Smoothly Clipped Absolute Deviation (SCAD) penalty [15] and the Minimax Concave Penalty (MCP) [16] have been proposed as alternatives. These penalties behave similarly to the norm when the signal coefficients are small, but tend to a constant when the signal coefficients are large. Notably, they possess the socalled oracle property: in the asymptotics of large dimension, they perform as well as the case where we know in advance the support of the sparse vectors [17][21].
In this work, we strengthen the GTF framework in [12] by considering a large family of possibly nonconvex regularizers, including SCAD and MCP that exhibit superior reconstruction performance over minimization for the denoising of piecewise smooth graph signals. Furthermore, we extend the GTF framework to allow vectorvalued signals, e.g. time series [22], on each node of the graph, which greatly broadens the applicability of GTF to applications in social networks [23], gene networks, and semisupervised classification [24][26]. Through theoretical analyses and empirical performance, we demonstrate that the use of nonconvex penalties improves the performance of GTF in terms of both reduced reconstruction error and improved support recovery, i.e. how accurately we can localize the discontinuities of the piecewise smooth signals. Our contributions can be summarized as follows:

Theoretically, we derive the statistical error rates of the signal estimates, defined as firstorder stationary points of the proposed GTF estimator. We derive the rates in terms of the noise level and the alignment of the ground truth signal with respect to the underlying graph, without making assumptions on the piecewise smoothness of the ground truth signal. The better the alignment, the more accurate the estimates. Importantly, the estimators do not need to be the global minima of the proposed nonconvex problem, which are much milder requirements and important for the success of optimization. For denoising vectorvalued signals, the GTF estimate is more accurate when each dimension of the signal shares similar patterns across the graph.

Algorithmically, we propose an ADMMbased algorithm that is guaranteed to converge to a critical point of the proposed GTF estimator.

Empirically, we demonstrate the performance improvements of the proposed GTF estimators with nonconvex penalties on both synthetic and real data for signal estimation, support recovery, event detection, and semisupervised classification.
The rest of this paper is organized as follows. Section II reviews related works and their relationships to the current paper. In Section III, we provide some background and definitions on graph signal processing and GTF. Section IV presents the proposed GTF framework with nonconvex penalties and vectorvalued graph signals. Section V develops its performance guarantees, and Section VI presents an efficient algorithm based on ADMM. Numerical performance of the proposed approach is examined on both synthetic and realworld data for denoising and semisupervised classification in Section VII. Finally, we conclude in Section VIII and briefly discuss future work.
Throughout this paper, we use boldface letters and to represent vectors and matrices respectively. The transpose of is denoted as . The th row of a matrix is denoted as , and the th column of a matrix is denoted as . The cardinality of a set is denoted as . For any set and , we denote such that if and only if for . Similarly, we define a submatrix of that corresponds to pulling out the rows of indexed by . The norm of a vector is defined as , and the spectral norm of a matrix is defined as . The pseudoinverse of a matrix is defined as . For a function , we write to denote the gradient or subdifferential of , if they exist, evaluated at . When the intention is clear, this may be written concisely as . We also follow the standard asymptotic notations. If for some constants , for all , then ; if , then . Finally, Table I summarizes some key notations used in this paper for convenience.
Symbol  Description  Dimension 

oriented incidence matrix  
th order graph difference operator  
scalarvalued graph signal  
vectorvalued graph signal  
noisy observation of  
noisy observation of  
th row of  
th column of  
spectral norm of 
Ii Related Work and Connections
Estimators that adapt to spatial inhomogeneities have been well studied in the literature via regularized regression, total variation and splines [27][29]. Most of these methods involve locating change points or knots that denote a distinct change in the behavior of the function or the signal.
Our work is most related to the spatially adaptive GTF estimator introduced in [12] that smoothens or filters noisy signals to promote piecewise smooth behavior with respect to the underlying graph structure; see also [30]. In the same spirit as [27], the fused LASSO and univariate trend filtering framework developed in [13, 31, 32] use discrete difference operators to fit a time series signal using piecewise polynomials. The GTF framework generalizes univariate trend filtering by generalizing a path graph to arbitrarily complex graphs. Specifically, by appropriately defining the discrete difference operator, we can enforce piecewise constant, piecewise linear, and more generally piecewise polynomial behaviors over the graph structure. In comparison to previous work [12], in this paper, we have significantly expanded its scope by allowing vectorvalued data over the graph nodes and a broader family of possibly nonconvex penalties.
We note that while a significant portion of the relevant literature on GTF or the fused LASSO has focused on the sparsistency or support recovery conditions under which we can ensure the recovery of the location of the discontinuities or knots [33, 34], in this work, we study the asymptotic error rates of our estimator with respect to the mean squared error. Our analysis of error rates leverages techniques in [35, 36] that result in sharp error rates of total variation denoising via oracle inequalities, which we have carefully adapted to allow nonconvex regularizers. The obtained error rates can be translated into bounds on support recovery or how well we can localize the boundary by leveraging techniques in [37].
Employing a graphbased regularizer that promotes similarities between the signal values at connected nodes has been investigated by many communities, such as graph signal processing, machine learning, applied mathematics, and network science. The Network LASSO proposed in [23], which is similar to the GTF framework with multidimensional or vectorvalued data, focused on the development of efficient algorithms without any theoretical guarantees. The recent works by Jung et. al. [25, 26, 38] have analyzed the performance of Network LASSO for semisupervised learning when the graph signal is assumed to be clustered according to the labels using the network null space property and the network compatibility condition inspired by related concepts in compressed sensing [39]. In contrast, our analysis does not make assumptions on the graph signal, and the error rate is adaptive to the alignment of the signal and the graph structure used in denoising.
A wellstudied generalization of the sparse linear inverse problem is when there are multiple measurement vectors (MMV), and the solutions are assumed to have a common sparsity pattern [40][42]. Sharing information across measurements, and thereby exploiting the conformity of the sparsity pattern, has been shown to significantly improve the performance of sparse recovery in compressive sensing and sparse coding [43][47]. Motivated by these works, we consider vectorvalued graph signals that are regarded as multiple measurements of scalarvalued graph signals sharing discontinuity patterns.
There are a few variants of nonconvex penalties that promote sparsity such as SCAD, MCP, weakly convex penalties, and () minimization [17, 48][51]. In this paper, we consider and develop theory for a family of nonconvex penalties parametrized similarly to that in [17, 49] with SCAD and MCP as our prime examples, although it is valid for other nonconvex penalties.
Iii Graph Signal Processing, Piecewise Smooth Signals, and Graph Trend Filtering
We consider an undirected graph , where is the set of nodes, is the set of edges, and is the unweighted adjacency matrix – also known as the graph shift operator [3]. The edge set represents the connections of the undirected graph , and the positive edge weight measures the underlying relation between the th and the th node, such as a similarity, a dependency, or a communication pattern. Let a scalarvalued graph signal be defined as
where denotes the signal coefficient at the th node.
Let be the oriented incidence matrix of , where each row corresponds to an edge. That is, if the edge connects the th node to the th node (), the entries in the th row of is then given as
The entries of the signal specifies the unweighted pairwise differences of the graph signal over each edge. As a result, can be interpreted as a graph difference operator. In graph signal processing, a signal is called smooth over a graph if is small.
Iiia Piecewise Smooth Graph Signals
In practice, the graph signal may not be necessarily smooth over the entire graph, but only locally within different pieces of the graph. To model inhomogeneous levels of smoothness over a graph, we say that a graph signal is piecewise constant over a graph if many of the differences are zero for . Consequently, the difference signal is sparse and is small.
We can characterize piecewise th order polynomial signals on a graph, where the piecewise constant case corresponds to , by generalizing the notion of graph difference operators. Specifically, we use the following recursive definition of the th order graph difference operator [12]. Let for . For , let
(1) 
The signal is said to be a piecewise th order polynomial graph signal if is small. To further illustrate, let us consider the piecewise linear graph signal, corresponding to , as a signal whose value at a node can be linearly interpolated from the weighted average of the values at neighboring nodes. It is easy to see that this is the same as requiring the secondorder differences to be sparse. Similarly, we say that a signal has a piecewise quadratic structure over a graph if the differences between the secondorder differences defined for piecewise linear signals are mostly zero, that is, if is sparse. Fig. 1 illustrates various orders of piecewise graph smooth signals over the Minnesota road network graph.
IiiB Denoising Piecewise Smooth Graph Signals via GTF
Assume we observe a noisy signal over the graph under i.i.d Gaussian noise:
(2) 
and seek to reconstruct from by leveraging the graph structure. When is a smooth graph signal, Laplacian smoothing [10, 11, 52][54] can be used, which solves the following problem:
(3) 
where . However, it cannot localize abrupt changes in the graph signal when the signal is piecewise smooth.
Graph trend filtering (GTF) [12] is a flexible framework for estimation on graphs that is adaptive to inhomogeneity in the level of smoothness of an observed signal across nodes. The th order GTF estimate is defined as:
(4) 
which can be regarded as applying total variation or fused LASSO with the graph difference operator [13, 55]. The sparsitypromoting properties of the norm have been wellstudied [56]. Consequently, applying the penalty in GTF sets many of the (higherorder) graph differences to zero while keeping a small fraction of nonzero values. GTF is then adaptive over the graph; its estimate at a node adapts to the smoothness in its localized neighborhood.
Iv VectorValued GTF with Nonconvex Penalties
In this section, we first extend GTF to allow a broader family of nonconvex penalties and then extend it to handle vectorvalued signals over the graph.
Iva (Non)convex Penalties
The norm penalty considered in (4) is wellknown to produce biased estimates [57], which motivates us to extend the GTF framework to a broader class of sparsitypromoting regularizers that are not necessarily convex. We wish to minimize the following generalized th order GTF loss function:
(5) 
where
is a regularizer defined as the sum of the penalty function applied elementwise to . Here, for even and for odd to account for different dimensions of ; see (1). We will refer to the GTF estimator that minimizes as scalarGTF.
Similarly to [17, 19, 49], we consider a family of penalty functions that satisfies the following assumptions.
Assumption 1.
Assume satisfies the following:

satisfies , is symmetric around , and is nondecreasing on the real nonnegative line.

For , the function is nonincreasing in . Also, is differentiable for all and subdifferentiable at , with . This upper bounds .

There exists such that is convex.
Many penalty functions satisfy these assumptions. Besides the penalty, the nonconvex SCAD [15] penalty
(6) 
and the MCP [16]
(7) 
also satisfy them. We note that Assumption 1 (c) is satisfied for SCAD with and for MCP with . Fig. 2 illustrates the , SCAD and MCP penalties for comparison. While the nonconvexity means that in general, we may not always find the global optimum of , it often affords us many other advantages. SCAD and MCP both taper off to a constant value and hence apply less shrinkage for higher values. As a result, they mitigate the bias effect while promoting sparsity. Further, they are smooth and differentiable for and are both upper bounded by the penalty for all .
IvB VectorValued GTF
In many applications, the signals on each node are in fact multidimensional or vectorvalued, e.g. time series in social networks, multiclass labels in semisupervised learning, feature vectors of different objects in feature selection. Therefore, it is natural to consider an extension to the graph signal denoising problem, where the graph signal on each node is a dimensional vector instead of a scalar. In this scenario, we define a vectorvalued graph signal to be piecewise smooth if it is piecewise smooth in each of its dimensions, and assume their discontinuities to coincide over the same small set of edges or nodes. Further, we denote the vectorvalued signal of interest as , such that the th row of the matrix corresponds to the th node of the graph. The noise model for the observation matrix is defined as
(8) 
where each element of is drawn i.i.d from . A naïve approach is to estimate each column of separately via scalarGTF:
(9) 
However, this formulation does not take full advantage of the multidimensionality of the graph signal. Instead, when the columns of are correlated, coupling them can be beneficial such that we encourage the sharing of information across dimensions or features. For example, if one column exhibits strong piecewise smoothness over the graph, and therefore has compelling evidence about the relationship between nodes, sharing that information to a related column can improve the overall denoising and filtering performance. As a result, we formulate a vectorGTF problem as follows:
(10) 
where the new penalty function is the sum of applied to the norm of each row of :
(11) 
By enforcing sparsity on , we are coupling to be of similar sparsity patterns across . Note the difference from (9), where elements of can be set to zero or nonzero independently.
V Theoretical Guarantees
In this section, we present the error rates and support recovery guarantees of the generalized GTF estimators, namely scalarGTF (5) and vectorGTF (10), under the AWGN noise model. Before continuing, we first define a few useful quantities. Let be the number of connected components in the graph , or equivalently, the dimension of the null space of . Further, let be the number of rows of , and be the maximum norm of the columns of .
Va Error Rates of Firstorder Stationary Points
Due to nonconvexity, global minima of the proposed GTF estimators may not be attainable. Therefore, it is more desirable to understand the statistical performance of any firstorder stationary points of the GTF estimators by considering oracle inequalities. We call a stationary point of , if it satisfies
We further introduce the compatibility factor, which generalizes the notion used in [35] to allow vectorvalued signals.
Definition 1 (Compatibility factor).
Let be fixed. The compatibility factor of a nonempty set is defined as
To build intuition, consider . This is precisely the definition of , an induced norm of the submatrix of . Together with , can be related to the restrictive eigenvalue condition, which is often used to bound the performance of LASSO [39]. With slight abuse of notation, we write .
We have the following oracle inequality that is applicable to the stationary points of the GTF estimators, whose proof is given in Appendix A. We stress that although GTF was motivated by piecewise smooth graph signals, Theorem 1 holds for any graph and graph signal .
Theorem 1 (Oracle inequality of GTF stationary points).
Remark 1.
Recall that is defined in Assumption 1 (c), which characterizes how “nonconvex” the regularizer is, and dictates the inflection point in Fig. 2. The assumption in Theorem 1 therefore implicitly constrains the level of nonconvexity of the regularizer. Take MCP in (7) for example: since , we can guarantee the existence of a valid such that as long as we set .
Theorem 1 allows one to select and to optimize the error bounds on the right hand side of (12) and (13). For example, pick in (12) (hence an “oracle”) to have
(14) 
where .

By setting as an empty set, we have
(15) which suggest that the reconstruction accuracy improves when the ground truth is better aligned with the graph structure, and consequently the value of is small.

On the other hand, by setting as the support of , we achieve
which grows linearly as we increase the sparsity level .
Similar discussions can be conducted for vectorGTF by choosing in (13). More importantly, we can directly compare the performance of vectorGTF with scalarGTF, which was formulated for vectorvalued graph signals in (9). The error bound of vectorGTF pays a small price in the order of , but is tighter than scalarGTF if . This suggests that vectorGTF is much more advantageous when the support sets of for overlap, i.e. when the local discontinuities and patterns in are shared.
To sum up, despite being nonconvex, we can guarantee that any stationary point of the proposed GTF estimator possesses strong statistical guarantees.
VB Comparison with ScalarGTF using Regularization
We compare our error bound for scalarGTF with [12, Theorem 3], which is obtained for GTF with the penalty, reproduced below for convenience.
Theorem 2 (Basic error bound of GTF minimizer).
If , then , the minimizer of (4), satisfies
The above bound is comparable to our bound in the special case of setting to an empty set, i.e. (15). The first term of the bound in (15) is upper bounded by that of Theorem 2. The nonconvex regularization yields especially tighter bounds when contains large coefficients, so that . On the other hand, the second term of (15) contains in the denominator, which makes it an upper bound of the second term in Theorem 2. This gap can be brought down by choosing a larger , which allows one to pick a smaller , as mentioned in Remark 1. However, as , nonconvex SCAD and MCP also tends to , which erases the improvement from using nonconvex regularizers in the first term of the bound. This indicates a tradeoff in the overall error bound based on , or the “nonconvexity” of the regularizers chosen for scalarGTF.
VC Error Rates for ErdősRényi Graphs
We next specialize Theorem 1 to the ErdősRényi random graphs using spectral graph theory [58]. Let and respectively be the maximum and expected degree of the graph. It is known that for any graph it holds [12]
(16) 
where is the smallest nonzero eigenvalue of the graph Laplacian matrix . Moreover, we have , and [59]. Next, we present a simple lower bound on , which is proved in Appendix B.
Proposition 1 (Bound on ).
is bounded for any and as
VD Support Recovery
An alternative yet important metric for gauging the success of the proposed GTF estimators is support recovery, which aims to localize the discontinuities in the piecewise smooth graph signals, i.e. the support set of , that is
(17) 
In particular, for odd , the discontinuities correspond to graph nodes; and for even , they correspond to the edges. Let be the GTF estimate of the graph signal. The quality of the support recovery can be measured using the graph screening distance [37]. For any and , let denote the length of the shortest path between them. The distance of from is then defined as
(18) 
Interestingly, Lin et.al. [37] showed recently that under mild assumptions, one can translate the error bound into a support recovery guarantee. Specifically, letting be the RHS of (12) that bounds the error in Theorem 1, we have
(19) 
where quantifies the minimum level of discontinuity, defined as the minimum absolute value of the nonzero values of , i.e.
(20) 
Consequently, this leads to support recovery guarantees of the proposed GTF estimators. Numerical experiment in Section VIIA verifies the superior performance of the nonconvex regularizers over the regularizer for support recovery.
Vi ADMM Algorithm and its Convergence
There are many algorithmic approaches to optimize the vectorGTF formulation in (10), since scalarGTF (5) can be regarded as a special case with . In this section, we illustrate the approach adopted in this paper, which is the Alternating Direction Method of Multipliers (ADMM) framework for solving separable optimization problems [62].
Via a change of variable as , we can transform (10) to
Its corresponding Lagrangian can be written as:
(21) 
where is the Lagrangian multiplier, and is the parameter. Alg. 1 shows the ADMM updates based on the Lagrangian in (21). Recall the proximal operator is defined as for a function . , SCAD and MCP all admit closedform solutions of Prox, which are simple thresholding operations [63]. Furthermore, we have the following convergence guarantee for Alg. 1, whose proof is provided in Appendix C.
In addition, we provide a detailed time complexity analysis of Alg. 1 in Table II. Note that since is a sparse matrix with exactly nonzero entries, Alg. 1 can run much faster when . As a preprocessing step for each , we compute and , the eigenvectors and eigenvalues of , exactly once. can then be initialized very efficiently for all experiments that use .
eigen decomposition  

initialization  
initialization  
update  
calculation  
update  
Total after iterations 
Vii Numerical Experiments
For the following experiments, we fixed for SCAD, for MCP. Further, the graphs we use in the following experiments satisfy Assumption 2 for this choice of . Unless explicitly mentioned, we tuned and for each experiment using the Hyperopt toolbox [64]. To meet the convergence criteria in Theorem 3, we enforced . SCAD/MCP were warmstarted with the GTF estimate with penalty. Python packages PyGSP [65] and NetworkX [66] were used to construct and plot graphs. The input signal SNR was calculated as , while the reconstructed signal SNR was calculated as , where was the reconstruction. Computation time was measured with MacBook Pro 2017 with an 2.9 GHz Intel Core i7 and 16GB RAM. Our code is available at https://github.com/HarlinLee/nonconvexGTFpublic.
Viia Denoising via GTF with Nonconvex Regularizers
We first highlight via synthetic examples two important advantages that nonconvex regularizers provide over the penalty.

Bias Reduction: We demonstrate the reduction in signal bias in Fig. 3 for the graph signal defined over a 2Dgrid graph, using both the penalty and the MCP penalty. Clearly, the MCP estimate (orange) has less bias than the estimate (blue), and can recover the ground truth surface (purple) more closely.

Support Recovery: We illustrate the improved support recovery performance of nonconvex penalties on localizing the boundaries for a piecewise constant signal on the Minnesota road graph, shown in Fig. 5. Particularly, we look at how well our estimator localizes the support of , that is, the discontinuity of the piecewise constant graph signal by looking at how well we can classify an edge as connecting two nodes in the same piece or being a cut edge across two pieces. By sweeping the regularization parameter , we obtain the ROC curve in Fig. 4, i.e. the true positive rate versus the false positive rate of classifying a cut edge correctly, and see that scalarGTF with MCP and SCAD consistently outperforms the scalarGTF with penalty.
Then, we compare the performance of GTF using nonconvex regularizers such as SCAD and MCP with that using the norm more rigorously. For the ground truth signal , we construct a piecewise constant signal on a 2Dgrid graph and the Minnesota road graph [65] as shown in the left panel of Fig. 5, and add different levels of noise following (2). We recover the signal by scalarGTF with Alg. 1, and plot the SNR of the reconstructed signal versus the SNR of the input signal in solid lines in the middle panel of Fig. 5, averaged over and realizations, respectively. SCAD/MCP consistently outperforms in denoising graph signals defined over both regular and irregular structures.
ViiB Denoising Vectorvalued Signals via GTF
We compare the performance of vectorGTF in (10) with (9), which applies scalarGTF to each column of the vectorvalued graph signal. The convex norm, and the nonconvex SCAD and MCP are employed. We reuse the same ground truth graph signals over the 2Dgrid graph and the Minnesota road graph constructed in Section VIIA in Fig. 5. independent noisy realizations of the graph signal are concatenated to construct a noisy vectorvalued graph signal with dimension on the 2Dgrid graph and with on the Minnesota road graph. We recover the vectorvalued signal by minimizing vectorGTF (10) with Alg. 1.
The middle panel of Fig. 5 plots the average SNR of the reconstructed signal versus the average SNR of the input signal in dotted lines. We emphasize that the performance of (9) is the same as applying scalarGTF to each realization, which is shown in the middle panel of Fig. 5 in solid lines. As before, SCAD/MCP consistently outperforms in denoising signals over both regular and irregular graphs. Furthermore, as expected, due to the sharing of information across realizations, vectorGTF consistently outperforms scalarGTF, especially in the low SNR regime.
The right panel of Fig. 5 plots the computation time versus the gain in SNR from denoising via vectorGTF. 10 trials are performed for each regularizer with the input signal SNR fixed at 20dB. Parameter tuning and eigen decomposition of are preprocessing steps, and hence they are not included in the time measurement. However, since GTF with nonconvex regularizers are warmstarted by the estimate, the runtime for GTF is added to the SCAD/MCP runtime. Overall, running vectorGTF with SCAD/MCP after once with takes more time, but with large benefits in the denoising performance. Even with the additional computation time, VectorGTF runs reasonably fast; with the Minnesota road network, where and , computation takes less than 25 seconds.
Average  #1  #2  #3  #4  #5  #6  #7  #8  

Input SNR (dB)  8.7  14  0  0  3.5  5.8  12  29  34 
VectorGTF +  29  10  20  23  26  36  37  39  38 
ScalarGTF +  21  0  11  13  16  18  26  41  45 
VectorGTF + SCAD  32  10  20  22  25  36  35  49  61 
ScalarGTF + SCAD  29  0  15  17  25  35  34  47  60 
VectorGTF + MCP  32  10  20  22  25  36  35  49  61 
ScalarGTF + MCP  29  0  15  22  24  30  33  49  60 
We further investigate the benefit of sharing information across measurements or realizations in the following experiment, using the same ground truth signal on the 2Dgrid graph. We stack eight noisy realizations of this same piecewise constant signal to build a vectorvalued signal. We construct these noisy measurements by scaling each one of them differently and randomly such that each will have SNR dB under (2). This has the effect of rendering some measurements more informative than others, and potentially allowing vectorGTF to reap the benefits of sharing information across measurements. We recover the 8dimensional graph signal via Alg. 1 using , SCAD, and MCP regularizers, and in Table III, report the input signal and reconstructed signal SNRs for each measurement in addition to the average SNRs. is fixed at .
First of all, notice that as before, using SCAD/MCP generally achieves results with higher SNR than using , and that on average, minimizing (10) outperforms minimizing (9). The effect of sharing information across measurements is most apparent in low SNR settings, when information about the boundaries of the graph signal can be borrowed from higher SNR signals to improve the estimation. On the other hand, sharing information with noisier signals does not help denoising signals with high input SNR. However, it is worth noting that, unlike , SCAD/MCP does not see decrease in its performance in the high SNR settings.
ViiC Event Detection with NYC Taxi Data
To further illustrate graph trend filtering on a realworld dataset, we consider the road network of Manhattan where the nodes correspond to junctions [67]. We map the pickups and dropoffs of the NYC taxi trip dataset [68, 69] to the nearest road junctions, and define the total count at that junction to be the signal value on the corresponding graph node. The signal of interest, plotted on the top left panel of Fig. 6, is the difference between the event graph signal on the day of NYC Gay Pride parade, 122pm on June 26, 2011, and the seasonal average graph signal at the same time during the 8 nearest Sundays. During the event, no pickups and dropoffs could occur in the areas shown in the top right panel of Fig. 6 [70]. We denoise the signal via GTF using both and MCP, where we chose such that . Once again, we observe the GTF estimate with MCP produces sharper traces around the parade route, indicating better capabilities of event detection and localization.
ViiD Semisupervised Classification
Heart  Wine quality  Wine  Iris  Breast  Car  
# of samples ()  303  1599  178  150  569  1728  
# of classes ()  2  6  3  3  2  4  
0.148  0.346  0.038  0.036  0.042  0.172  
SCAD pvalue  0.148  0.353  0.038  0.033  0.042  0.149  
1.  0.06  1.  0.27  1.  0.06  
MCP pvalue  0.144  0.351  0.037  0.035  0.040  0.148  
0.23  0.18  0.34  0.34  0.35  0.05  
0.143  0.351  0.034  0.039  0.035  0.104  
SCAD pvalue  0.144  0.350  0.034  0.039  0.035  0.104  
0.30  0.43  0.34  1.  0.71  0.66  
MCP pvalue  0.146  0.350  0.034  0.039  0.034  0.103  
0.05  0.44  0.34  1.  0.02  0.23 
Graphbased learning provides a flexible and attractive way to model data in semisupervised classification problems when vast amounts of unlabeled data are available compared to labeled data, and labels are expensive to acquire [10, 11, 54]. One can construct a nearestneighbor graph based on the similarities between each pair of samples, and hope to propagate the label information from labeled samples to unlabeled ones. We move beyond our original problem in (5) to a class classification problem in a semisupervised learning setting, where for a given dataset with samples, we observe a subset of the onehot encoded class labels, , such that if th sample has been observed to be in th class, and otherwise. A diagonal indicator matrix denotes samples whose class labels have been observed. Then, we can define the modified absorption problem [11, 12, 54] using a variation of GTF to estimate the unknown class probabilities :
(22) 
where (set to be uniform in the experiment) is a fixed prior belief, and determines how much emphasis to be given to the prior belief. The labels can be estimated using such that if and only if , and otherwise . Note that this can be completely separated into scalarGTF problems, one corresponding to each class.
We applied the algorithm in to 6 popular UCI classification datasets [71] with . For each dataset, we normalized each feature to have zero mean and unit variance, and constructed a 5nearestneighbor graph of the samples based on the Euclidean distance between their features, with edge weights from the Gaussian radial basis kernel. We observed the labels of 20% of samples in each class randomly. Table IV shows the misclassification rates averaged over repetitions, which demonstrates that the performance using nonconvex penalties such as SCAD/MCP are at least competitive with, and often better than, those with the penalty.
Viii Conclusions
We presented a framework for denoising piecewise smooth signals on graphs that generalizes the graph trend filtering framework to handle vectorvalued signals using a family of nonconvex regularizers. We provided theoretical guarantees on the error rates of our framework, and derived a general ADMMbased algorithm to solve this generalized graph trend filtering problem. Furthermore, we demonstrated the superior performance of these nonconvex regularizers in terms of reconstruction error, bias reduction, and support recovery on both synthetic and realworld data. In particular, its performance on semisupervised classification is investigated. In future work, we plan to further study this approach when the graph signals are observed by indirect and incomplete measurements.
References
 [1] R. Varma, H. Lee, Y. Chi, and J. Kovačević, “Improving graph trend filtering with nonconvex penalties,” in ICASSP 2019  2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5391–5395.
 [2] M. Newman, Networks. Oxford University Press, 2018.
 [3] A. Sandryhaila and J. M. Moura, “Discrete signal processing on graphs,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1644–1656, 2013.
 [4] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
 [5] A. Ortega, P. Frossard, J. Kovačević, J. M. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” Proceedings of the IEEE, vol. 106, no. 5, pp. 808–828, 2018.
 [6] S. Chen, R. Varma, A. Singh, and J. Kovačević, “Signal representations on graphs: Tools and applications,” arXiv preprint arXiv:1512.05406, 2015.
 [7] S. Chen, R. Varma, A. Sandryhaila, and J. Kovačević, “Discrete signal processing on graphs: Sampling theory,” IEEE Transactions on Signal Processing, vol. 63, no. 24, pp. 6510–6523, 2015.
 [8] S. Chen, A. Sandryhaila, J. M. Moura, and J. Kovačević, “Signal recovery on graphs: Variation minimization,” IEEE Transactions on Signal Processing, vol. 63, pp. 4609–4624, 2015.
 [9] A. Elmoataz, O. Lezoray, and S. Bougleux, “Nonlocal discrete regularization on weighted graphs: a framework for image and manifold processing,” IEEE Transactions on Image Processing, vol. 17, no. 7, pp. 1047–1060, 2008.
 [10] M. Belkin, I. Matveeva, and P. Niyogi, “Regularization and semisupervised learning on large graphs,” in International Conference on Computational Learning Theory. Springer, 2004, pp. 624–638.
 [11] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semisupervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International Conference on Machine Learning (ICML03), 2003, pp. 912–919.
 [12] Y.X. Wang, J. Sharpnack, A. J. Smola, and R. J. Tibshirani, “Trend filtering on graphs,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3651–3691, 2016.
 [13] S.J. Kim, K. Koh, S. Boyd, and D. Gorinevsky, “ Trend Filtering,” SIAM Review, vol. 51, no. 2, pp. 339–360, 2009.
 [14] P. Bühlmann and S. Van De Geer, Statistics for HighDimensional Data: Methods, Theory and Applications. Springer Science & Business Media, 2011.
 [15] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
 [16] C.H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, vol. 38, no. 2, pp. 894–942, 2010.
 [17] P.L. Loh and M. J. Wainwright, “Regularized Mestimators with nonconvexity: Statistical and algorithmic theory for local optima,” in Advances in Neural Information Processing Systems, 2013, pp. 476–484.
 [18] P.L. Loh, “Statistical consistency and asymptotic normality for highdimensional robust Mestimators,” The Annals of Statistics, vol. 45, no. 2, pp. 866–896, 2017.
 [19] C.H. Zhang and T. Zhang, “A general theory of concave regularization for highdimensional sparse estimation problems,” Statistical Science, vol. 27, no. 4, pp. 576–593, 2012.
 [20] P. Breheny and J. Huang, “Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection,” The Annals of Applied Statistics, vol. 5, no. 1, p. 232, 2011.
 [21] S. Ma and J. Huang, “A concave pairwise fusion approach to subgroup analysis,” Journal of the American Statistical Association, vol. 112, no. 517, pp. 410–423, 2017.
 [22] S. Chen, F. Cerda, P. Rizzo, J. Bielak, J. H. Garrett, and J. Kovačević, “Semisupervised multiresolution classification using adaptive graph filtering with application to indirect bridge structural health monitoring,” IEEE Transactions on Signal Processing, vol. 62, no. 11, pp. 2879–2893, 2014.
 [23] D. Hallac, J. Leskovec, and S. Boyd, “Network lasso: Clustering and optimization in large graphs,” in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 387–396.
 [24] A. Jung, A. O. Hero III, A. Mara, and S. Jahromi, “Semisupervised learning via sparse label propagation,” arXiv preprint arXiv:1612.01414, 2016.
 [25] A. Jung, N. Tran, and A. Mara, “When is network lasso accurate?” Frontiers in Applied Mathematics and Statistics, vol. 3, p. 28, 2018.
 [26] N. Tran, S. Basirian, and A. Jung, “When is network lasso accurate: The vector case,” arXiv preprint arXiv:1710.03942, 2017.
 [27] E. Mammen and S. van de Geer, “Locally adaptive regression splines,” The Annals of Statistics, vol. 25, no. 1, pp. 387–413, 1997.
 [28] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: nonlinear phenomena, vol. 60, no. 14, pp. 259–268, 1992.
 [29] T. F. Chan, S. Osher, and J. Shen, “The digital TV filter and nonlinear denoising,” IEEE Transactions on Image Processing, vol. 10, no. 2, pp. 231–241, 2001.
 [30] F. Mahmood, N. Shahid, U. Skoglund, and P. Vandergheynst, “Adaptive graphbased total variation for tomographic reconstructions,” IEEE Signal Processing Letters, vol. 25, no. 5, pp. 700–704, 2018.
 [31] R. J. Tibshirani, “Adaptive piecewise polynomial estimation via trend filtering,” The Annals of Statistics, vol. 42, no. 1, pp. 285–323, 2014.
 [32] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, “Sparsity and smoothness via the fused lasso,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 1, pp. 91–108, 2005.
 [33] J. Sharpnack, A. Singh, and A. Rinaldo, “Sparsistency of the edge lasso over graphs,” in Artificial Intelligence and Statistics, 2012, pp. 1028–1036.
 [34] Z. Harchaoui and C. LévyLeduc, “Multiple changepoint estimation with a total variation penalty,” Journal of the American Statistical Association, vol. 105, no. 492, pp. 1480–1493, 2010.
 [35] J.C. Hütter and P. Rigollet, “Optimal rates for total variation denoising,” in Conference on Learning Theory, 2016, pp. 1115–1146.
 [36] A. S. Dalalyan, M. Hebiri, and J. Lederer, “On the prediction performance of the lasso,” Bernoulli, vol. 23, no. 1, pp. 552–581, 2017.
 [37] K. Lin, J. Sharpnack, A. Rinaldo, and R. J. Tibshirani, “Approximate Recovery in Changepoint Problems, from Estimation Error Rates,” arXiv preprint arXiv:1606.06746, 2016.
 [38] A. Jung and M. Hulsebos, “The network nullspace property for compressed sensing of big data over networks,” arXiv preprint arXiv:1705.04379, 2017.
 [39] S. A. Van De Geer, P. Bühlmann et al., “On the conditions used to prove oracle results for the lasso,” Electronic Journal of Statistics, vol. 3, pp. 1360–1392, 2009.
 [40] S. F. Cotter, B. D. Rao, K. Engan, and K. KreutzDelgado, “Sparse solutions to linear inverse problems with multiple measurement vectors,” IEEE Transactions on Signal Processing, vol. 53, no. 7, pp. 2477–2488, 2005.
 [41] J. Chen and X. Huo, “Theoretical results on sparse representations of multiplemeasurement vectors,” IEEE Transactions on Signal Processing, vol. 54, no. 12, pp. 4634–4643, 2006.
 [42] Y. C. Eldar, P. Kuppinger, and H. Bolcskei, “Blocksparse signals: Uncertainty relations and efficient recovery,” IEEE Transactions on Signal Processing, vol. 58, no. 6, pp. 3042–3054, 2010.
 [43] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Learning for Matrix Factorization and Sparse Coding,” Journal of Machine Learning Research, vol. 11, no. Jan, pp. 19–60, 2010.
 [44] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral Image Classification Using DictionaryBased Sparse Representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 10, pp. 3973–3985, Oct. 2011.
 [45] Y. Li and Y. Chi, “OfftheGrid Line Spectrum Denoising and Estimation With Multiple Measurement Vectors,” IEEE Transactions on Signal Processing, vol. 64, no. 5, pp. 1257–1269, Mar. 2016.
 [46] Y. C. Eldar and M. Mishali, “Robust Recovery of Signals From a Structured Union of Subspaces,” IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 5302–5316, Nov. 2009.
 [47] M. E. Davies and Y. C. Eldar, “Rank Awareness in Joint Sparse Recovery,” IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 1135–1146, Feb. 2012.
 [48] P.L. Loh and M. J. Wainwright, “Support recovery without incoherence: A case for nonconvex regularization,” The Annals of Statistics, vol. 45, no. 6, pp. 2455–2482, 2017.
 [49] L. Chen and Y. Gu, “The convergence guarantees of a nonconvex approach for sparse recovery,” IEEE Transactions on Signal Processing, vol. 62, no. 15, pp. 3754–3767, 2014.
 [50] R. Chartrand and V. Staneva, “Restricted isometry properties and nonconvex compressive sensing,” Inverse Problems, vol. 24, no. 3, p. 035020, 2008.
 [51] K. Ji, J. Tan, Y. Chi, and J. Xu, “Learning latent features with pairwise penalties in matrix completion,” arXiv preprint arXiv:1802.05821, 2018.
 [52] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in Neural Information Processing Systems, 2002, pp. 585–591.
 [53] ——, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003.
 [54] P. P. Talukdar and K. Crammer, “New regularized algorithms for transductive learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2009, pp. 442–457.
 [55] R. J. Tibshirani, The Solution Path of the Generalized Lasso. Stanford University, 2011.
 [56] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), pp. 267–288, 1996.
 [57] C.H. Zhang and J. Huang, “The sparsity and bias of the lasso selection in highdimensional linear regression,” The Annals of Statistics, vol. 36, no. 4, pp. 1567–1594, 2008.
 [58] F. Chung and M. Radcliffe, “On the spectra of general random graphs,” The Electronic Journal of Combinatorics, vol. 18, no. 1, p. 215, 2011.
 [59] F. R. Chung and F. C. Graham, Spectral Graph Theory. American Mathematical Society, 1997.
 [60] A. Blum, J. Hopcroft, and R. Kannan, “Foundations of data science,” Vorabversion eines Lehrbuchs, 2016.
 [61] A. Lubotzky, R. Phillips, and P. Sarnak, “Ramanujan graphs,” Combinatorica, vol. 8, no. 3, pp. 261–277, 1988.
 [62] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
 [63] J. Huang, P. Breheny, and S. Ma, “A selective review of group selection in highdimensional models,” Statistical science: a review journal of the Institute of Mathematical Statistics, vol. 27, no. 4, 2012.
 [64] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures,” in Proceedings of the 30th International Conference on Machine Learning, 2013.
 [65] M. Defferrard, L. Martin, R. Pena, and N. Perraudin, “PyGSP: Graph Signal Processing in Python.” [Online]. Available: https://github.com/epfllts2/pygsp/
 [66] A. Hagberg, P. Swart, and D. S Chult, “Exploring network structure, dynamics, and function using NetworkX,” 2008.
 [67] G. Boeing, “OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks,” Computers, Environment and Urban Systems, vol. 65, pp. 126–139, 2017.
 [68] Taxi and Limousine Commission (TLC), “2011 yellow taxi trip data.” [Online]. Available: https://data.cityofnewyork.us/Transportation/2011YellowTaxiTripData/uwypdntv
 [69] Socrata API, “2011 yellow taxi trip data.” [Online]. Available: https://dev.socrata.com/foundry/data.cityofnewyork.us/uwypdntv
 [70] “2011 Pride Guide.” [Online]. Available: https://issuu.com/nycpride/docs/prideguide_9_rev2_lowres
 [71] A. Asuncion and D. Newman, UCI Machine Learning Repository, 2007.
 [72] M. J. Wainwright, Highdimensional statistics: A nonasymptotic viewpoint. Cambridge University Press, 2019, vol. 48.
 [73] P. Tseng, “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of Optimization Theory and Applications, vol. 109, no. 3, pp. 475–494, 2001.
a Proof of Theorem 1
Proof.
We denote . Define as the row space of , and the null space. Let , the projection onto , and . Additionally, , the projection onto . Since is a stationary point of , it follows that