Vector-Valued Graph Trend Filtering with Non-Convex Penalties

Vector-Valued Graph Trend Filtering with Non-Convex Penalties

Rohan Varma, , Harlin Lee, , Jelena Kovačević, , and
Yuejie Chi,
R. Varma, H. Lee and Y. Chi are with the Dept. of Electrical and Computer Engineering, Carnegie Mellon University. Emails:{rohanv, harlinl, yuejiec}@andrew.cmu.edu.J. Kovačević is with the Tandon School of Engineering, New York University. Email: jelenak@nyu.edu. This work is supported in part by NSF under grants CCF-1563918, CCF-1826519, CCF-1806154 and ECCS-1818571, by ONR under grant N00014-18-1-2142, by ARO under grant W911NF-18-1-0303, and by NIH under grant R01EB025018. A preliminary version of partial results in this paper was presented at the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing [1].The first two authors contributed equally.
Abstract

This work studies the denoising of piecewise smooth graph signals that exhibit inhomogeneous levels of smoothness over a graph, where the value at each node can be vector-valued. We extend the graph trend filtering framework to denoising vector-valued graph signals with a family of non-convex regularizers, which exhibit superior recovery performance over existing convex regularizers. Using an oracle inequality, we establish the statistical error rates of first-order stationary points of the proposed non-convex method for generic graphs. Furthermore, we present an ADMM-based algorithm to solve the proposed method and establish its convergence. Numerical experiments are conducted on both synthetic and real-world data for denoising, support recovery, event detection, and semi-supervised classification.

graph signal processing, graph trend filtering, semi-supervised classification, non-convex optimization

I Introduction

Signal estimation from noisy observations is a classic problem in signal processing and has applications in signal inpainting, collaborative filtering, recommendation systems and other large-scale data completion problems. Since noise can have deleterious, cascading effects in many downstream tasks, being able to efficiently and accurately filter and reconstruct a signal is of significant importance.

With the explosive growth of information and communication, signals are generated at an unprecedented rate from various sources, including social networks, citation networks, biological networks, and physical infrastructure [2]. Unlike time series or images, these signals lie on complex, irregular structures, and require novel processing techniques, leading to the emerging field of graph signal processing [3]-[5]. This framework models the structure by a graph and generalizes concepts and tools from classical discrete signal processing to graph signal processing. The associated graph-structured data are referred to as graph signals.

In graph signal processing, a common assumption is that the graph signal is smooth with respect to the graph, that is, the signal coefficients do not vary much over local neighborhoods of the graph. However, this characterization is insufficient for many real-world signals that exhibit spatially inhomogeneous levels of smoothness over the graph. In social networks for example, within a given community or social circle, users’ profiles tend to be homogeneous, while within a different social circle they will be of different, yet still have homogeneous values. Consequently, the signal is often characterized by large variations between regions and small variations within regions such that there are localized discontinuities and patterns in the signal. As a result, it is necessary to develop representations and algorithms to process and analyze such piecewise smooth graph signals.

In this work, we study the denoising of the class of piecewise smooth graph signals (including but not limited to piecewise constant graph signals), which is complementary to the class of smooth graph signals that exhibit homogeneous levels of smoothness over the graph. The reconstruction of smooth graph signals has been well studied in previous work both within graph signal processing [4]-[9] as well as in the context of Laplacian regularization [10, 11].

The Graph Trend Filtering (GTF) framework [12], which applies total variation denoising to graph signals [13], is a particularly flexible and attractive approach that regularizes discrete graph differences using the norm. Although the norm based regularization has many attractive properties [14], the resulting estimates are biased toward zero for large coefficients. To alleviate this bias effect, non-convex penalties such as the Smoothly Clipped Absolute Deviation (SCAD) penalty [15] and the Minimax Concave Penalty (MCP) [16] have been proposed as alternatives. These penalties behave similarly to the norm when the signal coefficients are small, but tend to a constant when the signal coefficients are large. Notably, they possess the so-called oracle property: in the asymptotics of large dimension, they perform as well as the case where we know in advance the support of the sparse vectors [17]-[21].

In this work, we strengthen the GTF framework in [12] by considering a large family of possibly non-convex regularizers, including SCAD and MCP that exhibit superior reconstruction performance over minimization for the denoising of piecewise smooth graph signals. Furthermore, we extend the GTF framework to allow vector-valued signals, e.g. time series [22], on each node of the graph, which greatly broadens the applicability of GTF to applications in social networks [23], gene networks, and semi-supervised classification [24]-[26]. Through theoretical analyses and empirical performance, we demonstrate that the use of non-convex penalties improves the performance of GTF in terms of both reduced reconstruction error and improved support recovery, i.e. how accurately we can localize the discontinuities of the piecewise smooth signals. Our contributions can be summarized as follows:

  • Theoretically, we derive the statistical error rates of the signal estimates, defined as first-order stationary points of the proposed GTF estimator. We derive the rates in terms of the noise level and the alignment of the ground truth signal with respect to the underlying graph, without making assumptions on the piecewise smoothness of the ground truth signal. The better the alignment, the more accurate the estimates. Importantly, the estimators do not need to be the global minima of the proposed non-convex problem, which are much milder requirements and important for the success of optimization. For denoising vector-valued signals, the GTF estimate is more accurate when each dimension of the signal shares similar patterns across the graph.

  • Algorithmically, we propose an ADMM-based algorithm that is guaranteed to converge to a critical point of the proposed GTF estimator.

  • Empirically, we demonstrate the performance improvements of the proposed GTF estimators with non-convex penalties on both synthetic and real data for signal estimation, support recovery, event detection, and semi-supervised classification.

The rest of this paper is organized as follows. Section II reviews related works and their relationships to the current paper. In Section III, we provide some background and definitions on graph signal processing and GTF. Section IV presents the proposed GTF framework with non-convex penalties and vector-valued graph signals. Section V develops its performance guarantees, and Section VI presents an efficient algorithm based on ADMM. Numerical performance of the proposed approach is examined on both synthetic and real-world data for denoising and semi-supervised classification in Section VII. Finally, we conclude in Section VIII and briefly discuss future work.

Throughout this paper, we use boldface letters and to represent vectors and matrices respectively. The transpose of is denoted as . The -th row of a matrix is denoted as , and the -th column of a matrix is denoted as . The cardinality of a set is denoted as . For any set and , we denote such that if and only if for . Similarly, we define a submatrix of that corresponds to pulling out the rows of indexed by . The norm of a vector is defined as , and the spectral norm of a matrix is defined as . The pseudo-inverse of a matrix is defined as . For a function , we write to denote the gradient or subdifferential of , if they exist, evaluated at . When the intention is clear, this may be written concisely as . We also follow the standard asymptotic notations. If for some constants , for all , then ; if , then . Finally, Table I summarizes some key notations used in this paper for convenience.

Symbol Description Dimension
oriented incidence matrix
th order graph difference operator
scalar-valued graph signal
vector-valued graph signal
noisy observation of
noisy observation of
-th row of
-th column of
spectral norm of
TABLE I: Key notations used in this paper.

Ii Related Work and Connections

Estimators that adapt to spatial inhomogeneities have been well studied in the literature via regularized regression, total variation and splines [27]-[29]. Most of these methods involve locating change points or knots that denote a distinct change in the behavior of the function or the signal.

Our work is most related to the spatially adaptive GTF estimator introduced in [12] that smoothens or filters noisy signals to promote piecewise smooth behavior with respect to the underlying graph structure; see also [30]. In the same spirit as [27], the fused LASSO and univariate trend filtering framework developed in [13, 31, 32] use discrete difference operators to fit a time series signal using piecewise polynomials. The GTF framework generalizes univariate trend filtering by generalizing a path graph to arbitrarily complex graphs. Specifically, by appropriately defining the discrete difference operator, we can enforce piecewise constant, piecewise linear, and more generally piecewise polynomial behaviors over the graph structure. In comparison to previous work [12], in this paper, we have significantly expanded its scope by allowing vector-valued data over the graph nodes and a broader family of possibly non-convex penalties.

We note that while a significant portion of the relevant literature on GTF or the fused LASSO has focused on the sparsistency or support recovery conditions under which we can ensure the recovery of the location of the discontinuities or knots [33, 34], in this work, we study the asymptotic error rates of our estimator with respect to the mean squared error. Our analysis of error rates leverages techniques in [35, 36] that result in sharp error rates of total variation denoising via oracle inequalities, which we have carefully adapted to allow non-convex regularizers. The obtained error rates can be translated into bounds on support recovery or how well we can localize the boundary by leveraging techniques in [37].

Employing a graph-based regularizer that promotes similarities between the signal values at connected nodes has been investigated by many communities, such as graph signal processing, machine learning, applied mathematics, and network science. The Network LASSO proposed in [23], which is similar to the GTF framework with multi-dimensional or vector-valued data, focused on the development of efficient algorithms without any theoretical guarantees. The recent works by Jung et. al. [25, 26, 38] have analyzed the performance of Network LASSO for semi-supervised learning when the graph signal is assumed to be clustered according to the labels using the network null space property and the network compatibility condition inspired by related concepts in compressed sensing [39]. In contrast, our analysis does not make assumptions on the graph signal, and the error rate is adaptive to the alignment of the signal and the graph structure used in denoising.

A well-studied generalization of the sparse linear inverse problem is when there are multiple measurement vectors (MMV), and the solutions are assumed to have a common sparsity pattern [40]-[42]. Sharing information across measurements, and thereby exploiting the conformity of the sparsity pattern, has been shown to significantly improve the performance of sparse recovery in compressive sensing and sparse coding [43]-[47]. Motivated by these works, we consider vector-valued graph signals that are regarded as multiple measurements of scalar-valued graph signals sharing discontinuity patterns.

There are a few variants of non-convex penalties that promote sparsity such as SCAD, MCP, weakly convex penalties, and () minimization [17, 48]-[51]. In this paper, we consider and develop theory for a family of non-convex penalties parametrized similarly to that in [17, 49] with SCAD and MCP as our prime examples, although it is valid for other non-convex penalties.

Fig. 1: Illustration of piecewise smooth signals on the Minnesota road graph. From left to right: piecewise constant (), piecewise linear (), and piecewise quadratic () graph signals. Note that the highlighted change points, i.e. the support of , are edges for even and nodes for odd .

Iii Graph Signal Processing, Piecewise Smooth Signals, and Graph Trend Filtering

We consider an undirected graph , where is the set of nodes, is the set of edges, and is the unweighted adjacency matrix – also known as the graph shift operator [3]. The edge set represents the connections of the undirected graph , and the positive edge weight measures the underlying relation between the th and the th node, such as a similarity, a dependency, or a communication pattern. Let a scalar-valued graph signal be defined as

where denotes the signal coefficient at the th node.

Let be the oriented incidence matrix of , where each row corresponds to an edge. That is, if the edge connects the th node to the th node (), the entries in the th row of is then given as

The entries of the signal specifies the unweighted pairwise differences of the graph signal over each edge. As a result, can be interpreted as a graph difference operator. In graph signal processing, a signal is called smooth over a graph if is small.

Iii-a Piecewise Smooth Graph Signals

In practice, the graph signal may not be necessarily smooth over the entire graph, but only locally within different pieces of the graph. To model inhomogeneous levels of smoothness over a graph, we say that a graph signal is piecewise constant over a graph if many of the differences are zero for . Consequently, the difference signal is sparse and is small.

We can characterize piecewise th order polynomial signals on a graph, where the piecewise constant case corresponds to , by generalizing the notion of graph difference operators. Specifically, we use the following recursive definition of the th order graph difference operator [12]. Let for . For , let

(1)

The signal is said to be a piecewise th order polynomial graph signal if is small. To further illustrate, let us consider the piecewise linear graph signal, corresponding to , as a signal whose value at a node can be linearly interpolated from the weighted average of the values at neighboring nodes. It is easy to see that this is the same as requiring the second-order differences to be sparse. Similarly, we say that a signal has a piecewise quadratic structure over a graph if the differences between the second-order differences defined for piecewise linear signals are mostly zero, that is, if is sparse. Fig. 1 illustrates various orders of piecewise graph smooth signals over the Minnesota road network graph.

Iii-B Denoising Piecewise Smooth Graph Signals via GTF

Assume we observe a noisy signal over the graph under i.i.d Gaussian noise:

(2)

and seek to reconstruct from by leveraging the graph structure. When is a smooth graph signal, Laplacian smoothing [10, 11, 52]-[54] can be used, which solves the following problem:

(3)

where . However, it cannot localize abrupt changes in the graph signal when the signal is piecewise smooth.

Graph trend filtering (GTF) [12] is a flexible framework for estimation on graphs that is adaptive to inhomogeneity in the level of smoothness of an observed signal across nodes. The th order GTF estimate is defined as:

(4)

which can be regarded as applying total variation or fused LASSO with the graph difference operator  [13, 55]. The sparsity-promoting properties of the norm have been well-studied [56]. Consequently, applying the penalty in GTF sets many of the (higher-order) graph differences to zero while keeping a small fraction of non-zero values. GTF is then adaptive over the graph; its estimate at a node adapts to the smoothness in its localized neighborhood.

Iv Vector-Valued GTF with Non-convex Penalties

In this section, we first extend GTF to allow a broader family of non-convex penalties and then extend it to handle vector-valued signals over the graph.

Iv-a (Non-)convex Penalties

The norm penalty considered in (4) is well-known to produce biased estimates [57], which motivates us to extend the GTF framework to a broader class of sparsity-promoting regularizers that are not necessarily convex. We wish to minimize the following generalized th order GTF loss function:

(5)

where

is a regularizer defined as the sum of the penalty function applied element-wise to . Here, for even and for odd to account for different dimensions of ; see (1). We will refer to the GTF estimator that minimizes as scalar-GTF.

Similarly to [17, 19, 49], we consider a family of penalty functions that satisfies the following assumptions.

Assumption 1.

Assume satisfies the following:

  1. satisfies , is symmetric around , and is non-decreasing on the real non-negative line.

  2. For , the function is non-increasing in . Also, is differentiable for all and sub-differentiable at , with . This upper bounds .

  3. There exists such that is convex.

Many penalty functions satisfy these assumptions. Besides the penalty, the non-convex SCAD [15] penalty

(6)

and the MCP [16]

(7)

also satisfy them. We note that Assumption 1 (c) is satisfied for SCAD with and for MCP with . Fig. 2 illustrates the , SCAD and MCP penalties for comparison. While the non-convexity means that in general, we may not always find the global optimum of , it often affords us many other advantages. SCAD and MCP both taper off to a constant value and hence apply less shrinkage for higher values. As a result, they mitigate the bias effect while promoting sparsity. Further, they are smooth and differentiable for and are both upper bounded by the penalty for all .

Fig. 2: Illustration of for , SCAD (), and MCP (), where . Both SCAD and MCP move towards as increases.

Iv-B Vector-Valued GTF

In many applications, the signals on each node are in fact multi-dimensional or vector-valued, e.g. time series in social networks, multi-class labels in semi-supervised learning, feature vectors of different objects in feature selection. Therefore, it is natural to consider an extension to the graph signal denoising problem, where the graph signal on each node is a -dimensional vector instead of a scalar. In this scenario, we define a vector-valued graph signal to be piecewise smooth if it is piecewise smooth in each of its dimensions, and assume their discontinuities to coincide over the same small set of edges or nodes. Further, we denote the vector-valued signal of interest as , such that the th row of the matrix corresponds to the th node of the graph. The noise model for the observation matrix is defined as

(8)

where each element of is drawn i.i.d from . A naïve approach is to estimate each column of separately via scalar-GTF:

(9)

However, this formulation does not take full advantage of the multi-dimensionality of the graph signal. Instead, when the columns of are correlated, coupling them can be beneficial such that we encourage the sharing of information across dimensions or features. For example, if one column exhibits strong piecewise smoothness over the graph, and therefore has compelling evidence about the relationship between nodes, sharing that information to a related column can improve the overall denoising and filtering performance. As a result, we formulate a vector-GTF problem as follows:

(10)

where the new penalty function is the sum of applied to the norm of each row of :

(11)

By enforcing sparsity on , we are coupling to be of similar sparsity patterns across . Note the difference from (9), where elements of can be set to zero or non-zero independently.

V Theoretical Guarantees

In this section, we present the error rates and support recovery guarantees of the generalized GTF estimators, namely scalar-GTF (5) and vector-GTF (10), under the AWGN noise model. Before continuing, we first define a few useful quantities. Let be the number of connected components in the graph , or equivalently, the dimension of the null space of . Further, let be the number of rows of , and be the maximum norm of the columns of .

V-a Error Rates of First-order Stationary Points

Due to non-convexity, global minima of the proposed GTF estimators may not be attainable. Therefore, it is more desirable to understand the statistical performance of any first-order stationary points of the GTF estimators by considering oracle inequalities. We call a stationary point of , if it satisfies

We further introduce the compatibility factor, which generalizes the notion used in [35] to allow vector-valued signals.

Definition 1 (Compatibility factor).

Let be fixed. The compatibility factor of a nonempty set is defined as

To build intuition, consider . This is precisely the definition of , an induced norm of the submatrix of . Together with , can be related to the restrictive eigenvalue condition, which is often used to bound the performance of LASSO [39]. With slight abuse of notation, we write .

We have the following oracle inequality that is applicable to the stationary points of the GTF estimators, whose proof is given in Appendix -A. We stress that although GTF was motivated by piecewise smooth graph signals, Theorem 1 holds for any graph and graph signal .

Theorem 1 (Oracle inequality of GTF stationary points).

Assume . Fix . For scalar-GTF (5), let be a stationary point. Set , then

(12)

with probability at least for any . Similarly, for vector-GTF (10), let be a stationary point. Set , then

(13)

with probability at least for any .

Remark 1.

Recall that is defined in Assumption 1 (c), which characterizes how “non-convex” the regularizer is, and dictates the inflection point in Fig. 2. The assumption in Theorem 1 therefore implicitly constrains the level of non-convexity of the regularizer. Take MCP in (7) for example: since , we can guarantee the existence of a valid such that as long as we set .

Theorem 1 allows one to select and to optimize the error bounds on the right hand side of (12) and (13). For example, pick in (12) (hence an “oracle”) to have

(14)

where .

  • By setting as an empty set, we have

    (15)

    which suggest that the reconstruction accuracy improves when the ground truth is better aligned with the graph structure, and consequently the value of is small.

  • On the other hand, by setting as the support of , we achieve

    which grows linearly as we increase the sparsity level .

Similar discussions can be conducted for vector-GTF by choosing in (13). More importantly, we can directly compare the performance of vector-GTF with scalar-GTF, which was formulated for vector-valued graph signals in (9). The error bound of vector-GTF pays a small price in the order of , but is tighter than scalar-GTF if . This suggests that vector-GTF is much more advantageous when the support sets of for overlap, i.e. when the local discontinuities and patterns in are shared.

To sum up, despite being non-convex, we can guarantee that any stationary point of the proposed GTF estimator possesses strong statistical guarantees.

V-B Comparison with Scalar-GTF using Regularization

We compare our error bound for scalar-GTF with [12, Theorem 3], which is obtained for GTF with the penalty, reproduced below for convenience.

Theorem 2 (Basic error bound of GTF minimizer).

If , then , the minimizer of (4), satisfies

The above bound is comparable to our bound in the special case of setting to an empty set, i.e. (15). The first term of the bound in (15) is upper bounded by that of Theorem 2. The non-convex regularization yields especially tighter bounds when contains large coefficients, so that . On the other hand, the second term of (15) contains in the denominator, which makes it an upper bound of the second term in Theorem 2. This gap can be brought down by choosing a larger , which allows one to pick a smaller , as mentioned in Remark 1. However, as , non-convex SCAD and MCP also tends to , which erases the improvement from using non-convex regularizers in the first term of the bound. This indicates a trade-off in the overall error bound based on , or the “non-convexity” of the regularizers chosen for scalar-GTF.

V-C Error Rates for Erdős-Rényi Graphs

We next specialize Theorem 1 to the Erdős-Rényi random graphs using spectral graph theory [58]. Let and respectively be the maximum and expected degree of the graph. It is known that for any graph it holds [12]

(16)

where is the smallest non-zero eigenvalue of the graph Laplacian matrix . Moreover, we have , and  [59]. Next, we present a simple lower bound on , which is proved in Appendix -B.

Proposition 1 (Bound on ).

is bounded for any and as

For an Erdős-Rényi random graph, if , we have almost surely [60, Corollary 8.2] and . Furthermore, [12, 58, 61], and for odd and for even . Therefore, with probability at least , we have

where by plugging in . These results are also applicable to -regular Ramanujan graphs [61].

V-D Support Recovery

An alternative yet important metric for gauging the success of the proposed GTF estimators is support recovery, which aims to localize the discontinuities in the piecewise smooth graph signals, i.e. the support set of , that is

(17)

In particular, for odd , the discontinuities correspond to graph nodes; and for even , they correspond to the edges. Let be the GTF estimate of the graph signal. The quality of the support recovery can be measured using the graph screening distance [37]. For any and , let denote the length of the shortest path between them. The distance of from is then defined as

(18)

Interestingly, Lin et.al. [37] showed recently that under mild assumptions, one can translate the error bound into a support recovery guarantee. Specifically, letting be the RHS of (12) that bounds the error in Theorem 1, we have

(19)

where quantifies the minimum level of discontinuity, defined as the minimum absolute value of the non-zero values of , i.e.

(20)

Consequently, this leads to support recovery guarantees of the proposed GTF estimators. Numerical experiment in Section VII-A verifies the superior performance of the non-convex regularizers over the regularizer for support recovery.

Vi ADMM Algorithm and its Convergence

There are many algorithmic approaches to optimize the vector-GTF formulation in (10), since scalar-GTF (5) can be regarded as a special case with . In this section, we illustrate the approach adopted in this paper, which is the Alternating Direction Method of Multipliers (ADMM) framework for solving separable optimization problems [62].

Via a change of variable as , we can transform (10) to

Its corresponding Lagrangian can be written as:

(21)

where is the Lagrangian multiplier, and is the parameter. Alg. 1 shows the ADMM updates based on the Lagrangian in (21). Recall the proximal operator is defined as for a function . , SCAD and MCP all admit closed-form solutions of Prox, which are simple thresholding operations [63]. Furthermore, we have the following convergence guarantee for Alg. 1, whose proof is provided in Appendix -C.

Theorem 3.

Let , then Alg. 1 converges to a stationary point of (10).

1:Inputs: , and parameters , ,
2:Initialize:
3:      or if given. , ,
4:repeat
5:     for  1 to num_cols() do
6:         
7:     end for
8:     for  1 to num_rows() do
9:         
10:     end for
11:     
12:until termination
Algorithm 1 ADMM for solving (10)

In addition, we provide a detailed time complexity analysis of Alg. 1 in Table II. Note that since is a sparse matrix with exactly non-zero entries, Alg. 1 can run much faster when . As a preprocessing step for each , we compute and , the eigenvectors and eigenvalues of , exactly once. can then be initialized very efficiently for all experiments that use .

eigen decomposition
initialization
initialization
update
calculation
update
Total after iterations
TABLE II: Time complexity analysis of Alg. 1.

Vii Numerical Experiments

For the following experiments, we fixed for SCAD, for MCP. Further, the graphs we use in the following experiments satisfy Assumption 2 for this choice of . Unless explicitly mentioned, we tuned and for each experiment using the Hyperopt toolbox [64]. To meet the convergence criteria in Theorem 3, we enforced . SCAD/MCP were warm-started with the GTF estimate with penalty. Python packages PyGSP [65] and NetworkX [66] were used to construct and plot graphs. The input signal SNR was calculated as , while the reconstructed signal SNR was calculated as , where was the reconstruction. Computation time was measured with MacBook Pro 2017 with an 2.9 GHz Intel Core i7 and 16GB RAM. Our code is available at https://github.com/HarlinLee/nonconvex-GTF-public.

Vii-a Denoising via GTF with Non-convex Regularizers

Fig. 3: Scalar-GTF with MCP (orange) has much lower bias than scalar-GTF with (blue) when estimating a piecewise constant signal over a grid graph. See highlighted regions pointed by red arrows in A and B. The scatter points correspond to a noisy signal with dB SNR.
Fig. 4: The ROC curve for classifying whether an edge lies on a boundary for the Minnesota road graph signal shown in Fig. 5. The input SNR of the noisy piecewise constant signal is dB.

We first highlight via synthetic examples two important advantages that non-convex regularizers provide over the penalty.

  • Bias Reduction: We demonstrate the reduction in signal bias in Fig. 3 for the graph signal defined over a 2D-grid graph, using both the penalty and the MCP penalty. Clearly, the MCP estimate (orange) has less bias than the estimate (blue), and can recover the ground truth surface (purple) more closely.

  • Support Recovery: We illustrate the improved support recovery performance of non-convex penalties on localizing the boundaries for a piecewise constant signal on the Minnesota road graph, shown in Fig. 5. Particularly, we look at how well our estimator localizes the support of , that is, the discontinuity of the piecewise constant graph signal by looking at how well we can classify an edge as connecting two nodes in the same piece or being a cut edge across two pieces. By sweeping the regularization parameter , we obtain the ROC curve in Fig. 4, i.e. the true positive rate versus the false positive rate of classifying a cut edge correctly, and see that scalar-GTF with MCP and SCAD consistently outperforms the scalar-GTF with penalty.

Fig. 5: The left panel shows the ground truth piecewise constant signals on 2D-grid graph (top), and Minnesota road graph (bottom). The middle panel shows their corresponding plots of input signal SNR versus reconstructed signal SNR, averaged over 10 and 20 realizations, respectively. Finally, the right panel plots the computation time against gain in SNR from denoising via vector-GTF. 10 trials were performed for each regularizer, where the input signal SNR was fixed at 20dB.

Then, we compare the performance of GTF using non-convex regularizers such as SCAD and MCP with that using the norm more rigorously. For the ground truth signal , we construct a piecewise constant signal on a 2D-grid graph and the Minnesota road graph [65] as shown in the left panel of Fig. 5, and add different levels of noise following (2). We recover the signal by scalar-GTF with Alg. 1, and plot the SNR of the reconstructed signal versus the SNR of the input signal in solid lines in the middle panel of Fig. 5, averaged over and realizations, respectively. SCAD/MCP consistently outperforms in denoising graph signals defined over both regular and irregular structures.

Vii-B Denoising Vector-valued Signals via GTF

We compare the performance of vector-GTF in (10) with (9), which applies scalar-GTF to each column of the vector-valued graph signal. The convex norm, and the non-convex SCAD and MCP are employed. We reuse the same ground truth graph signals over the 2D-grid graph and the Minnesota road graph constructed in Section VII-A in Fig. 5. independent noisy realizations of the graph signal are concatenated to construct a noisy vector-valued graph signal with dimension on the 2D-grid graph and with on the Minnesota road graph. We recover the vector-valued signal by minimizing vector-GTF (10) with Alg. 1.

The middle panel of Fig. 5 plots the average SNR of the reconstructed signal versus the average SNR of the input signal in dotted lines. We emphasize that the performance of (9) is the same as applying scalar-GTF to each realization, which is shown in the middle panel of Fig. 5 in solid lines. As before, SCAD/MCP consistently outperforms in denoising signals over both regular and irregular graphs. Furthermore, as expected, due to the sharing of information across realizations, vector-GTF consistently outperforms scalar-GTF, especially in the low SNR regime.

The right panel of Fig. 5 plots the computation time versus the gain in SNR from denoising via vector-GTF. 10 trials are performed for each regularizer with the input signal SNR fixed at 20dB. Parameter tuning and eigen decomposition of are preprocessing steps, and hence they are not included in the time measurement. However, since GTF with non-convex regularizers are warm-started by the estimate, the runtime for GTF is added to the SCAD/MCP runtime. Overall, running vector-GTF with SCAD/MCP after once with takes more time, but with large benefits in the denoising performance. Even with the additional computation time, Vector-GTF runs reasonably fast; with the Minnesota road network, where and , computation takes less than 25 seconds.

Average #1 #2 #3 #4 #5 #6 #7 #8
Input SNR (dB) 8.7 -14 0 0 3.5 5.8 12 29 34
Vector-GTF + 29 10 20 23 26 36 37 39 38
Scalar-GTF + 21 0 11 13 16 18 26 41 45
Vector-GTF + SCAD 32 10 20 22 25 36 35 49 61
Scalar-GTF + SCAD 29 0 15 17 25 35 34 47 60
Vector-GTF + MCP 32 10 20 22 25 36 35 49 61
Scalar-GTF + MCP 29 0 15 22 24 30 33 49 60
TABLE III: Noisy input and reconstructed signal SNRs for eight measurements of varying input SNRs, rounded to two significant figures. Highest reconstructed signal SNR for each measurement is in bold.

We further investigate the benefit of sharing information across measurements or realizations in the following experiment, using the same ground truth signal on the 2D-grid graph. We stack eight noisy realizations of this same piecewise constant signal to build a vector-valued signal. We construct these noisy measurements by scaling each one of them differently and randomly such that each will have SNR dB under (2). This has the effect of rendering some measurements more informative than others, and potentially allowing vector-GTF to reap the benefits of sharing information across measurements. We recover the 8-dimensional graph signal via Alg. 1 using , SCAD, and MCP regularizers, and in Table III, report the input signal and reconstructed signal SNRs for each measurement in addition to the average SNRs. is fixed at .

First of all, notice that as before, using SCAD/MCP generally achieves results with higher SNR than using , and that on average, minimizing (10) outperforms minimizing (9). The effect of sharing information across measurements is most apparent in low SNR settings, when information about the boundaries of the graph signal can be borrowed from higher SNR signals to improve the estimation. On the other hand, sharing information with noisier signals does not help denoising signals with high input SNR. However, it is worth noting that, unlike , SCAD/MCP does not see decrease in its performance in the high SNR settings.

Vii-C Event Detection with NYC Taxi Data

Fig. 6: Top left: the noisy signal on the Manhattan road network is the change in the taxi pickup and dropoff count during the 2011 NYC Gay Pride. Top right: areas of Pride events, where the traffic was blocked off. Bottom: the GTF estimates using and MCP. The GTF estimate with MCP better detects and localizes the event, compared to the one using penalty.

To further illustrate graph trend filtering on a real-world dataset, we consider the road network of Manhattan where the nodes correspond to junctions [67]. We map the pickups and dropoffs of the NYC taxi trip dataset [68, 69] to the nearest road junctions, and define the total count at that junction to be the signal value on the corresponding graph node. The signal of interest, plotted on the top left panel of Fig. 6, is the difference between the event graph signal on the day of NYC Gay Pride parade, 12-2pm on June 26, 2011, and the seasonal average graph signal at the same time during the 8 nearest Sundays. During the event, no pickups and dropoffs could occur in the areas shown in the top right panel of Fig. 6 [70]. We denoise the signal via GTF using both and MCP, where we chose such that . Once again, we observe the GTF estimate with MCP produces sharper traces around the parade route, indicating better capabilities of event detection and localization.

Vii-D Semi-supervised Classification

Heart Wine quality Wine Iris Breast Car
# of samples () 303 1599 178 150 569 1728
# of classes () 2 6 3 3 2 4
0.148 0.346 0.038 0.036 0.042 0.172
SCAD p-value 0.148 0.353 0.038 0.033 0.042 0.149
1. 0.06 1. 0.27 1. 0.06
MCP p-value 0.144 0.351 0.037 0.035 0.040 0.148
0.23 0.18 0.34 0.34 0.35 0.05
0.143 0.351 0.034 0.039 0.035 0.104
SCAD p-value 0.144 0.350 0.034 0.039 0.035 0.104
0.30 0.43 0.34 1. 0.71 0.66
MCP p-value 0.146 0.350 0.034 0.039 0.034 0.103
0.05 0.44 0.34 1. 0.02 0.23
TABLE IV: Misclassification rates averaged over 10 trials, with p-values from running sampled t-tests between SCAD/MCP misclassification rates and the corresponding rates using . Cases where non-convex penalties perform better than with p-value below are highlighted in bold, and where they perform worse are in italic.

Graph-based learning provides a flexible and attractive way to model data in semi-supervised classification problems when vast amounts of unlabeled data are available compared to labeled data, and labels are expensive to acquire [10, 11, 54]. One can construct a nearest-neighbor graph based on the similarities between each pair of samples, and hope to propagate the label information from labeled samples to unlabeled ones. We move beyond our original problem in (5) to a -class classification problem in a semi-supervised learning setting, where for a given dataset with samples, we observe a subset of the one-hot encoded class labels, , such that if th sample has been observed to be in th class, and otherwise. A diagonal indicator matrix denotes samples whose class labels have been observed. Then, we can define the modified absorption problem [11, 12, 54] using a variation of GTF to estimate the unknown class probabilities :

(22)

where (set to be uniform in the experiment) is a fixed prior belief, and determines how much emphasis to be given to the prior belief. The labels can be estimated using such that if and only if , and otherwise . Note that this can be completely separated into scalar-GTF problems, one corresponding to each class.

We applied the algorithm in to 6 popular UCI classification datasets [71] with . For each dataset, we normalized each feature to have zero mean and unit variance, and constructed a 5-nearest-neighbor graph of the samples based on the Euclidean distance between their features, with edge weights from the Gaussian radial basis kernel. We observed the labels of 20% of samples in each class randomly. Table IV shows the misclassification rates averaged over repetitions, which demonstrates that the performance using non-convex penalties such as SCAD/MCP are at least competitive with, and often better than, those with the penalty.

Viii Conclusions

We presented a framework for denoising piecewise smooth signals on graphs that generalizes the graph trend filtering framework to handle vector-valued signals using a family of non-convex regularizers. We provided theoretical guarantees on the error rates of our framework, and derived a general ADMM-based algorithm to solve this generalized graph trend filtering problem. Furthermore, we demonstrated the superior performance of these non-convex regularizers in terms of reconstruction error, bias reduction, and support recovery on both synthetic and real-world data. In particular, its performance on semi-supervised classification is investigated. In future work, we plan to further study this approach when the graph signals are observed by indirect and incomplete measurements.

References

  • [1] R. Varma, H. Lee, Y. Chi, and J. Kovačević, “Improving graph trend filtering with non-convex penalties,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5391–5395.
  • [2] M. Newman, Networks.   Oxford University Press, 2018.
  • [3] A. Sandryhaila and J. M. Moura, “Discrete signal processing on graphs,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1644–1656, 2013.
  • [4] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
  • [5] A. Ortega, P. Frossard, J. Kovačević, J. M. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” Proceedings of the IEEE, vol. 106, no. 5, pp. 808–828, 2018.
  • [6] S. Chen, R. Varma, A. Singh, and J. Kovačević, “Signal representations on graphs: Tools and applications,” arXiv preprint arXiv:1512.05406, 2015.
  • [7] S. Chen, R. Varma, A. Sandryhaila, and J. Kovačević, “Discrete signal processing on graphs: Sampling theory,” IEEE Transactions on Signal Processing, vol. 63, no. 24, pp. 6510–6523, 2015.
  • [8] S. Chen, A. Sandryhaila, J. M. Moura, and J. Kovačević, “Signal recovery on graphs: Variation minimization,” IEEE Transactions on Signal Processing, vol. 63, pp. 4609–4624, 2015.
  • [9] A. Elmoataz, O. Lezoray, and S. Bougleux, “Nonlocal discrete regularization on weighted graphs: a framework for image and manifold processing,” IEEE Transactions on Image Processing, vol. 17, no. 7, pp. 1047–1060, 2008.
  • [10] M. Belkin, I. Matveeva, and P. Niyogi, “Regularization and semi-supervised learning on large graphs,” in International Conference on Computational Learning Theory.   Springer, 2004, pp. 624–638.
  • [11] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 912–919.
  • [12] Y.-X. Wang, J. Sharpnack, A. J. Smola, and R. J. Tibshirani, “Trend filtering on graphs,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3651–3691, 2016.
  • [13] S.-J. Kim, K. Koh, S. Boyd, and D. Gorinevsky, “ Trend Filtering,” SIAM Review, vol. 51, no. 2, pp. 339–360, 2009.
  • [14] P. Bühlmann and S. Van De Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications.   Springer Science & Business Media, 2011.
  • [15] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
  • [16] C.-H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” The Annals of Statistics, vol. 38, no. 2, pp. 894–942, 2010.
  • [17] P.-L. Loh and M. J. Wainwright, “Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima,” in Advances in Neural Information Processing Systems, 2013, pp. 476–484.
  • [18] P.-L. Loh, “Statistical consistency and asymptotic normality for high-dimensional robust M-estimators,” The Annals of Statistics, vol. 45, no. 2, pp. 866–896, 2017.
  • [19] C.-H. Zhang and T. Zhang, “A general theory of concave regularization for high-dimensional sparse estimation problems,” Statistical Science, vol. 27, no. 4, pp. 576–593, 2012.
  • [20] P. Breheny and J. Huang, “Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection,” The Annals of Applied Statistics, vol. 5, no. 1, p. 232, 2011.
  • [21] S. Ma and J. Huang, “A concave pairwise fusion approach to subgroup analysis,” Journal of the American Statistical Association, vol. 112, no. 517, pp. 410–423, 2017.
  • [22] S. Chen, F. Cerda, P. Rizzo, J. Bielak, J. H. Garrett, and J. Kovačević, “Semi-supervised multiresolution classification using adaptive graph filtering with application to indirect bridge structural health monitoring,” IEEE Transactions on Signal Processing, vol. 62, no. 11, pp. 2879–2893, 2014.
  • [23] D. Hallac, J. Leskovec, and S. Boyd, “Network lasso: Clustering and optimization in large graphs,” in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2015, pp. 387–396.
  • [24] A. Jung, A. O. Hero III, A. Mara, and S. Jahromi, “Semi-supervised learning via sparse label propagation,” arXiv preprint arXiv:1612.01414, 2016.
  • [25] A. Jung, N. Tran, and A. Mara, “When is network lasso accurate?” Frontiers in Applied Mathematics and Statistics, vol. 3, p. 28, 2018.
  • [26] N. Tran, S. Basirian, and A. Jung, “When is network lasso accurate: The vector case,” arXiv preprint arXiv:1710.03942, 2017.
  • [27] E. Mammen and S. van de Geer, “Locally adaptive regression splines,” The Annals of Statistics, vol. 25, no. 1, pp. 387–413, 1997.
  • [28] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: nonlinear phenomena, vol. 60, no. 1-4, pp. 259–268, 1992.
  • [29] T. F. Chan, S. Osher, and J. Shen, “The digital TV filter and nonlinear denoising,” IEEE Transactions on Image Processing, vol. 10, no. 2, pp. 231–241, 2001.
  • [30] F. Mahmood, N. Shahid, U. Skoglund, and P. Vandergheynst, “Adaptive graph-based total variation for tomographic reconstructions,” IEEE Signal Processing Letters, vol. 25, no. 5, pp. 700–704, 2018.
  • [31] R. J. Tibshirani, “Adaptive piecewise polynomial estimation via trend filtering,” The Annals of Statistics, vol. 42, no. 1, pp. 285–323, 2014.
  • [32] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, “Sparsity and smoothness via the fused lasso,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 1, pp. 91–108, 2005.
  • [33] J. Sharpnack, A. Singh, and A. Rinaldo, “Sparsistency of the edge lasso over graphs,” in Artificial Intelligence and Statistics, 2012, pp. 1028–1036.
  • [34] Z. Harchaoui and C. Lévy-Leduc, “Multiple change-point estimation with a total variation penalty,” Journal of the American Statistical Association, vol. 105, no. 492, pp. 1480–1493, 2010.
  • [35] J.-C. Hütter and P. Rigollet, “Optimal rates for total variation denoising,” in Conference on Learning Theory, 2016, pp. 1115–1146.
  • [36] A. S. Dalalyan, M. Hebiri, and J. Lederer, “On the prediction performance of the lasso,” Bernoulli, vol. 23, no. 1, pp. 552–581, 2017.
  • [37] K. Lin, J. Sharpnack, A. Rinaldo, and R. J. Tibshirani, “Approximate Recovery in Changepoint Problems, from Estimation Error Rates,” arXiv preprint arXiv:1606.06746, 2016.
  • [38] A. Jung and M. Hulsebos, “The network nullspace property for compressed sensing of big data over networks,” arXiv preprint arXiv:1705.04379, 2017.
  • [39] S. A. Van De Geer, P. Bühlmann et al., “On the conditions used to prove oracle results for the lasso,” Electronic Journal of Statistics, vol. 3, pp. 1360–1392, 2009.
  • [40] S. F. Cotter, B. D. Rao, K. Engan, and K. Kreutz-Delgado, “Sparse solutions to linear inverse problems with multiple measurement vectors,” IEEE Transactions on Signal Processing, vol. 53, no. 7, pp. 2477–2488, 2005.
  • [41] J. Chen and X. Huo, “Theoretical results on sparse representations of multiple-measurement vectors,” IEEE Transactions on Signal Processing, vol. 54, no. 12, pp. 4634–4643, 2006.
  • [42] Y. C. Eldar, P. Kuppinger, and H. Bolcskei, “Block-sparse signals: Uncertainty relations and efficient recovery,” IEEE Transactions on Signal Processing, vol. 58, no. 6, pp. 3042–3054, 2010.
  • [43] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Learning for Matrix Factorization and Sparse Coding,” Journal of Machine Learning Research, vol. 11, no. Jan, pp. 19–60, 2010.
  • [44] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral Image Classification Using Dictionary-Based Sparse Representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 10, pp. 3973–3985, Oct. 2011.
  • [45] Y. Li and Y. Chi, “Off-the-Grid Line Spectrum Denoising and Estimation With Multiple Measurement Vectors,” IEEE Transactions on Signal Processing, vol. 64, no. 5, pp. 1257–1269, Mar. 2016.
  • [46] Y. C. Eldar and M. Mishali, “Robust Recovery of Signals From a Structured Union of Subspaces,” IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 5302–5316, Nov. 2009.
  • [47] M. E. Davies and Y. C. Eldar, “Rank Awareness in Joint Sparse Recovery,” IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 1135–1146, Feb. 2012.
  • [48] P.-L. Loh and M. J. Wainwright, “Support recovery without incoherence: A case for nonconvex regularization,” The Annals of Statistics, vol. 45, no. 6, pp. 2455–2482, 2017.
  • [49] L. Chen and Y. Gu, “The convergence guarantees of a non-convex approach for sparse recovery,” IEEE Transactions on Signal Processing, vol. 62, no. 15, pp. 3754–3767, 2014.
  • [50] R. Chartrand and V. Staneva, “Restricted isometry properties and nonconvex compressive sensing,” Inverse Problems, vol. 24, no. 3, p. 035020, 2008.
  • [51] K. Ji, J. Tan, Y. Chi, and J. Xu, “Learning latent features with pairwise penalties in matrix completion,” arXiv preprint arXiv:1802.05821, 2018.
  • [52] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in Neural Information Processing Systems, 2002, pp. 585–591.
  • [53] ——, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003.
  • [54] P. P. Talukdar and K. Crammer, “New regularized algorithms for transductive learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2009, pp. 442–457.
  • [55] R. J. Tibshirani, The Solution Path of the Generalized Lasso.   Stanford University, 2011.
  • [56] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), pp. 267–288, 1996.
  • [57] C.-H. Zhang and J. Huang, “The sparsity and bias of the lasso selection in high-dimensional linear regression,” The Annals of Statistics, vol. 36, no. 4, pp. 1567–1594, 2008.
  • [58] F. Chung and M. Radcliffe, “On the spectra of general random graphs,” The Electronic Journal of Combinatorics, vol. 18, no. 1, p. 215, 2011.
  • [59] F. R. Chung and F. C. Graham, Spectral Graph Theory.   American Mathematical Society, 1997.
  • [60] A. Blum, J. Hopcroft, and R. Kannan, “Foundations of data science,” Vorabversion eines Lehrbuchs, 2016.
  • [61] A. Lubotzky, R. Phillips, and P. Sarnak, “Ramanujan graphs,” Combinatorica, vol. 8, no. 3, pp. 261–277, 1988.
  • [62] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning, vol. 3, no. 1, pp. 1–122, 2011.
  • [63] J. Huang, P. Breheny, and S. Ma, “A selective review of group selection in high-dimensional models,” Statistical science: a review journal of the Institute of Mathematical Statistics, vol. 27, no. 4, 2012.
  • [64] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures,” in Proceedings of the 30th International Conference on Machine Learning, 2013.
  • [65] M. Defferrard, L. Martin, R. Pena, and N. Perraudin, “PyGSP: Graph Signal Processing in Python.” [Online]. Available: https://github.com/epfl-lts2/pygsp/
  • [66] A. Hagberg, P. Swart, and D. S Chult, “Exploring network structure, dynamics, and function using NetworkX,” 2008.
  • [67] G. Boeing, “OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks,” Computers, Environment and Urban Systems, vol. 65, pp. 126–139, 2017.
  • [68] Taxi and Limousine Commission (TLC), “2011 yellow taxi trip data.” [Online]. Available: https://data.cityofnewyork.us/Transportation/2011-Yellow-Taxi-Trip-Data/uwyp-dntv
  • [69] Socrata API, “2011 yellow taxi trip data.” [Online]. Available: https://dev.socrata.com/foundry/data.cityofnewyork.us/uwyp-dntv
  • [70] “2011 Pride Guide.” [Online]. Available: https://issuu.com/nycpride/docs/prideguide_9_rev2_low-res
  • [71] A. Asuncion and D. Newman, UCI Machine Learning Repository, 2007.
  • [72] M. J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint.   Cambridge University Press, 2019, vol. 48.
  • [73] P. Tseng, “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of Optimization Theory and Applications, vol. 109, no. 3, pp. 475–494, 2001.

-a Proof of Theorem 1

Proof.

We denote . Define as the row space of , and the null space. Let , the projection onto , and . Additionally, , the projection onto . Since is a stationary point of , it follows that