Recovery Conditions and Sampling Strategies for Network Lasso
Abstract
The network Lasso is a recently proposed convex optimization method for machine learning from massive network structured datasets, i.e., big data over networks. It is a variant of the wellknown least absolute shrinkage and selection operator (Lasso), which is underlying many methods in learning and signal processing involving sparse models. Highly scalable implementations of the network Lasso can be obtained by stateofthe art proximal methods, e.g., the alternating direction method of multipliers (ADMM). By generalizing the concept of the compatibility condition put forward by van de Geer and Bühlmann as a powerful tool for the analysis of plain Lasso, we derive a sufficient condition, i.e., the network compatibility condition, on the underlying network topology such that network Lasso accurately learns a clustered underlying graph signal. This network compatibility condition, relates the the location of the sampled nodes with the clustering structure of the network. In particular, the NCC informs the choice of which nodes to sample, or in machine learning terms, which data points provide most information if labeled.
Alexandru Mara and Alexander Jung
\addressDepartment of Computer Science, Aalto University, Finland; firstname.lastname(at)aalto.fi
compressed sensing, big data, semisupervised learning, complex networks, convex optimzation
I Introduction
We consider semisupervised learning from massive heterogeneous datasets with an intrinsic network structure which occur in many important applications ranging from image processing to bioinformatics [17]. By contrast to standard supervised learning methods, e.g., linear or logistic regression, which embed the data points into euclidean space [2, 9], we model the data points as nodes of a finite space whose discrete topology is represented by data graph with the nodes representing individual data points. Two nodes which represent similar data points are connected by an edge whose strength is quantified by the positive weight .
The goal of semisupervised learning for network structured datasets is to learn an underlying hypothesis which maps each data point to a label , which can be a categorial or continuous variable. In some applications we have access to a small amount of initial label information in the form of (typically corrupted) samples taken for all nodes in a small sampling set . In order to learn the complete label information, we rely on a smoothness hypothesis [2, 4], requiring the signal to be nearly constant over well connected subset of nodes (clusters).
By representing label information as graph signals and using their total variation (TV) for measuring smoothness of the labeling, the learning problem can be formulated as a convex TV minimization problem. Following this approach, the authors of [7] obtain the network Lasso which can be interpreted as a generalization of Lasso based method for learning sparse parameters [9].
An efficient scalable implementation of the network Lasso can be obtained via the alternating direction method of multipliers (ADMM) [3]. The implementation via ADMM is appealing since the resulting iterative algorithm is highly scalable, by using modern big data frameworks, and guaranteed to converge under the most general conditions [3].
In this paper, we present a condition on the network topology such that network Lasso is able to accurately learn a clustered graph signal. To this end, we introduce a very simple model for graph signals which are constant over a well connected group of nodes (clusters). Our condition, which we coin “network compatibility condition” amounts to the existence of certain network flows and is closely related to the “network nullspace condition” proposed recently by the first author [1, 11].
The closest to our research program, initiated by the works [1, 11, 12, 10, 8, 6], is [18, 21], which provide sufficient conditions such that a special case of the network Lasso (referred to as the “edge Lasso”) accurately recovers smooth graph signals from noisy observations. However, these works require access to fully labeled datasets, while we consider datasets which are only partially labeled.
Outline. We formalize the problem of recovering (learning) smooth graph signals from observing its values at few sampled nodes in Section II. In particular, we show how to formulate this recovery as a convex optimization problem which coincides with the network Lasso problem studied in [7]. Our main result, stated in Section III, is a sufficient condition on the network structure and sampling set such that accurate recovery is possible. Loosely speaking, this condition requires to sample nodes which are wellconnected to the boundaries of clusters.
Ii Problem Formulation
We consider massive heterogeneous datasets which are represented by a network, i.e., a undirected weighted data graph nodes represent individual data points. For example, the node might represent a chat message on a user profile, measurements of a molecule, a sound fragment or a tabulated numerical data (cf. Figure 1) [5].
Many applications naturally suggest a notion of similarity between individual data points, e.g., the profiles of befriended social network users or greyscale values of neighbouring image pixels. These domainspecific notions of similarity are represented by the edges of the graph , i.e., the nodes representing similar data points are connected by an undirected edge . We quantify the extent of the similarity between connected data points using positive edge weights , which we collect in the symmetric weight matrix . In what follows, we consider only simple data graphs without self loops, i.e., for any we have and .
We sometimes need to orient the data graph by declaring for each edge one node as the head (e.g., ) and the other node as the tail (e.g., ) . Given an edge set in the data graph , we denote the set of directed edges obtained by orienting as .
Beside the network structure, encoded by the edges , a dataset typically contains additional information, e.g., features, labels or model parameters associated with individual data points. Let us represent this additional information by a graph signal defined over the data graph . A graph signal is a mapping , which associates every node with the value . For the house prize example considered in [7], the graph signal corresponds to a regression parameter for a local prize model (used for the house market in a limited geographical area represented by the node ).
In some applications, initial labels are available for few data points only. We collect those nodes in the data graph for which initial labels are available in the sampling set (typically ). In what follows, we model the initial labels as noisy versions of the true underlying labels , i.e.,
(1) 
Iia Learning Graph Signals
We aim at learning a graph signal defined over the date graph , from observing its noisy values provided on a (small) sampling set
(2) 
where typically .
The network Lasso, is a particular recovery method which rests on a smoothness assumption, which is similar in spirit to the smoothness hypothesis of supervised machine learning [4]:
Assumption 1.
The graph signal values (labels) , of two nodes within a cluster of the data graph are similar, i.e., .
The class of smooth graph signals includes lowpass signals in digital signal processing where time samples at adjacent time instants (forming a chain graph) are strongly correlated for sufficiently high sampling rate. Another application involving smooth graph signals is image processing for natural images (forming a grid graph) whose closeby pixels tend to be coloured likely.
What sets our work apart from digital signal processing, is that we consider datasets whose data graph is not restricted to regular chain or grid graphs but may form an arbitrary (complex) networks. In particular, our analysis targets the tendency of the networks occurring in many practical applications to form clusters, i.e., wellconnected subset of nodes. A very basic example of such a clustered data graph is illustrated in Figure 2, which involves a partition of the data graph into two disjoint clusters and . The informal smoothness hypothesis Assumption 1 required the signal valued for all nodes (or ) to be mutually similar, e.g., to the value (or ).
In what follows, we will quantify the smoothness of a graph signal via its total variation (TV)
(3) 
It will be convenient to introduce, for a given subset of edges , the shorthand
(4) 
Besides smoothness another criterion for learning graph signals is a small empirical error
(5) 
where denotes initial labels provided for all data points belonging to the sampling set .
Learning a signal with small TV and small empirical error (cf. (5)), yields the optimization problem
(6) 
As the notation already indicates, there might be multiple solutions for the optimization problem (6). However, any learned graph signal obtained by solving (6) balances the empirical error with the TV of the learned graph signal. The optimization problem (6) is a special case of the network Lasso problem studied in [7]. In particular, the network Lasso formulation in [7] allows for vector valued labels and more general empirical loss functions. The parameter in (6) allows to trade off small empirical error against signal smoothness. In particular, choosing a small value for enforces the solutions of (6) to yield a small empirical error, whereas choosing a large value for enforces the solutions of (6) to have small TV, i.e., to be smooth.
There exist highly efficient methods for solving the network Lasso problem (6) (cf. [22] and the references therein). Most of the stateofthe art convex optimization method belong to the family of proximal methods [16]. One particular instance of proximal methods is ADMM which has been applied to the network Lasso in [7] to obtain a highly scalable learning algorithm.
Iii Network Compatibility Condition
For network Lasso methods, based on solving (6), to be accurate, we have to verify the solutions of (6) to be close to the true (but unknown) underlying graph signal . In what follows, we present a condition which guarantees any solution of (6) to be close to a clustered graph signal . Given a fixed partition of the data graph into disjoint clusters , we define the class of clustered graph signals by
(7) 
where, for a subset , we define the indicator signal
(8) 
For a given partition , the boundary is the set of edges which connect nodes and from different clusters, i.e., with . For a partition whose overall boundary weight is small, the clustered graph signals (7) have small TV , i.e., they are smooth.
The signal model (7), which has been used also in [18, 21], is closely related to the stochastic block model (SBM) [15]. Indeed, the SBM is obtained from (7) by choosing the coefficients uniquely for each cluster, i.e., . Moreover, the SBM provides a generative (stochastic) model for the edges within and between the clusters .
The main contribution of this paper is the insight that network Lasso accurately learns clustered graph signals (cf. (7)) if there exist certain network flows [13] between the sampled nodes in .
Definition 1.
Consider an empirical graph with an arbitrary but fixed orientation. A flow with demands , for , is a mapping satisfying

the conservation law
(9) 
the capacity constraints
(10)
Here, we used the directed neighbourhoods and .
Using the notion of a network flow with demands, we now adapt the compatibility condition introduced for learning sparse vectors with the Lasso [20] to learning clustered graph signals (cf. (7)) with the network Lasso (6).
Definition 2.
Consider a sampling set and a partition of the data graph into disjoint subsets. Then, the network compatibility condition (NCC) with parameters is satisfied by and , if for any orientation of the edges in the boundaries , we can orient the remaining edges in such that there exists a flow with demands on such that

for any boundary edge : ,

for every sampled node : ,

for ever other node : .
We are now in the position to state our main result, i.e., if a sampling set satisfies the NCC, then any solution of the network Lasso is an accurate proxy for a true underlying clustered graph signal (cf. (7)).
Theorem 3.
It is important to realize that the network Lasso problem (6) does not require knowledge of the partition underlying the unknown clustered graph signal . The partition is only used for the analysis of learning methods based on the network Lasso (6). Moreover, for graph signals having different signal values over different clusters, the solutions of (6) could be used for determining the clusters which constitute the partitioning .
Finally, we point to the fact that the NCC depends on both: the sampling set and the graph partition via the (total weight of) boundary . Thus, for a given partition , we might choose the sampling set such that the NCC is guaranteed. One particular such choice is suggested by the following result.
Lemma 4.
Consider a partitioning of the data graph which also contains the sampling set . If each boundary edge with , is connected to sampled nodes, i.e., and with , , and weights , then the sampling set satisfied NCC with parameters and .
Iv Numerical Experiments
In order to illustrate the theoretical findings of Section III we applied network Lasso to a synthetically generated dataset with data graphs . In particular, we generated a data graph using the popular LFR model proposed by Lancichinetti and Fortunato [14]. The LFR model allows to generate networks with a community structure similar to to those of observed in many realworld networks. In particular, networks obtained from the LFR exhibit a power law distribution of node degrees and community sizes. The final synthetic data graph contains a total of nodes which are partitioned into four clusters . The nodes are connected by undirected edges with uniform edge weights for all . Given the data graph and partition we generate a clustered graph signal according to (7) as with coefficients . We illustrate the data graph along with the graph signal values in Figure 3.
According to Lemma 4, in order to recover the entire graph signal it is most helpful to have its signal values for the nodes close to boundary between different clusters. In order to verify this intuition, we constructed two different sampling sets. The first sampling set was constructed in line with Lemma 4, by prefering to sample nodes near a cluster boundary. By contrast, the second sampling set was obtained by selecting nodes uniformly at random and thus ignoring the cluster structure inherent to . The size of both sampling sets is equal to . For simplcity, we assume noiseless measurements, i..e, for the initial labels are given by for each sampled node (with either or ).
For each of the two sampling sets and , we learned the overall graph signal by solving the network Lasso problem (6) using a modified version of the ADMM implementation discussed in [7] (which considered the mean squared error instead of the empirical error (5)). The learned signals obtained finally for each of the two sampling sets are shown in Figure 4, along with the true clustered graph signal .
V Conclusions
We presented a sufficient condition on the network topology and sampling set such that any solution of the network Lasso problem is an accurate estimate for a true underlying clustered graph signal. This recovery condition, which we term the network compatibility condition, amounts to ensuring the existence of certain network flows with prescribed demands. We also provide a more specific, somewhat more practical, condition on the sampling set which implies the network compatibility condition. Loosely speaking, for a given budget of how man nodes to sample, our conditions suggest to sample more densely near to the boundaries between different clusters in the data graph. This intuition is verified by means of numerical experiments involving a toy data graph which has been generated in line with the properties of many realworld networks, i.e., presence of clusters and powerlaw degree distribution.
References
 [1] S. Basirian and A. Jung. Random walk sampling for big data over networks. In Proc. Int. Conf. Sampling Th. and Applications (SampTA).
 [2] C. M. Bishop. Pattern Recognition and Machine Learning. 2006.
 [3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, volume 3 of Foundations and Trends in Machine Learning. Now Publishers, Hanover, MA, 2010.
 [4] O. Chapelle, B. Schölkopf, and A. Zien, editors. SemiSupervised Learning. The MIT Press, Cambridge, Massachusetts, 2006.
 [5] S. Cui, A. Hero, Z.Q. Luo, and J. Moura, editors. Big Data over Networks. Cambridge Univ. Press, 2016.
 [6] G. B. Eslamlou, A. Jung, N. Goertz, and M. Fereydooni. Graph signal recovery from incomplete and noisy information using approximate message passing. In Proc. IEEE ICASSP 2016, pages 6170–6174, March 2016.
 [7] D. Hallac, J. Leskovec, and S. Boyd. Network lasso: Clustering and optimization in large graphs. In Proc. SIGKDD, pages 387–396, 2015.
 [8] G. Hannak, P. Berger, G. Matz, and A. Jung. Efficient graph signal recovery over big networks. In Proc. Asilomar Conf. Signals, Sstems, Computers, pages 1–6, Nov 2016.
 [9] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY, USA, 2001.
 [10] A. Jung, P. Berger, G. Hannak, and G. Matz. Scalable graph signal recovery for big data over networks. In 2016 IEEE 17th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pages 1–6, July 2016.
 [11] A. Jung, A. Heimowitz, and Y. C. Eldar. The network nullspace property for compressed sensing over networks. In Proc. Int. Conf. Sampling Th. and Applications (SampTA).
 [12] A. Jung, A. O. Hero, III, A. Mara, and B. Jahromi. Sparse Label Propagation. in preparation. preprint available under https://arxiv.org/abs/1612.01414, Nov. 2016.
 [13] J. Kleinberg and E. Tardos. Algorithm Design. Addison Wesley, 2006.
 [14] A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs for testing community detection algorithms. Phys. Rev. E, 78:046110, Oct 2008.
 [15] E. Mossel, J. Neeman, and A. Sly. Stochastic block models and reconstruction. ArXiv eprints, Feb. 2012.
 [16] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231, 2013.
 [17] A. Sandryhaila and J. M. F. Moura. Big data analysis with signal processing on graphs: Representation and processing of massive data sets with irregular structure. IEEE Signal Processing Magazine, 31(5):80–90, Sept 2014.
 [18] J. Sharpnack, A. Rinaldo, and A. Singh. Sparsistency of the edge lasso over graphs. AIStats (JMLR WCP), 2012.
 [19] D. Spielman. Algorithms, graph theory, and the solution of laplacian linear equations. In Automata, Languages, and Programming, volume 7392 of Lecture Notes in Computer Science, pages 24–26. Springer Berlin Heidelberg, 2012.
 [20] S. A. van de Geer and P. Bühlmann. On the conditions used to prove oracle results for the Lasso. Electron. J. Statist., 3:1360 – 1392, 2009.
 [21] Y.X. Wang, J. Sharpnack, A. J. Smola, and R. J. Tibshirani. Trend filtering on graphs. J. Mach. Lear. Research, 17, 2016.
 [22] Y. Zhu. An augmented admm algorithm with application to the generalized lasso problem. 2015.