# Learning Networked Exponential Families with Network Lasso

###### Abstract

The data arising in many important big-data applications, ranging from social networks to network medicine, consist of high-dimensional data points related by an intrinsic (complex) network structure. In order to jointly leverage the information conveyed in the network structure as well as the statistical power contained in high-dimensional data points, we propose networked exponential families. We apply the network Lasso to learn networked exponential families as a probabilistic model for heterogeneous datasets with intrinsic network structure. In order to allow for accurate learning from high-dimensional data we borrow statistical strength, via the intrinsic network structure, across the dataset. The resulting method aims at regularized empirical risk minimization using the total variation of the model parameters as regularizer. This minimization problem is a non-smooth convex optimization problem which we solve using a primal-dual splitting method. This method is appealing for big data applications as it can be implemented as a highly scalable message passing algorithm.

Alexander Jung
\addressDepartment of Computer Science, Aalto University, Espoo, Finland; firstname.lastname(at)aalto.fi

## I Introduction

The data generated in many important application domains have an intrinsic network structure. Such networked data arises in the study of social networks, text document collections and personalized medicine [1, 2, 3]. Network science provides powerful tools for the analysis of such data based on its intrinsic network structure [4, 5]. However, the network structure of datasets is complemented by the information contained in attributes (such as features or labels) of individual data points [1].

In this paper, we study a particular class of statistical models for networked data which are based on modeling the statistics of data attributes using exponential families [6, 7, 8]. The exponential families describing the individual data points are coupled via the network structure underlying the data. The resulting networked exponential families allows to jointly captialize on the network structure and the statistical properties of features and labels assigned to individual data points.

Our approach extends prior work on networked (sparse) linear and logistic regression models [9, 10, 11]. Indeed, the proposed network exponential family model contains linear and logistic regression as special cases. In contrast to [1], which formulates a probabilistic model for the network structure, we consider the network structure as fixed and known. The closest to our work is [12] which considers networked models but uses a different smoothness measure for tying the models of neighbouring data points. In particular, while [12] uses the graph Laplacian quadratic form as a smoothness measure, our approach controls the non-smooth total variation of the model parameters.

The main theme of this paper is the application of the network Lasso (nLasso) to learning networked exponential families (see Figure 1). The nLasso has been proposed recently as a natural extension of the least absolute shrinkage and selection operator (Lasso) to networked data [13, 14]. We show how the nLasso can be implemented efficiently using a primal dual splitting method for convex optimization. The resulting scalable learning method amounts to a message passing protocol over the data network structure.

Outline. The rest of the paper is organized as follows. We start in Section II with modeling networked high-dimensional data points using empirical graphs whose nodes are equipped with individual probabilistic models forming networked exponential families. In Section III we detail how some well-known networked models, such as linear and logistic regression, are obtained as special cases of networked exponential families. As discussed in Section IV, the nLasso provides a principled approach to learning exponential families via regularized empirical risk minimization. The resulting learning problem can be solved by applying an efficient primal-dual method for non-smooth convex optimization (see Section V). We also discuss how to cope with partially observed exponential families which is relevant for many popular topic (latent variable) models such as latent Dirichlet allocation.

Contribution. Our main contributions are: (i) We introduce networked exponential families as a novel statistical model for networked high-dimensional data. (ii) We develop a scaling method for learning networked exponential families. In particular, we apply a primal-dual method for convex optimization and verify convergence of the resulting method which can be implemented as message passing over the data network.

Notation. Boldface lowercase (uppercase) letters denote vectors (matrices). We denote the transpose of vector . The -norm of a vector is . The spectral norm of a matrix is . The convex conjugate of a function is .

## Ii Networked Exponential Families

We consider networked data that is represented by an undirected weighted graph (the
“empirical graph”) . A particular node
of the graph represents an
individual data point (such as a document, or a social network user profile).^{1}^{1}1With a slight abuse of notation, we refer by to a
node of the empirical graph as well as the data point which is represented by that node.
Two different data points are connected by an undirected edge if
they are considered similar (such as documents authored by the same person or
social network profiles of befriended users). For ease of notation, we denote the
edge set by .

Each edge is endowed with a positive weight which quantifies the amount of similarity between data points . The neighborhood of a node is .

Beside the network structure, datasets convey additional information in the form of attributes associated with each data point . For a document collection with data points representing text documents, the attributes could be frequencies of particular words [1].

We model the attributes of data points as independent random variables distributed according to an exponential family [7, 6]

(1) |

The exponential family (1) is defined by the map which is known as sufficient statistic or potential function [7]. The function is known as the log partition function or cumulant function [7].

Note that the distribution (1) is parametrized by the (unknown) weight vectors . The networked exponential family combines the node-wise models (1) by requiring the weight vectors to be similar for well-connected data points. In particular, we require the weight vectors to have small total variation (TV)

(2) |

Requiring small TV of weight vectors , for , typically implies that weight vectors are approximately constant over well connected subsets (clusters) of nodes.

## Iii Some Examples

Before we turn to the problem of learning networked exponential families (1) in Section IV, we now discuss some important special cases of the model (1).

### Iii-a Networked Linear Regression

Consider a networked dataset whose data points are characterized by features and numeric labels . Maybe the most basic (yet quite useful) model for the relation between features and labels is the linear model

(3) |

with Gaussian noise of known variance . The linear model (3), for each node , is parametrized by the weight vector . The networked linear regression model requires the weight vectors in the individual linear models (3) to have a small TV (2) [9, 10].

The model (3) is obtained as the special case of the exponential family (1) for the scalar attributes with and .

In some applications it is difficult to obtain accurate label information, i.e., is not known for some data point . One approach to handle such partially labeled data is to use some crude estimates of the labels for unlabelled nodes. We then account for the lack of accurate label information by using heterogeneous noise variables . In particular, we would assign a larger variances fo the noise at nodes for which only unreliable label information (e.g., in the form of crude estimates ) are available.

### Iii-B Networked Logistic Regression

Consider networked data points each characterized by features and binary labels . Maybe the most basic (yet quite useful) model for the relation between features and labels is the linear model

(4) |

The logistic regression model (4) is parametrized by the weight vector for each node . The networked logistic regression model requires the weight vectors in the node-wise logistic regression models (4) to have a small TV (2) [15].

### Iii-C Networked Latent Dirichlet Allocation

Consider a networked dataset representing a collection of text documents (such as scientific articles). The latent Dirichlet allocation model (LDA) is a probabilistic model for the relative frequencies of words in a document [7, 16]. Within LDA, each document is considered a blend of different topics. Each topic has a characteristic distribution of the words in the vocabulary.

A simplified form of LDA represents each document containing “words” by two sequences of multinomial random variables and with being the size of the vocabulary defining elementary words and is the number of different topics. It can be shown that LDA is a special case of the exponential family (1) with particular choices for and (see [7, 16]).

## Iv Network Lasso

Our goal is to develop a method for learning an accurate estimate for the weight vectors at . The learning of the weight vectors is based on the availability of the nodes attributes for a small sampling set . A reasonable estimate for the weight vectors can be obtained from maximizing the likelihood of observing the attributes :

(5) |

It is easy to show that maximizing (6) is equivalent to minimizing the empirical risk

(6) |

The criterion (6) by itself is not sufficient to learn the weights at all nodes , since (6) since it completely ignores the weights at unobserved nodes . Therefore, we need to impose some additional structure on the weight vectors. In particular, any reasonable estimate should conform with the cluster structure of the empirical graph [4].

The network structure of data arising in important applications is organized as clusters (or communities) which are well-connected subset of nodes [17]. Many methods of (supervised) machine learning are motivated by a clustering assumption that nodes belonging to the same cluster represent similar data points. We implement this clustering assumption by requiring the parameter vectors in (1) to have a small TV (2).

We are led quite naturally to learning the weights for the networked exponential family via the regularized empirical risk minimization (ERM)

(7) |

The learning problem (7) is an instance of the generic nLasso problem [13]. The parameter in (7) allows to trade-off small TV against small error (cf. (6)). The choice of can be guided by cross validation [18].

It will be convenient to reformulate (7) using vector notation. We represent a graph signal as the vector

(8) |

Define a partitioned matrix block-wise as

(9) |

where is the identity matrix. The term in (2) is the -th block of . Using (8) and (9), we can reformulate the nLasso (7) as

(10) |

with

(11) |

with stacked vector .

## V Efficient Implementation

The nLasso (10) is a convex optimization problem with a non-smooth objective function which rules out the use of gradient descent methods [19]. However, the objective function is highly structured since it is the sum of a smooth convex function and a non-smooth convex function , which can be optimized efficiently when considered separately. This suggests to use some proximal method [20, 21] for solving (10).

One particular example of a proximal method is the alternating direction method of multipliers (ADMM) which has been considered in [13]. However, we will choose another proximal method which is based on a dual problem to (10). Based on this dual problem, efficient primal-dual methods have been proposed recently [22, 23]. These methods are attractive since their analysis provides natural choices for the algorithm parameters. In contrast, tuning the ADMM parameters is non-trivial [24].

### V-a Primal-Dual Method

The preconditioned primal-dual method [22] launches from reformulating the problem (10) as a saddle-point problem

(12) |

with the convex conjugate of [23].

Any solution of (12) is characterized by [25, Thm 31.3]

(13) |

This condition is, in turn, equivalent to

(14) |

with positive definite matrices . The matrices are design parameters whose choice will be detailed below. The condition (14) lends naturally to the following coupled fixed point iterations [22]

(15) | ||||

(16) |

If the matrices and in (15), (16) satisfy

(17) |

the sequences obtained from iterating (15) and (16) converge to a saddle point of the problem (12) [22, Thm. 1]. The condition (17) is satisfied for the choice and , with node degree and some [22, Lem. 2].

The update (16) involves the resolvent operator

(18) |

where . The convex conjugate of (see (11)) can be decomposed as with the convex conjugate of the scaled -norm . Moreover, since is a block diagonal matrix, the -th block of the resolvent operator can be obtained by the Moreau decomposition as [21, Sec. 6.5]

where for .

The update (15) involves the resolvent operator of (see (6) and (11)), which does not have a simple closed-form solution in general. However, for the choice , the update (15) decomposes into separate node-wise updates

(19) |

with , and .

In general, it is not possible to compute the update (19) exactly. A notable exception are networked linear Gaussian models, for which (19) amounts to simple matrix operations [26]. However, since (19) amounts to the unconstrained minimization of a smooth and convex objective function we can apply efficient convex optimization methods [27]. In fact, the update (19) is a regularized maximum likelihood problem for the exponential family (1). This can be solved efficiently using quasi-Newton methods such as L-BGFS [28, 29]. We will detail a particular iterative method for approximately solving (19) in Section V-B.

Let us denote the approximate solution to (19) by and assume that it is sufficiently accurate such that

(20) |

Thus, we require the approximation quality (for approximating the update (19)) to increase with the iteration number . According to [30, Thm. 3.2], the error bound (20) ensures the sequences obtained by (15) and (16) when replacing the exact update (19) with the approximation still converge to a saddle-point of (12) and, in turn, a solution of the nLasso problem (10).

### V-B Approximate Primal-Dual Steps

Let us detail here a simple iterative method for computing an approximate solution to the update (19). By standard convex analysis (see [27]), any solution is characterized by the zero gradient condition

(21) |

with

(22) |

Inserting (22) into (21), and using some basic calculus, yields

(23) |

The condition (23), which is necessary and sufficient for to solve (19), is a fixed point equation with

(24) |

We will use the Hessian of with entries

(25) |

According to the mean-value theorem [31, Thm. 9.19.], the map with Lipschitz constant . Thus, if we choose such that

(26) |

the map is a contraction and the fixed-point iteration

(27) |

will converge to the solution of (19).

Moreover, if (26) is satisfied, we can bound the deviation between the iterate and the (unique) solution of (26) as (see [31, Proof of Thm. 9.23])

(28) |

Thus, if we use the approximation for the update (19), we can ensure (20) by iterating (V-B) for at least

(29) |

iterations.

Note that computing the iterates (V-B) requires the evaluation of the gradient of the log partition function . According to [7, Prop. 3.1.], this gradient is given by the expectation of the sufficient statistic under the distribution :

(30) |

Moreover, the Hessian in (25) is obtained as the covariance matrix of the sufficient statistics [7, Prop. 3.1.]. In particular, the entries of the Hessian are

(31) |

In general, the expectations (30) and (V-B) cannot be computed exactly in closed-form. A notable exception are exponential families obtained from a probabilistic graphical model defined on a triangulated graph such as a tree. In this case it is possible to compute (30) in closed-form (see [7, Sec. 2.5.2]). Another special case of (1) for which (30) and (V-B) can be evaluated in closed-form is linear and logistic regression (see Section III).

### V-C Partially Observed Models

The learning Algorithm 1 can be adapted easily to cope with partially observed exponential families [7]. In particular, for the networked LDA described in Section III, we typically have access only to the word variables of some documents . However, for (approximately) computing the update step (19) we would also need the values of the topic variables but those are not observed since they are latent (hidden) variables. In this case we can approximate (19) by some “Expectation-Maximization” (EM) principle (see [7, Sec. 6.2]). An alternative to EM methods, based on the method of moments, for learning (latent variable) topic models has been studied in a recent line of work [32, 33].

## Vi Numerical Experiments

In this section we report on the results obtained by applying particular instances of Alg. 1 to different datasets. The first dataset is synthetically generated using an empirical graph composed of two well-connected clusters. We also consider a dataset obtained from temperature measurements at various locations in Finland.

### Vi-a Two-Cluster Dataset

We generate the empirical graph () by sparsely connecting two random graphs and , each of size and with average degree . The nodes of are assigned feature vectors obtained by i.i.d. random vectors uniformly distributed on the unit sphere . The labels of the nodes are generated according to the linear model (3) with zero noise and piecewise constant weight vectors

(32) |

with some two (different) fixed vectors . We assume that the labels are known for the nodes in a small training set which includes three data points from each cluster, i.e., .

As shown in [34] the performance of network Lasso type methods (for learning problems similar to but different from (1)) depends on the connectivity of the cluster nodes with the boundary edges which connect nodes in different clusters. In order to quantify the connectivity of the labeled nodes with the cluster boundary, we compute, for each cluster , the normalized flow value from one particular in each cluster and the cluster boundary . We normalize this flow by the boundary size .

In Fig. 2, we depict the normalized mean squared error (NMSE) incurred by Alg. 1 (averaged over i.i.d. simulation runs) for varying connectivity, as measured by the empirical average of and (having same distribution). Accorcind to Fig. 2 there are two regimes of levels of connectivity. For sufficiently large connectivity , Alg. 1 is able to capitalize on the network structure in order to learn the piece-wise constant weight vectors .

### Vi-B Weather Data

In this experiment, we consider networked data obtained from the Finnish meteorological institute. The empirical graph of this data represents Finnish weather stations (see Fig. 3), which are initially connected by an edge to their nearest neighbors. The feature vector of node contains the local (daily mean) temperature for the preceding three days. The label is the current day-average temperature.

We use Alg. 1 to learn the weight vectors for a localized linear model (3). For the sake of illustration we focus on the weather stations in the capital region around Helsinki (indicated by a red cross in Fig. 3). These stations are represented by nodes and we assume that labels are available for all nodes outside and for the nodes . Thus, for more than half of the nodes in we do not know the labels but predict them via with the weight vectors obtained from Alg. 1 (using and a fixed number of iterations). The normalized average squared prediction error is and only slightly larger than the prediction error incurred by fitting a single linear model to the cluster using a simple linear regression method.

## Vii Conclusion

We have introduced networked exponential families as a flexible statistical modeling paradigm for networked data. Individual data points are modeled by exponential families whose parameters are coupled across connected data points by requiring a small TV. An efficient method for learning networked exponential families is obtained by applying a primal-dual method to solve the non-smooth nLasso problem.

## Acknowledgments

We thank Roope Tervo from the Finnish Meteorological Institute for helping with gathering the weather data.

## References

- [1] J. Chang and D. M. Blei, “Relational topic models for document networks,” in Proc. of the 12th Int. Conf. on Art. Int. Stat. (AISTATS), Florida, USA, 2009.
- [2] A. Barabási, N. Gulbahce, and J. Loscalzo, “Network medicine: a network-based approach to human disease,” Nature Reviews Genetics, vol. 12, no. 56, 2010.
- [3] W. W. Zachary, “An information flow model for conflict and fission in small groups,” J. Anthro. Research, vol. 33, no. 4, pp. 452–473, 1977.
- [4] M. E. J. Newman, Networks: An Introduction, Oxford Univ. Press, 2010.
- [5] S. Cui, A. Hero, Z.-Q. Luo, and J.M.F. Moura, Eds., Big Data over Networks, Cambridge Univ. Press, 2016.
- [6] L. D. Brown, Fundamentals of Statistical Exponential Families, Institute of Mathematical Statistics, Hayward, CA, 1986.
- [7] M. J. Wainwright and M. I. Jordan, Graphical Models, Exponential Families, and Variational Inference, vol. 1 of Foundations and Trends in Machine Learning, Now Publishers, Hanover, MA, 2008.
- [8] A. Jung, S. Schmutzhard, and F. Hlawatsch, “The RKHS approach to minimum variance estimation revisited: Variance bounds, sufficient statistics, and exponential families,” IEEE Trans. Inf. Theory, vol. 60, no. 7, pp. 4050–4065, Jul. 2014.
- [9] A. Jung and N. Tran, “Localized linear regression in networked data,” IEEE Sig. Proc. Letters, 2019.
- [10] M.Yamada, T. Koh, T. Iwata, J. Shawe-Taylor, and S. Kaski, “Localized Lasso for High-Dimensional Regression,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, Apr. 2017, vol. 54, pp. 325–333, PMLR.
- [11] H. Ambos, N. Tran, and A. Jung, “Classifying big data over networks via the logistic network lasso,” in Proc. 52nd Asilomar Conf. Signals, Systems, Computers, Oct./Nov. 2018.
- [12] T. Li, E. Levina, and J. Zhu, “Prediction models for network-linked data,” The Annals of Applied Statistics, 2019.
- [13] D. Hallac, J. Leskovec, and S. Boyd, “Network lasso: Clustering and optimization in large graphs,” in Proc. SIGKDD, 2015, pp. 387–396.
- [14] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity. The Lasso and its Generalizations, CRC Press, 2015.
- [15] H. Ambos, N. Tran, and A. Jung, “Classifying big data over networks via the logistic network lasso,” in Proc. 52nd Asilomar Conference on Signals, Systems, and Computers. 2018, 10.1109/ACSSC.2018.8645260.
- [16] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Jan. 2003.
- [17] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3-5, pp. 75–174, Feb. 2010.
- [18] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer Series in Statistics. Springer, New York, NY, USA, 2001.
- [19] A. Jung, “A fixed-point of view on gradient methods for big data,” Frontiers in Applied Mathematics and Statistics, vol. 3, 2017.
- [20] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” in Fixed-Point Algorithms for Inverse Problems in Science and Engineering, H. Bauschke, R. Burachik, P. Combettes, V. Elser, D. Luke, and H. Wolkowicz, Eds., vol. 49. Springer New York, 2011.
- [21] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123–231, 2013.
- [22] T. Pock and A. Chambolle, “Diagonal preconditioning for first order primal-dual algorithms in convex optimization,” in IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, Nov. 2011.
- [23] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” J. Math. Imag. Vis., vol. 40, no. 1, 2011.
- [24] R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. I. Jordan, “A general analysis of the convergence of admm,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015, vol. 37.
- [25] R. T. Rockafellar, Convex Analysis, Princeton Univ. Press, Princeton, NJ, 1970.
- [26] A. Jung and N. Vesselinova, “Analysis of network lasso for semi-supervised regression,” in The 22nd Int. Conf. Art. Int. Stat. (AISTATS), Okinawa, Japan, April 2019.
- [27] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univ. Press, Cambridge, UK, 2004.
- [28] G. Andrew and J. Gao, “Scalable training of l1-regularized log-linear models,” in Proc. of the 24th International Conference on Machine Learning (ICML), Corvallis, OR, 2007.
- [29] A. Mokhtari and A. Ribeiro, “Global convergence of online limited memory bfgs,” Jour. Mach. Learning Res., vol. 16, pp. 3151 – 3181, Dec. 2015.
- [30] L. Condat, “A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms,” Journal of Optimization Theory and Applications, vol. 158, no. 2, pp. 460–479, Aug. 2013.
- [31] W. Rudin, Principles of Mathematical Analysis, McGraw-Hill, New York, 3 edition, 1976.
- [32] S. Arora, R. Ge, F. Koehler, T. Ma, and A. Moitra, “Provable algorithms for inference in topic models,” in Proc. 33rd Int. Conf. Mach. Learn. (ICML), New York, NY, USA, 2016.
- [33] A. Anandkumar, D.P. Foster, D.J. Hsu, S.M. Kakade, and L. Yi-kai, “A spectral algorithm for latent dirichlet allocation,” in Advances in Neural Information Processing Systems 25, 2012, pp. 917–925.
- [34] A. Jung, N.T. Quang, and A. Mara, “When is Network Lasso Accurate?,” Front. Appl. Math. Stat., vol. 3, Jan. 2018.