Distributed Reconstruction of Nonlinear Networks: An ADMM Approach\thanksreffootnoteinfo
In this paper, we present a distributed algorithm for the reconstruction of large-scale nonlinear networks. In particular, we focus on the identification from time-series data of the nonlinear functional forms and associated parameters of large-scale nonlinear networks. In (Pan et al. (2013)), a nonlinear network reconstruction problem was formulated as a nonconvex optimisation problem based on the combination of a marginal likelihood maximisation procedure with sparsity inducing priors. Using a convex-concave procedure (CCCP), an iterative reweighted lasso algorithm was derived to solve the initial nonconvex optimisation problem. By exploiting the structure of the objective function of this reweighted lasso algorithm, a distributed algorithm can be designed. To this end, we apply the alternating direction method of multipliers (ADMM) to decompose the original problem into several subproblems. To illustrate the effectiveness of the proposed methods, we use our approach to identify a network of interconnected Kuramoto oscillators with different network sizes (500100,000 nodes).
footnoteinfo]W. Pan gratefully acknowledge the support of Microsoft Research through the PhD Scholarship Program. A. Sootla and G.-B. Stan acknowledge the support of EPSRC through the project EP/J014214/1 and the EPSRC Science and Innovation Award EP/G036004/1.
The importance of reconstructing nonlinear systems and its associated difficulties are widely recognised (Ljung et al. (2011)). Reconstruction methods focus on specific system classes such as those described by Wiener and Volterra series or nonlinear auto-regressive with exogenous inputs (NARX) models to name just a few examples (see Ljung (1999) and references therein). However, nonlinear systems can be described by other functional forms. One of the most important and challenging problems in nonlinear network reconstruction is nonlinear structure identification (Sjöberg et al. (1995)). Nonlinear functional forms can be typically expanded as sums of terms belonging to a family of parameterised functions (see Sec. 5.4, Ljung (1999)). A usual approach to identifying a nonlinear black-box model is to search amongst a set of possible nonlinear terms (e.g., basis functions) for a parsimonious description coherent with the available data set Haber and Unbehauen (1990)).
In this paper, nonlinear systems are represented in a general state-space form. The framework we develop uses some a priori knowledge of the type of system we want to reconstruct, i.e., we consider a set of candidate dictionary functions appropriate for the specific type of systems from which the data have been collected (e.g., biological, chemical, mechanical, or electrical system). We assume the measured time-series data and a set of candidate dictionary functions are given. Our main objective is to identify the most parsimonious representation that explains the collected time-series data at best. Generally, we cast the reconstruction problem into a sparse signal recovery problem (Candès and Tao (2005); Donoho (2006)). In Pan et al. (2013), the nonlinear network reconstruction problem was casted as a nonconvex optimisation problem which was shown to be efficiently solvable using a centralised reweighted lasso algorithm. Nonlinear reconstruction problems solved by centralised reconstruction methods have typically a relatively small size.
Social networks, communication networks and biological networks are typically very large (e.g. more than 100,000 nodes) and the data set collected from them is therefore quite “big”. Typically, centralised reconstruction algorithms cannot handle such problems due to their associated very large memory and computational requirements. Here, we will apply the alternating direction method of multipliers (ADMM) to split the centralised problem into several subproblems with each subproblem solving a weighed lasso problem independently. This approach has the advantage that memory and computational requirements can be both reduced in comparison to generic centralised solvers.
ADMM is a powerful algorithm for solving structured convex optimization problems. The ADMM method was introduced for optimisation in the 1970’s and is closely related to many other optimisation algorithms including Bregman iterative algorithms, Douglas-Rachford splitting, and proximal point methods (see (Boyd et al. (2011)) and references therein). ADMM has been shown to have strong convergence properties and to be useful for solving by decomposition large optimisation problems which cannot be handled by generic optimization solvers. ADMM has been applied in many areas, such as filtering (Wahlberg et al. (2012)), image processing (Figueiredo and Bioucas-Dias (2010)) as well as large-scale problems in statistics and machine learning (Boyd et al. (2011)).
The paper is organised as follows. In Section 2, we formulate the nonlinear network reconstruction problem we consider in this paper. In Section 3, we re-interpret this reconstruction problem from a Bayesian point of view. In Section 4, we derive an iterative reweighted lasso algorithm to solve the nonconvex optimisation problem based on concave-convex procedure. In Section 5, we review ADMM, apply it to our optimisation problem, and derive a distributed algorithm. In Section 6, we apply our method to the reconstruction of networks of interconnected Kuramoto oscillators. Finally, in Section 7, we conclude and discuss several future problems.
2 Problem Formulation
2.1 Nonlinear Dynamical Systems
We consider dynamical systems described by multi-input multi-output (MIMO) nonlinear discrete-time equations with additive noise:
where denotes the state vector; denotes the input vector; , and is assumed to be a zero-mean Gaussian white noise vector with constant positive covariance matrix , i.e., . Since this description covers most of the discrete-time nonlinear dynamical systems with infinite number of possible functional forms for , we confine the scope and assume system (1) satisfies the following assumptions:
The system (1) is fully measurable, i.e., all the state variables can be measured and there are no hidden variables.
2.2 Construction of Dictionary Functions
Depending on the field for which the dynamical model needs to be built, only a few typical nonlinearities specific to this field need to be considered. For example, the class of models that arise from genetic regulatory networks (GRN) typically involves nonlinearities that capture fundamental biochemical kinetic laws, e.g., first-order degradation functions, mass-action kinetics, Hill and Michaelis-Menten functions, which are confined to either polynomial or rational functions. In what follows we gather the set of all candidate/possible dictionary functions that we want to consider for reconstruction. Consider state variable , . Under Assumption 2, the function terms for state can be written as:
where and are dictionary functions that are assumed to govern the dynamics. can be monomial, polynomial, constant or any other functional form such as rational, exponential, trigonometric etc.
Taking the transpose of both sides of (2), we obtain
where is assumed to be i.i.d. Gaussian distributed: , with with . If data samples including the initial value satisfying (3) can be obtained from the system of interest, the system in (3) can be written as
The problem is thus to find given the measured noisy data stored in .
2.3 Discussion on relaxations of solutions
Considering the network reconstruction problem in practice, several problems arise with respect to (5). Firstly, a low number of time-series measurements will render the linear regression in (5) under-determined. Secondly, the number of columns of the dictionary matrix might be very large, due to the potential introduction of non-relevant and/or non-independent dictionary functions in . As a result, the sparsest solution will be favoured due to the model selection criterions such as Akaike information criterion, (Akaike (1974)) or Bayesian information criterion, (Schwarz (1978)). However, finding the sparsest solution is NP-hard. Classically, a lasso algorithm is typically used as a relaxation (Tibshirani (1996)) to this NP-hard optimisation problem. Lasso usually works well when the dictionary matrix has certain properties such as the restricted isometry property (RIP), (Candès and Tao (2005) or the incoherence property, (Donoho and Elad (2003)). These properties basically state that two or more of the columns of dictionary matrix cannot be co-linear or close to be co-linear. Unfortunately, such properties are hardly guaranteed in typical network reconstruction problems. In the following, we shall introduce, from a probabilistic viewpoint, how a Bayesian treatment can alleviate these RIP or incoherence requirements (Tipping (2001); Seeger and Wipf (2010)).
3 Bayesian Viewpoint
3.1 Sparsity Inducing Prior
Bayesian modelling treats all unknowns as stochastic variables with certain probability distributions, (Bishop (2006)). For , it is assumed that the stochastic variables in are i.i.d. Gaussian distributed with . In such case, the likelihood of the data given is
Given the likelihood function in (6) and specifying a prior . We further define a prior distribution as where is a given function of . To enforce sparsity on , the function is usually chosen as a concave, non-decreasing function of . Such penalty functions can be realised by adding sparsity inducing priors on . Such sparsity inducing priors include Laplace , where , Student’s t, where , and others, (Palmer et al. (2005)).
Typically, is approximated by a Gaussian distribution from which efficient algorithms exist (Bishop (2006)). To this end, we may consider super-Gaussian priors, which yield a lower bound for the priors (Palmer et al. (2005)). More specifically, if we define , we can represent the prior in the following relaxed (variational) form:
where is a nonnegative function which is treated as a hyperprior with being its associated hyperparameters. Throughout, we call the “potential function”. This Gaussian relaxation is possible if and only if is concave on . Hereafter, we adopt Student’s t prior and the corresponding .
3.2 Marginal Likelihood Maximisation
For a fixed , we define a relaxed prior which is a joint probability distribution over and
where Since is is Gaussian in (6), we can get a relaxed posterior which is also Gaussian.
Now the key question is how to choose the most appropriate to maximise such that can be a “good” relaxation to . Using the product rule for probabilities, we can write the full posterior Since is independent of , the quantity
is the prime target for variational methods (Wainwright and Jordan (2008)). This quantity is known as evidence or marginal likelihood. A good way of selecting is to choose it as the minimiser of the sum of the misaligned probability mass, e.g.,
The second equality is a consequence of (see (8)). The procedure in (3.2) is referred to as evidence/marginal likelihood maximisation or type-II maximum likelihood, (Tipping (2001); Seeger and Wipf (2010); Wipf et al. (2011); Seeger and Nickisch (2011)). It means that the marginal likelihood can be maximised by selecting the most probable hyperparameters able to explain the observed data. Once is computed, an estimate of the unknown weights can be obtained by setting to the posterior mean
However, such formulation does not allow to incorporate convex constraints on , which are typically very useful in network reconstruction. In (Wipf et al. (2011); Pan et al. (2013)), it is shown that the dual cost function in -space has the following form
To ease notation and avoid interrupting the flow, we will derive the algorithm without considering convex constraints in the sequel.
4 Concave-Convex Procedure
The cost function in (19) is convex in but nonconvex in . In the next section, we show how this nonconvex optimisation problem can be formulated as a concave-convex procedure (CCCP) and then, show that finding the solution to CCCP is equivalent to solving iterative reweighted lasso problem. CCCP is another interpretation of the derivation of iterative reweighted lasso algorithm (Wipf and Nagarajan (2010)).
Note that is jointly convex in and , is convex in . Then the minimisation of the cost function (19) can be formulated as a concave-convex procedure
Since is differentiable over , the problem in (12) can be transformed into the following iterative convex optimisation problem
Using basic principles in convex analysis, we then obtain the following analytic form for the negative gradient of at is:
Then the iterative procedure (13) can be formulated as
The objective function in (14) is jointly convex in and and can be globally minimised by solving over and then . If is fixed, it gives
We can then set , . Then we update by (4). However, some of the estimated weights will be several orders of magnitude lower than the average “energy”, e.g., . Thus a threshold needs to be defined a priori to prune “small” weights at each iteration.
We can now explain how the update of the parameters can be performed based on the above. Set the iteration count to zero and . At this stage, the optimisation is a typical lasso. Then at the iteration, we initialise , . The above described procedure is summarised in Algorithm 1.
In the next section, we will reformulate the centralised optimisation in Algorithm 1 into distributed optimisation by ADMM.
There is one interesting finding from (15). We can actually minimise over first then minimise over . Rather than only initialise , we also initialise . The iterative procedure is then a reweighted algorithm (Wipf and Nagarajan (2010))
Then with the estimate
has a closed form solution . Then we can update as in (4) and iterate until a stopping criterion is satisfied.
The update in (17) is actually a reweighted- algorithm. are the weighting vectors like in reweighted lasso in Algorithm 1. We also tested this reweighted algorithm for the examples in the sequel. However the heuristic convergence rate and reconstruction accuracy is not as good as reweighted lasso algorithm. We are still studying on it.
5 Alternating Direction Method of Multipliers (Admm)
In this section we give an overview of ADMM. We follow closely the development of (Boyd et al. (2011)
ADMM is a numerical algorithm for solving optimisation problems such as
with variable and , where , , and . We will assume that and are convex.
As the method of multipliers, we form the augmented Lagrangian
Defining the residual and as the scaled dual variable, we can express ADMM as
5.1.1 Stopping criterion
The primal and dual residuals at iteration are given by
We terminate the algorithm when the primal and dual residuals satisfy a stopping criterion:
Here, the tolerances and and can be set via an absolute plus relative criterion
where and are absolute and relative tolerances. More details can be found in (Boyd et al. (2011)).
5.2 Splitting across candidate functions
In our setting, the number of candidate functions will be very large. Therefore we partition across the candidate functions. Each subsystem can deal with its split of candidate functions independently then update the shared variables. The following are direct consequences of the so-called sharing problem in (Boyd et al. (2011)).
We partition the parameter vector as , with , where . Partition the dictionary matrix as , with . Thus , i.e., can be thought of as a ‘partial’ prediction of using only the candidate functions referenced in .
Then the reweighted lasso problem (16)
Following the approach used for the sharing problem, we express the problem as
with new variables . The derivation and simplification of ADMM also follows that for the sharing problem. The scaled form of ADMM is
As in the discussion for the sharing problem, we carry out the -update by first solving for the average
where . Substituting the last expression into the update for , we find that
which shows that, as in the sharing problem, all the dual variables are equal. Using a single dual variable , eliminating , and define
we arrive at the Algorithm 2
Each -update is a lasso problem with variables, which can be solved using any lasso method.
In the -update, we have (meaning that none of the features in the -th block are used) if and only if
The first step involves solving parallel weighted lasso (weighted -regularised least squares) problems in variables each. Between the first and second steps, we collect and sum the partial predictors to form . The second step is a single minimisation in variables, a quadratically regularised loss minimisation problem
5.3 ADMM for Weighted Lasso
The weighted lasso problem can be solved using ADMM as well.
Define , it yields the ADMM algorithm
where is the penalty parameter and the soft thresholding operator is defined as
5.4 Algorithm For Nonlinear Network Reconstruction
Now, we summarise the procedure for nonlinear network reconstruction in Algorithm 4.
Collect the time-series data, specify the candidate functions and construct the dictionary matrix;
Partition the dictionary matrix as , with ;
Initialise the weight as , , with .
6 Numerical Illustration
6.1 An Example of Kuramoto Oscillator
A classical example in physics, engineering and biology is the Kuramoto oscillator network (Strogatz (2000)). We consider a network where the Kuramoto oscillator are nonidentical (each has its own natural oscillation frequency ) and the coupling strengths between nodes are not the same. The corresponding discrete-time dynamics can be described by
where , is the phase of oscillator , is its natural frequency, and the coupling function is usually taken as sine for all . represent the coupling strength between oscillators and thus defines the topology of the oscillator network. Here, assuming we don’t know the exact form of , we reconstruct from time-series data of the individual phases a dynamical network consisting of Kuramoto oscillators, i.e., we identify the coupling functions as well as the model parameters, i.e., and , .
To define the dictionary matrix , we assume that all the dictionary functions are functions of a pair of state variables only and consider candidate coupling functions : , . Based on this, we define the dictionary matrix as
To also take into account the natural frequencies, we add to the last column of a unit vector. This leads to the following dictionary matrix :
Then the output can be defined as
To generate the time-series data, we simulated a Kuramoto oscillator network for which of the non-diagonal entries of the weight matrix are nonzero (assuming and are zeros), and the non-zero values are drawn from a standard uniform distribution on the interval . The natural frequencies are drawn from a normal distribution with mean and variance . In order to create simulated data, we simulated the discrete-time model (24) and took ‘measurements data points’ every between and (in arbitrary units) from random initial conditions which are drawn from a standard uniform distribution on the open interval . Thus a total of 1001 measurements for each oscillator phase are collected (including the initial value). Once again, it should be noted that the the number of rows is less than that of columns of the dictionary matrix.
6.2 Algorithmic Performance Comparisons
The reweighted lasso algorithm can be implemented in a centralised way by using CVX (Grant et al. (2008)) or YALMIP (Lofberg (2004)), Matlab packages for specifying and solving convex optimisation problems. CVX or YALMIP calls generic SDP solvers (SDPT3 (Toh et al. (1999) or SeDuMi (Sturm (1999)) to solve the problem. While these solvers are reliable for wide classes of optimisation problems, they are not customised for particular problem families, such as ours.
We compare the centralised algorithm using CVX, the centralised algorithm using ADMM, and the distributed algorithm using ADMM. We fixed the number of measurements to be and varied the network size between and . For the distributed algorithm, we split the problem into 1000 subproblems where each one has the same dimension. The algorithm is implemented in MATLAB R2012b. The calculations were performed on HP workstation with two 8 core Intel® Xeon(R) CPU E5-2650 2.00GHz with 64 GB RAM.
Since the reconstruction problem in (4) for each node is independent, we therefore consider the performance of a single node for illustration. We first investigate the performance for different signal-to-noise ratios of the data generated for this example. We define signal-to-noise ratio (SNR) by . We considered SNR ranging from 5 dB to 25 dB for each generated weight. To characterise the accuracy of a reconstruction, we use the normalised mean square error (NMSE) as a performance index, defined by , where is the estimate of the true weight . For each SNR, we generated 50 independent experiments (with different initial conditions and parameters) and calculated the average NMSE for each SNR over these 10 experiments. The results are shown in Fig. 1.
Next, we compare the computation time for different network sizes for each method. These methods are tested under different SNRs as above. In the implementation of the distributed algorithm, we use Matlab command
parfor to parallelise the -update in Algorithm 2. We use Matlab command
matlabpool(‘size’) to start a worker pool. The
size varies from 2 to 10. For each method, we found that the computation time over each SNR varied slightly (at least within the same magnitude). We calculated the average computation time from a total 250 (=5) independent experiments for each method.
The results are shown in Table 1 (unit in second).
For larger problem, with network size 50,000 and 100,000, CVX based reweighted lasso run into memory difficulties. Thus there are no results reported in the table for these network sizes.
For the distributed reweighted lasso algorithm, the computation time decreases when the problem is split between an increasing number of processors.
However, it should be noted the computation time for distributed reweighted lasso is small partially because Matlab performs some matrix computations in parallel. On the other hand, CVX exploits only one core.
For all the experiments, we set the penalty parameter in the augmented Lagrangian and the scalar regularisation parameter . We also considered termination tolerances and and set the ADMM iteration number to be 200 and the reweighted iteration number to be 10 throughout all algorithms.
|Distributed reweighted lasso with 2 cores||24.7||84.3||156.5||587.1||1341.5|
|Distributed reweighted lasso with 4 cores||16.4||64.1||91.5||411.8||868.7|
|Distributed reweighted lasso with 10 cores||15.5||46.9||62.9||345.2||788.3|
7 Conclusion and Discussion
In this paper, a new distributed reconstruction method for nonlinear dynamical network is proposed. The proposed method only requires time-series data and some prior knowledge about the class of systems for which a dynamical model needs to be built. The network reconstruction problem can be casted as a sparse linear regression problem. Under Bayesian interpretation, this problem is solved using a reweighed algorithm which can further reduce the Normalised Mean Square Error in comparison with the classic lasso algorithm. Furthermore, our distributed algorithm can deal with networks comprising more than 50,000 nodes, which centralised algorithm typically cannot deal with.
Although the convergence of ADMM algorithm and reweighted lasso algorithm have been studied previously (Boyd et al. (2011); Wipf and Nagarajan (2007)), the convergence of the algorithms developed in this paper still need to be properly characterised. We are currently establishing such convergence results as well as the associated convergence rates and their dependence on parameters such as and .
Authors would like to thank Dr Ye Yuan and Dr Jorge Gonçalves for helpful discussions and suggestions.
- Akaike (1974) H. Akaike. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716–723, 1974.
- Bishop (2006) C.M. Bishop. Pattern recognition and machine learning, volume 4. springer New York, 2006.
- Boyd et al. (2011) Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
- Candès and Tao (2005) E.J. Candès and T. Tao. Decoding by linear programming. Information Theory, IEEE Transactions on, 51(12):4203–4215, 2005.
- Donoho (2006) D.L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289–1306, 2006.
- Donoho and Elad (2003) D.L. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
- Figueiredo and Bioucas-Dias (2010) Mário AT Figueiredo and José M Bioucas-Dias. Restoration of poissonian images using alternating direction optimization. Image Processing, IEEE Transactions on, 19(12):3133–3145, 2010.
- Grant et al. (2008) M. Grant, S. Boyd, and Y. Ye. Cvx: Matlab software for disciplined convex programming. Online accessiable: http://stanford. edu/b̃oyd/cvx, 2008.
- Haber and Unbehauen (1990) R. Haber and H. Unbehauen. Structure identification of nonlinear dynamic systems: survey on input/output approaches. Automatica, 26(4):651–677, 1990.
- Ljung (1999) L. Ljung. System Identification: Theory for the User. Prentice Hall, 1999.
- Ljung et al. (2011) L. Ljung, H. Hjalmarsson, and H. Ohlsson. Four encounters with system identification. European Journal of Control, 17(5):449, 2011.
- Lofberg (2004) Johan Lofberg. Yalmip: A toolbox for modeling and optimization in matlab. In Computer Aided Control Systems Design, 2004 IEEE International Symposium on, pages 284–289. IEEE, 2004.
- Palmer et al. (2005) Jason Palmer, Kenneth Kreutz-delgado, Bhaskar D Rao, and David P Wipf. Variational em algorithms for non-gaussian latent variable models. In Advances in neural information processing systems, pages 1059–1066, 2005.
- Pan et al. (2013) Wei Pan, Ye Yuan, Jorge Gonçalves, and Guy-Bart Stan. Bayesian approaches to nonlinear network reconstruction. submitted, 2013.
- Schwarz (1978) G. Schwarz. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.
- Seeger and Nickisch (2011) Matthias W Seeger and Hannes Nickisch. Large scale bayesian inference and experimental design for sparse linear models. SIAM Journal on Imaging Sciences, 4(1):166–199, 2011.
- Seeger and Wipf (2010) M.W. Seeger and D.P. Wipf. Variational bayesian inference techniques. Signal Processing Magazine, IEEE, 27(6):81–91, 2010.
- Sjöberg et al. (1995) J. Sjöberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.Y. Glorennec, H. Hjalmarsson, and A. Juditsky. Nonlinear black-box modeling in system identification: a unified overview. Automatica, 31(12):1691–1724, 1995.
- Strogatz (2000) S.H. Strogatz. From kuramoto to crawford: exploring the onset of synchronisation in populations of coupled oscillators. Physica D: Nonlinear Phenomena, 143(1):1–20, 2000.
- Sturm (1999) Jos F Sturm. Using sedumi 1.02, a matlab toolbox for optimization over symmetric cones. Optimization methods and software, 11(1-4):625–653, 1999.
- Tibshirani (1996) R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
- Tipping (2001) M.E. Tipping. Sparse bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1:211–244, 2001.
- Toh et al. (1999) Kim-Chuan Toh, Michael J Todd, and Reha H Tütüncü. Sdpt3–a matlab software package for semidefinite programming, version 1.3. Optimization Methods and Software, 11(1-4):545–581, 1999.
- Wahlberg et al. (2012) Bo Wahlberg, Stephen Boyd, Mariette Annergren, and Yang Wang. An admm algorithm for a class of total variation regularized estimation problems. 16th IFAC Symposium on System Identification, 2012.
- Wainwright and Jordan (2008) M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.
- Wipf and Nagarajan (2010) D. Wipf and S. Nagarajan. Iterative reweighted l1 and l2 methods for finding sparse solutions. IEEE Journal of Selected Topics in Signal Processing, 4(2):317–329, 2010.
- Wipf and Nagarajan (2007) David P Wipf and Srikantan S Nagarajan. A new view of automatic relevance determination. In Advances in Neural Information Processing Systems, pages 1625–1632, 2007.
- Wipf et al. (2011) D.P. Wipf, B.D. Rao, and S. Nagarajan. Latent variable bayesian models for promoting sparsity. Information Theory, IEEE Transactions on, 57(9):6236–6255, 2011.