This work focuses on the estimation of change-points in a time-varying Ising graphical model (outputs or ) evolving in a piecewise constant fashion. The occurring changes alter the graphical structure of the model, a structure which we also estimate in the present paper. For this purpose, we propose a new optimization program consisting in the minimization of a penalized negative conditional log-likelihood. The objective of the penalization is twofold: it imposes the learned graphs to be sparse and, thanks to a fused-type penalty, it enforces them to evolve piecewise constantly. Using few assumptions, we then give a change-point consistency theorem. Up to our knowledge, we are the first to present such theoretical result in the context of time-varying Ising model. Finally, experimental results on several synthetic examples and a real-world dataset demonstrate the empirical performance of our method.
Detecting multiple change-points in the time-varying Ising model
Batiste Le Bars &Pierre Humbert &Argyris Kalogeratos &Nicolas Vayatis
CMLA, CNRS, ENS Paris-Saclay,
94235, Cachan cedex, France. &
31670, Labège, France
Graphs are fundamental tools for revealing and studying static or varying relationships between variables of potentially high-dimensional vector-data. In the static scenario, the task of learning relationships between variables is referred as graph inference and emerges in many fields such as in Graph Signal Processing (dong2016learning; bigbosses) or in physics and biology (rodriguez2011uncovering; du2012learning). In this work, we consider a probabilistic framework where the observed data are drawn from an Ising model, which is a discrete Markov Random Field (MRF) with -outputs. MRF are undirected probabilistic graphical models (koller2009probabilistic) where a set of random variables are represented as different nodes of a graph. In this graph an edge between two nodes indicates that the two corresponding random variables are dependent given the other variables.
Using a set of observations to estimate the structure of such probabilistic graphical models has been widely investigated (banerjee2008model). In particular for Gaussian Graphical Models (GGM) (meinshausen2006high; yuan2007model; ren2015asymptotic) with the graphical lasso (friedman2008sparse). Learning the parameters of an Ising model is also addressed by several papers in the literature (guo2010joint; hofling2009estimation; ravikumar2010high; xue2012nonconcave).
Over the past years, however, there has been a burst of interest in investigating variations and changes occurring over time in real-world networks. Take for example the brain functional activity network that radically changes during an epilepsy (kramer2008emergent) or faced to an external stimuli such as anesthesia. Similar observations can be made for several network cases, namely social, transportation, communication, or biological networks (e.g. interacting genes (friedman2008sparse)).
Thus, the interest on the problem of change-point detection emerges naturally for time-varying probabilistic graphical models that are of particular significance for studying dynamic phenomena. In fact, this has been extensively investigated in time-varying MRFs with piecewise constant structure. The Gaussian case has been particularly explored (wang2018fast), either algorithmically (hallac2017network; gibberd2017regularized), or theoretically (kolar2012estimating) for all the possible variations of the change-point detection problem: single (bybee2018change) or multiple (gibberd2017multiple) change-points; offline (kolar2012estimating) or online (keshavarz2018sequential) detection, etc.
Notably, the advancements related to the time-varying Ising model are a lot more limited. In ahmed2009recovering and kolar2010estimating, the authors introduce the task of fitting the parameters of a time-varying Ising model without emphasis on the change-point detection problem, and without providing a theoretical understanding of their approach. Only the single change-point detection method in roy2017change comes with theoretical analysis.
In this work we consider the task of detecting an unknown number of change-points in a time-varying Ising model with piecewise constant parameters, while also being able to estimate the varying graph structure. In technical terms we introduce an optimization program using a fused-type penalty (harchaoui2010multiple; kolar2012estimating). Contrarily to ahmed2009recovering and kolar2010estimating, we propose the use of the -norm as fused penalty, which allows us to derive a consistency theorem for the detected change-points. To the best of our knowledge, we are the first to provide such a theoretical result. The proposed method is referred as TVI-FL, which stands for Time-Varying Ising model identified with Fused and Lasso penalties.
The paper is organized as follows. First, we briefly recall important properties of the static Ising model and describe its piecewise constant version. Second, we present our methodology to infer the piecewise constant graph structure over time and the timestamps at which abrupt changes occur. Then, we present our main theoretical result which consist in a change-point detection consistency theorem. Finally, results on simulated data show empirically that our method outperforms the Gaussian version of it (kolar2012estimating), so as the method based on a total-variation penalty (ahmed2009recovering), with respect to our simulation process. An additional real-world experiment done on a telecommunication dataset shows promising results.
2 The time-varying Ising model
Let us first recall that the static Ising model is a discrete MRF with -outputs. This model is defined by a graph and is associated to a symmetric weight matrix whose non-zero elements correspond to the set of edges . Formally, for the the -th element of is iff . An Ising model is thus entirely described by its associated weight matrix . Let be a random vector following an Ising model, with realizations , then its associated probability function is given by:
where is the the normalizing constant, and , are respectively the -th and -th coordinates of . In the following we write .
Due to the intractability of the normalizing constant , most Ising model inference methods use the conditional probability distribution function of a node knowing the other nodes given by:
where denotes the -th column (or row) of that is used to parametrize the probability function of Eq. (2). Here, denotes the vector without the coordinate .
A time-varying Ising model is defined by a set of graphs , over a fixed set of nodes through a time-varying set of edges . Like in the static case, each is associated to a symmetric weight matrix and an independent distribution given by Eq. (1). A random variable associated to this model is a set of independent random vectors , hence a single realization is a set of vectors, each denoted by .
In addition, we assume that the model is piecewise constant, i.e. there exist a collection of timestamps , sorted in ascending order, and a set of real-valued symmetric matrices such that :
with and , while is the number of change-points considered by the model. According to Eq. (3), for a fixed , the set contains iid vector samples drown from .
3 Learning Methodology
Assuming the observation of a single realization of the described time-varying model, our objective is twofold. Firstly, we want to recover the set of change-points , and secondly, we want to recover the graph structure underlying the observed data vectors, i.e. which edges are activated at each time-stamp. We next describe our methodology to perform those tasks.
Due to the complexity of our model and the aforementioned intractability of the normalizing constant , classical maximum likelihood approaches are difficult to apply. Hence, we extend the neighborhood selection strategy introduced for the static setting of ravikumar2010high to our time-varying setting. Instead of maximizing the global likelihood of Eq. (1), this approach maximizes, for each node, the conditional likelihood of the node knowing the other nodes. As in the Gaussian version of our framework (kolar2012estimating), the proposed TVI-FL method, uses a fused-like penalty to infer the change-points and assumes that for each submodel the underlying graph is sparse. With such approach, at each timestamp, the local neighborhood of each node is selected with a penalized logistic regression method given in the next section.
3.1 Optimization program
For each node , we solve the program:
where stands for the negative conditional log-likelihood of the node , knowing the rest of the nodes. It has a closed-form given by:
Provided two positive hyperparameters, and , the penalty function that we propose to take is the following:
With such an objective function, we learn at each timestamp the neighborhood of the node given by the -th column of . The penalty function is necessary to prevent overfitting. In particular, without the first term, we would fit for each timestamp a parameter vector that perfectly matches the observation (in terms of likelihood). In such situation, we would obtain as many different parameter vectors as the different samples, making the piecewise constant assumption (Eq. 3) impossible to get. On the contrary here, the -norm encourages consecutive parameter vectors to be equal. The second term, also important to prevent overfitting, especially when the number of nodes is big, allows the learned parameter vectors to be sparse and thus imposes structure in the learned graphs. Knowing that, the hyperparameter controls the number of estimated change-points – the bigger is, the lower the number of estimated change-points is. Similarly, when increases, the sparsity of each parameter vector increases as well.
3.2 Change-point detection and structure identification
Now assume that the optimization program (4) is completed. The set of estimated change-points is:
namely the set of timestamps at which the estimated parameter vectors have changed. The total number of estimated change-points is given by the cardinality of , let that be .
If we are interested in learning the graph weights for each submodel , the -th column of is estimated by . All the non-zero elements of indicate the neighborhood of .
One should notice that this estimation would lead to a non-symmetric weight matrix (directed graphs). To face this problem, the authors of ravikumar2010high; kolar2012estimating propose to either use the min or max operator. For example, to estimates the structure of the -th graph, we take:
where is the -th element of and conversely for . In this case, there is an edge between two nodes if at least one of the nodes and contains the other node in its neighborhood.
4 Theoretical analysis
In this section, we present a change-point consistency theorem when the right number of change-points is estimated. This property states that when the number of samples tends to infinity, the time instances at which the change-points occur are well-estimated. Although left for future work, this consistency result is fundamental to obtain the consistency of the estimated graph structures, as it is the case in former works in time-varying Gaussian graphical models (kolar2012estimating; gibberd2017multiple).
We now give two important quantities in order to introduce the main assumptions of our theorem. Denoting by the set of indices , the first quantity is the minimal time difference between two change-points and is given by:
The second one is the minimal variation in parameters between two change-points and is given by:
The following assumptions are assumed to be true for each node .
(A1) There exist two constants and such that , and . Here and denote respectively the smallest and largest eigenvalues of the input matrix.
(A2) There exists a constant such that and a constant such that .
(A3) The sequence satisfies, for each , , where is a fixed, unknown sequence of the change-point fractions belonging to . This allows to have enough samples in each of the submodels, as goes to infinity.
4.2 A change-point consistency theorem
In this section, we show the change-point consistency when the right number of change-points are estimated. The proof is made for a unique node but generalizes to all the other nodes. We first give optimality conditions necessary to the global proof.
(Optimality Conditions) A matrix is optimal for problem (4) if and only if there exist a collection of subgradient vectors and , with and , that for all they satisfy:
and the hyperbolic tangent function.
The proof is given in the Appendix. It consists in writing the sub-differential of the objective function and say, thanks to the convexity, that belongs to it. ∎
(Change-point consistency) Let be a sequence of observation drawn from a piecewise constant Ising model presented in Sec. 2. Suppose that the assumptions described earlier hold, and assume that . Let be a non-increasing sequence that converges to , and such that , , with . Assume further that , , and .
Then, if the right number of change-points are estimated (), we have:
The proof closely follows the steps given in harchaoui2010multiple; kolar2012estimating; gibberd2017multiple which can be found in the Appendix. We present here a sketch of proof. First of all, thanks to the union bound, the probability of Eq. (8) can be upper bounded by:
To prove Eq. (8), it is now sufficient to show that . Let us define the event and its complementary . The rest of the proof is divided in two parts: bounding the good scenario, namely shows that and doing the same for the bad scenario i.e .
To show the first scenario, the proof applies Lemma 1 to bound the considered probability by three others probabilities. Those latter are then asymptotically bounded by thanks to a combination of the Assumptions (A1-A3), the assumptions of the theorem and concentration inequalities related to the considered time-varying Ising model given in the lemmas of the Appendix.
To bound the bad case scenario, the three following complementary events are defined:
Thus, it suffices to prove that , and tends to as goes to infinity. To prove this, similar arguments to those used for the good case are again employed. ∎
5 Simulated Experiments
5.1 Preliminary questions about the optimization procedure
The first question that rises is how to solve the program (4). Although non-differentiable, due to the convexity of the objective function there are several solvers available to minimize such a function. In our experiments, we use the python package CVXPY (cvxpy). Note that, since the optimization of each node parameter is independent, all these can be optimized in parallel.
Another technical remark about the optimization step is that, in some situations, the data vectors that we observe does not necessarily belong to but to instead. In this case, the proposed method is essentially the same, except that the negative log-likelihood must be replaced by:
In some cases, we may observe more than one data vector at each timestamp. Again, in this case, one has to replace the negative log-likelihood by the following:
where stands for the number of data vectors observed at timestamp and for the -th observed vector at time .
The last technical issue concerns the choice of the hyperparameters and . In kolar2012estimating and ahmed2009recovering, the authors note that the problem of learning the weights actually corresponds to a supervised classification task. Indeed, our optimization function actually corresponds to the regression of a node give the others as explanatory variables. For this reason they propose to either use cross-validation techniques or BIC for model selection. In this work we use the BIC, first used in kolar2010estimating, which consist in averaging for all nodes the following quantity:
We will see that, despite being easy to compute, BIC does not necessarily identify the best model regarding our learning tasks.
5.2 Experimental setup
We compare our method, TVI-FL, to two state-of-the-art approaches. The first one, referred as TD-lasso (kolar2012estimating), corresponds to the Gaussian version of our algorithm. Indeed, it assumes that the output of the graphical models are Gaussian data and their proposed optimization program is the one of Eq. (4), where the negative likelihood function is replaced by . The second approach, named TESLA (ahmed2009recovering), is closely related to our method as it simply replaces the -norm in the penalty function by the -norm. However, this method should not be seen as proper competitor. Indeed, as explained in (hallac2017network), the difference in using either the - or the -norm resides in the type of changes we aim to detect. The -norm will be best-suited to the detection of local changes in the neighborhood of a node, whereas the -norm will be more appropriate for the detection of global changes.
We propose to compare the methods’ performance on totally independent synthetic datasets. For each of them, the total number of time-instances is and number of change-points , resulting in submodels. The first time-instances are dedicated to the first submodel, the next for the second, and the last for the third one.
Then, for each of the submodels, we generate independently a random regular graph of size with the degree of all nodes equal to . To draw such random graph, we used the generator developed in the python package Networkx (hagberg2008exploring). Note also that, for a fixed example, the degree of the graph remains the same across the three submodels. The experiments are actually performed for the four values of , resulting in experiments per dataset in total.
Now that we have the the graph structure of the Ising models, we need the edge weights. As in the paper of ahmed2009recovering, for each edge, the corresponding weight is drawn from a uniform distribution taken over .
Finally we need to draw the observations following the corresponding submodel. To do so, we used Gibbs sampling with a burn-in period of samples and a lag of between each sample to avoid dependency between the observation. Furthermore, to accelerate the convergence, we observe (instead of just )vectors of size at each time-instance. It results in the sample of observations for each of the examples and in the use of the likelihood of Eq. (9).
We use four different metrics usually used in such experiments. The first one, the Hausdorff metric (-score), measures the performance of the estimated set of change-points . It corresponds to the greatest temporal distance between a change-point and its prediction:
and has to be as small as possible.
The other metrics, namely , , and score, concern the goodness of the learned graph structure. They are given by the following equations:
A final remark concerns the choice of the two hyperparameters. We performed a grid-search strategy to find the best pair of values for in for TVI-FL and TESLA, and in for TD-lasso.
To begin with, for each experiment and for each degree , we select the models with the lowest BIC (Eq. (10)). Associated metrics are collected in Tab. 1. For or , TVI-FL obtains the best results in terms of graph reconstruction and change-point detection, e.g. when the average -score and -score are equal to and , against and for Tesla, and and for TD-Lasso. On the contrary, when increases, the BIC fails to select the best model for our method, which results in better performances for TD-Lasso.
In the following, we empirically prove that it is possible to obtain more accurate results on both -score and -score with a better selection of the hyperparameters. To do so, for each experiment and for each degree , we select the model with the highest -score when the associated -score is: (a) equal to , or (b) lower than , or (c) , or (d) . In other words, we evaluate the best -score when the worst change-point error is equal to or lower than or timestamps. The results are displayed in Fig. 1. First of all, it shows that for a lower -score, TVI-FL is able to obtain similar -score than those displayed in Tab. 1. Furthermore, we observe that it almost always gives better -score than the competitors. Notice that, during our experiments, no model from TD-Lasso reached an -score equal to , and hence it is not displayed in Fig. 1(a). In conclusion, although having similar results as the graph density increases, with a better model selection could help TVI-FL to outperform the state-of-the-art methods. In the Appendix, we provide the table of results when the -score is lower than . It confirms that BIC does not select the best set of hyperparameters, and that TVI-FL outperforms the competitors by selecting them properly.
Finally we display in Fig. 2 the heat-maps of the average -score and -score with respect to and when the degree of each node is , or . From these heat-maps, we see that when , both and need to be high enough (greater than and ) to obtain a -score lower than . On the contrary, to obtain an -score greater than , need to be lower than and greater than . Hence, there is a trade-off between accurately detecting the change-points (low -score) and recovering accurately graph structure (high -score). As an example, if we want -score , we need to set and , but with such values we cannot have -score higher than . The same empirical conclusion can be made for the other node degrees. Heat-maps of the other methods are given in the Appendix and corroborate the idea of finding a trade-off. It appears that compared to them, the trade-off is easier to find for TVI-FL, as a wide range of are suitable.
5.4 A real-world example
In this section, we evaluate the goodness of graph learning with TVI-FL on the Sigfox IoT dataset (le2019probabilistic). It consists in data from a telecommunication network, where each observation corresponds to a message that has been received by a subset of antennas. Each data vector is binary and indicates which antennas has received the message or not (, not ). The dataset contains all the messages received by the antennas, on a daily basis over a period of five months, resulting in timestamps. According to the authors, one antenna is working poorly after timestamp number . In the following experiment, we selected this antenna so as the geographically closest others, and we selected randomly messages at each timestamp. The learned graphs with TVI-FL at timestamps (before the antenna problem) and (during the antenna problem) are displayed in Fig. 3 (only the positive edges are drawn).
The goodness-of-fit of our method can be corroborated by the two following observations; 1) The learned graphs are in adequacy with the spatial distribution of the antennas: nearby antennas are more likely to be connected. This was expected as nearby antennas have higher chance to receive the same messages, 2) The failing antenna lost edges between the two graphs. Again, this could have been expected since a poorly working antenna would receive less messages, implying a decreasing correlation with its neighbors.
6 Conclusion and future work
This paper proposed a new optimization program to learn a time-varying Ising model with piecewise constant structure. To the best of our knowledge, it is the first to provide a change-point consistency theorem and a empirical comparative study with the previous works on both time-varying Ising (ahmed2009recovering; kolar2010estimating) and Gaussian (kolar2012estimating) graphical models. Future directions of research include the investigation of a more adapted way to solve the optimization program, extensive real-world experiments, and improving the theoretical analysis with the proof of consistent graph structure estimation (sparsistency).