A LearningtoInfer Method for RealTime
Power Grid Topology Identification
Abstract
Identifying arbitrary topologies of power networks in real time is a computationally hard problem due to the number of hypotheses that grows exponentially with the network size. A new “LearningtoInfer” variational inference method is developed for efficient inference of every line status in the network. Optimizing the variational model is transformed to and solved as a discriminative learning problem based on Monte Carlo samples generated with power flow simulations. A major advantage of the developed LearningtoInfer method is that the labeled data used for training can be generated in an arbitrarily large amount fast and at very little cost. As a result, the power of offline training is fully exploited to learn very complex classifiers for effective realtime topology identification. The proposed methods are evaluated in the IEEE 30, 118 and 300 bus systems. Excellent performance in identifying arbitrary power network topologies in real time is achieved even with relatively simple variational models and a reasonably small amount of data.
I Introduction
Lack of situational awareness in abnormal system conditions is a major cause of blackouts in power networks [3]. Network component failures such as transmission line outages, if not timely identified and contained, can quickly escalate to cascading failures. In particular, when line failures happen, the power network topology changes instantly, newly stressed areas can unexpectedly emerge, and subsequent failures may be triggered that lead to increasingly complex network topology changes. While the power system is usually protected against the so called “” failure scenarios (i.e., only one component fails), as failures accumulate, effective automatic protection is no longer guaranteed. Thus, when cascading failures start developing, realtime protective actions critically depend on correct and timely knowledge of the network status. Indeed, without knowledge of the network topology changes, protective control methods have been observed to further aggravate the failure scenarios [4]. Thus, realtime network topology identification is essential to all network control decisions for mitigating failures. In particular, since the first few line outages may have already been missed, the ability to identify in real time the network topology with an arbitrary number of line outages becomes critical to prevent system collapse.
Realtime topology identification is however a very challenging problem, especially when unknown line statuses in the network quickly accumulate as in scenarios that cause largescale blackouts [3]. The number of possible topologies grows exponentially with the number of unknown line statuses, making real time topology identification fundamentally hard. Other limitations in practice such as behaviors of human operators under time pressure, missing and contradicting information, and privacy concerns over data sharing can make this problem even harder. Assuming a small number of line failures, exhaustive search methods have been developed in [5], [6], [7] and [8] based on hypothesis testing, and in [9] and [10] based on logistic regression. To overcome the prohibitive computational complexity of exhaustive search methods, [11] has developed sparsity exploiting outage identification methods with overcomplete observations to identify sparse multiline outages. Without assuming sparsity of line outages, a graphical model based approach has been developed for identifying arbitrary network topologies [12].
On a related note, nonrealtime topology identification have also been extensively studied: the underlying topology stays the same, while many data are collected over a relatively long period of time before the topology can be identified. A variety of data have been exploited for addressing this problem, e.g., data of power injections [13], voltage correlation [14], and energy prices [15]. For power distribution systems in particular, graphical model based approaches have been developed [16, 17].
In this paper, we focus on realtime identification of arbitrary grid topologies based on instantly collected measurements in the power system. We start with a probabilistic model of the variables in a power system (topology, power injections, voltages, power flows, currents etc.) and in its monitoring system (sensor measurements on all kinds of physical quantities). We then formulate the topology identification problem in a Bayesian inference framework, where we aim to compute the posterior probabilities of the topologies given any instant measurements.
To overcome the fundamental computationally complexity due to the exponentially large number of possible topologies, we develop a variational inference framework, in which we aim to approximate the desired posterior probabilities using models that allow computationally easy marginal inference of line statuses. Importantly, we develop “endtoend” variational models for topology identification, and allow arbitrary variational model structures and complexities. In order to find effective endtoend variational models, we transform optimizing a variational model to a discriminative learning problem leveraging a Monte Carlo approach: a) Based on fullblown power flow equations, data samples of network topology, network states, and sensor measurements in the network can be efficiently generated according to a generative model of these quantities, and b) With these simulated data, discriminative models are learned offline, which then offer realtime prediction of the network topology based on newly observed instant measurements from the real world. We thus term the proposed method “LearningtoInfer”. It is important to note that this LearningtoInfer method is not limited by any potential lack of realworld data, as the entire offline training procedure can be conducted entirely based on simulated data.
A major strength of the proposed LearningtoInfer method is that the labeled data set for training the variational model can be generated in an arbitrarily large amount, at very little cost. As such, we can fully exploit the benefit of offline model training in order to get accurate online topology identification performance. The proposed approach is also not restricted to specific models and learning methods, but can exploit any powerful models such as deep neural networks [18]. As a result, variational models of very high complexities can be adopted, yet without worrying about overfitting since more labeled training data can always be generated had overfitting been observed.
The developed LearningtoInfer method is evaluated in the IEEE 30, 118, and 300 bus systems [19] for identifying topologies with an arbitrary number of line outages. It is demonstrated that, even with relatively simple variational models and a reasonably small amount of data, the performance is surprisingly good for this very challenging task.
The remainder of the paper is organized as follows. Section II introduces the system model, and formulates realtime topology identification as a Bayesian inference problem. Section III develops the LearningtoInfer variational inference method. Section IV discusses the architectures of neural networks employed in this study. Section V presents the results from our numerical experiments. Section VI concludes the paper.
Ii Problem Formulation
Iia Power Flow Models
We consider a power system with buses, and its baseline topology (i.e., the network topology when there is no line outage) with lines. We denote the incidence matrix of the baseline topology by [20]. We use a binary variable to denote the status of a line , with for a connected line , and otherwise. The actual topology of the network can then be represented by . Generalizing this notation, we also employ to denote whether two buses and are connected by a line or not. Given a network topology , the system’s bus admittance matrix can be determined accordingly with the physical parameters of the system [21]: , where and denote conductance and susceptance, respectively. Note that, when two buses and are not connected, .
We denote the real and reactive power injections at all the buses by , and the voltage magnitudes and phase angles by . Given the bus admittance matrix , the nodal power injections and the nodal voltages satisfy the following AC power flow equations [21]:
(1) 
where a subscript denotes the component of a vector. In particular, given the network topology and a set of controlled input values , (where and consist of some subsets of and , respectively,) the remaining values of can be determined by solving (IIA). Typically, apart from a slack bus, most buses are “ buses” at which the real and reactive power injections are controlled inputs, and the remaining buses are “ buses” at which the real power injection and voltage magnitude are controlled inputs [21]. We refer the readers to [21] for more details of solving AC power flow equations.
A useful approximation of the AC power flow model is the DC power flow model: under a topology , the nodal real power injections and voltage phase angles approximately satisfy the following equation [21],
(2) 
where , , and is the reactance of line . We note that, in the DC power flow model, reactive power is not considered, and all voltage magnitudes are approximated by a constant.
IiB Observation Models
To monitor the power system, we consider real time measurements taken by sensors measuring nodal voltage magnitudes and phase angles, current magnitudes and phase angles on lines, real and reactive power flows on lines, nodal real and reactive power injections, etc. In general, the observation model can be written as the following,
(3) 
where a) collects all the noisy measurements, b) , denotes the noiseless values of the measured quantities, and the forms of depend on the specific locations and types of the sensors, and c) denote the measurement noises.
Remark 1
A noiseless measurement function may not have a closed form. For example, given and , while the nodal voltage magnitude and phase angle at a particular bus can be solved from (IIA), such a solution can only be obtained using numerical methods, and a closed form expression is not available. For discussions on the existence and uniqueness of the solution to the power flow equations (IIA), we refer the readers to [22].
The observations model can be significantly simplified under the approximate DC power flow model (2). For example, measurements of provided by phasor measurement units (PMUs) located at a subset of the buses can be modeled as
(4) 
where is formed by entries of from buses in . From the DC power flow model (2), we have
(5) 
where denotes pseudoinverse^{1}^{1}1For a connected network, the solution of given is made unique by setting the phase angle at a reference bus to be zero.. We note that, while the noiseless voltage phase angle measurements enjoy a closed form (5) and are linear in the power injections , they are not linear in the line statuses .
IiC Topology Identification as Bayesian Inference
We are interested in identifying the network topology in real time based on instant measurements collected in the power system. We formulate this topology identification problem as a Bayesian inference problem. First, we model and with a joint probability distribution,
(6) 
It is important to note that, given , the noiseless measurements (cf. (3)) can be exactly computed by solving the AC power flow equations (IIA). Adding noises to then leads to .
Remark 2 (Generative Model)
(6) represents a generative model [23] with which a) the topology and the controlled inputs of power injections and voltage magnitudes are generated according to a prior distribution , and b) all the quantities measured in the system can then be computed by solving the power flow equations (IIA), based on which the actual noisy measurements follow the conditional probability distribution .
Our objective is to infer the topology of the power grid given the observed measurements . Thus, under a Bayesian inference framework, we are interested in computing the posterior conditional probabilities: ,
(7) 
Given the observations , a maximum aposteriori probability (MAP) detector would pick as the topology identification decision, which minimizes the identification error probability [24]. However, as the number of hypotheses of grows exponentially with the number of unknown line statuses, performing such a hypothesis testing based on an exhaustive search becomes computationally intractable. In general, as there are up to possibilities for , computing, or even listing the probabilities has an exponential complexity.
Posterior Marginal Probabilites
As an initial step towards addressing the fundamental challenge of computational complexity, instead of computing , we focus on computing the posterior marginal conditional probabilities . We note that the posterior marginals are characterized by just numbers, , as opposed to numbers required for characterizing . Accordingly, the hypothesis testing problem on is decoupled into separate binary hypothesis testing problems: for each line , the MAP detector identifies . As a result, instead of minimizing the identification error probability of the vector (i.e., “symbol” error probability), the binary MAP detectors minimize the identification error probability of each line status (i.e., “bit” error probability).
Although listing the posterior marginals are tractable, computing them, however, still remains intractable. In particular, even with given, summing out all to obtain still requires an exponential computational complexity [25]. As a result, even a binary MAP detection decision of cannot be made in a computationally tractable way. This challenge will be addressed by a novel method we will develop in the next section.
Iii A LearningToInfer Method
Iiia A Variational Inference Framework
In this section, we develop a variational method for approximate inference of the posterior marginal conditional probabilities . The general idea is to find a variational conditional distribution which

approximates the original very closely, and

offers fast and accurate topology identification results.
In particular, we consider that is modeled by some parametric form (e.g., neural networks), and is hence chosen from some family of parameterized conditional probability distributions , where is a vector of model parameters. It is worth noting that is a function of both and , and the parameters associate both and with the probability value .
To achieve the two goals above, we aim to choose a family of probability distributions to satisfy the following:

The parametric form of has sufficient expressive power to represent very complicated functions, so that our approximation to the true can be made sufficiently precise.

It is easy to compute the marginal , so that we can use it to infer based on the observed with low computation complexity.
From a family of parameterized distributions , we would like to choose a that approximates as closely as possible. For this, we employ the KullbackLeibler (KL) divergence as a metric of closeness between two probability distributions,
(8) 
Note that, for any particular realization of observations , a KL divergence can be computed. Thus, can be viewed as a function of . Since we would like the parameterized conditional to closely approximate for all , we would like to minimize the expected KL divergence as follows:
(9) 
where the expectation is taken with respect to the true distribution .
Offline computation:  

1. Generate labeled data set using Monte Carlo  
simulations with the fullblown power flow and  
sensor models.  
2. Select a parameterized variational model .  
3. Train the model parameters using the generated data  
set.  
Online inference (in real time):  
1. Collect instant measurements from the system.  
2. Compute the approximate posterior marginals  
and infer the line statues . 
IiiB From Generative Model to Discriminative Learning
Evaluating is, however, very difficult, primarily because it again requires the summation of an exponentially large number of terms. To address this, the key step forward is that we can approximate the expectation by the empirical mean of over a large number of Monte Carlo samples, generated according to the true joint probability (cf. (6)). We denote the relevant Monte Carlo samples by . Accordingly, (9) is approximated by the following,
(10) 
With a data set generated using Monte Carlo simulations, (10) can then be solved as a deterministic optimization problem. The optimal solution of the model parameters approaches that for the original problem (9) as .
In fact, the problem (10) can be viewed as an empirical risk minimization problem in machine learning [26], as it trains a discriminative model with a data set generated from a generative model (cf. Remark 2). As a result of this offline learning / training process (10), an approximate posterior function is obtained.
IiiC Offline Learning for Online Inference
It is important to note that,

the training process to obtain the function is conducted completely offline;

the use of the trained function , however, is in real time, i.e., online.
In particular, in real time, given whatever newly observed measurements of the system, based on , the approximate posterior marginals will be computed. Based on such instantly computed , a detection decision of whether line () is connected or not in the current topology will be made. For example, a MAP detector would make the following decision,
(11) 
Accordingly, we name our proposed methodology “LearningtoInfer”: To perform real time inference of network topology, we exploit offline learning to train a detector based on labeled data simulated from the fullblown physical model of the power system. The methodology is summarized in Table I. A system diagram is plotted in Figure 1.
Remark 3 (Training Binary Classifiers)
For any detector that identifies the status of a line , (e.g., a binary MAP detector), it can also be viewed as a binary classifier : For each possible realization of , this classifier outputs an inferred status of line . From this perspective, solving (10) is exactly a supervised learning process based on a labeled data set, , where are the output labels that correspond to the input data . As a result, the rich literature on supervised learning for training binary classifiers directly apply to our problem under this LearningtoInfer framework.
IiiD Advantages of the Proposed Method
One great advantage of this LearningtoInfer method is that we can generate labeled data very efficiently. Specifically, we can efficiently sample from the generative model of (cf. (6)) as long as we have some prior that is easy to sample from. While historical data and expert knowledge would surely help in forming such priors, using simple uninformative priors can already suffice as will be shown later in the numerical examples. As a result, we can obtain an arbitrarily large set of data with very little cost to train the discriminative model. This is quite different from the typical situations encountered in machine learning problems, where obtaining a large amount of labeled data is usually expensive as it requires extensive human annotation effort.
Furthermore, once the approximate posterior distribution is learned, it can be deployed to infer the power grid topology in realtime as the computation complexity of is very low by design. This is especially important in monitoring largescale power grids in real time, because, although training could take a reasonably amount of time, the inference speed is very fast. Therefore, the learned predictor can be used in realtime with lowcost hardware.
Limitations of Historical Data and Power of Simulated Data
In overcoming the computational complexity challenges of realtime topology identification, it is particularly worth noting the fundamental limitation of using real historical data. Even with the explosion of data available from pervasive sensors in power systems, the data are often collected under a very limited set of system scenarios. For example, most historical data are collected under normal system topologies. Even with data collected under slowly updated systems or faulty systems, the underlying topologies in these real world cases only represent a very small fraction of the entire, exponentially large model space. It would not be prudent to postulate that a fault event in the future would always resemble some of the earlier faults happened in the past for which the data have been collected. Moreover, the Consequently, historical data are fundamentally insufficient for real time grid topology identification especially under rare faults and cascading failures.
Simulated data, as evidenced in the proposed LearningtoInfer framework, offer great potential beyond what historical data can offer. An orders of magnitude richer set of scenarios can be generated, and a learning procedure based on these simulated data can provide very powerful classifiers for identifying arbitrary topologies that may appear in the future, but have rarely, if not at all, appeared in the past.
Iv Neural Network Architectures for Learning Classifiers
To perform binary MAP inference of each line status, the decision boundary of the MAP detector is highly nonlinear (cf. Remark 3). We investigate classifiers based on neural networks to capture such complex nonlinear decision boundaries. As we have lines, a straightforward design architecture is to train a separate classifier for each single line : the input layer of the neural network consists of , and the output layer consists of just one node predicting either or . Thus, a total of classifiers need to be trained. For training and testing, we generate labeled data randomly that satisfy the power flow equations and the observation models. Each consists of labels used by the classifiers respectively. A diagram illustrating this neural network architecture is depicted in Figure 2. The function of the neural network for classifying can be understood as follows: The hidden layers of neurons compute a number of nonlinear features of the input , and the output layer applies a binary linear classifier to these features to make a decision on .
Next, we introduce a second architecture that allows classifiers for different lines to share features, which can lead to more efficient learning of the classifiers. Specifically, instead of training separate neural networks each with one node in its output layer, we train one neural network whose output layer consists of nodes each predicting a different line’s status. An illustration of this architecture is depicted in Figure 3. As a result, the features computed by the hidden layers can all be used in classifying any line’s status. The idea of using shared features is that certain common features may provide good predictive power in inferring many different lines’ statuses in a power network.
Furthermore, using a single neural network with feature sharing can drastically reduce the computational complexity of both the training and the testing processes. Indeed, while using separate neural networks requires training of classifiers, using a neural network that allows feature sharing involves training of only a single classifier. Note that, with similar sizes of neural networks, adding nodes in the output layer incurs only a very small increase in the training time. As a result, there is an reduction in computation time for this architecture with shared features, which can be significant savings for large power networks.
Evidently, compared with separate neural networks, a shared neural network of the same size would have a performance degradation in classification due to a reduced expressive power of the model. However, such a performance degradation can be erased by increasing the size of the shared neural network. In fact, increasing the size of the shared neural network to be the sum of that of the separate neural networks leads to a classifier model that is strictly more general, and hence offers a performance enhancement as opposed to degradation. As will be shown later, it is sufficient to increase the size of the shared neural network architecture by a much smaller factor to achieve the same performance as the separate neural network architecture does.
With the proposed LearningtoInfer method, since labeled data can be generated in an arbitrarily large amount using Monte Carlo simulations, whenever overfitting is observed, it can in principle always be overcome by generating more labeled data for training. Thus, as long as the computation time allows, we can use neural network models of whatever complexity for approximating the binary MAP detectors, without worrying about overfitting.
V Numerical Experiments
We evaluate the proposed LearningtoInfer method for grid topology identification with three benchmark systems of increasing sizes, the IEEE 30, 118, and 300 bus systems, as the baseline topologies. As opposed to considering only a small number of simultaneous line outages as in existing works, we allow any number of line outages, and investigate whether the learned discriminative classifiers can successfully recover the topologies.
Va Data Set Generation
In our experiments, we employ the DC power flow model (2) to generate the data sets. Accordingly, the set of controlled inputs reduce to just , and the generative model (6) reduces to . To generate a data set , we assume the prior distribution factors as . As such, we generate the network topologies and the power injections independently:

We generate the line statuses using independent and identically distributed (IID) Bernoulli random variables, with and for the IEEE 30, 118, and 300 bus systems, respectively. We do not consider disconnected networks in this study, and exclude the line status samples if they lead to disconnected networks. As such, considering that some lines must always be connected to ensure network connectivity, after some network reduction, the equivalent networks for the IEEE 30, 118, and 300 bus systems have , and lines that can possibly be in outage, respectively.

We would like our predictor to be able to identify the topology for arbitrary values of power injections as opposed to fixed ones. Accordingly, we generate using the following procedure: For each data sample, we first generate bus voltage phase angles as IID uniformly distributed random variables in , and then compute according to (2) under the baseline topologies.
With each pair of generated and , we consider two types of measurements that constitute : nodal voltage phase angle measurements and nodal power injection measurements. For these, a) we generate IID Gaussian voltage phase angle measurement noises with a standard deviation of degree, the stateoftheart PMU accuracy [27], and b) we assume power injections are measured accurately. Here, we consider that measurements of voltage phase angles and power injections are collected at all the buses. The effect of number and locations of sensors will be discussed toward the end of this section.
The (reduced) IEEE 30 bus system with lines  

Number of all topologies  
Number of topologies with  
disconnected lines  
The generated data set  
The (reduced) IEEE 118 bus system with lines  
Number of all topologies  
Number of topologies with  
disconnected lines  
The generated data set  
The (reduced) IEEE 300 bus system with lines  
Number of all topologies  
Number of topologies with  
disconnected lines  
The generated data set 
In this study, we generate , , and data samples for the IEEE 30, 118, and 300 bus systems, respectively. The data for the IEEE 30 bus system are divided into , and samples for training, validation, and testing, respectively; the data for the IEEE 118 bus system are divided into , and samples; and the data for the IEEE 300 bus system are divided into , and samples. We note that over of the generated 30bus topologies are distinct from each other, so are that of the generated 118 bus topologies and that of the 300 bus topologies. As a result, these generated data set can very well evaluate the generalizability of the trained classifiers, as (almost) all data samples in the test set have topologies unseen in the training set.
Moreover, in the generated data sets, the average numbers of disconnected lines relative to the baseline topology are and for the IEEE 30, 118 and 300 bus systems, respectively. These numbers of simultaneous line outages are significantly higher than those typically assumed in sparse line outage studies. Furthermore, we would like to compare the size of the generated data set to the total number of possible topology hypotheses, as highlighted in Table II. Clearly, a) it is computationally prohibitive to perform line status inference based on exhaustive search, and b) the generated and data sets are only a tiny fraction of the entire space of all topologies. Yet, we will show that the classifiers trained with the generated data sets exhibit excellent inference performance and generalizability.
VB Neural Network Structure and Training
We employ twolayer (i.e., one hidden layer) fully connected neural networks for both the separate training architecture and the feature sharing architecture. Rectified Linear Units (ReLUs) are employed as the activation functions in the hidden layer. In the output layer we employ hinge loss as the loss function. In training the classifiers, we use stochastic gradient descent (SGD) with momentum update and Nesterov’s acceleration [28]. While this optimization algorithm works sufficiently well for our experiments, we note that other algorithms may further accelerate the training procedure [29].
VC Evaluation Results
VC1 The Separate Training Architecture vs. the Feature Sharing Architecture
We compare the performance of the separate training architecture (cf. Figure 2) and the feature sharing architecture (cf. Figure 3) on the IEEE 30 bus system. We train the neural network classifiers and obtain the accuracy of identifying each line status. In the remainder of this section, the average accuracies across all binary inference of line statuses are presented.
For separately training a neural network for each line status inference (out of 38 in total), we employ 75 neurons in the hidden layer, whereas for training a single neural network with feature sharing we employ 300 neurons. We note that . The sizes of the models are chosen such that both the separate training architecture and the feature sharing architecture achieve the same average accuracy of 0.989, and their training times can be fairly compared. For all neural networks, we run SGD for 2000 epochs in training. On a laptop with an Intel Core i7 3.1GHz CPU and 8 GB of RAM, with the training samples, it takes about 14.7 hours to separately train 38 neural networks of size 75, but only 1.4 hours to train the one neural network of size 300 with feature sharing. We observe that the feature sharing architecture is about 11 times faster to train than the separate training architecture while achieving the same performance. Such a speed advantage of the feature sharing architecture will become even more pronounced in larger power networks.
As discussed in Section IIID, while the offline data generation and training procedures may take a reasonable amount of time, the testing procedure, i.e., real time topology identification, is performed extremely fast: In all of our numerical experiments, the testing time per data sample is under a millisecond. The extremely fast testing speed demonstrates that the proposed approach applies very well to realtime tasks, such as failure identification during cascading failures.
VC2 Performance of the LearningtoInfer Method
From this point on, all simulations are performed with the feature sharing architecture (cf. Figure 3). We now present the results for the IEEE 30, 118 and 300 bus systems.
In particular, we continue to employ neural networks with one hidden layer, and use and neurons in the hidden layer for the IEEE 30, 118 and 300 bus systems, respectively. For all the three systems, we plot in Figure 4 the achieved training and validation losses for every epoch, and in Figure 5 the achieved testing accuracies for every epoch. It is clear that the training and validation losses stay very close to each other for all the three systems, and thus no overfitting is observed. Moreover, very high testing accuracies, 0.989, 0.990 and 0.997 are achieved for the IEEE 30, 118 and 300 bus systems, respectively.
The testing accuracies can be equivalently understood by the average numbers of misidentified line statuses, plotted in Figure 6. We observe that, at the beginning of the training procedures, the average numbers of misidentified line statuses are , and for the IEEE 30, 118 and 300 bus systems, which are exactly the average numbers of disconnected lines in the respective generated data sets (cf. Section VA). Indeed, this coincides with the result from a naive identification decision rule of always claiming all the lines as connected (i.e., a trivial majority rule). As the training procedures progress, the average numbers of misidentified line statuses are drastically reduced to eventually , and . In other words, for the IEEE 300 bus system for example, facing on average simultaneous line outages, only line status would be misidentified on average by the learned classifier. We again emphasize that such a performance is achieved with identification decisions made in real time, under a millisecond.
We would like to further emphasize that the topologies and the power injections used to train the classifier are different from the ones in the validation and test sets. This is of particular interest because it means that our learned classifier is able to generalize well on the unseen test topologies and power injections based on its knowledge learned from the training data set. It is also worth noting that we have generated the training, validation and testing data sets with uniformly random voltage phase angles, and hence considerably variable power injections. In practice, there is often more informative prior knowledge about the power injections based on historical data and load forecasts. With such information, the model can be trained with much less variable samples of power injections, and the identification performance can be further improved significantly.
VC3 Model Size, Sample Complexity, and Scalability
In the proposed LearningtoInfer method, obtaining labeled data is not an issue since data can be generated in an arbitrarily large amount using Monte Carlo simulations. This leads to two questions that are of particular interest: to learn a good classifier, a) what size of a neural network is needed? and b) how much data needs to be generated? To answer these questions,

for the IEEE 30 bus system, we vary the size of the hidden layer of the neural network from 100 neurons to 300 neurons, as well as the training data size from to , and evaluate the learned classifiers;

for the IEEE 118 bus system, we vary the size of the hidden layer of the neural network from 300 neurons to 1000 neurons, as well as the training data size from to , and evaluate the learned classifiers.

for the IEEE 300 bus system, we vary the size of the hidden layer of the neural network from 1000 neurons to 3000 neurons, as well as the training data size from to , and evaluate the learned classifiers.
We plot the testing results for the IEEE 30, 118 and 300 bus systems in Figure 7, 8 and 9, respectively. We have the following observations:

For the IEEE 30 bus system: a) With training data, the neural network models of size 200 and 300 are severely overfit, but the levels of overfitting are significantly reduced as the size of training data increases to above ; b) The best performance is achieved with 300 neurons and training data, where no overfitting is observed.

For the IEEE 118 bus system: a) With data, the three neural network models of sizes 300, 600 and 1000 all severely overfit, but the levels of overfitting are significantly reduced as the size of training data increases to above ; b) The best performance is achieved with 1000 neurons and training data, where no overfitting is observed.

For the IEEE 300 bus system: a) With data, the three neural network models of sizes 1000, 2000 and 3000 all severely overfit, but the levels of overfitting are significantly reduced as the size of training data increases to above ; b) The best performance is achieved with 3000 neurons and training data, where no overfitting is observed.
Based on all these experiments, we now examine the scalability of the proposed LearningtoInfer method as the problem size increases. We observe that training data sizes of and and neural network models of sizes 300, 1000 and 3000 ensure very high and comparable performance with no overfitting for the IEEE 30, 118 and 300 bus systems, respectively. When these data sizes are reduced by a half, some levels of overfitting then appeared for these models in all the three systems. We plot the training data sizes compared to the problem sizes for the three systems in Figure 10. We observe that the required training data size increases approximately linearly with the problem size. This linear scaling behavior implies that the proposed LearningtoInfer method can be effectively implemented for largescale systems with reasonable computation resources.
VC4 Effect of Number and Locations of Sensors
We close this section with a look into the effect of sensor placement in topology identification. It is clear that the performance of realtime topology identification would closely depend on where and what types of sensor measurements are collected. Given limited sensing resources, optimizing the sensor placement is a hard problem for which many studies have addressed (see, e.g., [7] among others). Here, we present a case study on the IEEE 30 bus system, for which voltage phase angles are collected only at buses (as opposed to all the buses as in the previous experiments), as depicted in Figure 11. Interestingly, the achieved average identification accuracy only drops to (from when all the buses are monitored.) This translates to on average only misidentified line statuses among a total of lines. A more comprehensive study of sensor placement for realtime topology identification is left for future work.
Vi Conclusion
We have developed a new LearningtoInfer variational inference method for realtime topology identification of power grids. The computational complexity due to the exponentially large number of topology hypotheses is overcome by efficient marginal inference with optimized variational models. Optimization of the variational model is transformed to and solved as a discriminative learning problem, based on Monte Carlo samples efficiently generated with fullblown power flow models. The developed LearningtoInfer method has the major advantages that a) the training process takes place completely offline, and b) labeled data sets can be generated in an arbitrarily large amount fast and with very little cost. As a result, very complex variational models can employed without worrying about overfitting, as more labeled training data can always be generated had there been overfitting observed. With the classifiers learned offline, their actual use is in real time, and topology identification decisions are made under a millisecond. We have evaluated the proposed method with the IEEE 30, 118 and 300 bus systems. It has been demonstrated that arbitrary network topologies can be identified in real time with excellent performance using classifiers trained with a reasonably small amount of generated data.
References
 [1] Y. Zhao, J. Chen, and H. V. Poor, “Learning to infer: A new variational inference approach for power grid topology identification,” in Proc. IEEE Workshop on Statistical Signal Processing, Jun. 2016, pp. 1–5.
 [2] ——, “Efficient neural network architecture for topology identification in smart grid,” in Proc. IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2016, pp. 811–815.
 [3] USCanada Power System Outage Task Force, Final Report on the August 14, 2003 Blackout in the United States and Canada, 2004.
 [4] ArizonaSouthern California Outages on September 8, 2011: Causes and Recommendations. FERC, NERC, 2012.
 [5] J. E. Tate and T. J. Overbye, “Line outage detection using phasor angle measurements,” IEEE Transactions on Power Systems, vol. 23, no. 4, pp. 1644 – 1652, Nov. 2008.
 [6] ——, “Double line outage detection using phasor angle measurements,” in Proc. IEEE Power and Energy Society General Meeting, Jul. 2009.
 [7] Y. Zhao, J. Chen, A. Goldsmith, and H. V. Poor, “Identification of outages in power systems with uncertain states and optimal sensor locations,” IEEE Journal of Selected Topics in Signal Processing, vol. 8, no. 6, pp. 1140–1153, Dec. 2014.
 [8] Y. Zhao, A. Goldsmith, and H. V. Poor, “On PMU location selection for line outage detection in widearea transmission networks,” in Proc. IEEE Power and Energy Society General Meeting, July 2012, pp. 1–8.
 [9] T. Kim and S. J. Wright, “PMU placement for line outage identification via multinomial logistic regression,” IEEE Transactions on Smart Grid, 2017.
 [10] M. Garcia, T. Catanach, S. Vander Wiel, R. Bent, and E. Lawrence, “Line outage localization using phasor measurement data in transient state,” IEEE Transactions on Power Systems, vol. 31, no. 4, pp. 3019–3027, 2016.
 [11] H. Zhu and G. B. Giannakis, “Sparse overcomplete representations for efficient identification of power line outages,” IEEE Transactions on Power Systems, vol. 27, no. 4, pp. 2215–2224, Nov. 2012.
 [12] J. Chen, Y. Zhao, A. Goldsmith, and H. V. Poor, “Line outage detection in power transmission networks via message passing algorithms,” in Proc. 48th Asilomar Conference on Signals, Systems and Computers, 2014, pp. 350–354.
 [13] M. He and J. Zhang, “A dependency graph approach for fault detection and localization towards secure smart grid,” IEEE Transactions on Smart Grid, vol. 2, no. 2, pp. 342–351, Jun. 2011.
 [14] S. Bolognani, N. Bof, D. Michelotti, R. Muraro, and L. Schenato, “Identification of power distribution network topology via voltage correlation analysis,” in IEEE Conference on Decision and Control, 2013, pp. 1659–1664.
 [15] V. Kekatos, G. B. Giannakis, and R. Baldick, “Online energy price matrix factorization for power grid topology tracking,” IEEE Transactions on Smart Grid, vol. 7, no. 3, pp. 1239–1248, 2016.
 [16] D. Deka, M. Chertkov, and S. Backhaus, “Structure learning in power distribution networks,” IEEE Transactions on Control of Network Systems, 2017.
 [17] Y. Weng, Y. Liao, and R. Rajagopal, “Distributed energy resources topology identification via graphical modeling,” IEEE Transactions on Power Systems, vol. 32, no. 4, pp. 2682–2694, 2017.
 [18] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [19] Power Systems Test Case Archive, University of Washington Electrical Engineering, https://www.ee.washington.edu/research/pstca/.
 [20] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, “Network flows: theory, algorithms, and applications,” 1993.
 [21] J. D. Glover, M. Sarma, and T. Overbye, Power System Analysis & Design. Cengage Learning, 2011.
 [22] R. Baldick, Applied optimization: formulation and algorithms for engineering systems. Cambridge University Press, 2006.
 [23] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
 [24] H. V. Poor, An Introduction to Signal Detection and Estimation. SpringerVerlag, New York, 1994.
 [25] M. Mezard and A. Montanari, Information, physics, and computation. Oxford University Press, 2009.
 [26] V. Vapnik, Statistical Learning Theory. Wiley, New York, 1998.
 [27] A. von Meier, D. Culler, A. McEachern, and R. Arghandeh, “Microsynchrophasors for distribution systems,” in Proc. IEEE Innovative Smart Grid Technologies (ISGT), 2013.
 [28] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, 2013, vol. 87.
 [29] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.