Realtime Anomaly Detection and Classification
in Streaming PMU Data
Abstract
Ensuring secure and reliable operations of the power grid is a primary concern of system operators. Phasor measurement units (PMUs) are rapidly being deployed in the grid to provide fastsampled operational data that should enable quicker decisionmaking. This work presents a general interpretable framework for analyzing realtime PMU data, and thus enabling grid operators to understand the current state and to identify anomalies on the fly. Applying statistical learning tools on the streaming data, we first learn an effective dynamical model to describe the current behavior of the system. Next, we use the probabilistic predictions of our learned model to define in a principled way an efficient anomaly detection tool. Finally, the last module of our framework produces onthefly classification of the detected anomalies into common occurrence classes using features that grid operators are familiar with. We demonstrate the efficacy of our interpretable approach through extensive numerical experiments on real PMU data collected from a transmission operator in the USA.
The work was supported by funding from the U.S. DOE as part of the DOE Grid Modernization Initiative, and by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number 20190351ER.
I Introduction
Traditional supervisory control and data acquisition (SCADA) systems provide information regarding the system state at the order of seconds to the operator. However, such fidelity, considered appropriate in prior decades, is not sufficient to observe or predict disturbances at faster timescales that are increasingly being observed in today’s stochastic grid [hossain2011investigation]. To provide more rapid measurement data, phasor measurement units (PMUs) have gained widespread deployment. PMUs [pmu] are timesynchronized by GPS timestamps and collect measurements of system states (Eg. voltage, frequency, currents) multiple times per second at rates between 30 and 240 Hz. Thereby, if efficient methods for PMU analysis are developed, they can help in obtaining a more holistic understanding of the state of the power grid. Realtime detection of changes in the power system can have multiple benefits to grid security. Among others, it can enable corrective control actions by operators that can prevent grid failures from devolving into cascading blackouts [ghanavati2015identifying]. Together with eventdetection, the problem of event identification/classification is equally important as it helps determine probable causes and followup actions if necessary. Figure 1 shows several voltage anomalies measured by a PMU system in a US transmission grid. While PMUs have gained increasing deployment in the modern power system, the use of their data reporting in realtime operations has not yet realized full adoption. The primary use of PMU data so far has been for offline event discovery and postevent analysis [PMU_loc].
The overarching goal of this research is to create a realtime approach for facilitating a remote analysis of individual components in the grid for monitoring, detection, and classification of anomalies. The implementation of our framework is fully automated, userfriendly and crucially interpretable. Any time the warning flag about the system state is raised, our methods ensure that the operator is able to see quantitative details of the prediction and classification subtasks in realtime for validation.
A principled way to achieve the interpretability of predictions consists of connecting data with the physical properties of the system. This physicsinformed approach has been proven successful when a reasonable system model under given conditions is known, such as the recovery of linearized swing equations in transmissionlevel power grids from PMU observation [lokhov2018online]. However, often even the functional form of the model is not known, or the conditions under which it holds can not be verified a priori; hence, it becomes desirable to use a more flexible datadriven approach that is based on, but also extends the physicsbased models commonly used in modeling the dynamics of power grids.
The datadriven framework proposed in this work is principally based on the theory of robust statistical machine learning [wainwrightjordan]. In particular, we learn the parameters of a general stochastic vector autoregressive (VAR) process to model the effective relationship between measurements collected. This multivariate approach, updated at regular intervals, makes nearterm predictions of future trajectories for the data. The anomalies are then identified based on statistically significant deviations in the incoming data from the ensuring predictions. Finally, we design a decisiontree based classification tool into our framework to distinguish the identified anomalies into different interpretable categories. In what follows, we discuss the application of our learning framework on real PMU datafeed collected from a US transmission system operator (ISO).
Compared to univariate data analysis and other generalpurpose machine learning techniques such as neural networks, our multivariate approach coupled with decision treebased classification shows superior performance for both event detection and classification in the available dataset. Even more importantly, our approach does not involve tuning or fitting of thresholds and hyperparameters specific to a particular system, and hence is intrinsically transferable from one real system to another.
The rest of the paper is organized as follows: Section II presents related work, Section III presents the implementation of our machine learning framework and the procedure for setting proper parameter values in the statistical model. In Sections IV and V, the anomaly detection and classification application using the implementation of the framework is presented and evaluated. Finally, the implications of the framework with proposed future work are discussed in Section VI.
Ii Related Work
Realtime use of PMUs is a growing field of research, though its practical implementation has been sporadic. The approach presented in this paper provides a statistical learning framework for modeling the multivariate stream of raw data from a single PMU that measures frequency, and complexvalued voltage, current, and power measurements in realtime. This step is motivated by nontrivial dependence between singlevariable measurements at the same location, as observed in later sections. Conversely, in prior work [pmu_small, pmuOther], statistical tests with sliding windows are designed to detect anomalies in single variables. In [PMU_loc], the authors use a rulebased approach for detecting changes in each data variable independently. The approach utilizes offline analysis of data rather than the realtime approach presented in this work. In [smartMeter], the authors focus on detecting and classifying smart meter anomalies using unsupervised machine learning techniques, including neural network, and projectionbased methods. The low datarate of smart meters makes it nontrivial to extend the techniques to realtime PMU data. In [PMU_ensemble], the authors develop an ensemblebased method that combines multiple detection algorithms to create a more robust detection mechanism. In [PMU_decision], the authors use a decision tree to classify line events from PMU data. While our work utilizes decision trees as well, due to their interpretability, the detection methods used makes it different from the existing work.A guide for signal processing based techniques is developed in [nrel_long]. The authors use the voltage phase angle difference measured by PMUs with a reference bus to determine power system events. In contrast, we model the PMU data stream individually within its local context for use in realtime detection. In [PMU_comms], the authors use Knearestneighbor (KNN) and Singular Spectrum Analysis (SSA) combined with fog computing which moves processing to edge devices. In future work, we intend to evaluate our modeling methods on multiple PMU data streams and incorporate edge computing paradigms.
Overall, the approach presented in this paper attempts to generalize the problem of anomaly detection: detection and scoring of anomalies in a principled manner based on an effective model that is learned on the fly; welldefined datadriven approach to selecting hyperparameters; capturing the interaction of variables due to underlying system properties; and working in the invariant spaces of probabilistic scores which ensures transferrability of methods.
Iii Approach
To analyze the PMU measurements collected from the grid during normal ambient operation, we model the observed values using a dynamical model. Once a suitable effective descriptive model is obtained, we can gain insight into the state of the system and changes made in it. We start by describing the measurements available from a US ISO for analysis.
Iiia Data Description and Preliminary Analysis
To provide a foundation for our statistical model, we use timestamped vectorvalued data collected from a single PMU on the transmission system of a US ISO. In this study, we use days of data, collected at 30Hz. Due to the underlying physics of the grid, there is an inherent relationship between the power variables useful to power operators in the datastream. The power triangle relates active power (P), reactive power (Q), Complex Power (S), and phase angle difference . Coupling with voltage, current and frequency, there are many potential variables at hand to model the system. Using all the information available is not optimal from the information storage and realtime data processing points of view; additionally, exact strong correlation relations between some of the variables may artificially complicate the learning of an effective model. Therefore, we limit the data variables used in our model to voltage, current, the sine of the angle difference of voltage and current and frequency. All other quantities can be straightforwardly estimated from these four if required. Figure 2 shows the empirical correlation between the variables used in our model. On the diagonal of the figure, the histograms of the data for each plot are shown. The plots below the diagonal contain the bivariate distribution ellipses to 95% confidence of the values along with their locally estimated scatter plot smoothing (LOESS) lines. Above the diagonal, the Pearson correlation coefficients are listed.
A simple exploratory analysis of the Figure 2 indicates that the sine of the angle difference is correlated to the magnitude of current and that voltage magnitude is correlated to the frequency of the system. This observation motivates the need to take the multivariate nature of the signal into account when constructing the effective dynamical model of the system, which is one of the main features of our method.
IiiB Data Modeling
To model the PMU data stream, we design a multivariate approach for capturing the interaction between variables. The interaction is expected due to the underlying physics of the electric power grid. In general, the dynamics may have an arbitrary nonlinear functional complexity; however, a common approach consists of considering a liner effective model linearized around a current operating point, which is typically valid over short time scales. Hence, in what follows we use the four identified variables in a vector autoregressive model (VAR) which relates the system’s current state to timelagged values of previous states via dynamical state matrices and an error term.
A VAR(p)process is defined consisting of endogenous variables for as:
(1) 
with as coefficient matrices for and is a dimensional random noise process with expected mean, E and time invariant white noise positive definite covariance matrix [vars2]. In the case of an independent Gaussian noise, the coefficients of a VAR model can be optimally estimated using leastsquares applied to the rewritten equations for each variable separately [lokhov2018online, lokhov2018cdc].
Since different variables have different scales, it is convenient to first preprocess and adequately format the data for the statistical learning model. First, each signal is individually detrended over the period the model learns on. To normalize, we then divide the data interval by the standard deviation, for each variable . Thus our transformed data is measured in standard deviations from the linear trendline and original units such as volts, amperes are removed. Figure 3 shows an example of regular raw data signals next to the standardized signals. A VAR process is then fitted to the preprocessed data. The R language VAR implementation is used [vars1] for analysis. Notice that it is immediate to transform the standardized data back to physical scale and dimension using the saved values of the linear trend and value of the standard deviation. The standardization parameters are constantly adjusted to incorporate new incoming data in realtime.
IiiC Hyperparameter selection in VAR
The VAR estimation process includes the following input hyperparameters: (a) the time resolution we want to model; (b) , the length of the time period over which the VAR coefficients are trained; and (c) , the number of lag terms included in the model. Below, we describe a principled empirical procedure for selecting the right values of each of the hyperparameters for a given data set.
Model resolution
The time resolution hyperparameter is not datarelated, but instead is intrinsically tied to the time scale of the application of interest. For example, detection of forced oscillations occurring at the scale of seconds would require much high data resolution compared to slowmoving changes such as transformer aging [transformers] which is the process that may take weeks or even months. In this study, we are interested in realtime detection of anomalies on the maximum scale of seconds; we use a 0.5second prediction interval that enables detection of realtime changes in the system. To fit the model using this resolution, the data should be appropriately coarsegrained to scale. In the case of a given data set, the data is processed by taking the rate of data transmission, i.e., 30 Hertz and dividing by the granularity (0.5) seconds and taking the mean of those points.
Training data size
Next, we determine the selection of training data length, . For a stable model, more data (larger ) leads to a better model fit. However, the physical parameters in an electric power grid tend to change throughout the day due to changing weather, control settings, or consumption and generation. Hence, we do not expect our model to be stable for very long periods of time; the effective description should drift with time as well. Moreover, our requirement of realtime data processing means that we would like to minimize the number of data points to construct a model on the fly. Hence, an ideal amount of training data will balance the training error caused by a finite number of learning samples and the model misspecification error due to the naturally changing effective model of the system, at the same time keeping the computational complexity of the learning procedure low. Below, we describe a selfconsistent datadriven approach to set the value of that would satisfy these requirements.
How can one estimate the training error given that the ground truth model is unknown? To tackle this challenge, we propose to learn a trial VAR(1) model from data and then use it to synthesize data; this model will then serve as ground truth in this synthetic experiment. From the synthesized data, we then retrain a VAR(1) model and calculate the difference between the original and the inferred models. A larger difference would show that there is an insufficient amount of data to accurately reconfigure the VAR model. In Figure 4, data is synthesized from VAR(1) models trained on PMU data from minutes at halfminute intervals. VAR(1) models are then retrained on the synthesized data. The error is computed using the following relation defined by the average absolute perelement difference :
(2) 
Here, is a reference VAR(1) coefficient matrix and is the VAR(1) coefficient matrix fitted on data simulated using . Figure 4 shows that the error rapidly decreases to an average of about 0.005 at about 8 minutes, after which the decay becomes gradual. This illustrates that the typical dynamic state matrix for the VAR(1) model is reconstructed to an acceptable accuracy using the data equivalent of 812 minutes.
In order to evaluate the misspecification error arising from the natural change of the effective model, we compare the coefficient matrices estimated using two consecutive intervals:
(3) 
Similarly to Eq. 2, Eq. 3 computes the average absolute perunit difference of and where is the coefficient matrix from the VAR(1) model trained from and is the coefficient matrix fitted from the VAR(1) model trained from for ranging from 0.5 to 20 minutes.
Figure 5 illustrates the stability of the model when , the size of the training set, increases. As expected, over short intervals i.e., 1 minute, the model remains very stable with less than 0.025 per element change in the coefficient matrices. As the length of the model increases, the similarity of the coefficient matrices decreases. After about 10 minutes, the error plateaus around 0.05. Based on our similarity metric, it can be concluded that beyond 10 minutes, the difference in matrices encompasses the error due to the change in the effective model.
The analysis of the two sources of error depicted in Figures 4 and 5 shows that an increase in error due to model misspecification will take over the reduction of error due to statistical estimation. In what follows, we fix minutes as an acceptable choice for the length of training data in the current data set.
Depth of the model
The final parameter of the VAR model is the number of lag terms , that are used to predict the next state. The benefit of including more time lags which increases the expressivity of the model can be hampered by the decrease of the accuracy of the recovered model due to the final amount of data. In order to determine an appropriate number of lag terms to include, we analyze the impact of the number of lags on a model’s retraining error using Eq. 2. For the case of 2 lags, we compute the average perelement absolute error over both coefficient matrices and . Figure 6 illustrates the error over 50 runs where increases from 0.5 minutes to 25 minutes. It can be seen that a VAR(2) model with 20 minutes has over twice the perelement error as a VAR(1) model with 10 minutes. Therefore, we can conclude that to minimize errors associated with the selected , the inclusion of two lags is prohibitive and instead one lag is sufficient for the VAR model on the current data set.
To summarize, for the data set analyzed in this study, we utilize a VAR(1) model with a training data length of minutes, and a 0.5second prediction scale. The variables included in the VAR(1) model are voltage magnitude, current magnitude, sine of the phasor angle differences, and the frequency. In the next section, we use this VAR(1) model for anomaly detection and subsequent classification. For other systems and data sets, the operator may use the procedure described above to set the optimal hyperparameters of our learning framework.
Iv Anomaly Detection
Under normal control operations as well as external perturbations, the measurements recorded in the PMU measurements change gradually. However brief or extended large changes are observed from time to time and qualified as anomalies. These include both operational changes such as load tap changes, as well as external perturbations such as lightning strikes. Figure 1 shows a few examples of anomalies in the voltage signal of the available PMU. While large anomalies can be visually seen by the human eye in individual measurement streams, we propose a statistical approach that enables early detection of anomalies when they begin or less severe or subtle anomalies that can only be detected by considering multiple measured variables in the VAR(1) model.
Probabilistic scoring of the incoming data
As described in the previous section, we use a VAR(1) model with 4 variables, scaled to 0.5second granularity, with training data of 10 minutes for modeling the PMU datastream. The VAR(1) model is retrained in realtime every 0.5 second and used to predict the next data point in the time series. Once the next datapoint is received, a vectorvalued error/residual is computed as
(4) 
where is the prediction based on and is the observed measurement. The residuals can then be scored according to the Mahalanobis distance [mahdist] using noise covariance matrix which is learned during the VAR(1) training:
(5) 
The distance of the residuals from the model’s noise distribution are unitless and hence transferable across PMU streams and different operating conditions. This metric (given here in the zeromean case due to detrending) generalizes the distance in terms of standard deviations for onedimensional case to the case of multivariate Gaussian distributions.
The analysis of the multivariate distribution can also provide a distance metric related to an individual variable (e.g. voltage alone), but informed by the state of other variables. Indeed, a conditional distribution of given the other variables is a Gaussian distribution itself:
(6) 
with mean and standard deviation that are computed for each variable and the state of the remaining variables .
The incoming data points are provided with the score equal to the Mahalanobis distance in the case of full multivariate distribution, or to the distance to the mean measured in standard deviations for an individual variable . Examples of distances of the individual residuals using the VAR(1) model on the available PMU data is shown in Figure 7. Notice that most of the points lie within the “” score, which means that they are considered normal for the current effective model. On the other hand, the larger the value of , the more likely it is to be a severe event. To identify anomalies, we compare the relative score against a userspecified threshold . When a score larger than the threshold is found, the VAR(1) model retraining is stopped, as we know that the data set contains anomalous points, and does not describe normal behavior of the system. Our learning algorithm then scores the next streaming measurement samples by their corresponding residues based on a step forecast of the VAR(1) process given by . In our detection approach, we set to detect events of short duration or larger events that progress in this period of time.
An example of this procedure is given in Figure 8 for a voltage anomaly over 10 seconds. The conditional probability is used for scoring the residuals for the 5 seconds following the anomaly. In this anomaly with , the deviation of the voltage is over 70 standard deviations away from the expected distribution.
Multivariate vs. univariate detection
It is worth mentioning that in prior work, anomaly detection is typically based on historical data involving a single variable on a sliding window [pmu_small, pmuOther]. Our approach is more general in utilizing a multivariate statistical linear model. Figure 9 compares anomaly detection by our multivariate VAR(1) approach with that based on univariate minmax detection approach on a PMU data stream over 1 day. The minmax algorithm takes the range (maximumminimum) of a sliding window and determines if it is some standard deviations from the mean of the whole data set. Using a 10second sliding window, there are many overlaps in the conditional probability of VAR residuals as well as the minmax algorithm. For the multivariate approach, we use a threshold of to identify the start of the anomaly. For the minmax algorithm, the thresholds are set at 3 for voltage magnitude, 4 for current and angle difference, and 6 for frequency. This is in accordance with the National Renewable Energy Laboratory’s PMU anomaly detection guidelines [nrel_long] which suggest that the use of multiple techniques is useful to detect changes based on historical PMU data. It is clear from Figure 9 that our multivariate approach is able to identify almost all anomalies that are detected in different single variable statistical analysis. Crucially, it is able to overcome the fact that certain anomalies detected by a singlevariable based test are missed entirely by tests based on some other variable. This combined with the automatic retraining during normal operating regime makes our framework seamless yet accurate.
Selection of the detection threshold
To demonstrate the dependence on the threshold , we consider the entire datastream of days and report results on anomaly detection by multivariate as well as individual variables in Figure 10 for two thresholds (a) , and (b) . In either case, each detected anomaly labeled based on the set of individual variables scores and/or multivariate score that led to its discovery. Setting a detection threshold of = 5 standard deviations results in a total of just under 8000 anomalies. Observe that the most common detection of anomalies corresponds to detection through the multivariate case. Voltage anomalies are the second most occurring anomaly type, but they are also detected by the multivariate test. A threshold corresponding to five standard deviations may be too low as in the large amount of data it would also capture the lowprobability deviations that correspond to normal events under the learned model. Setting a detection threshold of in Figure 10 (b), reduces the number of false positives but may potentially miss some low impact events. The trend remains similar in both cases: the multivariate score is enough to detect the score in all other variables. In practice, a threshold selection should be done by the operator using randomized tests with labeled data of events of importance to the system. In this threshold setting scheme, one could imagine anoperator assisted procedure where a human may introduce some feedback to finetune the threshold since the events are detected in realtime. In the next step, we move to the crucial step of anomaly classification of detected events.
V Classification of Anomalies
It is desirable to provide information useful for operators once an event is detected. Specifically, a tool that is able provide an automatic classification of the anomaly into predefined classes can provide quick insight into the operation of the system. This would enable the operator (or automated system) to take fast and informative control action. Another requirement for the classifier is to be as interpretable as possible: the operator should be able to look up why a certain event has been classified into a given class.
From the detection results, it is clear that among single variable based detection tests, voltage anomalies are the most frequent anomaly type. To describe and demonstrate our classification task, we focus on assigning the voltage anomalies into a set of common occurrence classes. The PMU data available to us is unlabeled, and hence for the sake of illustration of our methods, we resort to shapebased but intuitive classes. We have manually labeled 750 voltage events into 4 categories using raw data of the voltage signal: voltage spikes, prolonged voltage drops, step up and step down events typically caused by load tap changes and miscellaneous oscillatory events. In realworld application, these classes should be defined based on the domain expertise of the operators.
To enable automatic classification, data features are generated from the data residuals, using 5 seconds of data following anomaly detection. The extracted features include:

Maximum distance

Average distance

Number of points above the threshold

Decile Ranks 19

Index of maximum distance

# of oscillations around the 0.25, 0.5, and 0.75 levels

Return from anomaly index. (Index of value such that future residuals are all below the anomaly threshold)

The difference of indices of maximum distance and return from anomaly
For elucidation, a representative anomaly of length seconds with annotated features is shown in Figure 11. In practice, additional features can be generated using automatic feature extraction schemes or using domain knowledge .
Our classification scheme is based on a supervised decision tree [cart84] based on the above features. The decision tree partitions regressively on features using the algorithm [rpart] and the R library rpart [rpart_r]. Among others, decision trees have good interpretability as they offer operators the exact sequence of feature values that led to an anomaly being classified into a particular class. In practice, this can help in the selection or removal of specific features based on the accuracy over labeled training data. The regressive partitioning algorithm in our decision tree selects a subset of features that give the most accurate classification: decile ranks 2 and 8, the maximum distance and the oscillations around the mean distance. Figure 12 shows the view of the decision tree used for our data. Using 10fold crossvalidation, we present the accuracy of classification by our decision tree. To benchmark its performance, we compare it with blackbox neural network and kmeans clusteringbased classification using both features extracted from the data and raw data in Table I. Despite its simplicity and interpretability, the decision tree approach has an accuracy of 93.3%, which is on par and even higher than that of other learning schemes.
Classification Approach  Accuracy 

Decision Tree (features)  93.3% 
Neural Network 0 hidden layers (features)  91.1% 
Neural Network 1010010 hidden layers (features)  91.7% 
Neural Network 10 hidden layers (raw data)  86.0% 
Kmeans Clustering (features)  90.9% 
Vi Discussion and conclusions
Large quantities of synchronized data from phasor measurement units have created a need for a statistical learning based framework to process, model and evaluate the data to gain insight and understanding of the electric power system. The requirements for such a framework include: (a) generality: the scheme should not be systemspecific and should be transferrable and deployed on a large number of infrastructure systems; (b) interpretability: each step, be it model learned, anomaly detected, or classified, should be easily interpretable by an operator to better inform the response decisions; (c) modularity: each of the components could potentially be replaced by other schemes if desired; (d) guarantees: welldefined theoretical and empirical error measures provided with each step of the framework; and (e) computational efficiency: the methods should operate on streaming data, perform realtime computations, and provide early detection, classification, and explanation of the events.
This paper shows a general realtime approach that satisfies all the criteria listed above. It utilizes an effective physicsmotivated linear vector autoregressive model learned from a PMU data stream. We presented the general datadriven procedure that can be used to select hyperparameters of the learning scheme in a principled way. The value of the effective model learned is shown through two important applications: categorization of anomalies detected using welldefined probabilistic scores according to the learned model; and classification of anomalies using interpretable decision trees that also show a very good level of accuracy.
One of the advantages of the designed anomaly detection scheme consists in the choice of the invariant representation of the signals in terms of probabilistic scoring according to the effective model learned on the fly. This ensures the transferability of methods to very general settings. As far as classification is concerned, given that most of the data is unlabeled, in future work it is important to move to unsupervised approaches such as clustering to be able to classify unlabeled data corresponding to different variables automatically.
One important future direction is to use the full power of the learned effective model to see the signs of endogenous events or cyber attacks [lokhov2016detection] before they occur. Inherently, it is impossible to predict external perturbations, but it is possible to predict the response of the system to exogenous perturbations and to take corrective actions if the response is dangerous for the stability of the system. We anticipate that for many such scenarios, the information should be contained in the dynamic state matrix of the effective model learned prior to the event.
This work focused on establishing data processing, anomaly detection and classification at the level of a single PMU with multivariable stream. Moving forward, it is important to use the inferred information for networkwide tasks, such as localization of anomalies [8718345]. With the overwhelming quantity of data transmitted in modern WAMS, and limited comunication network infrastrucure, it is desirable for data to be processed locally on each PMU in the way described in this study; and that only labeled information on the detected events with time stamps could be sent to the central server for performing networkwide analysis from a partial PMU coverage [lokhov2018cdc].
Last, our short term goal will be to release code associated with the methods developed in this study with a visualization component, which would enhance the interpretability of the results for grid operations.
Vii Acknowledgments
We are grateful to Scott Backhaus, Trevor Crawford and Alaa Moussawi for fruitful discussions and for their insights in formulating various elements of the developed framework.