Feature Learning for Fault Detection in High-Dimensional Condition-Monitoring Signals ††thanks: Preliminary versions of this work were presented in [michau_deep_2017]. Work supported by the Swiss Commission for Technology and Innovation under grant number 17833.2 PFES-ES.
Complex industrial systems are continuously monitored by a large number of heterogenous sensors. The diversity of their operating conditions and the possible fault types make it impossible to collect enough data for learning all the possible fault patterns.
The paper proposes an integrated automatic unsupervised feature learning approach for fault detection that uses healthy conditions data only for its training. The approach is based on stacked Extreme Learning Machines (namely Hierarchical, or HELM) and comprises stacked autoencoders performing unsupervised feature learning, and a one-class classifier monitoring the variations in the features to assess the health of the system.
This study provides a comprehensive evaluation of HELM fault detection capability compared to other machine learning approaches, including Deep Belief Networks. The performance is first evaluated on a synthetic dataset with typical characteristics of condition monitoring data. Subsequently, the approach is evaluated on a real case study of a power plant fault.
HELM demonstrates a better performance specifically in cases where several non-informative signals are included.
Machine learning and artificial intelligence have recently achieved some major breakthroughs leading to significant progress in many domains, including industrial applications [bengio_representation_2013, Shao2017, Valada2018, Li2018]. One of the major enablers has been the progress achieved on automatic feature learning, also known as representation learning [bengio_representation_2013]. It improves the performance of machine learning algorithms while limiting the need of human intervention. Feature learning aims at transforming raw data into more meaningful representations simplifying the main learning task.
Traditionally, features were engineered manually and were thereby highly dependent on the experience and expertise of domain experts. This has limited the transferability, generalization ability and scalability of the developed machine learning applications [Khan2018]. The feature engineering task is also becoming even more challenging as the size and complexity of data streams capturing the condition of industrial assets are growing. These challenges have fostered the development of data-driven approaches with automatic feature learning, such as Deep Learning that allows for end-to-end learning [Khan2018].
Feature learning techniques can be classified into supervised and unsupervised learning. While in its traditional meaning, supervised learning requires examples of “labels” to be learned, supervised feature learning refers instead to cases where features are learned while performing a supervised learning task, as for example fault classification or remaining useful life (RUL) estimation in PHM applications. Recently, deep neural networks have been used for such integrated supervised learning tasks, including Convolutional Neural Networks (CNN) [Ince2016, Krummenacher2018, Li2018b] or different types of recurrent neural networks, e.g., Long-Short-Term-Memory (LSTM) networks [Zhao2017b]. The performance achieved with such end-to-end feature learning architectures on commonly used PHM benchmark datasets, such as the CMAPSS remaining useful life (RUL) predictions, was shown to be superior to traditional feature engineering applications [Li2018, Zhang2018].
Contrary to the supervised learning, unsupervised feature learning does not require any additional information on the input data and aims at extracting relevant characteristics of the input data itself. Examples of unsupervised feature learning include clustering and dictionary learning. Deep neural networks have also been used for unsupervised feature learning tasks for PHM applications, e.g., with Deep Belief Networks [Chen2017, Zhao2017], and with different types of autoencoders [Chen2017], e.g., Variational Autoencoders [Yoon2017].
Even though the approaches described above applied either supervised or unsupervised feature learning within their applications, the subsequent step of using the learned features within fault detection, diagnostics or prognostics applications have always required labeled data to train the algorithms. Yet, while data collected in PHM are certainly massive, they often span over a very small fraction of the equipment lifetime or operating conditions. In addition, when systems are complex, the potential fault types can be extremely numerous, of varied natures and consequences. In such cases, it is unrealistic to assume data availability on each and every fault that might occur, and these characteristics challenge the ability of data-driven approaches to actually learn the faults without having access to a representative dataset of possible fault types.
The focus of the methodology proposed in this paper is to account for this possibility. Using healthy data only, as they are usually largely available, it combines first an unsupervised feature learning step, providing the condensed and informative features to the second learning step, in which the health of the system is estimated and subsequently used to monitor the system condition and detect faults. This health monitoring concept is also different to other approaches that proposed to use health indicators (HIs) to monitor the system condition [Malhotra2016], as the HIs were learned in a supervised way. The proposed approach is able to measure the distance to the healthy system condition and by that also to distinguish different degrees of fault severity. To the best of our knowledge, this is the first unsupervised deep learning approach enabling efficient health monitoring trained solely with healthy condition monitoring data.
The approach presented here takes advantage of the recent advances in using multi-layer extreme learning machines for representation learning [yang_autoencoder_2017]. The novelty here, is the use of hierarchical extreme learning machines (HELM) [cao_building_2016, michau_deep_2017, Michau_2018b]. HELM consists in stacking sparse autoencoders, whose neurons can directly be interpreted as features and a one-class classifier, aggregating the features in a single indicator, representing the health of the system. The very good learning abilities of HELM have already been used in other fields, e.g., in medical applications [miotto_deep_2016] and in PHM [michau_deep_2017]. This study provides a comprehensive evaluation of HELM detection capability compared to other machine learning approaches, including Deep Belief Networks.
The paper is organised as follows: Section 2 details the HELM theory and the algorithms used for this research. Section 3 presents a simulated case study designed to test and compare the HELM to other traditional PHM models. Controlled experiments allow for quantified results and comparisons. A real application is analysed in Section 4, with data from a power plant experiencing a generator inter-turn failure. Finally, results and perspectives are discussed in Sections 5.
2 Framework of Hierarchical Extreme Learning Machines (HELM)
HELM is a multilayer perceptron network, which was first proposed in [tang_extreme_2016]. The structure of HELM consists in staking single layer neural networks hierarchically so that each one can be trained independently. In this project we propose to stack first compressive autoencoders, used here for unsupervised feature learning, using the learnt features in the second step as input to the one-class classifier that aggregates the features in a relevant indicator [michau_deep_2017] that can be interpreted as health indicator. This section details the theoretical background of HELM highlight the adaptations needed to match the specific requirements of PHM applications.
In the following, , , and refer respectively to signals, model outputs, targets and labels. is the signal dataset dimensions, represents the number of samples and the dimension of the model output. When needed, the superscript notation is used to discriminate between variables specific to training data (e.g., ), validating data (e.g., ) or testing data (e.g., ).
For single layer feed-forward networks, with hidden neurons, and refer respectively to the weights and biases between inputs and the hidden layer neurons. , refers to the weights between the hidden layer neurons and outputs. is the hidden layer matrix such as where is the activation function.
2.2 Theory and Training of ELM
Extreme Learning Machines (ELM) are Single-hidden Layer Feed-forward Neural networks (SLFNs) [huang_extreme_2004]
SLFN equation is usually written as
where represents the input, the output, the activation function, and , and weights and biases respectively.
The whole theory behind ELM, relies on the proof that, once set, a linear kernel, and a training set where represents the target output, the network ouput can approximate with any accuracy given, first, enough neurons (that is, the length and width of , and respectively) and, second, randomly sampled and [huang_universal_2006]. That is, for any , for and for enough neurons, exists such that:
where denotes the hidden layer matrix .
In the case of traditional feed-forward neural networks, the weights , and are optimized over several iterations, usually using a back-propagation algorithm. By sampling randomly , the input weights, and , the bias, training an ELM only consists in finding the optimal the output weights . This simplifies Problem (3) considerably as it is now formulated as the single-variable regularised convex optimisation in Problem (4).
where , and is the weight of the regularization. represents a compromise between the learning objective and properties of that one would like to impose (e.g., sparsity, non-diverging coefficient, etc…). Any combination of , , , leads to a different solution of Problem (4), it is however usual to take .
When and , Equation (4) is known as the ridge regression problem (also referred to as Tikhonov regularisation problem) [huang_what_2015], and it has an explicit solution:
2.3 ELM-Autoencoder for Feature Learning
Autoencoders (or AE) are unsupervised machine learning tools trained for the reconstruction of their input. Structural or mathematical constraints make sure that they do not simply learn the identity operation, such as adding noise to the input (the autoencoder is then qualified as denoising), imposing dimensionality reduction (compressive AE), dimensionality increase or adding a kernel regression hidden layer (variational AE).
Within neural networks, autoencoders are often part of multilayer architectures: they can be trained independently to the problem at hand and are often used for feature learning [vincent_stacked_2010]. Conceptually, in a compressive AE, the hidden layers learn the best few combinations of the different measurements that can explain all the signals. The ELM framework can be used to train single layer autoencoder neural networks.
In this case, it consists in solving Problem (4) with the target equal the input . An intuition on feature relevance is that if each feature is connected to few measurements only, each feature should maximise the information it contains on part of the system. This sparse connectivity between features and reconstructed inputs can be achieved using the -norm for the regularization term (). Yet, this would make Problem (4) non-convex and require integer programming solver. A surrogate is the -norm, which leads to the LASSO problem, one of the classic convex optimisation problem in the literature [Hastie_2017]. One can note, that the -norm has a closed form solution and is thus much easier to compute (cf. Equation (5)) yet, the solution is likely to be dense and redundant [cambria_extreme_2013], thus, not compatible with our idea of a good set of features.
The AE-ELM consists finally in solving the typical LASSO problem:
In this paper, we solve this problem using the FISTA algorithm (Fast Iterative Shrinkage-Thresholding Algorithm) for both its fast convergence and computational simplicity [beck_fast_2009]. The process for solving Problem (4) is detailed in Algorithm (1).
2.4 Health Indicator for Fault Detection
In the context of critical and complex systems, the high number of possible faults due to the large number of parts involved in the system, the rarity of certain faults (if ever experienced at all), and, overall, the (understandable) lack of willingness from operators to let the faults happen repeatedly for data collection, make it impossible to aim for fault recognition or classification. Instead, we propose to focus on abnormality detection, training a one-class classifier ELM [leng_one-class_2015] on features from healthy data solely.
In this case, we use the ridge regression problem (5) with (). Then, the distance between the output and the normal class () is monitored. A threshold on this distance discriminates between the normal and the faulty class. As such, this distance is similar to an health indicator [hu_deep_2016]. We propose here to base the threshold on a validation set , that is, a set of data-points from healthy conditions, not used in the training. Experiments have shown that a good threshold can be designed as:
where is the th-percentile function and, are hyperparameters.
Then the actual class of a sample ( for healthy, otherwise) can be devised with the following equation:
The choice of both and is to some extent a single problem, consisting in fact in choosing a robust threshold. In this paper, we chose to fix and find the best by optimising on the number of true and false positives (cf. Section 3). Experiments will show that gives good results.
2.5 Stacking AE(s) and a one-class classifier for HELM
A traditional approach to benefits from AE performances in deep learning is to pre-train an AE before using it in a deeper classification architecture, itself trained and fine-tuned using back-propagation [lecun_deep_2015]. HELM consists instead in training each layer sequentially and independently. The hidden layer of each successive ELM becomes the input of the next. This avoid well-known limitations of the BP, that are the risk of exploding or vanishing gradient and intensive computations required for the training [vincent_stacked_2010, hinton_fast_2006]. In HELM, the lower layers are autoencoders, while the last layer is a one-class classifier, as illustrated in Figure 1.
In the following, HELM with a single AE is tested on a simulation and on a real case study. HELM is compared to five other different feature extraction and fault detection approaches: First, a one class classifier ELM and a Support Vector Machine (SVM) with the raw signal as an input (no feature learning stage). Second, both same method applied on the results of a Principle Component Analysis (PCA). Last a Deep Belief Network (DBN) composed of two staked RBM: one for feature learning and a second whose hidden layer is fully connected to a single output neuron trained as a one-class classifier.
Remark about the one class classifier SVM: Note that usually, one class classifier SVM outputs the likelihood for data points to belong to the main class rather than being outliers. The traditional interpretation is thus to label positive likelihood points as “normal data” and negative likelihood points as outliers or “faulty”. Yet this proved to give very bad detection performance on the applied datasets and we proposed, for a fairer comparison, to use the decision rule (8), using the negative likelihood as the model output distance . This is, however, not exactly a distance as it can take negative values.
3 Simulated Case Study
To evaluate the performances of the proposed approaches under known and controlled conditions, a simulated case study is designed. The synthetically generated datasets aim to represent behaviors observed in real condition-monitoring signals and to simulate varied faults.
Methodology: To simulate the behaviour of a real complex industrial system, a set of artificial datasets have been generated, each composed of training, validating and testing data. It is assumed that the simulated system has a number of intrinsic properties (e.g., temperature, rotation, voltage etc…) that are here simulated with distinct random strictly positive signals . Then, sensors are simulated, each sensor reading one of the signals according to an internal reading function and a noise . Here, is either a noised affine function , or a logarithm, often used for intensity measurements, , where is the signal (among ) read by the sensor , models the sensitivity of the sensor and is drawn randomly and is an additive random Gaussian-noise of .
At a given time step a fault is simulated which can impact one base signal in (Fault (1) to (4)) or the sensors themselves (Fault (5)). Simulated faults are:
20% amplitude change: the signal is multiplied by 1.2
50% amplitude change: the signal is multiplied by 1.5
20% stepwise deviation: a constant value computed as 20% of the signal amplitude is added to the signal
new: a new random signal is generated and used instead
alternatively, the fault can instead impact 5% of the sensors by increasing the reading amplitude by 20% (similarly to fault (1) but at the sensor level).
The signals before the fault are split into training and validation data. The latter dataset is used to estimate the detection threshold as per Equation (8). Among the testing data, a non-faulty set enables the computation of True Negatives (TN) (complementary to the False Positives (FP)) while the True Positives (TP), complementary to the False Negatives (FN) are obtained from the datasets with faults (1) to (5).
For each dataset, if the threshold in Equation (8) is exceeded at least once, then it is considered either as TP (if the dataset has a fault) or as FP otherwise.
For each model and for each simulated fault independently, the accuracy result is optimised with a grid search over the hyperparameters, where the accuracy is defined as:
where is the experiment number.
Hyperparameters optimised in the process are:
HELM: (), respectively the ridge regression constant, the norm penalisation, the number of neurons for the autoencoder, the number of neurons for the one class classifier and the factor used in the decision rule (8).
PCA: (), the number of principle components selected in the PCA (up to 99% of explained variance).
SVM: (), the Gaussian kernel scale, the number of outliers used for the training and the decision rule (8) factor.
DBN: (), the number of neurons and the factor used in the decision rule (8).
Datasets: The experiments are repeated 100 times (). For each experiment, there are training points, validation points, and six set of testing points. One non-faulty and one of each fault (1) to (5).
Impact of randomness on (H)ELM efficiency: Experiments demonstrated that, for ELM-based methods, averaging results over 5 independent training gives slightly more robust results, without changing the conclusions drawn from the case study. This is important for practitioners, as HELM is inexpensive to train and run and as reducing the randomness impact on the results is a valid concern.
The Accuracies of the six models for all the five fault types and for the two measurement models , are summarised in Table 1 and 2. Table 1 presents the best accuracies achieved independently for each fault, for and affected base signals, while Table 2 presents the results for the models with best average accuracy over the 5 fault types and demonstrates the robustness and generalisation performances of HELM.
Based on Table 1, and according to expectations, accuracies are globally inferior for higher , as the underlying complexity of the dataset increases. Accuracies are also globally smaller for the logarithmic sensor measurement function. HELM achieves very good detection performances on all fault types and is the most efficient approach to detect faults that impact the base signals. HELM provides consistent results both for the different number of affected base signals and also for the two different measurement functions.
Compared to ELM, HELM performs consistently better which proves the benefits from the additional feature encoding layer. SVM is having performances very close to that of HELM for but as the inherent dimensionality of the signal increases, to , its accuracy is strongly reduced, achieving just better than random detection on faults (1) and (3). Using PCA as a first step for dimension reduction proved to decrease significantly the models efficiency for , for faults impacting the base signals. On fault (5), where the sensors themselves are impacted by a fault, PCA tends to improve the results, yet not significantly. The non-linear relationships between signals and faults seem to limit the efficiency and impact of PCA. Last, the Deep Belief Network shows robust results for or , but with low accuracies.
In Table 2, in addition to the accuracies, the magnification coefficient is computed as the ratio between the 99% quantile of the output value and the decision threshold of Eq. (8). This coefficient indicates the model robustness: a higher magnification corresponds to a lower sensitivity to the threshold. Results show that HELM is the most robust model: with a single set of hyperparameters, it achieves almost optimal results on all faults, with accuracies consistently around 95% and very strong magnification coefficients (up to around 100). SVM achieves accuracies close to those of HELM, yet it is up to 100 times slower to train and its magnification is very low. Performing PCA before the SVM does reduce the time needed for training and increases the magnification, but it lowers the accuracies.
4 Real-Application Case Study: Generator Fault Detection
4.1 Power Plant Condition Monitoring
For the real case study, a H cooled generator from a electricity producing plant is evaluated. The generator is working on nominal load, with changing operating conditions and thus, variations in the parameters. The evaluated dataset cover nine months (275 days) with 310 sensors recording every 5 minutes( and ). Sensors are grouped in 5 families: The rotor shaft voltage is used to detect shaft grounding problems, shaft rubbing, electroerosion, bearing isolation problems and rotor inter-turn shorts. Rotor flux is used to detect the occurrence, the magnitude and the location of the rotor winding inter-turn short circuit. Partial discharge is used to detect aging of the main insulation, loose bars or contact as well as contamination. End winding vibration is mainly used to detect deterioration in mechanical stiffness of the overhang support system. Last, the Stator bar water temperature is also monitored.
The generator failure mode tested here has been a posteriori explained by expert as, first, at day 120, an intermittent short circuits in the rotor that remained undetected, denoted in the following by lower level fault, developing in a second faulty state or upper level fault at day 169, when the short circuit worsened to a continuous one. This led to the power plant shut-down.
Similarly as in the previous case study, the data are split in three, for training and validation (first 120 days with a random sampling ) and for testing (from day 120 to day 169 to count FP, the rest to count TP). The dataset is hence split as follows: , and . Remind that . As results need to be assessed on a single experiment, in the present case study, the TP and FP are redefined the proportion of points for which the model output exceeds the threshold after and before day 169 respectively.
|Acc||TP (%)||FP (%)||Mag||Time (s)|
Table 3 synthesises the accuracy, percentage of TP and FP, the magnification coefficient and the training time for the best performing model of each family. In addition, the distance to the normal class () of these models are illustrated in Figure 8. In accordance with the results obtained so-far, HELM provides the best performances. Its accuracy is higher and it is a robust model, with a strong magnification (20), and easily trained. The one-class classifier alone performs sensibly worse: numerous FP are raised, TP are scarce, which is consistent with its much lower magnification. SVM provides good performances with no FP and a good rate of TP, yet its accuracy is smaller than that of HELM and its magnification is very small (). SVM is the least robust with respect to the threshold definition and is the harder to train (300 times slower). Performing a PCA before the classification also worsened the results, which could be the consequence of inherent non-linear relationships between the measurements. The Deep Belief Network also has a lower accuracy than HELM but it has a stronger magnification.
Based on Fig. 8, for all models but SVM the distance to normal class increases after the lower level fault (from day 169 to day 247) and again after the upper-level fault (after day 247). The output of the one-class classifier behaves similarly to a health indicator.
4.4 HELM versus Expertise
Experts who analysed the power plant operation could identify the fault starting at day 169 thanks to eight particular signals. Using these signals as inputs to the ELM and SVM models mimics the traditional machine learning approach using engineered features as input. Doing so, ELM achieves an accuracy of 0.88 (TP: 75.1%, FP: 0.0%) and a magnification of 20 and SVM an accuracy of 0.8 (TP: 60.2%, FP:0.0%) and a magnification of 6. This approach brings ELM results very close to those achieved by HELM in the previous section. For SVM, accuracy decreases but the higher magnification allows now to distinguish between lower and higher level faults.
This approach, however, is based on the signals selected by the experts a posteriori to the fault. It could probably detect similar faults in the future but it may miss other types of faults, impacting other signals. This flaw is not shared by HELM which does not require any a posteriori knowledge.
5 Discussion and Conclusions
Both case studies analysed in this research demonstrate a better performance of HELM compared to that of other methods. Tested on high dimensional signals with varied characteristics and faults, HELM raises consistent and robust alarms in faulty system conditions. It also enables tracking in time of the evolution of the faulty system conditions. The other tested methods are more impacted by the dimensionality of the input data, the characteristics of the signals and of faults. The dimensionality reduction step with PCA, for example, has proven in other experiments to be particularly suitable for signals with linear relationships. However, in the presence of non-linear relationships, the ability of PCA to extract non-correlated components decreases significantly. Other deep learning approaches, such as DBN, can also provide excellent results but are usually highly dependent on the selection of hyperparameters. Experimentally, training an efficient DBN necessitates a lot of fine tuning and has proved to be a difficult task.
On both case studies, HELM demonstrates a performance that is either superior or similar to that of the other approaches on all of the datasets. It is, in addition, much faster and more robust. The generator case study demonstrated in addition that HELM is able to learn the relevant features also on real condition-monitoring signals and to detect the fault at its earliest stage, with higher accuracy than with manually selected features. This is a promising advantages as the dimensionality of the dataset tends to increase rapidly. Features are harder to engineer and signal selection becomes more challenging.
HELM integrates a feature learning step which extracts the high-level representations of the condition-monitoring signals and improves the detection accuracy. HELM is able to handle raw condition-monitoring signals and does not require knowledge on the faults as it is trained on normal system state data solely. Its output can be interpreted as a distance to the normal system condition, or health indicator, and provides, as such, more information over time than typical binary decision process. By detecting very early anomalies, months before faults happens, as in the generator case study, the savings in term of operational costs can reach millions of euros. HELM satisfies therefore the many requirement of a good decision support tool for diagnostics engineers.
The proposed approach has shown to be particularly beneficial for industrial applications with high-dimensional signals having complex relationships, experiencing very few but critical faults and lacking representative datasets covering all possible operating conditions and fault types.
For future work, the question of a priori hyperparameter setting remains open. While in this study, results were presented for the models that would maximise the accuracies a posteriori, it is crucial for diagnostics engineer to know how to set the model prior to the occurrence of faulty system conditions. Nonetheless, we have seen in the simulated case that HELM trained with a good set of parameters is able to detect many different kinds of faults. Therefore, inducing artificial faults on real condition monitoring data could help finding a good set of hyperparameters that will likely be valid for other types of faults. The investigation of the number of autoencoder layers on the feature learning performance is also left for future works. In this contribution, a single feature learning layer was sufficient to demonstrate a superior performance. Additional feature learning layers could improve results or be necessary if the dimensionality of the dataset is higher than in the present case. This would add an additional hyperparameter to be selected.