Feature Learning for Fault Detection in High-Dimensional Condition-Monitoring Signals Preliminary versions of this work were presented in [michau_deep_2017]. Work supported by the Swiss Commission for Technology and Innovation under grant number 17833.2 PFES-ES.

Feature Learning for Fault Detection in High-Dimensional Condition-Monitoring Signals thanks: Preliminary versions of this work were presented in [michau_deep_2017]. Work supported by the Swiss Commission for Technology and Innovation under grant number 17833.2 PFES-ES.

Gabriel Michau ETH Zürich, Swiss Federal Institute of Technology, Zurich, Switzerland
(email: gmichau@ethz.ch)
Yang Hu Science and Technology on Complex Aviation Systems Simulation Laboratory, Fengtai District, Beijing, 100076, China Thomas Palmé General Electric (GE) Switzerland, Brown Boveri Str. 7, Baden, 5401, Switzerland Olga Fink ETH Zürich, Swiss Federal Institute of Technology, Zurich, Switzerland
(email: gmichau@ethz.ch)

Complex industrial systems are continuously monitored by a large number of heterogenous sensors. The diversity of their operating conditions and the possible fault types make it impossible to collect enough data for learning all the possible fault patterns.

The paper proposes an integrated automatic unsupervised feature learning approach for fault detection that uses healthy conditions data only for its training. The approach is based on stacked Extreme Learning Machines (namely Hierarchical, or HELM) and comprises stacked autoencoders performing unsupervised feature learning, and a one-class classifier monitoring the variations in the features to assess the health of the system.

This study provides a comprehensive evaluation of HELM fault detection capability compared to other machine learning approaches, including Deep Belief Networks. The performance is first evaluated on a synthetic dataset with typical characteristics of condition monitoring data. Subsequently, the approach is evaluated on a real case study of a power plant fault.

HELM demonstrates a better performance specifically in cases where several non-informative signals are included.



1 Introduction

Machine learning and artificial intelligence have recently achieved some major breakthroughs leading to significant progress in many domains, including industrial applications [bengio_representation_2013, Shao2017, Valada2018, Li2018]. One of the major enablers has been the progress achieved on automatic feature learning, also known as representation learning [bengio_representation_2013]. It improves the performance of machine learning algorithms while limiting the need of human intervention. Feature learning aims at transforming raw data into more meaningful representations simplifying the main learning task.

Traditionally, features were engineered manually and were thereby highly dependent on the experience and expertise of domain experts. This has limited the transferability, generalization ability and scalability of the developed machine learning applications [Khan2018]. The feature engineering task is also becoming even more challenging as the size and complexity of data streams capturing the condition of industrial assets are growing. These challenges have fostered the development of data-driven approaches with automatic feature learning, such as Deep Learning that allows for end-to-end learning [Khan2018].

Feature learning techniques can be classified into supervised and unsupervised learning. While in its traditional meaning, supervised learning requires examples of “labels” to be learned, supervised feature learning refers instead to cases where features are learned while performing a supervised learning task, as for example fault classification or remaining useful life (RUL) estimation in PHM applications. Recently, deep neural networks have been used for such integrated supervised learning tasks, including Convolutional Neural Networks (CNN) [Ince2016, Krummenacher2018, Li2018b] or different types of recurrent neural networks, e.g., Long-Short-Term-Memory (LSTM) networks  [Zhao2017b]. The performance achieved with such end-to-end feature learning architectures on commonly used PHM benchmark datasets, such as the CMAPSS remaining useful life (RUL) predictions, was shown to be superior to traditional feature engineering applications [Li2018, Zhang2018].

Contrary to the supervised learning, unsupervised feature learning does not require any additional information on the input data and aims at extracting relevant characteristics of the input data itself. Examples of unsupervised feature learning include clustering and dictionary learning. Deep neural networks have also been used for unsupervised feature learning tasks for PHM applications, e.g., with Deep Belief Networks  [Chen2017, Zhao2017], and with different types of autoencoders [Chen2017], e.g., Variational Autoencoders [Yoon2017].

Even though the approaches described above applied either supervised or unsupervised feature learning within their applications, the subsequent step of using the learned features within fault detection, diagnostics or prognostics applications have always required labeled data to train the algorithms. Yet, while data collected in PHM are certainly massive, they often span over a very small fraction of the equipment lifetime or operating conditions. In addition, when systems are complex, the potential fault types can be extremely numerous, of varied natures and consequences. In such cases, it is unrealistic to assume data availability on each and every fault that might occur, and these characteristics challenge the ability of data-driven approaches to actually learn the faults without having access to a representative dataset of possible fault types.

The focus of the methodology proposed in this paper is to account for this possibility. Using healthy data only, as they are usually largely available, it combines first an unsupervised feature learning step, providing the condensed and informative features to the second learning step, in which the health of the system is estimated and subsequently used to monitor the system condition and detect faults. This health monitoring concept is also different to other approaches that proposed to use health indicators (HIs) to monitor the system condition [Malhotra2016], as the HIs were learned in a supervised way. The proposed approach is able to measure the distance to the healthy system condition and by that also to distinguish different degrees of fault severity. To the best of our knowledge, this is the first unsupervised deep learning approach enabling efficient health monitoring trained solely with healthy condition monitoring data.

The approach presented here takes advantage of the recent advances in using multi-layer extreme learning machines for representation learning [yang_autoencoder_2017]. The novelty here, is the use of hierarchical extreme learning machines (HELM) [cao_building_2016, michau_deep_2017, Michau_2018b]. HELM consists in stacking sparse autoencoders, whose neurons can directly be interpreted as features and a one-class classifier, aggregating the features in a single indicator, representing the health of the system. The very good learning abilities of HELM have already been used in other fields, e.g., in medical applications [miotto_deep_2016] and in PHM [michau_deep_2017]. This study provides a comprehensive evaluation of HELM detection capability compared to other machine learning approaches, including Deep Belief Networks.

The paper is organised as follows: Section 2 details the HELM theory and the algorithms used for this research. Section 3 presents a simulated case study designed to test and compare the HELM to other traditional PHM models. Controlled experiments allow for quantified results and comparisons. A real application is analysed in Section 4, with data from a power plant experiencing a generator inter-turn failure. Finally, results and perspectives are discussed in Sections 5.

2 Framework of Hierarchical Extreme Learning Machines (HELM)

HELM is a multilayer perceptron network, which was first proposed in [tang_extreme_2016]. The structure of HELM consists in staking single layer neural networks hierarchically so that each one can be trained independently. In this project we propose to stack first compressive autoencoders, used here for unsupervised feature learning, using the learnt features in the second step as input to the one-class classifier that aggregates the features in a relevant indicator [michau_deep_2017] that can be interpreted as health indicator. This section details the theoretical background of HELM highlight the adaptations needed to match the specific requirements of PHM applications.

2.1 Notations

In the following, , , and refer respectively to signals, model outputs, targets and labels. is the signal dataset dimensions, represents the number of samples and the dimension of the model output. When needed, the superscript notation is used to discriminate between variables specific to training data (e.g., ), validating data (e.g., ) or testing data (e.g., ).

For single layer feed-forward networks, with hidden neurons, and refer respectively to the weights and biases between inputs and the hidden layer neurons. , refers to the weights between the hidden layer neurons and outputs. is the hidden layer matrix such as where is the activation function.

2.2 Theory and Training of ELM

Extreme Learning Machines (ELM) are Single-hidden Layer Feed-forward Neural networks (SLFNs) [huang_extreme_2004]

SLFN equation is usually written as


where represents the input, the output, the activation function, and , and weights and biases respectively.

The whole theory behind ELM, relies on the proof that, once set, a linear kernel, and a training set where represents the target output, the network ouput can approximate with any accuracy given, first, enough neurons (that is, the length and width of , and respectively) and, second, randomly sampled and  [huang_universal_2006]. That is, for any , for and for enough neurons, exists such that:


where denotes the hidden layer matrix .

In the case of traditional feed-forward neural networks, the weights , and are optimized over several iterations, usually using a back-propagation algorithm. By sampling randomly , the input weights, and , the bias, training an ELM only consists in finding the optimal the output weights . This simplifies Problem (3) considerably as it is now formulated as the single-variable regularised convex optimisation in Problem (4).


where , and is the weight of the regularization. represents a compromise between the learning objective and properties of that one would like to impose (e.g., sparsity, non-diverging coefficient, etc…). Any combination of , , , leads to a different solution of Problem (4), it is however usual to take .

When and , Equation (4) is known as the ridge regression problem (also referred to as Tikhonov regularisation problem) [huang_what_2015], and it has an explicit solution:


2.3 ELM-Autoencoder for Feature Learning

Autoencoders (or AE) are unsupervised machine learning tools trained for the reconstruction of their input. Structural or mathematical constraints make sure that they do not simply learn the identity operation, such as adding noise to the input (the autoencoder is then qualified as denoising), imposing dimensionality reduction (compressive AE), dimensionality increase or adding a kernel regression hidden layer (variational AE).

Within neural networks, autoencoders are often part of multilayer architectures: they can be trained independently to the problem at hand and are often used for feature learning [vincent_stacked_2010]. Conceptually, in a compressive AE, the hidden layers learn the best few combinations of the different measurements that can explain all the signals. The ELM framework can be used to train single layer autoencoder neural networks.

In this case, it consists in solving Problem (4) with the target equal the input . An intuition on feature relevance is that if each feature is connected to few measurements only, each feature should maximise the information it contains on part of the system. This sparse connectivity between features and reconstructed inputs can be achieved using the -norm for the regularization term (). Yet, this would make Problem (4) non-convex and require integer programming solver. A surrogate is the -norm, which leads to the LASSO problem, one of the classic convex optimisation problem in the literature [Hastie_2017]. One can note, that the -norm has a closed form solution and is thus much easier to compute (cf. Equation (5)) yet, the solution is likely to be dense and redundant [cambria_extreme_2013], thus, not compatible with our idea of a good set of features.

The AE-ELM consists finally in solving the typical LASSO problem:


In this paper, we solve this problem using the FISTA algorithm (Fast Iterative Shrinkage-Thresholding Algorithm) for both its fast convergence and computational simplicity [beck_fast_2009]. The process for solving Problem (4) is detailed in Algorithm (1).

5:while Crit  do
10:      Crit 
Algorithm 1 FISTA

2.4 Health Indicator for Fault Detection

In the context of critical and complex systems, the high number of possible faults due to the large number of parts involved in the system, the rarity of certain faults (if ever experienced at all), and, overall, the (understandable) lack of willingness from operators to let the faults happen repeatedly for data collection, make it impossible to aim for fault recognition or classification. Instead, we propose to focus on abnormality detection, training a one-class classifier ELM [leng_one-class_2015] on features from healthy data solely.

In this case, we use the ridge regression problem (5) with (). Then, the distance between the output and the normal class () is monitored. A threshold on this distance discriminates between the normal and the faulty class. As such, this distance is similar to an health indicator [hu_deep_2016]. We propose here to base the threshold on a validation set , that is, a set of data-points from healthy conditions, not used in the training. Experiments have shown that a good threshold can be designed as:


where is the th-percentile function and, are hyperparameters.

Then the actual class of a sample ( for healthy, otherwise) can be devised with the following equation:


The choice of both and is to some extent a single problem, consisting in fact in choosing a robust threshold. In this paper, we chose to fix and find the best by optimising on the number of true and false positives (cf. Section 3). Experiments will show that gives good results.

2.5 Stacking AE(s) and a one-class classifier for HELM

A traditional approach to benefits from AE performances in deep learning is to pre-train an AE before using it in a deeper classification architecture, itself trained and fine-tuned using back-propagation [lecun_deep_2015]. HELM consists instead in training each layer sequentially and independently. The hidden layer of each successive ELM becomes the input of the next. This avoid well-known limitations of the BP, that are the risk of exploding or vanishing gradient and intensive computations required for the training [vincent_stacked_2010, hinton_fast_2006]. In HELM, the lower layers are autoencoders, while the last layer is a one-class classifier, as illustrated in Figure 1.

Figure 1: HELM architecture: Example with a single AE whose the hidden layers is used as the input of the next layer: a one-class classifier.

If denotes the number of stacked autoencoders, then training the HELM corresponds to Algorithm 2. Running the HELM, for validation and testing, consists in applying Algorithm 3.

3:for  do Stacked AE ELM
4:      Generate: , random weights
6:       (cf. Alg. 1)
7:       Upper layer ELM
8:Generate: , random weights
10: (cf. Eq. (5))
Algorithm 2 HELM Training
3:for  do
Algorithm 3 Running HELM

In the following, HELM with a single AE is tested on a simulation and on a real case study. HELM is compared to five other different feature extraction and fault detection approaches: First, a one class classifier ELM and a Support Vector Machine (SVM) with the raw signal as an input (no feature learning stage). Second, both same method applied on the results of a Principle Component Analysis (PCA). Last a Deep Belief Network (DBN) composed of two staked RBM: one for feature learning and a second whose hidden layer is fully connected to a single output neuron trained as a one-class classifier.

Remark about the one class classifier SVM: Note that usually, one class classifier SVM outputs the likelihood for data points to belong to the main class rather than being outliers. The traditional interpretation is thus to label positive likelihood points as “normal data” and negative likelihood points as outliers or “faulty”. Yet this proved to give very bad detection performance on the applied datasets and we proposed, for a fairer comparison, to use the decision rule (8), using the negative likelihood as the model output distance . This is, however, not exactly a distance as it can take negative values.

3 Simulated Case Study

To evaluate the performances of the proposed approaches under known and controlled conditions, a simulated case study is designed. The synthetically generated datasets aim to represent behaviors observed in real condition-monitoring signals and to simulate varied faults.

3.1 Description

Methodology: To simulate the behaviour of a real complex industrial system, a set of artificial datasets have been generated, each composed of training, validating and testing data. It is assumed that the simulated system has a number of intrinsic properties (e.g., temperature, rotation, voltage etc…) that are here simulated with distinct random strictly positive signals . Then, sensors are simulated, each sensor reading one of the signals according to an internal reading function and a noise . Here, is either a noised affine function , or a logarithm, often used for intensity measurements, , where is the signal (among ) read by the sensor , models the sensitivity of the sensor and is drawn randomly and is an additive random Gaussian-noise of .

At a given time step a fault is simulated which can impact one base signal in (Fault (1) to (4)) or the sensors themselves (Fault (5)). Simulated faults are:

  1. 20% amplitude change: the signal is multiplied by 1.2

  2. 50% amplitude change: the signal is multiplied by 1.5

  3. 20% stepwise deviation: a constant value computed as 20% of the signal amplitude is added to the signal

  4. new: a new random signal is generated and used instead

  5. alternatively, the fault can instead impact 5% of the sensors by increasing the reading amplitude by 20% (similarly to fault (1) but at the sensor level).

The signals before the fault are split into training and validation data. The latter dataset is used to estimate the detection threshold as per Equation (8). Among the testing data, a non-faulty set enables the computation of True Negatives (TN) (complementary to the False Positives (FP)) while the True Positives (TP), complementary to the False Negatives (FN) are obtained from the datasets with faults (1) to (5).

For each dataset, if the threshold in Equation (8) is exceeded at least once, then it is considered either as TP (if the dataset has a fault) or as FP otherwise.

For each model and for each simulated fault independently, the accuracy result is optimised with a grid search over the hyperparameters, where the accuracy is defined as:


where is the experiment number.

Hyperparameters optimised in the process are:

  • HELM: (), respectively the ridge regression constant, the norm penalisation, the number of neurons for the autoencoder, the number of neurons for the one class classifier and the factor used in the decision rule (8).

  • ELM: ().

  • PCA: (), the number of principle components selected in the PCA (up to 99% of explained variance).

  • SVM: (), the Gaussian kernel scale, the number of outliers used for the training and the decision rule (8) factor.

  • DBN: (), the number of neurons and the factor used in the decision rule (8).

Datasets: The experiments are repeated 100 times (). For each experiment, there are training points, validation points, and six set of testing points. One non-faulty and one of each fault (1) to (5).

Impact of randomness on (H)ELM efficiency: Experiments demonstrated that, for ELM-based methods, averaging results over 5 independent training gives slightly more robust results, without changing the conclusions drawn from the case study. This is important for practitioners, as HELM is inexpensive to train and run and as reducing the randomness impact on the results is a valid concern.

3.2 Results

The Accuracies of the six models for all the five fault types and for the two measurement models , are summarised in Table 1 and 2. Table 1 presents the best accuracies achieved independently for each fault, for and affected base signals, while Table 2 presents the results for the models with best average accuracy over the 5 fault types and demonstrates the robustness and generalisation performances of HELM.

{adjustbox}max width= Fault type (1) (2) (3) (4) (5) n 5 10 5 10 5 10 5 10 5 10 HELM 96 75 100 99 95 77 94 94 100 100 ELM 81 66 98 93 83 70 91 92 99 98 PCA ELM 63 64 87 85 68 65 91 92 100 100 SVM 86 61 100 98 88 66 93 93 98 98 PCA SVM 69 68 98 92 76 70 93 90 100 100 DBN 70 68 84 79 67 66 84 86 87 87

{adjustbox}max width= Fault type (1) (2) (3) (4) (5) n 5 10 5 10 5 10 5 10 5 10 HELM 93 73 100 96 94 73 96 94 100 100 ELM 67 67 89 92 65 69 90 92 100 98 PCA ELM 58 64 84 78 58 61 92 89 100 100 SVM 92 59 100 98 93 60 95 93 100 100 PCA SVM 66 57 98 90 65 61 94 91 100 100 DBN 65 62 78 71 64 60 82 84 90 90

Table 1: Accuracies for the different models, and affected base signals for fault types (1) to (5)

Based on Table 1, and according to expectations, accuracies are globally inferior for higher , as the underlying complexity of the dataset increases. Accuracies are also globally smaller for the logarithmic sensor measurement function. HELM achieves very good detection performances on all fault types and is the most efficient approach to detect faults that impact the base signals. HELM provides consistent results both for the different number of affected base signals and also for the two different measurement functions.

Compared to ELM, HELM performs consistently better which proves the benefits from the additional feature encoding layer. SVM is having performances very close to that of HELM for but as the inherent dimensionality of the signal increases, to , its accuracy is strongly reduced, achieving just better than random detection on faults (1) and (3). Using PCA as a first step for dimension reduction proved to decrease significantly the models efficiency for , for faults impacting the base signals. On fault (5), where the sensors themselves are impacted by a fault, PCA tends to improve the results, yet not significantly. The non-linear relationships between signals and faults seem to limit the efficiency and impact of PCA. Last, the Deep Belief Network shows robust results for or , but with low accuracies.

In Table 2, in addition to the accuracies, the magnification coefficient is computed as the ratio between the 99% quantile of the output value and the decision threshold of Eq. (8). This coefficient indicates the model robustness: a higher magnification corresponds to a lower sensitivity to the threshold. Results show that HELM is the most robust model: with a single set of hyperparameters, it achieves almost optimal results on all faults, with accuracies consistently around 95% and very strong magnification coefficients (up to around 100). SVM achieves accuracies close to those of HELM, yet it is up to 100 times slower to train and its magnification is very low. Performing PCA before the SVM does reduce the time needed for training and increases the magnification, but it lowers the accuracies.

{adjustbox}max width= (1) (2) (3) (4) (5) Acc Acc Mag Acc Mag Acc Mag Acc Mag Acc Mag HELM 95.5 94 2.2 99 6.6 95 2.2 94 85 98 102 ELM 87.6 79 1.37 95 1.88 80 1.34 89 5.4 95 6 PCAELM 76.6 59 1.78 82 1.99 56 1.63 90 28 97 72 SVM 89.4 80 1.15 99 1.31 81 1.15 93 1.53 96 1.46 PCASVM 84 65 1.3 98 1.49 65 1.27 93 12 100 29 DBN 74.9 67 1.2 75 1.28 66 1.17 83 1.72 86 2.1
{adjustbox}max width= Time (s) HELM 1.5 20 100 0.01 1e-5 0.16 ELM 1.5 400 1e-5 0.10 PCA ELM 1.5 100 15 0.20 SVM 1.1 1 10 10.5 PCA SVM 1.2 15 1 10 3.3 DBN 1.1 70 100 8.2

Table 2: Accuracies and Magnification on different faults for a single set of hyperparameters. ,

4 Real-Application Case Study: Generator Fault Detection

4.1 Power Plant Condition Monitoring

For the real case study, a H cooled generator from a electricity producing plant is evaluated. The generator is working on nominal load, with changing operating conditions and thus, variations in the parameters. The evaluated dataset cover nine months (275 days) with 310 sensors recording every 5 minutes( and ). Sensors are grouped in 5 families: The rotor shaft voltage is used to detect shaft grounding problems, shaft rubbing, electroerosion, bearing isolation problems and rotor inter-turn shorts. Rotor flux is used to detect the occurrence, the magnitude and the location of the rotor winding inter-turn short circuit. Partial discharge is used to detect aging of the main insulation, loose bars or contact as well as contamination. End winding vibration is mainly used to detect deterioration in mechanical stiffness of the overhang support system. Last, the Stator bar water temperature is also monitored.

The generator failure mode tested here has been a posteriori explained by expert as, first, at day 120, an intermittent short circuits in the rotor that remained undetected, denoted in the following by lower level fault, developing in a second faulty state or upper level fault at day 169, when the short circuit worsened to a continuous one. This led to the power plant shut-down.

4.2 Methodology

Similarly as in the previous case study, the data are split in three, for training and validation (first 120 days with a random sampling ) and for testing (from day 120 to day 169 to count FP, the rest to count TP). The dataset is hence split as follows: , and . Remind that . As results need to be assessed on a single experiment, in the present case study, the TP and FP are redefined the proportion of points for which the model output exceeds the threshold after and before day 169 respectively.

4.3 Results

Acc TP (%) FP (%) Mag Time (s)
HELM 0.95 89.1 0.0 20 0.5
ELM 0.63 26.6 1.0 4.8 0.4
PCA ELM 0.57 14.6 0.5 3.1 1.4
SVM 0.87 73.6 0.0 1.04 169
PCA SVM 0.69 39.8 1.2 1.03 41
DBN 0.78 56.2 0.0 40 5.5
Table 3: Performances on Real Dataset
(a) HELM (TP: 89.1% – FP: 0.0%)
(b) ELM (TP: 26.6% – FP: 1.0%)
(c) PCA-ELM (TP: 35.6% – FP: 3.2%)
(d) SVM (TP: 73.6% – FP: 0.0%)
(e) PCA-SVM (TP: 39.8% – FP: 1.2%)
(f) DBN (TP: 56.3% – FP: 0.0%)
Figure 8: Generator Data Case Study: Distance to normal conditions for the 6 approaches against time (days). Blue points represent the training, yellow points the validation and red points the testing. The black horizontal lines corresponds to the threshold in Eq. (8). The two vertical black lines represents days and days. The Y-axis is rescaled such that the threshold is 1.

Table 3 synthesises the accuracy, percentage of TP and FP, the magnification coefficient and the training time for the best performing model of each family. In addition, the distance to the normal class () of these models are illustrated in Figure 8. In accordance with the results obtained so-far, HELM provides the best performances. Its accuracy is higher and it is a robust model, with a strong magnification (20), and easily trained. The one-class classifier alone performs sensibly worse: numerous FP are raised, TP are scarce, which is consistent with its much lower magnification. SVM provides good performances with no FP and a good rate of TP, yet its accuracy is smaller than that of HELM and its magnification is very small (). SVM is the least robust with respect to the threshold definition and is the harder to train (300 times slower). Performing a PCA before the classification also worsened the results, which could be the consequence of inherent non-linear relationships between the measurements. The Deep Belief Network also has a lower accuracy than HELM but it has a stronger magnification.

Based on Fig. 8, for all models but SVM the distance to normal class increases after the lower level fault (from day 169 to day 247) and again after the upper-level fault (after day 247). The output of the one-class classifier behaves similarly to a health indicator.

4.4 HELM versus Expertise

Experts who analysed the power plant operation could identify the fault starting at day 169 thanks to eight particular signals. Using these signals as inputs to the ELM and SVM models mimics the traditional machine learning approach using engineered features as input. Doing so, ELM achieves an accuracy of 0.88 (TP: 75.1%, FP: 0.0%) and a magnification of 20 and SVM an accuracy of 0.8 (TP: 60.2%, FP:0.0%) and a magnification of 6. This approach brings ELM results very close to those achieved by HELM in the previous section. For SVM, accuracy decreases but the higher magnification allows now to distinguish between lower and higher level faults.

This approach, however, is based on the signals selected by the experts a posteriori to the fault. It could probably detect similar faults in the future but it may miss other types of faults, impacting other signals. This flaw is not shared by HELM which does not require any a posteriori knowledge.

5 Discussion and Conclusions

Both case studies analysed in this research demonstrate a better performance of HELM compared to that of other methods. Tested on high dimensional signals with varied characteristics and faults, HELM raises consistent and robust alarms in faulty system conditions. It also enables tracking in time of the evolution of the faulty system conditions. The other tested methods are more impacted by the dimensionality of the input data, the characteristics of the signals and of faults. The dimensionality reduction step with PCA, for example, has proven in other experiments to be particularly suitable for signals with linear relationships. However, in the presence of non-linear relationships, the ability of PCA to extract non-correlated components decreases significantly. Other deep learning approaches, such as DBN, can also provide excellent results but are usually highly dependent on the selection of hyperparameters. Experimentally, training an efficient DBN necessitates a lot of fine tuning and has proved to be a difficult task.

On both case studies, HELM demonstrates a performance that is either superior or similar to that of the other approaches on all of the datasets. It is, in addition, much faster and more robust. The generator case study demonstrated in addition that HELM is able to learn the relevant features also on real condition-monitoring signals and to detect the fault at its earliest stage, with higher accuracy than with manually selected features. This is a promising advantages as the dimensionality of the dataset tends to increase rapidly. Features are harder to engineer and signal selection becomes more challenging.

HELM integrates a feature learning step which extracts the high-level representations of the condition-monitoring signals and improves the detection accuracy. HELM is able to handle raw condition-monitoring signals and does not require knowledge on the faults as it is trained on normal system state data solely. Its output can be interpreted as a distance to the normal system condition, or health indicator, and provides, as such, more information over time than typical binary decision process. By detecting very early anomalies, months before faults happens, as in the generator case study, the savings in term of operational costs can reach millions of euros. HELM satisfies therefore the many requirement of a good decision support tool for diagnostics engineers.

The proposed approach has shown to be particularly beneficial for industrial applications with high-dimensional signals having complex relationships, experiencing very few but critical faults and lacking representative datasets covering all possible operating conditions and fault types.

For future work, the question of a priori hyperparameter setting remains open. While in this study, results were presented for the models that would maximise the accuracies a posteriori, it is crucial for diagnostics engineer to know how to set the model prior to the occurrence of faulty system conditions. Nonetheless, we have seen in the simulated case that HELM trained with a good set of parameters is able to detect many different kinds of faults. Therefore, inducing artificial faults on real condition monitoring data could help finding a good set of hyperparameters that will likely be valid for other types of faults. The investigation of the number of autoencoder layers on the feature learning performance is also left for future works. In this contribution, a single feature learning layer was sufficient to demonstrate a superior performance. Additional feature learning layers could improve results or be necessary if the dimensionality of the dataset is higher than in the present case. This would add an additional hyperparameter to be selected.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description