Augmenting Monte Carlo Dropout Classification Models with Unsupervised Learning Tasks for Detecting and Diagnosing OutofDistribution Faults
Abstract
The Monte Carlo dropout method has proved to be a scalable and easytouse approach for estimating the uncertainty of deep neural network predictions. This approach was recently applied to Fault Detection and Diagnosis (FDD) applications to improve the classification performance on incipient faults. In this paper, we propose a novel approach of augmenting the classification model with an additional unsupervised learning task. We justify our choice of algorithm design via an informationtheoretical analysis. Our experimental results on three datasets from diverse application domains show that the proposed method leads to improved fault detection and diagnosis performance, especially on outofdistribution examples including both incipient and unknown faults.
Introduction
Datadriven approaches relying on supervised deep learning have achieved considerable success in many application domains due to their ability to classify data from multiple classes. Although supervised deep learning methods tend to perform well on known (indistribution) data patterns, the unseen (outofdistribution) data may lead to unexpected prediction behaviors. In the context of Fault Detection and Diagnosis (FDD), labeled data for normal (faultfree) states and highseverity fault states are more easily accessible. On the contrary, labeled data for incipient faults (lowseverity faults of known types) are more difficult to obtain [Jin et al.2019a] and usually missing or underrepresented in the dataset. In addition, in realworld operation, there could be unknown faults that do not belong to any fault types modeled in a classification model. These outofdistribution faults not seen by the model in the training phase may fool a classification model into wrong belief, which is undesirable for FDD applications. Although this problem conceptually can be alleviated by training the model on a larger and more comprehensive dataset, in practice it is technically impossible to include data of all different fault types, and of all possible severity levels. It is desirable for a diagnostic algorithm to report—in addition to the decisions—uncertainty estimates behind the decisions. Highly uncertain cases could then be flagged as requiring particular attention in future maintenance and repair actions.
For estimating a neural network’s prediction uncertainty, the standard way is to use a Bayesian approach whose goal is to learn a distribution over the network weights; however, such approach is computationally expensive and has not been widely adopted yet. A recent seminal work [Gal2016] discovered that one can approximate the posterior distribution of a dropout neural network by repeatedly sampling its predictions with dropout turned on at test time. This method, referred to as Monte Carlo dropout (MC dropout), provides an efficient and scalable way to perform Bayesian inference that can easily fit into the standard training pipelines of today’s deep learning frameworks. The MC dropout approach due to its ease of use has been applied to disease diagnosis [Leibig et al.2017] and time series anomaly detection [Zhu and Laptev2017]. \citeauthorjin2019detecting \shortcitejin2019detecting extended the use of MC dropout into a multiclass setting, and showed that MC dropout can not only help detect incipient faults but is also able to give informative hints about the types of these difficulttodiagnose faults with the produced uncertainty estimates.
Apart from the supervised classification approaches, one category of unsupervised methods called oneclass models [Tax2002] are particularly appealing in anomaly detection and FDD applications, because only normal (faultfree) data are required to train a detection model. A oneclass model aims to learn the distribution of the data for the normal operating condition of a system. Outliers to the learned distribution are recognized as anomalies or faults. Typical examples of oneclass models include autoencoders [Thompson et al.2002] and oneclass support vector machines [Schölkopf et al.2001]. In practice, it can sometimes be difficult to train a good oneclass classification model [Jin et al.2019a], due to the lack of fault data for crossvalidation. On the other hand, even when fault data are available, there is no straightforward way to directly incorporate them into the training process. In addition, as these models can only tell whether or not an input data point belong to the normal data distribution, they lack the diagnostic ability to differentiate between faults of different types.
The aforementioned reasons motivate us to devise a method that can leverage the strengths of both supervised and unsupervised learning approaches. The resulting model should not be only good at classifying indistribution data, but should also be able to give reasonable uncertainty estimates in addition to the diagnostic decisions when predicting outofdistribution fault examples. We summarize our main contributions in this paper as follows.

We propose a novel neural network architecture that combines an MC dropout classifier and an autoencoder, thereby leveraging the strengths of supervised and unsupervised learning into one jointtraining framework.

We motivate the design choice of regularizing the latent space representation with a “decoding pathway” from an information theory point of view. The experimental results match the conjecture from our theoretical analysis.

Our experimental results on three datasets from different domains have demonstrated superior FDD performance compared to nonaugmented MC dropout classifiers and MCdropout autoencoders, especially on outofdistribution fault examples.
Problem Statement
Fault Detection
Let be the set of data points, and be a model class. Each defines an anomaly score function that characterizes how likely a data point corresponds to a fault state; a larger implies higher chance of a data point being a fault. For a given threshold value , we can define the precision and recall of the model on the test data distribution as follows:
Our goal is to learn a score function and a corresponding threshold , such that can optimize the precision and the recall on unseen test data.
Fault Diagnosis
The fault diagnosis problem can be viewed as a natural extension of the fault detection problem. A fault diagnosis model not only needs to detect the existence of faults, but also differentiate between faults of different classes. In this paper, we view a fault diagnosis model that can deal with a situation of fault classes as the aggregation of fault detection models, where each fault detection model aims to distinguish the normal class and one particular type of fault. Concretely, the anomaly score function outputs one anomaly score for each type of fault, and we aim at identifying the optimal score function and threshold which result in the best precision and recall for all the faults. It is possible to assign multiple fault labels to a given input, which means the fault diagnosis problem in our context is not only a multiclass classification problem, but also a multilabel classification problem.
Motivation of Algorithm Design
We propose to use the prediction uncertainty on the label of the input data as our score function for identifying fault data points. In the following, we motivate our choice of score function from an informationtheoretic perspective.
Let us assume a classification model that is trained with learning algorithm on training set . The training set consists of data of class “normal” (labeled 0) and of class “fault” (labeled 1). In the test set , besides data that are in the training distribution, there are also data that do not belong to the training distribution. We denote the index set of normal data, fault data, and outofdistribution data in the test set respectively as , , and .
Given input , the output of the trained model is a random variable . Let be the entropy of random variable under distribution . By the conditional independence of data points on model, we can decompose the total entropy of the output variable on the test set into three parts,
(1) 
Since we want to utilize prediction uncertainty as an indicator for identifying fault data points (especially the outofdistribution faults), it is our desire to devise a learning algorithm to drive down the entropy (uncertainty) on normal data points, and meanwhile to increase the uncertainty on the outofdistribution faults so that they are more distinguished from the indistribution data points. To lower the uncertainty on a selected subset of the data points (in our case the normal data points), a basic idea is to devote more effort to this class during training. One straightforward way is to increase the weight for the data points in the normal class when training the classification model; however, this may not yield the desired result because a large class weight will push the decision boundary farther away from the normal class and thus allowing more incipient fault data points to be mistakenly classified as normal with little uncertainty.
To bypass this problem and better detect incipient faults, we propose to use an auxiliary task at a hidden layer for regularization purpose, which will encourage the network to learn different latent space representations for the data at this hidden layer. Since part of the network’s capacity is devoted to the auxiliary task, it can be expected that the entropy at the output of the resulting classifier on the test data is increased with the modified training algorithm , i.e.
Here apostrophes are used to indicate variables that correspond to the modified training algorithm . By Eq. (1), ; hence we have
The above relation suggests that, for algorithm , a lower uncertainty on the normal data points comes at the price of higher uncertainty on the indistribution fault data points. This analysis, however, does not give a definitive conclusion on the entropy of outofdistribution (incipient and unknown) faults. We conjecture that these outofdistribution examples will also exhibit higher uncertainty compared to normal examples under . As observed in our empirical study in the following sections, this tradeoff is actually helpful for detecting outofdistribution faults. In other words, we have made the model more sensitive in detecting deviations from the distribution of normal data by using an alternative learning algorithm that can suppress the uncertainty of the normal data on the auxiliary task.
The above informationtheoretical analysis motivates us to incorporate an auxiliary learning task into the existing classification model to get improved sensitivity to potential outofdistribution faults. In the upcoming sections, we will describe how we design an augmented neural network architecture by incorporating reconstruction (autoencoding) as an auxiliary learning task to achieve the desired tradeoff.
Methodology
Monte Carlo Dropout
Dropout [Srivastava et al.2014] is a powerful regularization technique to prevent overfitting neural network parameters. In effect, the dropout technique provides an inexpensive approximation to training and evaluating an ensemble of exponentially many neural networks. The dropout mechanism offers a way to incorporate intrinsic randomization into neural network models. Recently, Gal and Ghahramani proposed using MCdropout [Gal2016] to estimate a neural network’s prediction uncertainty by using dropout at test time. The uncertainty estimates are obtained by repeatedly “sampling” the outputs of a dropout model given the same input . Suppose we have obtained i.i.d. sampled outputs (output probabilities from the softmax layer). Their predictive mean can be understood as the expected output given input , and the predictive variance can be used to measure the confidence of in its prediction.
Augmented Classification Network
As motivated in the previous section, we augment a regular classification network by adding a decoding pathway to an intermediate layer of the classification network. We illustrate the network structure in Figure 1.
The resulting augmented network now has two output pathways, a classifying pathway that aims to output the correct label for an input to the encoding pathway, and an additional decoding pathway whose goal is to reconstruct the original input from the latent space representation. Our model therefore embodies the functionality of both a classifier and an autoencoder. To get satisfactory results from both pathways, we expect the network to learn meaningful representations that are conducive to both tasks at the latent space where the two pathways diverge. To encourage the separation of fault data points from normal data points in the latent space, we use a small classifying pathway and add dropout layers only in the encoding pathway. A small classifying pathway with limited capacity motivates the network to learn a clearcut decision boundary between the normal points and the fault points in the latent space. In addition, the random dropout in the encoding pathway also improves the separation between different classes, so that the learned decision boundary will be robust against the stochastic latent space embedding produced by the encoding pathway.
The inclusion of a decoding pathway may benefit the detection of outofdistribution faults in another aspect. As we know, a classifier trained as a discriminative model tends to find the most discriminative features to distinguish between classes. Much useful information in the original data is thus lost when training a discriminative model; however, such information may be useful for telling outofdistribution faults from normal data. Since the decoding pathway encourages more information about the normal class to be preserved in the latent space, these outofdistribution faults may become more discernible in the latent space, making it possible to detect and even diagnose them.
Loss function
As opposed to an autoencoder or a classifier, during training the augmented network is trained to minimize both the reconstruction loss and the classification loss at the same time. We thus use the loss function defined as follows
where the loss of each data point consists of two parts, the classification loss and the reconstruction loss . A hyperparameter is introduced to balance the tradeoff between the two losses. Since we only want to suppress the prediction uncertainty on the normal data points, we apply the reconstruction loss only on the normal data points. In our experiment to be later described, we used the crossentropy loss for and the mean square loss for .
Evaluation
Since our proposed model can be used both as an MC dropout autoencoder and an MC dropout classifier, its FDD performance can be evaluated in more than one way. These evaluation metrics to be described shortly can also apply to autoencoders and classifiers, and will be used in the case studies for comparing the performance of these networks. In our experiment, we used an MC dropout autoencoder and an MC dropout classifier (hereafter referred to as “autoencoder” and “classifier” for brevity) as benchmarks to show the performance gain from using our proposed augmented model. For a fair comparison, the two benchmark models share the same components (pathways) as the augmented model.
Binary Classification (Fault Detection)
In fault detection tasks, we need a model to tell whether (or how likely) an example is a fault. Both the decoding pathway and the classifying pathway can be used for this task.
Anomaly scores
To measure how significant an input example exhibits an anomalous behavior at the output node of a twoclass classifying pathway, we define the anomaly score of input data as below,
(2) 
A smaller anomaly score implies less uncertainty in the prediction mean and variance. Without loss of generality, let us label the output classes (the normal class and faults) by , where class 0 is the normal class. Because the normal state is signified by an output of 1 at the for class 0 output node, the above definition for anomaly scores needs to be specially treated for class 0 output,
For the decoding pathway, the anomaly score of input is defined to be the reconstruction error (i.e., the meansquareerror) at the output,
Here, is the predictive mean of the reconstruction output produced by the decoding pathway given input .
Detection Thresholds
Ideally, an example belonging to the normal class should give zero anomaly scores for both the classifying pathway and the decoding pathway. In practice, normal examples will still exhibit small anomaly scores, so we need a detection threshold for determining where an input example is faulty or not. To do so, we use a predefined to determine the detection rate, such that the false positive rate on normal training data is . We deem an input example as susceptible if its anomaly score is above the corresponding detection threshold, i.e. . The above method of choosing thresholds essentially limits the false positive rate to a given value, thus controlling the costs incurred by false alarms. A similar constant false alarm rate principle [Chen and Reed1987] has been adopted in adaptive algorithms for radar systems.
Fault detection in multiclass models
In multiclass models, its fault detection performance is also an important and meaningful metric to evaluate, for two reasons. First, in FDD applications, being able to tell the existence fault is by itself meaningful. Second, it is important for an FDD model to have the ability to tell a potential deviation from the normal condition. If the input corresponds to an unseen fault type that does not belong to the ones modeled by the classification model, then the detection performance is more valuable than the diagnostic performance.
In the multiclass case, we define a detection threshold in a similar fashion as in the twoclass case. Let us use variable to indicate anomalies corresponding to the type of fault. We consider to be a possible label for input if . It is possible that an input example gets assigned more than one label, which reflects the classifier’s uncertainty about the true label of . We define the final predicted label as the disjunction of ’s,
(3) 
Multiclass Classification (Fault Diagnosis)
The multiclass case can be seen as a natural extension to the twoclass case. One major difference is that the softmax function is used as the activation function in the output layer of the network. A multiclass classification model has the capability to tell what type of fault an input corresponds to (fault diagnosis). We introduce the notion of diagnostic accuracy for evaluating how accurately a multiclass model can pinpoint the underlying fault type.
Diagnostic Accuracy
In a multiclass setting, fault diagnosis is a more difficult task than just detecting the existence of faults. Let be the set of predicted labels of the classifier on input , and be the groundtruth label of input . Note that in our context each example only has one label. We define the diagnostic accuracy as follows,
(4) 
where the denominator is the total number of detected fault labels, and the numerator indicates whether the true label is correctly detected. The higher the diagnostic accuracy, the more accurately the classification result can pinpoint the true underlying system health status. It is worthy to note that is excluded in the summation in the denominator. In other words, the diagnostic accuracy will not be discounted if class 0 (the normal class) is included in the set of predicted labels as long as the correct label is also included. We believe it is an acceptable and desired behavior for an incipient fault as an intermediate state between the normal class and its corresponding fault class to be suspected and labeled as both by an FDD algorithm.
Datasets
Classification accuracy  Diagnostic accuracy  







Thyroid  Normal  0.900  0.900  0.900  0.882      
Subnormal  0.208  0.063  0.627  0.377      
Diseased  0.910  0.860  1.000  1.000      
Chiller  Normal  0.937  0.944  0.942  0.936      
SL1  0.936  0.290  0.618  0.503  0.244  0.133  
SL2  0.885  0.565  0.716  0.703  0.203  0.169  
SL3  0.815  0.853  0.921  0.896  0.270  0.233  
SL4  0.796  0.319  1.000  1.000  0.280  0.311  
unknown  0.936  0.043  0.999  0.133      
Digits  Zero  0.947  0.947  0.949  0.946      
Nonzero  0.997  0.977  1.000  1.000  0.717  0.587  
Ambiguous  0.371  0.268  0.530  0.492  0.300  0.231  
Out of domain  0.999  0.997  0.999  0.998     
We selected three datasets from different domains to benchmark the performance of our proposed model. The common trait shared between the three datasets is that they all have some notion of “incipient faults” or “unknown faults”; these outofdistribution faults will not be represented in the training data. In the case of the hypothyroidism dataset, the resulting model is a binary classification model. Uncertainty information given by the deep learning model will be used to indicate a potential third class–the subnormal condition, which can be seen as an “incipient fault”. On the chiller dataset, we will train a multiclass classification model for predicting different types of faults. We also tested our approach on the MNIST [LeCun1998] dataset to show that our proposed approach can also work with image data. More details about these datasets will be given below.
Thyroid Disease Data
We used the “ANNthyroid” dataset from the UCI machine learning repository [Dua and Graff2017]. The dataset contains clinical data from both normal people and those who have been diagnosed hypothyroidism. Each data point has 21 features, among which 15 are binary and the rest are continuous. We used only the continuous features in our experiment.
The data points are classified into three classes. Besides normal and hypothyroidism, there is also a third class that represents subnormal (mild) hypothyroidism [Quinlan1987]. The dataset is highly unbalanced; the majority of the data points (about ) correspond to the normal condition. Among the rest, about 67 are of the subnormal class. The normal data and subnormal data in this model have some overlap (see Figure 4 in the supplemental material), which will cause some difficulties in differentiating the normal data from the subnormal data in classification.
RP1043 Chiller Data
We used the ASHRAE RP1043 Dataset [Comstock, Braun, and Bernhard1999] to test out the proposed approach in a multiclass setting. In RP1043, sensor measurements of a typical cooling system—a 90ton centrifugal watercooled chiller—were recorded under both faultfree and various fault conditions. Besides the normal state (NM), We used seven different types of process faults (referred to as FWC, FWE, RL, RO, CF, NC & EO) from the RP1043 dataset in our study; see [Jin et al.2019b] for a detailed description. Each fault was introduced at four levels of severity (SL1  SL4, from slightest to severest), except for the EO fault that only has three severity levels. We used the same sixteen features as previous work [Jin et al.2019b] did.
Only the normal (SL0) data and the SL4 fault data were used for training the classification models. The less severe SL1 & SL2 & SL3 faults were held out as ambiguous examples for testing purpose. We also held out the EO fault to see how the networks will respond to unknown fault examples.
MNIST Digits Data
In this case study, we considered digit0 images from the MNIST dataset [LeCun1998] to be the “normal” class, and four other digits (5, 6, 8 & 9), which are similar to digit0 as “faults”. As with the chiller dataset, we used two types of outofdistribution examples in our study. The first type is ambiguous digits that resemble two or more digits. For example, there are some digit0’s that are easily mistaken as digit6’s. To generate ambiguous examples that, we used a Variational Autoencoder (VAE) to interpolate between digit 0 and the fault digits. The interpolation was done in the latent space. The other five digits (1, 2, 3, 4 & 7) were used as “unknown faults” in this study.
Experimental Evaluation
To examine the performance of our proposed augmented model, we used an autoencoder model and a classifier as benchmark models, and compared their FDD performance on both indistribution and outofdistribution test data. In this section, we will demonstrate the results on the three datasets described in the previous section.
These three models all share the same encoding pathway, but are different in their output pathways. Both the autoencoder and the classifier can be seen as part of the augmented model: the autoencoder only has a decoding pathway, and the classifier only has a classification pathway. The augmented model has both output pathways that are of the same structures as those in the classifier and the autoencoder. In our model, we only add dropout layer to the encoder pathway. The neural networks used in our experimental study were implemented in Keras [Chollet and others2015]. We set for in all of our experiments.
We evaluate the three models’ FDD performances in terms of their binary classification accuracy and diagnostic accuracy. The autoencoder model is evaluated by its fault detection performance in terms of binary classification accuracy. The classification model is evaluated by its diagnostic accuracy, in addition to binary classification accuracy. Because the augmented model has two output pathways, it can be compared with the above two models separately depending on which output pathway we focus on.
Thyroid Dataset
We built a network with Fully Connected (FC) layers for the thyroid dateset, whose latent space had two dimensions. The network had three FC layers in its encoding pathway and decoding pathway and two FC layers in its classifying pathway. We added a dropout layer after each FC layer in the encoding pathway. Among three categories of data, we chose the the normal and the diseased data and randomly divided them into the training set and the test set. The whole training set was used to train the augmented model and the classification model while the autoencoder was trained only with the normal data. For the augmented model, reconstruction was presumably a more difficult task than classification. We found it easier to reconcile the two objectives if we pretrained the model only with the reconstruction loss as a warmup. In our experiment, the augmented model was pretrained for 20 epochs and then trained with the joint objective for another 100 epochs.
We calculate the performance metrics of all three models; see Table 1 for a comparison. Because of the overlaps between the normal and subnormal data, we give the model more tolerance to false positives and chose . Table 2 shows that our model has a much lower average detection threshold (0.002), compared with that of the classifier (0.005). Lower detection threshold implies both the means and the standard deviations of normal data prediction are closer to zero. In other words, our augmented model is more sensitive in detecting the outliers. From Table 1, it is clear that our model has a better performance than the autoencoder and the classifier in both fault detection and fault diagnosis. The normal data and diseased data are both accurately classified. On subnormal data, the binary classification accuracy of our augmented model (0.627) is significantly higher than that of the classifier (0.377), which also shows that the subnormal data are more likely to be detected.
Thyorid  Chiller (average)  Digits (average)  

Augmented model (decoding path)  0.009  0.019  0.040 
MC dropout autoencoder  0.007  0.013  0.036 
Augmented model (classifying path)  0.002  0.007  0.010 
MC dropout classifier  0.005  0.018  0.019 
Considering the decoding pathway, the average detection threshold given by our model (0.009) is slightly higher than that of the MC dropout autoencoder (0.007). We believe the reason to be that our model does both classification and reconstruction tasks while the autoencoder only concentrates on normal data reconstruction. We visualize and compare the latent space given by the autoencoder and our augmented model in Figure 2. In the latent space, the data points mostly reside in a straight line. To better visualize the distributions for each cluster, we add some Gaussian noise to make them spread out. Compared with the MC dropout autoencoder, the data with different labels are more separated in the latent space of our augmented model. The improvement is obvious on the subnormal data. Furthermore, from Table 1, our model also shows a much better performance than the autoencoder in fault detection compared with the autoencoder, with the classification accuracy value increasing from 0.063 to 0.208.
Chiller Dataset
The chiller dataset was also trained and tested by a FC neural network, whose latent space had four dimensions. The encoding pathway, decoding pathway and classifying pathway all had three FC layers and each FC layer in the encoding pathway was followed by a dropout layer. We randomly chose of the data as training set and rest as test set. The normal data and the severe fault (SL4) data were used to train the augmented model and the classifier. For the autoencoder, we trained it only with the normal data. The augmented model was trained for 200 epochs with 40 pretraining epochs and the other two models were both trained until convergence.
From Table 1, compared with the MC dropout classifier, it is clear that the binary classification accuracy given by the classification pathway in our augmented model is much higher, especially on incipient faults and unknown faults, which means our model performs better in fault detection. The diagnostic accuracy of incipient faults (SL1 to SL3) also sees improvements, which shows that the incipient faults are more likely to be correctly diagnosed. Similar to the thyroid dataset, the detection threshold of our model in chiller dataset (see Table 2) also decreases compared with the other two models, representing higher sensitivity of our model in detecting potential anomalies.
Binary classification accuracy given by the decoding pathway of our model also has improvements when compared with that given by MC dropout autoencoder. It is noteworthy that the binary classification accuracy has a significant increase in SL1 (0.290 to 0.936), SL2 (0.565 to 0.885) and unknown faults (0.433 to 0.936). Similar to the thyroid experiment, we used a Linear Discriminant Analysis (LDA) visualization on the latent space. As shown in Figure 2, different data clusters in the latent space of our augmented model are more dispersed compared with those of the autoencoder. The incipient faults are partially separated from the normal data, which will benefit our fault detection (see Figure 8 in the supplemental material). In addition, the average anomaly scores of our augmented model and the classifier (see Figure 5 in the supplemental material) can also illustrate the improvements of our model in fault detecting and diagnosis.
MNIST Digits Dataset
We designed a Convolutional Neural Network (CNN) to deal with the image data. The encoding pathway had two downsampling groups, each having two convolution layers and a maxpooling layer. Similarly, the decoding pathway had two upsampling groups, each having one upsampling layer and two convolution layers. Between the downsampling group and the upsampling group are six FC layers with 8 hidden nodes at the bottleneck. The classifying pathway was made up of three FC layers. We also added a dropout layer after each FC layer in the encoding pathway.
The augmented model and the MC dropout classifier were trained using both normal data and fault data; for training the autoencoder, only the normal data were used. We trained our model for 200 epochs (40 epochs for pretraining), and trained the other two models until convergence.
We choose and compute the performance metrics of the three models, in terms of both binary classification accuracy and diagnostic accuracy. Our model has shown improvements in both accuracy metrics, compared with the MC dropout classifier. As shown in Table 1, the zero and the nonzero data can be almost perfectly classified. (please see example prediction results in the supplemental material). Furthermore, our model also has a good performance on the ambiguous data, with the binary classification accuracy reaching 0.530. In multiclass fault diagnosis, our model also shows better diagnostic performance compared with the classifier, with the diagnostic accuracy climbing from 0.587 to 0.717 on nonzero digits, and from 0.231 to 0.300 on ambiguous digits. It can be seen from Table 2 that the average detection threshold of our model is also lower than that of the classifier, which implies that our model is more sensitive than the classifier when detecting outliers.
The average anomaly scores also see improvements of our model in detecting and diagnosing faults. Generally, the anomaly scores calculated from our model are higher than that from the classifier, which represents that our model is more sensitive about the out of distribution faults. The improvements in diagonal value of the matrix means that our model has better performance in diagnosing the faults and incipient faults,
Our augmented model also shows performance improvements in the results from the decoding pathway, compared with that from the autoencoder. As shown in Table 1, most zero and nonzero digits are both accurately identified. The binary classification accuracy of our model on ambiguous data (0.371) has improved, compared with that of the autoencoder (0.268). Similar to our observations on the previous two datasets, the distributions of data of different classes also become more dispersed in the latent space with our model; see Figure 2 for a visualization with LDA.
Conclusion
In this paper, we proposed a novel neural network model for FDD applications. In the proposed network structure, an MC dropout classifier is augmented with a decoding pathway; and as a result, the augmented network is trained to perform two tasks simultaneously, classification and reconstruction. We have shown that this combinedobjective training can give improved FDD performance compared to autoencoders and MC dropout classifiers, especially on outofdistribution faults. As future work, we plan to conduct a more indepth theoretical analysis of the proposed method.
References
 [An and Cho2015] An, J., and Cho, S. 2015. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2(1).
 [Chen and Reed1987] Chen, J. Y., and Reed, I. S. 1987. A detection algorithm for optical targets in clutter. IEEE Transactions on Aerospace and Electronic Systems (1):46–59.
 [Chollet and others2015] Chollet, F., et al. 2015. Keras. https://keras.io.
 [Comstock, Braun, and Bernhard1999] Comstock, M. C.; Braun, J. E.; and Bernhard, R. 1999. Development of analysis tools for the evaluation of fault detection and diagnostics in chillers. Purdue University.
 [Dua and Graff2017] Dua, D., and Graff, C. 2017. UCI machine learning repository.
 [Gal2016] Gal, Y. 2016. Uncertainty in deep learning. University of Cambridge.
 [Goodfellow et al.2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
 [Jin et al.2019a] Jin, B.; Chen, Y.; Li, D.; Poolla, K.; and SangiovanniVincentelli, A. 2019a. A oneclass support vector machine calibration method for time series change point detection. arXiv preprint arXiv:1902.06361.
 [Jin et al.2019b] Jin, B.; Li, D.; Srinivasan, S.; Ng, S.K.; Poolla, K.; et al. 2019b. Detecting and diagnosing incipient building faults using uncertainty information from deep neural networks. arXiv preprint arXiv:1902.06366.
 [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114.
 [LeCun1998] LeCun, Y. 1998. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/.
 [Leibig et al.2017] Leibig, C.; Allken, V.; Ayhan, M. S.; Berens, P.; and Wahl, S. 2017. Leveraging uncertainty information from deep neural networks for disease detection. Scientific reports 7(1):17816.
 [Li et al.2019] Li, D.; Chen, D.; Shi, L.; Jin, B.; Goh, J.; and Ng, S.K. 2019. MADGAN: Multivariate anomaly detection for time series data with generative adversarial networks. arXiv preprint arXiv:1901.04997.
 [Quinlan1987] Quinlan, J. R. 1987. Simplifying decision trees. International journal of manmachine studies 27(3):221–234.
 [Schölkopf et al.2001] Schölkopf, B.; Platt, J. C.; ShaweTaylor, J.; Smola, A. J.; and Williamson, R. C. 2001. Estimating the support of a highdimensional distribution. Neural computation 13(7):1443–1471.
 [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
 [Tax2002] Tax, D. M. J. 2002. Oneclass classification: Concept learning in the absence of counterexamples.
 [Thompson et al.2002] Thompson, B. B.; Marks, R. J.; Choi, J. J.; ElSharkawi, M. A.; Huang, M.Y.; and Bunje, C. 2002. Implicit learning in autoencoder novelty assessment. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), volume 3, 2878–2883. IEEE.
 [Wang et al.2019] Wang, X.; Du, Y.; Lin, S.; Cui, P.; and Yang, Y. 2019. Selfadversarial variational autoencoder with gaussian anomaly prior distribution for anomaly detection. arXiv preprint arXiv:1903.00904.
 [Zhu and Laptev2017] Zhu, L., and Laptev, N. 2017. Deep and confident prediction for time series at Uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 103–110. IEEE.
Supplemental material
Here, we present the additional information as supplemental material about the datasets used in our study and experimental results.
Additional Information for the Chiller Dataset
In the RP1043 chiller dataset, seven categories of faults are injected into the chiller system, with each fault being introduced at four levels of severity (SL1  SL4, from the slightest to the severest). Shown in Table 3 are the seven categories of faults and their respective normal operation ranges. The condenser fouling (CF) fault was emulated by plugging tubes into the condenser. The reduced condenser water flow rate (FWC) fault and the reduced evaporator water flow rate (FWE) fault were emulated directly by reducing water flow rate in the condenser and evaporator, respectively. The refrigerant overcharge (RO) fault and refrigerant leakage (RL) fault were emulated by increasing and decreasing the amount of refrigerant charge, respectively. The noncondensable in refrigerant (NC) fault was emulated by adding Nitrogen to the refrigerant. The excess oil (EO) fault was emulated by charging more oil than nominal.
Additional Visualization for the Thyroid Dataset
A visualization of the data points from the thyroid dataset is shown in Fig 4. The plot was created by projecting the subnormal data onto the Linear Discriminant Analysis (LDA) visualization for the indistribution data (normal and diseased). The overlap between the normal data and the subnormal data makes it difficult to differentiate between them.
Additional Visualization for the Chiller Dataset
Figure 5 shows the average anomaly scores given by the augmented model and the classifier across different output nodes, and on data of different fault types. This plot is similar to Figure 3 in the main text. Each value in this matrix represents the average anomaly score of one kind of fault at a single output node. It is worth noticing that our model in general gives higher average anomaly scores than the classifier on the diagonal, which shows that our model is not only more accurate but also more sensitive in diagnosing the incipient faults.
Fault  Normal Operation 

Reduced Condenser Water Flow (FWC)  270 gpm 
Reduced Evaporator Water Flow (FWE)  216 gpm 
Refrigerant Leak (RL)  300 lbs 
Refrigerant Overcharge (RO)  300 lbs 
Condenser Fouling (CF)  164 tubes 
NonCondensables in System (NC)  No nitrogen 
Excess Oil (EO)  22 lbs 
Additional Visualization for the Digits Dataset
Shown in Figure 6 are examples of a normal data point (digit “0”), a fault data point (digit “9”) and an ambiguous data point from the MNIST digits dataset, and histograms showing the distributions of their respective prediction outputs from our augmented model under Monte Carlo sampling. Both digit “0” and digit “9” are correctly classified into the normal class with small prediction uncertainty, and the prediction uncertainty of digit “9” is slightly higher than that of digit “0”. For the ambiguous digit, much prediction uncertainty can be seen at the output node for class “9”, showing that the model has difficulty in deciding whether the input image is a “0” or “9”. The results demonstrated by the examples are consistent with our design intent: suppress the uncertainty on normal data and increase the uncertainty on outofdistribution examples.
Visualization for the Chiller Dataset
The latent space visualization of different severity levels on chiller dataset is shown in Figure 8. It is clear that the incipient fault data and the normal data are becoming less separated in the latent space with the decreasing of the severity level. Some SL1 fault data are highly overlapped with the normal data. Nevertheless, different clusters of data are more dispersed in the latent space using our model than using the MC dropout autoencoder. The normal data are partially separated from the incipient fault data, which offers a good basis for the decoding pathway and classifying pathway to detect and diagnose the incipient faults.
Related Works
Recently, deep generative models such as VAE [Kingma and Welling2013] and Generative Adversarial Networks (GAN) [Goodfellow et al.2014] have become popular in anomaly detection applications. The main difference between a VAE and an autoencoder is that the VAE is a stochastic generative model from which a probabilistic measure can be derived for differentiating normal and fault data. In an earlier work [An and Cho2015], the authors used a Monte Carlo method to estimate the reconstruction probability of an input to a VAE for identifying faults. The idea of adversarial training has also been found useful for anomaly detection, especially in unsupervised settings where adversarial (anomalous) training data are not available. In work [Li et al.2019], a GANtrained discriminator network learns to detect fake data from real data in an unsupervised fashion. In [Wang et al.2019], the authors introduced a selfadversarial training procedure to VAE, so that the resulting deep representation not only captures the distribution of normal data but also has discriminative ability against faults.