Unsupervised Detection of Anomalous Sound based on Deep Learning and the NeymanPearson Lemma
Abstract
This paper proposes a novel optimization principle and its implementation for unsupervised anomaly detection in sound (ADS) using an autoencoder (AE). The goal of unsupervisedADS is to detect unknown anomalous sound without training data of anomalous sound. Use of an AE as a normal model is a stateoftheart technique for unsupervisedADS. To decrease the false positive rate (FPR), the AE is trained to minimize the reconstruction error of normal sounds and the anomaly score is calculated as the reconstruction error of the observed sound. Unfortunately, since this training procedure does not take into account the anomaly score for anomalous sounds, the true positive rate (TPR) does not necessarily increase. In this study, we define an objective function based on the NeymanPearson lemma by considering ADS as a statistical hypothesis test. The proposed objective function trains the AE to maximize the TPR under an arbitrary low FPR condition. To calculate the TPR in the objective function, we consider that the set of anomalous sounds is the complementary set of normal sounds and simulate anomalous sounds by using a rejection sampling algorithm. Through experiments using synthetic data, we found that the proposed method improved the performance measures of ADS under low FPR conditions. In addition, we confirmed that the proposed method could detect anomalous sounds in real environments.
I Introduction
Anomaly detection in sound (ADS) has received much attention. Since anomalous sounds might indicate symptoms of mistakes or malicious activities, their prompt detection can possibly prevent such problems. In particular, ADS has been used for various purposes including audio surveillance [1, 2, 3, 4], animal husbandry [5, 6], product inspection, and predictive maintenance [7, 8]. For the last application, since anomalous sounds might indicate a fault in a piece of machinery, prompt detection of anomalies would decrease the number of defective product and/or prevent propagation of damage. In this study, we investigated ADS for industrial equipment by focusing on machineoperating sounds.
ADS tasks can be broadly divided into supervisedADS and unsupervisedADS. The difference between the two categories is in the definition of anomalies. SupervisedADS is the task of detecting “defined” anomalous sounds such as gunshots or screams [2], and it is a kind of rare sound event detection (SED) [9, 10, 11]. Since the anomalies are defined, we can collect a dataset of the target anomalous sounds even though the anomalies are rarer than normal sounds. Thus, the ADS system can be trained using a supervised method that is used in various SED tasks of the “Detection and Classification of Acoustic Scenes and Events challenge” (DCASE) such as audio scene classification [12, 13], sound event detection [14, 15], and audio tagging [16]. On the other hand, unsupervisedADS [17, 18, 19] is the task of detecting “unknown” anomalous sounds that have not been observed. In the case of realworld factories, from the view of the development cost, it is impracticable to deliberately be damaged the expensive target machine. In addition, actual anomalous sounds occur rarely and have high variability. Therefore, it is impossible to collect an exhaustive set of anomalous sounds and need to detect anomalous sounds for which training data does not exist. From this reason, the task is often tackled as the oneclass unsupervised classification problem [17, 18, 19]. This point is one of the major differences in premise between the DCASE tasks and ADS for industrial equipment. Thus, in this study, we aim to detect unknown anomalous sounds based on an unsupervised approach.
In unsupervised anomaly detection, “anomaly” is defined as the patterns in data that do not conform to expected “normal” behavior [19]. Namely, the universal set consists of only the normal and the anomaly, and the anomaly is the complement to the normal set. More intuitively, the universal set is various machine sounds including many types of machines, the normal set is one specific type of various machine sound, and the anomaly set is all other types of machine sounds. Therefore, a typical way of unsupervisedADS is the use of the outlierdetection technique. Here, the deviation between a normal model and an observed sound is calculated; the deviation is often called the “anomaly score”. The normal model indicates the notion of normal behavior which is trained from training data of normal sounds. The observed sound is identified as an anomalous one when the anomaly score is higher than a predefined threshold value. Namely, the anomalous sounds are defined as the sounds that do not exist in training data of normal sounds.
To train the normal model, it is necessary to define the optimality of the anomaly score. One of the popular performance measurements of ADS is to measure both the true positive rate (TPR) and false positive rate (FPR). The TPR is the proportion of anomalies that are correctly identified, and the FPR is the proportion of normal sounds that are incorrectly identified as anomalies. To improve the performance of ADS, we need to increase TPR and decrease FPR simultaneously. However, these metrics are related to the threshold value and have a tradeoff relationship, as shown in Fig. 1. When the PDFs of the anomaly scores of normal and anomalous sounds overlap, false detections cannot be avoided regardless of any threshold. Thus, to increase TPR and decreases FPR simultaneously, we need to train the normal model to reduce the overlap area. More intuitively, it is essential to provide small anomaly scores for normal sounds and large anomaly scores for anomalous sounds. In addition, if an ADS system gives a false alert frequently, we cannot trust it, just as “the boy who cried wolf ” cannot be trusted. Therefore, it is especially important to increase TPR under a low FPR condition in a practical situation.
The early studies used various statistical models to calculate the anomaly score, such as the Gaussian mixture model (GMM) [3, 8] and support vector machine (SVM) [4]. The recent literature calculates the anomaly score through the use of deep neural networks (DNN) such as the autoencoder (AE) [20, 21, 22, 23] and variational AE (VAE) [24, 25]. In the case of the AE, one is trained to minimize the reconstruction error of the normal training data, and the anomaly score is calculated as the reconstruction error of the observed sound. Thus, the AE provides small anomaly scores for normal sounds. However, it gives no guarantee to increase anomaly scores for anomalous sounds. Indeed, if the AE is generalized, the anomalous sounds will also be reconstructed and the anomaly score of anomalous sound will be small. Therefore, to increase TPR and decrease FPR simultaneously, the objective function should be modified.
Another strategy for unsupervisedADS is the use of a generative adversarial network (GAN) [26, 27]. GANs have been used to detect anomalies in medical images [28]. In this strategy, a generator simulates “fake” normal data, and a discriminator identifies whether the input data is a real normal data or not. Therefore, the discriminator can be trained to increase TPR for fake normal data and decrease FPR for true normal data simultaneously. However, since the generator is trained to make normal data, if it perfectly generates normal sounds, the anomaly score of normal sounds and FPR will increase. Therefore, it is necessary to build an algorithm to simulate “nonnormal” sounds.
In this study, we propose a novel optimization principle and its implementation for ADS using AE. By considering an outlierdetectionbased ADS as a statistical hypothesis test, we define optimality as an objective function based on the NeymanPearson lemma [29]. The objective function works to increase TPR under an arbitrary low FPR condition. A problem in calculating TPR is the simulation of anomalous sound data. Here, we explicitly define the set of anomalous sounds to be the complement to the set of normal sounds and simulate anomalous sounds by using a rejection sampling algorithm.
A preliminary version of this work is presented in [8]. The previous study utilized a DNN as a feature extractor, and the anomaly score was calculated using the negativeloglikelihood of a GMM trained from normal data. Thus, although the DNN was trained to maximize the objective function based on the NeymanPearson lemma, the normal model did not guarantee to increase TPR and decrease FPR. In this study, endtoend training is achieved by using an AE as the normal model and both the feature extractor and the normal model are trained to increase TPR and decrease FPR.
The rest of this paper is organized as follows. Section II briefly introduces outlierdetectionbased ADS and its implementation using an AE. Section III describes the proposed training method and the details of the implementation. After reporting the results of objective experiments using synthetic data and verification experiments in real environments in Section IV, we conclude this paper in Section V. The mathematical symbols are listed in Appendix A.
Ii Conventional method
Iia Identification of anomalous sound based on outlier detection
ADS is an identification problem of determining whether the sound emitted from a target is a normal sound or an anomalous one. In this section, we briefly introduce the procedure of unsupervisedADS.
First, an anomaly score is calculated using a normal model. Here, is an input vector calculated from the observed sound indexed on for time, and is the set of parameters of the normal model. In many of the previous studies, was composed of handcrafted acoustic features such as melfrequency cepstrum coefficients (MFCCs) [1, 2, 3], and the normal model was often constructed with a PDF of normal sounds. Accordingly, the anomaly score can be calculated as
(1) 
where denotes the state, is normal, and is not normal, i.e. anomalous. is a normal model such as a GMM [8]. is determined to be anomalous when the anomaly score exceeds a predefined threshold value :
(2) 
One of the performance measures of ADS consists of the pair of TPR and FPR. The TPR and FPR can be calculated as expectations of with respect to anomalous and normal sounds, respectively:
(3)  
(4) 
where denotes the expectation with respect to . These metrics are related to and have a tradeoff relationship as shown in Fig. 1. The top figure shows the PDFs of anomaly scores for normal sounds and anomalous sounds . The bottom figures show the FPR and TPR with respect to . When these PDFs overlap, false detections, i.e. falsepositive and/or falsenegative, cannot be avoided regardless of any . In addition, the false detections increase as the overlap area gets wider. Therefore, to increase TPR and decrease FPR simultaneously, it is necessary to train so that the anomaly score is small for normal sounds and large for anomalous sounds. More precisely, we need to train to reduce the overlap area.
IiB UnsupervisedADS using an autoencoder
Recently, deep learning has been used to construct a normal model. Several studies on deeplearningbased unsupervisedADS have used an autoencoder (AE) [21, 20, 22, 23]. This section briefly describes unsupervisedADS using an AE (see Fig. 2).
The goal of using an AE is to learn an efficient representation of the input vector by using two neural networks and , which are called the encoder and decoder, respectively. First, the input vector is converted into a latent vector by . Then, an input vector is reconstructed from by . These processes are expressed as
(5)  
(6) 
The parameters of both neural networks are trained to minimize the reconstruction error:
(7) 
In ADS using an AE, the anomaly score is the reconstruction error of the observed sound, which is calculated as
(8) 
To train the normal model to provide small anomaly scores for normal sounds, the AE is trained to minimize the average reconstruction error of normal sound,
(9) 
where is the th training data of normal sound and is the number of training samples of normal sound. This objective function works to decrease the anomaly score of normal sounds. However, there is no guarantee of increasing anomaly scores for anomalous sounds. Indeed, if the AE is generalized, the anomalous sounds will also be reconstructed and the anomaly score of anomalous sounds will be also small. Therefore, (9) does not ensure that false detections are reduced and the accuracy of ADS is improved; thus, it would be better to modify the objective function.
Iii Proposed method
We will begin by defining an objective function that builds upon the NeymanPearson lemma in Sec. IIIA. Then, we will describe the rejection sampling algorithm for simulating anomalous sound used for calculating TPR in Sec IIIB. After that, the overall training and detection procedure of the proposed method will be summarized in Sec. IIIC and Sec. IIID. As a modified implementation of proposed method, we extend the proposed method to an area under the receiver operating characteristic curve (AUC) maximization in Sec IIIE.
Iiia Objective function for anomaly detection based on the NeymanPearson lemma
From (1) and (2), an anomalous sound satisfies the following inequality:
(10) 
Since is assumed to be sufficiently large to avoid false positives, an anomalous sound can be defined as “a sound which cannot be regarded as a sample of the normal model.” Thus, we can regard outlierdetectionbased ADS as a statistical hypothesis test. In other words, the observed sound is identified as anomalous when the following null hypothesis is rejected.
Null hypotheses: is a sample of the normal model .
The NeymanPearson lemma [29] states the condition for that achieves the most powerful test between two simple hypotheses. According to it, the most powerful test has the greatest detection power among all possible tests of a given FPR [30]. More simply, the most powerful test maximizes the TPR under the constraint that the FPR equals , i.e.,
Since the FPR can be controlled by manipulating , we define as satisfying . Accordingly, the objective function to obtain the most powerful test function can be defined as the one that maximizes with respect to . However, since the FPR is also a function of , it may become large when focusing only on TPR. To maximize the TPR and minimize the FPR simultaneously, we train to maximize the following objective function,
(11) 
where the superscript “NP” is an abbreviation of “NeymanPearson”. Since the proposed objective function directly increases TPR and decreases FPR, can be trained to provide a small anomaly score for normal sounds and a large anomaly score for anomalous sounds.
There are two problems when it comes to training and to maximize (11). The first problem is the calculation of TPR. The TPR and FPR are the expectations of , and in most practical cases, the expectation is approximated as an average over the training data. Thus, to calculate TPR and FPR, we need to collect enough normal and anomalous sound data for the average to be an accurate approximation of the expectation. However, since anomalous sounds occur rarely and have high variability, this condition is difficult to satisfy. In section IIIB, to calculate TPR, we consider “anomaly” to mean “not normal” and simulate anomalous sounds by using a sampling algorithm. The second problem is the determination of the threshold . In a parametric hypothesis test such as a test, the threshold at which FPR equals can be analytically calculated. However, DNN is a nonparametric statistical model; thus, the threshold can not be analytically calculated. In section IIIC, we numerically calculate as the th value of the sorted anomaly scores of normal sounds, where is the flooring function.
IiiB Anomalous sound simulation using an autoencoder
In accordance with (10), anomalous sounds emitted from the target machine are different from normal ones. Thus, we consider the set of normal sounds to be a subset of various machine sounds, and the set of anomalous sounds to be its complement. Then, we use rejection sampling to simulate anomalous sounds; namely, a sound is sampled from various machinesound PDFs, and it is accepted as an anomalous sound when its anomaly score is high. However, since the PDF of various machine sounds in the input vector domain may have a complex form, the PDF cannot be written in an analytical form and the sampling algorithm would become complex. Inspired by the strategy of VAE, we can avoid this problem by training so that the PDF of various latent vectors is mapped to a PDF whose samples can be generated by a pseudorandom number generator from a uniform distribution and its variable conversion. Then, the latent vectors of anomalous sounds are sampled using the rejection sampling algorithm, and the input vectors of anomalous sounds are reconstructed using a third neural network ,
(12) 
where is the parameter of . Hereafter, we call the generator. Although there is no constraint on the architecture of , we will use the same architecture for and . In addition, to simply generate and reject a candidate latent vector, we use two constraints to train and , and model the PDF of normal latent vectors using the GMM as
(13) 
where , is the number of mixtures, and , and are respectively the weight, mean vector, and covariance matrix of the th Gaussian. The concepts of these PDFs are shown in Fig. 3, and the procedure of anomalous sound simulation is summarized in Algorithm 1 and Fig. 4.
First, we describe the two constraints for training and . For algorithmic efficiency, should be generated with a low computational cost. As an implementation of , we use the normalized Gaussian distribution, because its samples can be generated by a pseudorandom number generator such as the MersenneTwister. Thus, for training and , we use the first constraint so that of the various machine sounds follows a normalized Gaussian distribution. To satisfy the first constraint, we train to minimize the following KullbackLeibler divergence (KLD):
(14) 
where the superscript “KL” is an abbreviation of “KullbackLeibler”, denotes the trace of a matrix, denotes transposition, and are respectively the zero vector and unit matrix with dimension , and and are respectively the mean vector and covariance matrix calculated from of the various machine sounds. To generate anomalous sounds from (12), needs to reconstruct various machine sounds, as . Thus, as a second constraint, we train and to minimize the reconstruction error (7) calculated on the various machine sounds.
Next, we describe the GMM that models the PDF of the normal latent vectors. To reject a candidate which seems to be of a normal sound, we need to calculate the probability that the candidate is a normal one. To calculate the probability, we need to model . Since there is no constraint on the form of in the training procedure of , might have a complex form. For simplicity, we use a GMM expressed as (13).
IiiC Detailed description of training procedure
Here, we describe the details of the training procedure shown in Fig 5. The training procedure consists in three steps. Hereafter, we call the proposed method using this training procedure NPPROP. The algorithm inputs are training data constructed from normal sounds and various machine sounds, and the outputs are and . Moreover, and respectively denote the th training samples of minibatches of various and normal machine sounds, and is the number of samples included in a minibatch.
First, and are trained to simulate anomalous sounds. A minibatch of various machine sounds is randomly selected from the training dataset of various machine sounds. Next, its latent vectors are calculated as . Then, the parameters of the Gaussian distribution of the minibatch are calculated as
(15)  
(16) 
Finally, to minimize the KLD and the reconstruction error of various sounds, the objective function is calculated as
(17) 
where the superscript “KR” is an abbreviation of “KLD and reconstruction”, and and are updated by gradient descent to minimize :
(18)  
(19) 
where is the step size.
Second, and are trained to maximize the objective function. A minibatch of normal sounds is randomly selected from the training dataset of normal sounds, and a minibatch of anomalous sounds is simulated using Algorithm 1. Here, since DNN is not a parametric PDF, the threshold that satisfies cannot be analytically calculated. Thus, in this study, we approximately calculate by sorting the anomaly scores of normal sounds in the minibatch . First, and are calculated, and and are set as the th value of the sorted and in descending order, respectively. Then, the TPR and FPR are approximately evaluated as
(20)  
(21) 
where the binary decision function is approximated by a sigmoid function, allowing the gradient to be analytically calculated. Finally, and are updated to increase by gradient ascent:
(22)  
(23) 
IiiD Detailed description of detection procedure
After training and , we can identify whether the observed sound is a normal one or not. First, the input vector is calculated from the observed sound. Then, the anomaly score is calculated as (8). Finally, a decision score, is calculated, and when exceeds a predefined value , the observed sound is determined to be anomalous. In this study, we used , meaning that, if the anomaly score exceeds the threshold even for one frame, the observed sound is determined to be anomalous.
IiiE Modified implementation as an AUC maximization
The receiver operating characteristic (ROC) curve and the AUC are widely used performance measures for imbalanced data classification and/or anomaly detection. The AUC is calculated as
(24)  
(25) 
As we can see in (25), anomalous sound data are needed to calculate the AUC. Although the AUC has been used as an objective function in imbalanced data classification [31, 32, 33], it has not been applied to unsupervisedADS so far. Fortunately, since the proposed rejection sampling can simulate anomalous sound data, AUC maximization can be used as an objective function of ADS. Instead of , the following objective function can be used in the training procedure:
(26) 
Hereafter, we call the proposed method using instead of AUCPROP.
Iv Experiments
We conducted experiments to evaluate the performance of the proposed method. First, we conducted an objective experiment using synthetic anomalous sounds (Sec. IVB). To generate a large enough anomalous dataset for the ADS accuracy evaluation, we used collision and sustained sounds from datasets for detection and classification of acoustic scenes and events 2016 (DCASE2016 [36]). To show the effectiveness of the method in real environments, we conducted verification experiments in three real environments (Sec. IVC).
Iva Experimental conditions
Compared methods
The proposed methods described in Sec IIIC (NPPROP) and Sec IIIE (AUCPROP) were compared with three stateoftheart ADS methods:

VAE [24]: and were implemented using VAE. The encoder estimated the mean and variance parameters of the Gaussian distribution in the latent space. Then, the latent vectors were sampled from the Gaussian distribution whose parameters were estimated by the encoder. Then, the decoder reconstructed the input vector from the sampled latent vectors. Finally, the reconstruction error was calculated and used as the anomaly score.
We also used our previous work [8] (CONVPROP) for comparison. This method uses a VAE to extract latent vectors as acoustic features. A GMM is used for the normal model, and the encoder and decoder are trained to maximize (11).
DNN architecture and setup
We tested two types of network architecture as shown in Fig. 6. The first architecture, “FNN”, consisted of fully connected DNNs with three hidden layers and 512 hidden units. The rectified linear unit (ReLU) was used as the activation functions of the hidden layers. The input vector was defined as
where is the discrete Fourier transform (DFT) spectrum of the observed sound, denotes the frequency index, ) is the context window size, and and denote 40dimensional Mel matrix multiplication and the elementwise absolute value. Thus, the dimension of was . The second architecture,“1DCRNN”, consisted in a onedimensional convolution neural network (1DCNN) layer and a long shortterm memory (LSTM) layer; it worked well in supervised anomaly detection (race SED) in DCASE 2017 [10]. In order to detect anomalous sounds in real time, we changed the backward LSTM to a forward one. In addition, to avoid overfitting, we used only one forward LSTM layer instead of two backward LSTM layers. The input vector was a 40dimensional log melband energy:
The dimension of was . For each architecture, the dimension of the latent vector was . All input vectors were meanandvariance normalized using the training data statistics.
As an implementation for the gradient method, the Adam method [34] was used instead of the gradient descent/ascent shown in (18)–(23). To avoid overfitting, normalization [35] with a regularization penalty of was used. The minibatch size for all methods was . All models were trained for 500 epochs. In all methods, the average value of the loss was calculated on the training set at every epoch, and when the loss did not decrease for five consecutive epochs, the stepsize was decreased by half.
Other conditions
All sounds were recorded at a sampling rate of 16 kHz. The frame size of the DFT was 512, and the frame was shifted every 256 samples. For , the number of Gaussian mixtures was and a diagonal covariance matrix was used to prevent the problem from being illconditioned. The EM algorithm for the GMM involved iterating (18)–(23) 30 times. All the abovementioned conditions are summarized in Table I.
Parameters for signal processing  

Sampling rate  16.0 kHz 
FFT length  512 pts 
FFT shift length  256 pts 
Number of melfilterbanks  40 
Other parameters  
Context window size  5 
Dimension of input vector for FNN  440 
Dimension of input vector for 1DCRNN  40 
Dimension of acoustic feature vector  40 
GMM update per gradient method  
Number of mixtures  16 
Minibatch size  512 
FPR parameter  0.2 
Step size  
normalization parameter 
IvB Objective experiments on synthetic data
Dataset
Sounds emitted from a condensing unit of an air conditioner operating in a real environment were used as the normal sounds. In addition, various machine sounds were recorded from other machines, including a compressor, engine, compression pump, and an electric drill, as well as environmental noise of factories. The normal and various machine sound data totaled 4 and 20 hours ( 4 hours normal 16 hours other machines), respectively. These sounds were recorded at a 16kHz sampling rate. In order to improve the robustness for different loudness levels and ratios of the normal and anomalous sound, the various machine sounds in the training dataset were augmented with a multiplication of five amplitude gains. These gains are calculated so that the maximum amplitudes of various sounds becomes to 1.0, 0.5, 0.25, 0.125, and 0.063.
Since it is difficult to collect a massive amount of test data including anomalous sounds, synthetic anomalous data were used in this evaluation.
In particular, we used the training datasets for task of DCASE2016 [36] as anomalous sounds.
Although these sounds are “normal” sounds in an office, in unsupervisedADS, the unknown sounds are categorized as “anomalous”.
Thus, we consider that this evaluation can at least evaluate the detection performance for unknown sounds.
Since the anomalous sounds of machines are roughly categorized into collision sounds (e.g., the sound of a metal part falling on the floor) and sustained sounds (e.g., frictional sound caused by scratched bearings), we selected 80 collision sounds, including (slamming doors , knocking at doors , keys put on a table, keystrokes on a keyboard), and 60 sustained sounds (drawers being opened, pages being turned, and phones ringing), from this dataset [37].
To synthesize the test data, the anomalous sounds were mixed with normal sounds at anomalytonormal power ratios (ANRs

select an anomalous sound and randomly cut a normal so that has the same signal length of the selected anomalous sound.

for the cut normal and anomalous sounds, calculate the framewise log power of each of 512 points with a 256 point shift on a dB scale, namely

select the median of as the representative power of each sound as.

manipulate the power of the anomalous sound so that the ANR has the desired value.

used the cut normal sound as the test data of normal sound, and generate the test data of the anomalous sound by mixing the anomalous sound with the quarried normal sound.
In total, we used 140 normal and anomalous sound samples for each ANR condition.
The training dataset of normal sounds and the MATLAB code to generate the test dataset are freely available on the website
Results
To evaluate the performance of ADS, we used the AUC, TPR, and partial AUC (AUC) [38]. The AUC is a traditional performance measure of anomaly detection. The other two measurements evaluated the performance under low FPR conditions. TPR is the TPR under the condition that FPR equals . The AUC is an AUC calculated with FPRs ranging from 0 to with respect to the maximum value of . The parameters were and . We evaluated these metrics for three different evaluation sets: 80 collision sounds (Collision), 60 sustained sounds (Sustain), and the sum of these sounds (Mix).
The results for each score, sound category, and ANR on FNN and 1DCRNN are shown in Fig. 8 and Fig. 8. Overall, the performances of AE, NPPROP and AUCPROP were better than those of VAE and VAEGAN. In detail, AE achieved high scores for all measurements, AUCPROP achieved high scores for AUC and AUC, and NPPROP achieved high scores for TPR and AUC. In addition, for all conditions, the TPR and AUC scores of NPPROP were higher than those of AE. To discuss the difference between the objective functions of AE, NPPROP and AUCPROP, we show the ROC curves in Fig. 9. Since the differences between the results of Collision, Sustained, and Mix were small, we plotted only those of the Mix dataset. From these ROC curves, we can see that the TPRs of NPPROP under the low FPR conditions were significantly higher than those of other methods. This might be because the objective function of NPPROP works to increase TPR under the low FPR condition. In addition, although AUCPROP’s TPRs under the low FPR condition were lower than those of NPPROP, the TPRs under the moderate and high FPR conditions were higher than those of the other methods. This might be because the objective function of AUCPROP works to increase TPR for all FPR conditions. Since the individual results and objective function tend to coincide, we consider that the training of each neural network succeeded. In addition, TPR under the low FPR conditions is especially important when the ADS is used in real environments, because if an ADS system frequently gives false alert, we cannot trust it. Therefore, unsupervisedADS using an AE trained using (11) would be effective in real situations.
In addition, regarding the FNN results, VAE scored lower than AE, and VAEGAN scored lower than all the other methods. These results suggest that when calculating the anomaly score using a simple network architecture like FNN, a simple reconstruction error would be better than complex calculation procedures such as VAE and VAEGAN. Moreover, the scores of NPCONV were lower than those of the DNNbased methods. In our previous study [8], we used a DNN a feature extractor and constructed the normal model by using a GMM. These results suggest that using a DNN for the normal model would be better than using a GMM.
IvC Verification experiment in a real environment
We conducted three verification experiments to test whether anomalous sounds in real environments can be detected. The target equipment and experimental conditions were as follows:

Stereolithography 3Dprinter: We collected an actual collisiontype anomalous sound. Two hours worth of normal sounds were collected as training data. The anomalous sound was caused by collision of the sweeper and the formed object. The 3Dprinter stopped 5 minutes after this anomalous sound occurred.

Air blower pump: We collected an actual collisiontype anomalous sound. Twenty minutes worth of normal sounds were collected as training data. The anomalous sound was caused by blockage by a foreign object stuck in the air blower duct. This anomaly does not lead to immediate machine failure; however, it should be addressed.

Water pump: We collected an actual sustained type anomalous sound. Three hours worth of normal sounds were collected as training data. Above 4 kHz, the anomalous sound has a larger amplitude than that of the normal sounds, and it was due to wearing of the bearings. An expert conducting a periodic inspection diagnosed that the bearings needed to be replaced.
All anomalous and normal sounds were recorded at a 16kHz sampling rate. The other conditions were the same as in the objective experiment. The FNN architecture was used for the anomaly score calculation.
Figure 10 shows the spectrogram (top) and anomaly scores of each method (bottom). The red dashed line in each of the bottom figures is the threshold , which is defined such that the FPR of the training data was 0.1%. Anomalous sounds are enclosed in white dotted boxes in the spectrograms, and the falsepositive detections are circled in purple in the anomaly score graphs. Since the anomalous sound of the water pump is a sustained sound, for ease of comparison, 60 seconds of normal sounds and 60 seconds of anomalous sound are concatenated in each figure. In addition, the anomalous sounds are enlarged, since the spectrum changes due to the anomalous sounds of the 3Dprinter and water pump are difficult to see.
All of the results for NPPROP and AUCPROP indicate that anomalous sounds were clearly detected; the anomaly scores of the anomalous sounds evidently exceeded the threshold, while those of the normal sounds were below the threshold. Meanwhile, in the results of AE and VAE, although the anomaly scores of all anomalous sounds exceeded the threshold, falsepositives were also observed in the results for the water pump. In addition, although AE’s anomaly score of the 3Dprinter and VAE’s anomaly score of the air blower pump exceeded the threshold, the excess margin of the anomaly score is small and it is difficult to use a higher threshold for reducing FPR. This problem might be because that the objective functions do not work to increase anomaly scores for anomalous sounds, and thus, the encoder and decoder reconstructed not only normal sounds but also anomalous sounds. In VAEGAN, the anomaly scores of the 3Dprinter and the water pump exceeded the threshold, whereas those of the air blower pump did not exceed the threshold. The reason might be that when the generator precisely generates “fake” normal sounds, the normal model is trained to increase the anomaly scores of normal sounds. Therefore, the threshold of the air blower pump, which is defined as the FPR of normal training data becoming 0.001, takes a very high value. These verification experiments suggest that the proposed method is effective at identifying anomalous sounds under practical conditions.
V Conclusions
This paper proposed a novel training method for unsupervisedADS using an AE for detecting unknown anomalous sound. The contributions of this research are as follows: 1) by considering outlierdetectionbased ADS as a statistical hypothesis test, we defined an objective function that builds upon the NeymanPearson lemma [29]. The objective function increases the TPR under a low FPR condition, which is often used in practice. 2) By considering the set of anomalous sounds to be complement to the set of normal sounds, we formulated a rejection sampling algorithm to simulate anomalous sounds. Experimental results showed that these contributions enabled us to construct an ADS system that accurately detects unknown anomalous sounds in three real environments.
In future, we will tackle the following remaining issues of ADS systems in real environments:
1) Extension to a supervised approach to detect both known and unknown anomalous sounds: while operating an ADS system in a real environment, we may occasionally obtain partial samples of anomalous sounds. While it might be better to use the collected anomalous sounds in training, the crossentropy loss would not be the best way to detect both known and unknown anomalous sounds [39]. In addition, if we calculate the TPR in and/or only using a part of the anomalous sounds, this training does not guarantee the performance for unknown anomalous sounds. Thus, we should develop a supervisedADS method that can also detect unknown anomalous sounds; a preliminary study on this has been published in [25].
2) Incorporating machine or contextspecific knowledge: to simplify the experiments, we used the simple detection rule described in Sec. IIID. However, for the anomaly alert, it would be better to use machine/contextspecific rules, such as modifying or smoothing the detection result from the raw anomaly score. Thus, it will be necessary to develop rules or a trainable postprocessing block to modify the anomaly score.
a List of Symbols
1. Functions
Objective function  
Anomaly score  
Binary decision  
Encoder of autoencoder  
Decoder of autoencoder  
Generator  
Gaussian distribution  
Expectation with respect to  
Gradient with respect to  
Trace of matrix  
KullbackLeibler divergence between and  
norm  
Flooring function 
2. Parameters
Parameters of normal model  
Parameters of encoder  
Parameters of decoder  
Parameters of generator  
Parameters of Gaussian mixture model 
3 Variables
Input vector  
State variable  
Latent vector  
Threshold for anomaly score  
Desired false positive rate  
Mean vector  
Covariance matrix  
Mixing weight of Gaussian mixure model  
Number of gaussian mixtures  
Number of time frames of observation  
Number of training samples  
Minibatch size  
Dimension of input vector  
Dimension of latent vector  
Step size for gradient method  
Context window size  
Temporary variable of anomaly score  
Anomaly decision score for one audio clip 
4. Notations
Timeframe index of observation  
Index of training sample  
Index of Gaussian distribution  
Transpose of matrix or vector  
Variable of normal sound  
Variable of anomalous sound  
Variable of various sound 
Yuma Koizumi(M ’15) received the B.S. and M.S. degrees from Hosei University, Tokyo, in 2012 and 2014, and the Ph.D. degree from the University of ElectroCommunications, Tokyo, in 2017. Since joining the Nippon Telegraph and Telephone Corporation (NTT) in 2014, he has been researching acoustic signal processing and machine learning including basic research of sound source enhancement and unsupervised/supervised anomaly detection in sounds. He was awarded the FUNAI Best Paper Award and the IPSJ Yamashita SIG Research Award from the Information Processing Society of Japan (IPSJ) in 2013 and 2014, respectively, and the Awaya Prize from the Acoustical Society of Japan (ASJ) in 2017. He is a member of the ASJ and the Institute of Electronics, Information and Communication Engineers (IEICE). 
Shoichiro Saito (SM ’06M ’07) received the B.E. and M.E. degrees from the University of Tokyo in 2005 and 2007. Since joining NTT in 2007, he has been engaging in research and development of acoustic signal processing systems including acoustic echo cancellers, handsfree telecommunication, and anomaly detection in sound. He is currently a Senior Research Engineer of Audio, Speech, and Language Media Laboratory, NTT Media Intelligence Laboratories. He is a member of the IEICE, and the ASJ. 
Hisashi Uematsu received the B.E., M.E., and Ph.D. degrees in Information Science from Tohoku University, Miyagi, in 1991, 1993, and 1996. He joined NTT in 1996 and has been engaged in research on psychoacoustics (human auditory mechanisms) and digital signal processing. He is currently a Senior Research Engineer of CrossModal Computing Project, NTT Media Intelligence Laboratories. He was awarded the Awaya Prize from the ASJ in 2001. He is a member of the ASJ. 
Yuta Kawachi received a B.E. and M.E. degrees from Waseda University, Tokyo, in 2012 and 2014. Since joining NTT in 2014, he has been researching acoustic signal processing and machine learning. He is a member of the ASJ. 
Noboru Harada (M ’99SM ’18) received the B.S., and M.S., degrees from the Department of Computer Science and Systems Engineering of Kyushu Institute of Technology in 1995 and 1997, respectively. He received the Ph.D. degree from the Graduate School of Systems and Information Engineering, University of Tsukuba in 2017. Since joining NTT in 1997, he has been researching speech and audio signal processing such as high efficiency coding and lossless compression. His current research interests include acoustic signal processing and machine learning for acoustic event detection including anomaly detection in sound. He received the Technical Development Award from the ASJ in 2016, Industrial Standardization Encouragement Awards from Ministry of Economy Trade and Industry (METI) of Japan in 2011, the Telecom System Technology Paper Encouragement Award from the Telecommunications Advancement Foundation (TAF) of Japan in 2007. He is a member of the ASJ, the IEICE, and the IPSJ. 
Footnotes
 ANR is a measure comparing the level of an anomalous sound to the level of a normal sound. This definition is the same as the signaltonoise ratio (SNR) when the signal is an anomalous sound and the noise is a normal sound.
 https://archive.org/details/ADSdataset
References
 C. Clavel, T. Ehrette, and G. Richard “Events Detection for an AudioBased Surveillance System,” In Proc. of ICME, 2005.
 G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, “Scream and Gunshot Detection and Localization for AudioSurveillance Systems,” In Proc. of AVSS, 2007.
 S. Ntalampiras, I. Potamitis, and N. Fakotakis “Probabilistic Novelty Detection for Acoustic Surveillance Under RealWorld Conditions,” IEEE Trans. on Multimedia, pp.713–719, 2011.
 P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento, “Audio Surveillance of Roads: A System for Detecting Anomalous Sounds,” IEEE Trans. ITS, pp.279–288, 2016.
 P. Coucke, B. De. Ketelaere, and J. De. Baerdemaeker, “Experimental analysis of the dynamic, mechanical behavior of a chicken egg,” Journal of Sound and Vibration, Vol. 266, pp.711–721, 2003.
 Y. Chung, S. Oh, J. Lee, D. Park, H. H. Chang and S. Kim, “Automatic Detection and Recognition of Pig Wasting Diseases Using Sound Data in Audio Surveillance Systems,” Sensors, pp.12929–12942, 2013.
 A. Yamashita, T. Hara, and T. Kaneko, “Inspection of Visible and Invisible Features of Objects with Image and Sound Signal Processing,” in Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2006), pp. 3837–3842, 2006.
 Y. Koizumi, S. Saito, H. Uematsu, and N. Harada, “Optimizing Acoustic Feature Extractor for Anomalous Sound Detection Based on NeymanPearson Lemma,” in Proc. of EUSIPCO, 2017.
 A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: tasks, datasets and baseline system,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp. 85–92, 2017.
 H. Lim, J. Park and Y. Han, “Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017.
 E. Cakir and T. Virtanen, “Convolutional Recurrent Neural Networks for Rare Sound Event Detection,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017.
 H. EghbalZadeh, B. Lehner, M. Dorfer, and G. Widmer, “CPJKU Submissions for DCASE2016: a Hybrid Approach Using Binaural IVectors and Deep Convolutional Neural Networks,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016.
 S. Mun, S. Park, D. K. Han, and H. Ko, “Generative Adversarial Network Based Acoustic Scene Training Set Augmentation and Selection Using Svm Hyperplane,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017.
 S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, and T. Virtanen, “Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016.
 S. Adavanne, and T. Virtanen, “A Report on Sound Event Detection with Different Binaural Features,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017.
 T. Lidy and A. Schindler, “CQTBased Convolutional Neural Networks for Audio Scene Classification and Domestic Audio Tagging,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016.
 V. J. Hodge and J. Austin, “A Survey of Outlier Detection Methodologies,” Artificial Intelligence Review, pp 85–126, 2004.
 A. Patcha and J. M. Park, “An overview of anomaly detection techniques: Existing solutions and latest technological trends,” Journal Computer Networks, pp.3448–3470, 2007.
 V. Chandola, A. Banerjee, and V. Kumar “Anomaly detection: A survey,” ACM Computing Surveys, 2009.
 E. Marchi, F. Vesperini, F. Eyben, S. Squartini, and B. Schuller, “A Novel Approach for Automatic Acoustic Novelty Detection using a Denoising Autoencoder with Bidirectional LSTM Neural Networks,” In Proc. of ICASSP, 2015.
 T. Tagawa, Y. Tadokoro, and T. Yairi, “Structured Denoising Autoencoder for Fault Detection and Analysis,” Proceedings of Machine Learning Research, pp.96–111, 2015.
 E. Marchi, F. Vesperini, F. Weninger, F. Eyben, S. Squartini, and B. Schuller, “Nonlinear prediction with LSTM recurrent neural networks for acoustic novelty detection,” In Proc. of IJCNN, 2015.
 Y. Kawaguchi and T. Endo, “How can we detect anomalies from subsampled audio signals?,” in Proc. of MLSP, 2017.
 J. An and S. Cho, “Variational Autoencoder based Anomaly Detection using Reconstruction Probability,” Technical Report. SNU Data Mining Center, pp.1–18, 2015.
 Y. Kawachi, Y. Koizumi, and N. Harada, “Complementary Set Variational Autoencoder for Supervised Anomaly Detection,” in Proc. of ICASSP, 2018.
 I. J. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” In Proc of NIPS, 2014.
 A. B. L. Larsen, S. K. Sonderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” In Proc. of ICML, 2016.
 T. Schlegl, P. Seebock, S. M. Waldstein, U. S. Erfurth, and G. Langs, “Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery,” In Proc. of IPMI, 2017.
 J. Neyman and E. S. Pearson, “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” Phi. Trans. of the Royal Society, 1933.
 G. Casella and R. L. Berger, “Statistical Inference, section 8.3.2 Most Powerful Test,” Duxbury Pr, pp.387–393, 2001.
 A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145–1159, 1997.
 A. Herschtal and B. Raskutti, “Optimising Area Under the ROC Curve Using Gradient Descent,” In Proc. of ICML, 2004.
 A. Fujino and N. Ueda, “A Semisupervised AUC Optimization Method with Generative Models,” In Proc. of ICDM, 2016.
 D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” In Proc. of ICLR, 2015.
 A. Krogh and J. A. Hertz, “A Simple Weight Decay Can Improve Generalization,” In Proc. of NIPS, 1992.
 http://www.cs.tut.fi/sgn/arg/dcase2016/
 http://www.cs.tut.fi/sgn/arg/dcase2016/download
 S. D. Walter “The partial area under the summary ROC curve,” Statistics in medicine, pp.2025–2040, 2005.
 N. Gornitz, M. Kloft, K. Rieck, and U. Brefeld, “Toward Supervised Anomaly Detection,” Journal of Artificial Intelligence Research, pp.235–262, 2013.