Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders
Abstract
Convolutional autoencoders have emerged as popular models for unsupervised defect segmentation on image data. Most commonly, this task is performed by thresholding a pixelwise reconstruction error based on an distance. However, this procedure generally leads to high novelty scores whenever the reconstruction encompasses slight localization inaccuracies around edges. We show that this problem prevents these approaches from being applied to complex realworld scenarios and that it cannot be easily avoided by employing more elaborate architectures. Instead, we propose to use a perceptual loss function based on structural similarity. Our approach achieves stateoftheart performance on a realworld dataset of nanofibrous materials, while being trained endtoend without requiring additional priors such as pretrained networks or handcrafted features.
Improving Unsupervised Defect Segmentation by Applying Structural Similarity to Autoencoders
1 Introduction
Visual inspection is essential in many industrial manufacturing pipelines to ensure high production quality and increased cost effectiveness by quickly discarding defective parts. Since manual inspection by humans is slow, expensive, and error prone, the usage of fully automated computer vision systems is becoming increasingly popular. Supervised methods, where the system learns how to segment defective regions by training on both defective and nondefective samples, are commonly used. However, this involves a high amount of effort to generate labeled data and all possible defect types need to be known beforehand. Furthermore, in some production processes the scrap rate might be too small to produce a sufficient number of defective samples for training, especially for datahungry deep learning models.
In this work, we focus on unsupervised defect segmentation for visual inspection. Our goal is to segment defective regions in images after having trained on exclusively nondefective samples. It has been shown that architectures based on convolutional neural networks (CNNs), such as autoencoders [1] or generative adversarial networks (GANs) [2], can be used for this task. We give a brief overview of such methods in Section 2. These models aim to reconstruct their inputs in the presence of certain constraints such as a bottleneck and hereby manage to capture the essence of highdimensional data (e.g. images) in a lowerdimensional space. Thus, anomalies in the test data deviate from the training data manifold and the model fails to reproduce them. As a result, large reconstruction errors indicate the presence of defects. Typically, the error measure that is employed is a pixelwise or distance. However, these measures yield high novelty scores in locations where the reconstruction is only slightly inaccurate, for example due to small localization imprecisions of edges. They also fail to detect structural differences between input and reconstructed images when the respective pixels’ color values are roughly consistent. This limits the usefulness of these methods when employed in complex realworld scenarios.
To alleviate these problems, we propose to compare input and reconstructed images using the structural similarity (SSIM) metric [3], a distance measure designed to capture perceptual similarity. By applying this method to a realworld inspection dataset of industrial relevance, we show that it solves the aforementioned problems and yields a performance that is on par with other stateoftheart unsupervised defect segmentation approaches (cf. Section 4.3). In contrast to these, we do not rely on any model priors, such as handcrafted features or pretrained networks. Figure 1 shows some qualitative results of our method.
2 Related Work
Detecting anomalies that deviate from the training data has been a longstanding problem in machine learning. Pimentel et al. [4] give a comprehensive overview of the field. In computer vision, one needs to distinguish between two setups of this task. First, there is the classification scenario, where novel samples appear as entirely different object classes that shall be labeled as outliers. Second, there is a scenario where novelties manifest themselves in subtle deviations from otherwise known structures and a segmentation of these deviations is required. For the first subproblem, a number of approaches have been proposed [5, 6]. We will limit ourselves to an overview of methods that attempt to tackle the latter problem.
Napoletano et al. [7] extract features from a CNN that has been pretrained on a classification task. The features are clustered in a dictionary during training and anomalous structures are identified when the extracted features strongly deviate from the learned cluster centers. General applicability of this approach is not guarenteed since the pretrained network might not extract useful features for the new task at hand and it is unclear which features of the network should be selected for clustering. The results achieved with this method are the current stateoftheart on the NanoTWICE dataset (cf. Section 4.1) we use in our experiments. They improve upon previous results by Carrera et al. [8], who build a dictionary that yields a sparse representation of the normal data. Similar approaches using sparse representations for novelty detection are [9, 10, 11].
Schlegl et al. [12] train a GAN on optical coherence tomography images of the retina and detect anomalies such as retinal fluid by searching for a latent sample that minimizes the pixelwise reconstruction error as well as a discriminator loss. The rather large number of optimization steps that must be performed to find a suitable latent sample makes this approach very slow. Therefore, it is only of use in practical applications that are not timecritical. Recently, Zenati et al. [13] proposed to use bidirectional GANs [14] to add the missing encoder network for faster inference. However, GANs are prone to run into mode collapse, meaning that there is no guarantee that all modes of the distribution of nondefective images are captured by the model. Furthermore, they are more difficult to train than autoencoders since the loss function of the adversarial training can typically not be trained to convergence [15]. Instead, the training results must be judged manually after regular optimization intervals.
Baur et al. [16] propose a general framework for defect segmentation using autoencoding architectures and a perpixel reconstruction loss. To circumvent the disadvantages of their loss function, they improve the reconstruction quality by requiring aligned input data and adding an adversarial loss to enhance the visual quality of the reconstructed images. However, for many applications that work on unstructured data, prior alignment is impossible. In addition to the instabilities during training, they might alter the visual appearance of the reconstruction, which further discourages the use of a perpixel error function.
Other approaches take into account the structure of the latent space of variational autoencoders [17] in order to define measures for outlier detection. An et al. [18] define a reconstruction probability for every image pixel by drawing multiple samples from the estimated encoding distribution and measuring the variability of the decoded outputs. Soukup et al. [19] disregard the decoder output entirely and instead compute the KL divergence as a novelty measure between the prior and the encoder distribution. This is based on the underlying assumption that defective inputs will manifest themselves in mean and variance values that are very different from those of the prior. Similarly, Vasilev et al. [20] define multiple novelty measures, either purely considering latent space behavior or combined measures with pixelwise reconstruction losses. Obtaining only a single scalar value that indicates novelty can quickly become a performance bottleneck in a segmentation scenario, where a separate forward pass would be required for each image pixel to obtain an accurate segmentation result. Furthermore, we show that pixelwise reconstruction probabilities obtained from variational autoencoders suffer from the same problems as pixelwise deterministic losses (cf. Section 4.3).
Ridgeway et al. [21] show that SSIM [3] and the closely related multiscale version MSSSIM [22] can be used as differentiable loss functions to generate sharper reconstructions in autoencoding architectures. Autoencoders are straightforward to train and reliably reconstruct nondefective images while visually altering defective regions to keep the reconstruction close to the learned manifold of the training data. While pixelwise loss functions are not designed to detect such structural changes, SSIM performs much better at identifying these alterations since it is designed to measure perceptual similarity.
3 Methodology
3.1 Autoencoders
Autoencoders attempt to reconstruct an input image through a bottleneck, effectively projecting the input image into a lowerdimensional space, called latent space. An autoencoder consists of an encoder function and a decoder function , where denotes the dimensionality of the latent space and denote the number of channels, height, and width of the input image, respectively. Choosing prevents the architecture from simply copying its input and forces the encoder to extract useful features from the input patches that facilitate accurate reconstruction by the decoder. The overall process can be summarized as
(1) 
where z denotes the latent vector and is the reconstruction of the input. In the following, the functions and are parameterized by CNNs. Strided convolutions are used to downsample the input feature maps in the encoder and to upsample them in the decoder.
To force the autoencoder to reconstruct its input, a loss function must be defined that guides it towards this behavior. For simplicity, one often chooses a perpixel error measure, such as the loss
(2) 
where denotes the intensity value of image x at row and column indices and . This loss function is also widely used for both the training and the evaluation of unsupervised defect segmentation autoencoders. We will discuss the usefulness of such a pixelwise error measure and present a better alternative — the structural similarity index — in Section 3.2.
There exist various extensions to the deterministic autoencoder framework. Some works, such as the recently introduced variational autoencoder (VAE) [17] impose constraints on the latent variables to follow a certain distribution . For simplicity, the distribution is typically chosen to be a unitvariance Gaussian. This turns the entire framework into a probabilistic model that enables efficient posterior inference and also allows to generate new data from the training manifold by sampling from the latent distribution. The approximate posterior distribution obtained by encoding an input image can be used to define further novelty measures. One option is to compute a distance between the two distributions such as the KLdivergence and indicate novelty for large deviations from the prior [19]. However, this approach by itself does not yield a pixelaccurate segmentation and a forward pass needs to be performed for a patch centered around each pixel of the entire input image. A second approach for utilizing the posterior which yields a novelty score for each pixel is to decode latent samples drawn from and then to evaluate the perpixel reconstruction probability as described in [18].
Another extension to standard autoencoders was proposed by Dosovitskiy et al. [23]. They increase the quality of the produced reconstructions by extracting features from both the input image x and its reconstruction and enforcing them to be equal. Let be a feature extractor that obtains an dimensional feature vector from an input image. Then a regularizer can be added to the loss function of the autoencoder, yielding the feature matching autoencoder (FMAE) loss
(3) 
where denotes the weighting factor between the two loss terms. can be parametrized using the first layers of a CNN pretrained on an image classification task. We show that employing such more elaborate architectures does not yield satisfactory improvements over deterministic autoencoders trained and evaluated with a pixelwise distance.
3.2 Structural Similarity
The SSIM metric [3] defines a symmetric distance measure between two sized image patches p and q, taking into account their similarity in luminance , contrast , and structure . These are combined as a product
(4) 
where are weight factors for the three terms. They are typically set to to simplify the expression. Based on the mean values and , variances and , and covariance , the above equation can then be compactly rewritten as
(5) 
The constants and ensure numerical stability and are typically set to and . It holds that . In particular, if and only if p and q are identical [3].
To compute the structural similarity between an entire image x and its reconstruction , one slides a sized window across the image and computes a SSIM value at each pixel location. Since Equation 5 is differentiable, it can be employed as a loss function in deep learning architectures that are getting optimized using gradient descent.
Figure 2 shows the advantages that SSIM has over pixelwise error functions such as . In the left image of Figure 2, we see the input to an autoencoder that contains four gray strokes that simulate defects. The right image shows the corresponding reconstruction created by an autoencoder trained on defectfree checkerboard patterns. Figure 2 shows the error maps when computing the SSIM distance with a window size of (left) and the distance (right) between the two images. For the distance, both the defects and the inaccuracies in the reconstruction of the edges are weighted equally in the error map, which makes them indistinguishable. In contrast, SSIM gives more weight to the actual defects, assigning less importantance to the small inaccuracies in the reconstruction of the edges. This ultimately enables us to detect and segment defects in complex structures.
4 Experiments
We evaluate our method on a dataset of nanofibrous materials [8] and compare it to lossbased deterministic, variational, and feature matching autoencoders. Figure 1 shows two images of the dataset where red contours outline the ground truth of present defects and green areas indicate defective regions found by our method.
4.1 The NanoTWICE Dataset
The dataset consists of 45 grayscale images of nanofibrous materials acquired by a scanning electron microscope and is publicly available^{1}^{1}1http://www.mi.imati.cnr.it/ettore/NanoTWICE/. A detailed description of the acquisition process can be found in [8]. All images are of size and the dataset is composed of two disjoint subsets. The first set consists of five images that do not contain any anomalies. We use four of these images for training. The fifth can be used as a validation image for setting the threshold during test time by fixing a certain false positive rate. The remaining 40 images constitute the second subset which is used for testing. These images contain various defects such as beads, specks of dust, or flattened areas, which are annotated with a pixelwise segmentation map.

4.2 Training and Testing Procedure
For the training of our autoencoder, we employ the following steps. First, we extract 20,000 patches of size from the given training images, since the input images are comparably large and only few of them are available. Based on our general autoencoding structure as shown in Figure 3, we set up four different architectures for training and evaluation. First, we train three networks using the error metric: a deterministic, a variational, and a feature matching autoencoder. The forth architecture is a deterministic autoencoder using SSIM. We train each network for 200 epochs, using the ADAM [24] optimizer with a learning rate of 0.0002 and a weight decay set to .
In order to improve the quality of our reconstructions which might enable the error metric to find defects more reliably, we train a deterministic autoencoder with the feature matching loss defined in Equation 3, setting . For calculating the features to be compared between the input and reconstructed image, we use the first three convolutional layers of an AlexNet [25] pretrained on ImageNet [26].
Latent dimension  AUC  SSIM window size  AUC  Patch size  AUC 

50  0.848  3  0.889  
100  0.935  7  0.965  32  0.949 
200  0.961  11  0.966  64  0.959 
500  0.966  15  0.960  128  0.966 
1000  0.962  19  0.952 
The evaluation is performed by striding over the testing images and reconstructing image patches of size using the trained autoencoder. In principle, it would be possible to set the horizontal and vertical stride to . We noted, however, that at different spatial locations the autoencoder produces slightly different reconstructions of the same data, which leads to some striding artifacts. Therefore, we decreased the stride to pixels and averaged the reconstructed pixel values. Then, we compare the input to the reconstruction using the respective error metric that was used for training ( or SSIM). In the case of the variational autoencoder, we decode latent samples from the approximate posterior distribution and evaluate the reconstruction probability for each pixel as a novelty score. We expect larger variance of for defective input patches, yielding lower reconstruction probabilities which might improve the performance in comparison to the deterministic autoencoder. The resulting novelty maps are thresholded to obtain candidate regions where a defect might be present. An opening with a circular structuring element of diameter four is applied as a morphological postprocessing to delete outlier regions that are only a few pixels wide [27]. We take the convex hull of each region found in order to close spurious holes in the segmentation result. An overview of the final novelty detection pipeline is depicted in Figure 3.
Using this setup, a forward pass through our architecture for a patch of size takes 14.1 milliseconds (ms) on a Tesla K40c GPU and the inference on a full input image takes around 9.6 seconds. This is comparable to the runtime reported by Napoletano et al. [7]. One should keep in mind, however, that the segmentations produced in their experiments are made up of blocks consisting of pixels each. For their method to achieve a truly pixelaccurate segmentation, a much higher runtime would be required. Additionally, as argued by [8], the computational time achieved with our method falls way below the time needed to produce a nanofiber sample and is therefore sufficient for the applicability of our algorithm.
We tested different hyperparameter settings using the deterministic autoencoder trained with the SSIM loss, before using the same values for all architectures ensuring comparability. We varied the latent space dimension of the autoencoder, window size of the SSIM similarity measure, and the size of the patches that the autoencoder was trained on. Table 1 shows the respective areas under the receiver operating characteristic (ROC) curves when evaluating the trained networks. Here, the true positive rate is defined as the percentage of pixels that were correctly classified as defect across the entire dataset. The false positive rate is the percentage of pixels that were wrongly classified as defective. Our approach is rather insensitive to different hyperparameter settings. However, if the latent space dimension is not set to a sufficiently large value, the autoencoder fails to reconstruct nondefective images and therefore its performance decreases. Nevertheless, increasing the latent space dimension does not improve the performance indefinitely. As it weakens the effect of the bottleneck, it ultimately enables the network to copy its inputs and thus perfectly reconstruct defective regions, rendering their detection impossible.
4.3 Evaluation
In Figure 4, we see an example that visualizes the difference in performance of autoencoders using the error metric and SSIM. Both approaches manage to reconstruct the nondefective parts of the image and significantly alter the appearance of the defect in the reconstruction. The distance fails to segment the defect since it cannot be distinguished from the large novelty scores that are produced around the reconstructed nondefect edges. Moreover, since the defect is replaced by a structure that has similar color values as the input, the error fails to detect a large portion of the defect surface. In contrast, SSIM gives more weight to the visually altered area such that the defect can be reliably segmented.
This general behavior manifests itself in our numerical results as well. Figure 5 compares the ROC curves and their respective area under the curve (AUC) values of our approach using SSIM to the ones of deterministic, variational, and feature matching autoencoders that employ the pixelwise distance. The performance of the deterministic and variational autoencoder is only marginally better than classifying each pixel randomly. We found the reconstructions obtained by different latent samples from the posterior of the VAE not to vary greatly. Thus, it could not improve on the deterministic framework. Feature matching yields a better performance as it manages to produce better reconstructions with more accurate edge locations. This enables the error metric to detect some of the anomalies. However, the results are still not competitive with other stateoftheart methods on this dataset. Our method using SSIM outperforms all other tested architectures, indicating that altering the loss function can indeed boost performance on complex, unstructured datasets. The achieved AUC of 0.966 is comparable to the stateoftheart as given in [7], where they report values of up to 0.974. In contrast to their method, our approach does not rely on any model priors such as handcrafted features or pretrained networks.
Since defects of smaller size contribute less to the overall true positive rate when weighting all pixel equally, we further evaluated the overlap of each detected anomaly region with the ground truth and report the quantiles for in Figure 5. We can see that for false positive rates as low as , more than of the defects have an overlap with the ground truth that is larger than . Therefore, we outperform the results achieved by [7], who report a minimal overlap of in this setting.
Figure 6 shows four closeups of test images together with reconstructions produced by our autoencoder and the corresponding detection results. Our approach manages to find defects of various sizes as well as broken fibers. Note how the autoencoder alters the visual appearance of the defects in the reconstructed images, which ultimately enables us to detect them using SSIM.
5 Conclusion
We propose to use a structural similarity measure in combination with autoencoders for unsupervised defect segmentation. This measure is less sensitive to small inaccuracies of edge locations and instead focuses on structural differences that are more salient for humans. Employing it for the comparison of input images and reconstructions produced by an autoencoder, we manage to achieve stateoftheart performance on a challenging dataset of nanofibrous materials which is of industrial relevance. We show that our approach constructs accurate error maps and manages to reliably detect defects of various scales. In contrast to the present stateoftheart on this dataset, our method does not require the existence and selection of a layer of a pretrained CNN suited to the task at hand. Furthermore, it provides a pixelaccurate segmentation with an acceptable runtime.
In comparison, we evaluate the performance of autoencoders using the commonly used pixelwise reconstruction error. We show that this approach is not well suited for the segmentation of defects in complex, realworld data. Even if we employ more sophisticated probabilistic novelty measures obtained from variational autoencoders or if we improve the quality of our reconstructions by employing a feature matching loss, perpixel error metrics still perform significantly worse.
References
 Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
 Goodfellow et al. [2014] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
 Wang et al. [2004] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
 Pimentel et al. [2014] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “A review of novelty detection,” Signal Processing, vol. 99, pp. 215–249, 2014.
 Perera and Patel [2018] P. Perera and V. M. Patel, “Learning Deep Features for OneClass Classification,” arXiv preprint arXiv:1801.05365, 2018.
 Sabokrou et al. [2018] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially Learned OneClass Classifier for Novelty Detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3379–3388.
 Napoletano et al. [2018] P. Napoletano, F. Piccoli, and R. Schettini, “Anomaly Detection in Nanofibrous Materials by CNNBased SelfSimilarity,” Sensors, vol. 18, no. 1, p. 209, 2018.
 Carrera et al. [2017] D. Carrera, F. Manganini, G. Boracchi, and E. Lanzarone, “Defect Detection in SEM Images of Nanofibrous Materials,” IEEE Transactions on Industrial Informatics, vol. 13, no. 2, pp. 551–561, 2017.
 Boracchi et al. [2014] G. Boracchi, D. Carrera, and B. Wohlberg, “Novelty Detection in Images by Sparse Representations,” in 2014 IEEE Symposium on Intelligent Embedded Systems (IES). IEEE, 2014, pp. 47–54.
 Carrera et al. [2015] D. Carrera, G. Boracchi, A. Foi, and B. Wohlberg, “Detecting anomalous structures by convolutional sparse models,” in 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 2015, pp. 1–8.
 Carrera et al. [2016] ——, “Scaleinvariant anomaly detection with multiscale groupsparse models,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 3892–3896.
 Schlegl et al. [2017] T. Schlegl, P. Seeböck, S. M. Waldstein, U. SchmidtErfurth, and G. Langs, “Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery,” in International Conference on Information Processing in Medical Imaging. Springer, 2017, pp. 146–157.
 Zenati et al. [2018] H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar, “Efficient GANBased Anomaly Detection,” arXiv preprint arXiv:1802.06222, 2018.
 Donahue et al. [2017] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial Feature Learning,” International Conference on Learning Representations, 2017.
 Arjovsky and Bottou [2017] M. Arjovsky and L. Bottou, “Towards Principled Methods for Training Generative Adversarial Networks,” International Conference on Learning Representations, 2017.
 Baur et al. [2018] C. Baur, B. Wiestler, S. Albarqouni, and N. Navab, “Deep Autoencoding Models for Unsupervised Anomaly Segmentation in Brain MR Images,” arXiv preprint arXiv:1804.04488, 2018.
 Kingma and Welling [2014] D. P. Kingma and M. Welling, “AutoEncoding Variational Bayes,” International Conference on Learning Representations, 2014.
 An and Cho [2015] J. An and S. Cho, “Variational Autoencoder based Anomaly Detection using Reconstruction Probability,” SNU Data Mining Center, Tech. Rep., 2015.
 Soukup and Pinetz [2018] D. Soukup and T. Pinetz, “Reliably Decoding Autoencoders’ Latent Spaces for OneClass Learning Image Inspection Scenarios,” in OAGM Workshop 2018. Verlag der Technischen Universität Graz, 2018.
 Vasilev et al. [2018] A. Vasilev, V. Golkov, I. Lipp, E. Sgarlata, V. Tomassini, D. K. Jones, and D. Cremers, “qSpace Novelty Detection with Variational Autoencoders,” arXiv preprint arXiv:1806.02997, 2018.
 Ridgeway et al. [2015] K. Ridgeway, J. Snell, B. Roads, R. S. Zemel, and M. C. Mozer, “Learning to generate images with perceptual similarity metrics,” arXiv preprint arXiv:1511.06409, 2015.
 Wang et al. [2003] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Record of the ThirtySeventh Asilomar Conference on Signals, Systems and Computers, vol. 2. Ieee, 2003, pp. 1398–1402.
 Dosovitskiy and Brox [2016] A. Dosovitskiy and T. Brox, “Generating Images with Perceptual Similarity Metrics based on Deep Networks,” in Advances in Neural Information Processing Systems, 2016, pp. 658–666.
 Kingma and Ba [2015] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” International Conference on Learning Representations, 2015.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification With Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
 Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 Steger et al. [2018] C. Steger, M. Ulrich, and C. Wiedemann, Machine Vision Algorithms and Applications, 2nd ed. Weinheim: WileyVCH, 2018.