Unsupervised Anomaly Detection for X-Ray Images
Abstract
Obtaining labels for medical (image) data requires scarce
and expensive experts. Moreover, due to ambiguous symptoms, single
images rarely suffice to correctly diagnose a medical condition. Instead,
it often requires to take additional background information such as the
patient’s medical history or test results into account. Hence, instead of
focusing on uninterpretable black-box systems delivering an uncertain
final diagnosis in an end-to-end-fashion, we investigate how unsupervised
methods trained on images without anomalies can be used to assist doctors
in evaluating X-ray images of hands. Our method increases the efficiency
of making a diagnosis and reduces the risk of missing important regions.
Therefore, we adopt state-of-the-art approaches for unsupervised learning
to detect anomalies and show how the outputs of these methods can
be explained. To reduce the effect of noise, which often
can be mistaken for an anomaly, we introduce a powerful preprocessing
pipeline. We provide an extensive evaluation of different approaches and
demonstrate empirically that even without labels it is possible to achieve
satisfying results on a real-world dataset of X-ray images of hands. We also
evaluate the importance of preprocessing and one of our main findings is
that without it, most of our approaches perform not better than random.
To foster reproducibility and accelerate research we make our code publicly available on GitHub
1 Introduction
Deep Learning techniques are ubiquitous and achieving state-of-the-art performance in many areas. However, they require vast amounts of labeled data as witnessed by the marvelous boost in image recognition after the publication of the large scale ImageNet data set [4, 11]. In medical applications, labels are expensive to acquire. While anyone can decide whether an image depicts a dog or a cat, deciding whether a medical image shows abnormalities, is a highly difficult task requiring specialists with years of training. Another specialty of medical applications is that a simple classification decision does often not suffice. End-to-end deep learning solutions tend to be hard to interpret, preventing their application in an area as sensitive as deciding for a treatment. Moreover, additional patient information such as the patient’s medical history, and clinical test results are often crucial to a correct diagnosis. Integrating this information into an end-to-end pipeline is difficult and makes results even less interpretable. Thus, the motivation for our work is to let doctors decide about the final diagnosis and treatment and develop a system, which can provide hints for doctors where to pay more attention to.
Hence, in this work, we investigate how we can support doctors to faster assess X-ray images, and reduce the chance of overlooking suspicious regions. To this end, we demonstrate how state-of-the-art unsupervised methods, such as Autoencoders (AE) or Generative Adversarial Networks (GANs), can be used for anomaly detection on X-ray images. As this dataset is noisy, and this is a general problem for a lot of real-world datasets, we present a sophisticated preprocessing pipeline to obtain better training data. Afterwards, we train several unsupervised models, and explain for each, how to obtain several image-level anomaly scores. For some of them, it is even natural to obtain pixel-wise annotations, highlighting anomalous regions. One of our main findings is that accurate data preprocessing is indispensable. The advantage of using autoencoders is that they naturally can provide pixel-level anomaly heatmaps, which can be used to understand model decisions. In contrast, GAN-based approaches seem to be able to cope with more noisy data, yet being only able to produce image-wise anomaly scores. We envision that this methodology can be easily installed in clinical daily routine to support doctors in quickly assessing X-ray images and spotting candidate regions for anomalies.
In this work, we focus on a subset of the MURA dataset [18] containing only hand images. In total, we have 5,543 images of 2,018 studies of 1,945 patients. Each study is labeled as negative or positive, where positive means that there was an anomaly diagnosed in this study. There are 521 positive studies, with a total of 1,484 images. Figure 1 shows some examples from the dataset. In summary, our contributions are as follows:
-
We present a powerful preprocessing pipeline for the MURA dataset [18], enabling the construction of a high-quality training set.
-
We extensively survey unsupervised Deep Learning methods, and present approaches on how to obtain image-level and even pixel-level anomaly scores.
-
We show extensive experiments on a real-world dataset evaluating the influence of proper preprocessing as well as the usability of the anomaly scores. To foster reproducibility, we will make our code public in the camera-ready version.
The rest of the paper is structured as follows: In Section 2 we describe our approach. We start with the description of data preprocessing in 2.1 and describe anomaly detection approaches along with anomaly scores in section 2.2. We discuss related work in section 3. Finally, Section 4 shows quantitative and qualitative experimental results on image-level and pixel-level anomaly detection.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2 Unsupervised Anomaly Detection
2.1 Preprocessing
Real-life data is often noisy. This is especially problematic for unsupervised approaches for anomaly detection. On the one hand, it is necessary to remove noise to make sure that it is not recognized as an anomaly. On the other hand, it is crucial that the data denoising process does not mistake anomalies for noise and does not remove them. After experimenting a lot, we end up with the preprocessing pipeline depicted in Figure 2. We distinguish offline and online processing steps, where the offline processing is done once and then stored to disk to save time, whereas the online preprocessing is done on-the-fly while loading the data. The individual steps are described in detail subsequently.
Cropping The first step in our pipeline is to detect the X-ray image carrier in the image. To this end, we apply OpenCV’s contour detection using Otsu binarization [15], and retrieve the minimum size bounding box, which does not need to be axis-aligned. This works sufficiently well as long as the majority of the image carrier is within the image (cf. Figure 3). However, the approach might fail for heavily tilted images or those where larger parts of the image carrier reach beyond the image border.
![]() |
![]() |
![]() |
Hand Localization To further improve the detection of hands, and in particular split the images where two hands are depicted on one image, we manually labeled approximately 150 bounding boxes in the images. Using this small dataset, we fine-tune a pre-trained single shot multibox detector (SSD) [13] with MobileNet as taken from TensorFlow. An exemplary results can be seen in Figure 3.
Foreground Segmentation In a final step, foreground segmentation is performed using Photoshop’s “select subject” method in batch processing mode. Thereby, we obtain a pixel-wise mask, roughly encompassing the scanned hand.
Data Augmentation Due to GPU memory constraints, the images for BiGAN and -GAN are resized to 128 pixels on the longer image side while maintaining aspect ratio before applying the augmentation.
For the auto-encoder models, this is not necessary.
Afterwards, standard data augmentation methods (horizontal/vertical flipping, channel-wise multiplication, rotation, scaling) using the imgaug
2.2 Models
In this section, we describe the different model types we trained in a fully unsupervised / self-supervised fashion on the train data part comprising only images from patients without attested anomalies. We also describe how to obtain anomaly scores from the trained models. In the appendix we additionally provide details about the architecture for every model.
2.3 Autoencoders
We studied different auto-encoder architectures for the task at hand. Common among them is their usage of a reconstruction loss, i.e. the input to the network is also used as the target, and we evaluate how well the input is reconstructed. As the information has to pass through an informational bottleneck, the model cannot simply copy the input data, but instead has to perform a form of compression, extracting features which suffice to reconstruct the image sufficiently well. Hence, we have an encoder part of the network (), which transforms the input non-linearly to a latent space representation . Analogously, there is a decoder that transforms an input from latent space back to and element in the input space.
For simplicity, we describe the general loss formulation using a vector input instead of a two-dimensional pixel-matrix. In its simplest form, the reconstruction loss is given as the mean over pixel-wise squared differences. Let , then
As we are only interested in detecting anomalies on the hand part of the image, we consider a variant of this loss, named masked reconstruction loss, where only those pixels are considered that belong to the mask. Let be the mask, where if and only if the position belongs to the hand. Then,
where denotes the Hadamard product (i.e. element-wise multiplication). In the following, we describe the architectures of the network in more detail.
In a convolutional auto-encoder (CAE), we implement encoder and decoder as fully convolutional neural networks (CNNs). In general, the encoder is built as a sequence of repeated convolution blocks. We apply Batch Normalization [9] between every convolution and the respective activation, and use ReLU [7] as activation function. A detailed model description is given in the appendix. Similarly, the decoder consists of repeated blocks of transposed convolutions. As before, we apply batch normalization before every activation function. As bottleneck size, we use a spatial resolution of and 512 channels.
Variational AE (VAE) [10] is a generative model, which maps an input to a Gaussian distribution in latent space, characterized by its mean and covariance , instead of mapping it to a fixed latent representation. The covariance matrix is usually restricted to a diagonal matrix. For reconstruction, a sample is drawn, and passed through the decoder sub-network. To avoid very small values in , and thereby approaching a delta distribution, i.e. traditional AE, an additional loss term is introduced as the Kullback-Leibler divergence (KLD) between and the standard normal distribution .
Anomaly Detection Scores
The rationale behind using AE for anomaly detection is that as the AE is trained on normal data only, it has not seen anomalies during training, and hence will fail to reproduce them. Due to the convolutional nature of the network, the error is even expected to occur stronger in regions close to the anomaly, and less strong further apart. If the receptive field is small enough, those regions outside of it are not affected at all. Hence, we can use the reconstruction error in two ways:
-
Pixel-wise to obtain a heatmap highlighting regions that were hardest to reconstruct. If there is an anomaly, we expect the highest error in that region. We show an example for such in the qualitative results, Figure 5.
-
Aggregated over all pixels (under the mask) to obtain an image-wise score. As for aggregation, we explore different aggregation strategies. In the simplest case, we just average over all locations. By using only the highest values to compute the mean, we can obtain a score that is more sensitive towards regions of high reconstruction error (i.e. anomalous regions).
We aim for using auto-encoder architectures which are strong enough to successfully reconstruct normal hands, without risking to learn identity mappings by allowing too wide bottlenecks. While the architecture should generalize over all normal hands, a too strong generalization might cause the effect that also anomalies can be reconstructed sufficiently well.
2.4 Gan
A Generative Adversarial Network (GAN) [8] comprises two sub-networks, a generator , and a discriminator , which can be seen as antagonists in a two-player game. The generator takes random noise as input and generates samples in the target domain. The discriminator takes real data points, as well as generated ones, and has to distinguish between real and fake data. The sub-networks are trained alternatingly, and if successful, the generator can afterwards be used to sample from the (approximated) data distribution, and the discriminator can be used to decide whether a sample is drawn from the given data distribution.
Deep Convolutional GAN (DCGAN)[16] is an extension of the original GAN architecture to convolutional neural networks. Similarly to the CAE, the two networks contain convolutions (discriminator) and transposed convolutions (generator) instead of the fully connected layers of the originally proposed GAN architecture.
-GAN [20] comprises four sub-networks:
-
An encoder which transforms a real image into a latent representation.
-
A code-discriminator which distinguishes between the latent representations produced by the encoder and random noise used as generator input.
-
A generator which generates an image from either the randomly sampled , or the encoded image .
-
A discriminator which distinguishes between reconstructed real images , and generated images .
In addition to the classification losses for both discriminators, a reconstruction loss is applied for the auto-encoder formed by the encoder-generator pair. Hence, the code-discriminator gives the encoder the incentive to transform the inputs to match the random distribution, similarly as in VAE through the KL-divergence. Likewise, the discriminator motivates matching the data distribution in the image domain.
Anomaly Detection Scores
For the GAN models, we generally use the discriminator’s output as the anomaly score. When converged, the discriminator should be able to distinguish between images belonging to the data manifold, i.e. images of hands without any anomalies, and those which lie outside, such as those containing anomalous regions. For GAN we use the mean over code discriminator and discriminator probability.
3 Related Work
With the rapid advancement of deep learning methods, they have also found their way into medical imaging, cf. e.g. [12, 19]. Despite the limited availability of labels in medical contexts, supervised methods make up the vast majority. Very likely, this is due to the easier trainability, but possibly also because the interpretability of the results so far has often been secondary. Sato et al. [22] use a 3D CAE for a pathology detection method in CT scans of brains. The CAE is trained solely on normal images, and at test time, the MSE between the image and its reconstruction is taken as the anomaly score. Uzunova et al. [25] use VAE for medical 2D and 3D CT images. Similarly, they use MSE reconstruction loss as the anomaly score. Besides the KL-divergence in latent space, they use a reconstruction loss for training, which produced less smooth output. GANomaly [2] and its extension with skip-connections uses an AE and maps the reconstructed input back to the latent space. The anomaly score is computed in latent space between original and reconstructed input. They apply their methods on X-Ray security imagery to detect anomalous items in baggage. Recently, there have been a lot of publications using the currently popular GANs. For example, [14] uses a semi-supervised approach for anomaly detection in chest X-ray images. They replace the standard discriminator classification into real and fake, with a three-way classification into real-normal, real-abnormal, and fake. While this allows training with fewer labels, it still requires them for training. Schlegl et al [23], train a DC-GAN on slices of OCT scans, where the original volume is cut along the x-z axis, and the slices are further randomly cropped. At test time, they use gradient descent to iteratively solve the inverse problem of obtaining a latent vector that produces the image. Stopping after a few iterations, the distance between the generated image and the input image is considered as residual loss. To summarize, the focus of recent work for anomaly detection approaches lies either in applying existing methods for a new type of data or adapting unsupervised methods for anomaly detection. Instead, we provide an extensive evaluation of state-of-the-art unsupervised learning approaches that can be directly used for anomaly detection. Furthermore, we evaluate the importance of different preprocessing steps and compare methods with regard to explainability.
4 Experiments
We demonstrate the capability of our preprocessing pipeline and all described models in experiments on a subset of the MURA dataset containing only X-ray images of hands. 3,062 images are stored in a single-channel PNG image, and 2,481 are stored with three RGB channels. However, all images look like gray-scale images, which is why we convert all 3-channel images to a single channel. The longest side of the images is always 512 pixels in size. The smaller side ranges from 160 to 512, with the majority between 350 and 450.
As our approach is unsupervised, we train only on negative images, i.e. images without an anomaly. Furthermore, to avoid test leakage, we split the data by patient, and not by study or image, to ensure that we do not have an image of a patient in the training data, and another image of the same patient in the test or validation data. To this end, we proceed as follows: Let be the set of all patients, and be the set of patients with a study that is labeled as abnormal. The rest of the patients is denoted by . For the test and validation set, we aim at having balanced classes. Therefore, we distribute evenly at random across test and validation. Afterwards, we randomly sample the same number of patients without known anomalies for test and validation and use the rest of the patients for training. The procedure is visualized in Figure 4. In total, we end up with 2,554 training images, 1,494 validation images, and 1,495 test images.
We trained all models on a machine with one NVIDIA Tesla V100 GPU with 16GiB of VRAM, 20 cores and 360GB of RAM. Following [17], we train our models from scratch and do not use transfer learning from large image classification datasets. We performed a manual hyper-parameter search on the validation set and selected the best-performing models per type with respect to Area-under-Curve for the Receiver-Operator-Curve (ROC-AUC). We report the ROC-AUC on the test set.
4.1 Quantitative Results
raw | crop | full | ||||
w/o HE | w/ HE | w/o HE | w/ HE | w/o HE | w/ HE | |
CAE | ||||||
MSE | .460 .033 | .504 .034 | .466 .022 | .510 .021 | .501 .013 | .570 .019 |
MSE (top-200) | .466 .013 | .448 .025 | .486 .015 | .473 .018 | .506 .039 | .553 .023 |
VAE | ||||||
KLD | .488 .031 | .491 .013 | .470 .046 | .496 .045 | .520 .026 | .533 .014 |
L1 | .432 .033 | .446 .016 | .438 .033 | .438 .016 | .435 .014 | .483 .009 |
L1 + KLD | .432 .033 | .446 .016 | .438 .034 | .437 .016 | .438 .011 | .488 .011 |
L1 (top-200) | .438 .017 | .472 .010 | .440 .025 | .471 .013 | .428 .013 | .481 .010 |
MSE | .432 .033 | .446 .016 | .438 .033 | .438 .016 | .435 .014 | .483 .009 |
MSE + KLD | .432 .033 | .446 .016 | .438 .033 | .438 .016 | .436 .013 | .486 .010 |
MSE (top-200) | .438 .017 | .472 .010 | .440 .025 | .471 .013 | .428 .013 | .481 .010 |
DCGAN | ||||||
Disc. (D) | .497 .018 | .491 .041 | .493 .015 | .493 .025 | .530 .027 | .527 .022 |
BiGAN | ||||||
MSE | .471 .021 | - | .438 .039 | - | .491 .042 | .522 .017 |
MSE (top-200) | .471 .011 | - | .459 .030 | - | .475 .033 | .508 .026 |
Disc. (D) | .508 .007 | - | .534 .016 | - | .549 .006 | .522 .019 |
GAN | ||||||
Code-Disc. (C) | .500 .000 | - | .500 .001 | - | .500 .000 | .500 .000 |
MSE | .476 .029 | - | .466 .022 | - | .442 .013 | .528 .018 |
MSE (top-200) | .465 .031 | - | .446 .018 | - | .422 .016 | .533 .013 |
Disc. (D) | .503 .022 | - | .534 .022 | - | .607 .016 | .584 .012 |
C + D | .503 .022 | - | .534 .022 | - | .607 .016 | .584 .012 |
Apart from the performance of single models, we also evaluate the importance of the preprocessing steps. Therefore, we evaluate the models on the raw data, the data after cropping the hand regions, as well as on the fully preprocessed data. We also vary whether histogram equalization is applied before the augmentation or not. We summarize the quantitative results in Table 1 showing the mean and standard deviation across four runs. There is a clear trend in preprocessing: All models have their best runs in the fully preprocessed setting, emphasizing the importance of our preprocessing pipeline for noisy datasets. Interestingly, without foreground segmentation, i.e. only by cropping the single hands, the results appear to be worse than on the raw data. While histogram equalization is a contrast enhancement method in particular useful to improve human perception of low-contrast images, it seems to improve the results for AE-based models consistently. For BiGAN and GAN our experiments did not finish until the deadline. As they comprise AE components we expect to see an improvement there. On raw and also cropped data we frequently observe ROC-AUC values smaller than 45%. Hence, we might be able to improve the ROC-AUC score by flipping the anomaly decision. Partially, we attribute this also to the rather unstable results for these models. Regarding the aggregation of reconstruction error, we observe that using only the top-k loss values across all pixels does not improve the result. We attribute that partially to not tuning enough across different values for , as we only used for all models, which may be too few pixels to detect some anomalies. Due to the lack of pixel-level annotation, we did not investigate this issue so far. In total, we obtain the best ROC-AUC score with 60.7% for -GAN using the discriminator probability. CAE however also achieves 57% ROC-AUC and additionally can naturally provide pixel-level anomaly scores yielding higher interpretability.
4.2 Qualitative Results
![]() |
![]() |
![]() |
In addition to the numerical results we also showcase some qualitative results. For all methods with reconstruction loss, i.e. all AE as well as -GAN, we can generate heatmaps visualizing the pixel-wise losses. Thereby, we can highlight regions that could not be reconstructed well. Following our assumption, these regions should be the anomalous regions. In Figure 5, we can see prototypical examples produced by CAE. The upper image shows a hand contained in a study which was labeled as normal. We can see that the reconstruction error does not occur concentrated, but is rather spread widely across the hand. The maxima seem to occur around joints, which due to their more complex structure are likely to be harder to reconstruct. Compared to the lower image, which shows a study labeled as abnormal, we see a clear highlighting at the middle finger. Visible also for a non-expert, we can spot metal parts in the X-ray image at the very same location. For those anomalies which could be validated by a person without a medical background, the highlighted regions seem to correspond largely to those anomalous regions.
5 Conclusion
In this paper, we investigated methods for unsupervised anomaly detection in X-ray images. To this end, we surveyed two families of unsupervised models, auto-encoders and GANs, regarding their applicability to derive anomaly scores. In addition, we provide a sophisticated multi-step preprocessing pipeline. In our experiments, we compare the methods against each other, and furthermore, reveal that the preprocessing is crucial for most models to obtain good results on real-world data. For the auto-encoder family, we study the interpretability of pixel-wise losses as anomaly heatmap and verify that in cases of anomalies which a non-expert can detect (e.g. metal pieces in the hand), these heatmaps closely match the anomalous regions. As future work, we envision the extension to broader datasets such as the full MURA dataset, as well as obtaining pixel-level anomaly scores for the GAN based models. To this end, methods from the field of explainable AI, such as grad-CAM [24] or LRP [3] can be applied to the discriminator to obtain heatmaps similarly to those of the AE models. Moreover, we see the potential for different model architectures closer tailored to the specific problem and data type, as well as the possibility of building an ensemble model using the different ways how to extract anomaly scores from single models, or even across different model types.
Acknowledgement
We would like to thank Franz Pfister and Rami Eisaway from deepc (www.deepc.ai) for access to the data and support in understanding the use case. Part of this work has been conducted during a practical course at Ludwig-Maximilians-Unversität München funded by Z.DB. This work has been funded by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.
Supplementary Material
.1 Schematic Architectures and Reconstruction Examples





.2 Architecture Details
VAE | |
---|---|
encoder | |
kernel size | output filters |
(4, 4) | (255, 255, 8) |
(4, 4) | (126, 126, 16) |
(4, 4) | (62, 62, 32) |
(4, 4) | (30, 30, 64) |
(4, 4) | (14, 14, 128) |
(4, 4) | (6, 6, 256) |
(4, 4) | (2, 2, 512) |
bottleneck | |
reshape: | |
= FC() | (1024,) |
= FC() | (1024,) |
(1024,) | |
reshape: | (2, 2, 512) |
decoder | |
kernel size | output filters |
(4, 4) | (6, 6, 256) |
(4, 4) | (14, 14, 128) |
(4, 4) | (30, 30, 64) |
(4, 4) | (62, 62, 32) |
(4, 4) | (126, 126, 16) |
(4, 4) | (254, 254, 8) |
(6, 6) | (512, 512, 1) |
DCGAN | |
generator | |
kernel size | output filters |
(4, 4) | (4, 4, 1024) |
(4, 4) | (8, 8, 512) |
(4, 4) | (16, 16, 256) |
(4, 4) | (32, 32, 128) |
(4, 4) | (64, 64, 64) |
(4, 4) | (128, 128, 32) |
(4, 4) | (256, 256, 16) |
(4, 4) | (512, 512, 1) |
discriminator | |
kernel size | output filters |
(4, 4) | (256, 256, 4) |
(4, 4) | (128, 128, 8) |
(4, 4) | (64, 64, 16) |
(4, 4) | (32, 32, 32) |
(4, 4) | (16, 16, 64) |
(4, 4) | (8, 8, 128) |
(4, 4) | (4, 4, 256) |
(4, 4) | (1, 1, 512) |
minibatch discrimination | (1, 1, 528) |
FC | (1,) |
BiGAN | |
generator | |
kernel size | output filters |
(4, 4) | (4, 4, 1024) |
(4, 4) | (8, 8, 512) |
(4, 4) | (16, 16 ,256) |
(4, 4) | (32, 32, 128) |
(4, 4) | (64, 64, 64) |
(4, 4) | (128, 128, 1) |
encoder | |
kernel size | output filters |
(4, 4) | (64, 64, 64) |
(4, 4) | (32, 32, 128) |
(4, 4) | (16, 16 ,256) |
(4, 4) | (8, 8, 512) |
(4, 4) | (4, 4, 1024) |
(4, 4) | (1, 1, 200) |
Discriminator: Image Branch | |
kernel size | output filters |
(4, 4) | (64, 64, 64) |
(4, 4) | (32, 32, 128) |
(4, 4) | (16, 16 ,256) |
(4, 4) | (8, 8, 512) |
(4, 4) | (4, 4, 1024) |
(4, 4) | (1, 1, 1024) |
Discriminator: Code Branch | |
kernel size | output filters |
(1, 1) | (1, 1, 512) |
(1, 1) | (1, 1, 512) |
Discriminator: Combination | |
kernel size | output filters |
stack branches | |
(1, 1) | (1, 1, 1024) |
(1, 1) | (1, 1, 1024) |
(1, 1) | (1, 1, 1) |
-GAN | |
---|---|
generator | |
kernel size | output filters |
(4, 4) | (4, 4, 1024) |
(4, 4) | (8, 8, 512) |
(4, 4) | (16, 16 ,256) |
(4, 4) | (32, 32, 128) |
(4, 4) | (64, 64, 64) |
(4, 4) | (128, 128, 1) |
encoder | |
kernel size | output filters |
(4, 4) | (64, 64, 64) |
(4, 4) | (32, 32, 128) |
(4, 4) | (16, 16 ,256) |
(4, 4) | (8, 8, 512) |
(4, 4) | (4, 4, 1024) |
mean and variance | |
(4, 4) | (1, 1, 200) |
discriminator | |
kernel size | output filters |
(4, 4) | (64, 64, 64) |
(4, 4) | (32, 32, 128) |
(4, 4) | (16, 16 ,256) |
(4, 4) | (8, 8, 512) |
(4, 4) | (4, 4, 1024) |
minibatch discrimination | (4, 4, 1028) |
(4, 4) | (1, 1, 1) |
code-discriminator | |
kernel size | output filters |
(1, 1) | 100 |
(1, 1) | 50 |
(1, 1) | 25 |
(1, 1) | 1 |
Notes: * denotes additional max-pooling/nearest-neighbor upsampling, FC denotes fully-connected layer, . denotes an additional self-attention layer.
.3 Data Augmentation
We use two different augmentation strategies, named default (used for GANs and VAE) and advanced (used for CAE and BAE).
Default
-
Horizontally flip 50% of all images
-
Vertically flip 50% of all images
-
Center pad all images to the target resolution
Advanced
-
Horizontally flip 50% of all images
-
Vertically flip 50% of all images
-
For 50% of the images change the brightness by multiplying all channels by a scalar value drawn randomly from the uniform distribution
-
For 50% of the images randomly scale in x- and y- direction independently by a factor drawn randomly from the uniform distribution
-
For 50% of the images rotate the image by an angle drawn randomly from the uniform distribution
-
Center pad all images to the target resolution
.4 Training Details
-
For all models we train 4 variants with different random seeds being 42, 4242, 424242, and 42424242.
Cae
-
Batch Size: 32
-
Image Resolution:
-
1,000 epochs
-
Batch Normalization
-
Learning rate: 0.0001
-
Adam optimizer
Bae
-
Batch Size: 32
-
Image Resolution:
-
500 epochs
-
Batch Normalization
-
Learning rate: 0.0001
-
Adam optimizer
Vae
-
Batch Size: 32
-
Image Resolution:
-
500 epochs
-
Batch Normalization
-
,
-
Learning rate: 0.0001
-
Adam optimizer
Dcgan
-
Batch Size: 80
-
Image Resolution:
-
500 epochs
-
No Batch Normalization
-
Spectral Normalization
-
Soft Labels
-
Generator Learning Rate: 0.001
-
Discriminator Learning Rate: 0.00001
-
Soft Delta: 0.01
-
-
As we observed mode collapse, we added minibatch discrimination [21]
-
Adam optimizer
BiGAN
-
Batch Size: 16
-
Image Resolution:
-
500 epochs
-
Generator & Encoder Learning Rate: 0.001
-
Discriminator Learning Rate: 0.000005
-
Adversarial Loss: Hinge Loss
-
-
Adam optimizer
-Gan
-
Batch Size: 16
-
Image Resolution:
-
500 epochs
-
Generator & Encoder Learning Rate: 0.001
-
Discriminator & Code-Discriminator Learning Rate: 0.000005
-
Adversarial Loss: Hinge Loss
-
-
As we observed mode collapse, we added minibatch discrimination [21]
-
Adam optimizer
Footnotes
- https://github.com/Valentyn1997/xray
- https://github.com/aleju/imgaug
- In the paper is called as opposed to the generator
References
- (2018) 15th IEEE international symposium on biomedical imaging, ISBI 2018, washington, dc, usa, april 4-7, 2018. IEEE. External Links: Link, ISBN 978-1-5386-3636-7 Cited by: 14.
- (2019) GANomaly: semi-supervised anomaly detection via adversarial training. In Computer Vision – ACCV 2018, C. V. Jawahar, H. Li, G. Mori and K. Schindler (Eds.), Cham, pp. 622–637. External Links: ISBN 978-3-030-20893-6 Cited by: §3.
- (2015-07) On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE 10 (7), pp. 1–46. External Links: Link, Document Cited by: §5.
- (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1.
- (2017) Adversarial Feature Learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.4.
- (2017) Adversarially Learned Inference. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.4.
- (2011) Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pp. 315–323. External Links: Link Cited by: §2.3.
- (2014) Generative Adversarial Nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.4.
- (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 448–456. External Links: Link Cited by: §2.3.
- (2014) Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §2.3.
- (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- (2017) A Survey on Deep Learning in Medical Image Analysis. Medical Image Analysis 42, pp. 60–88. External Links: Link, Document Cited by: §3.
- (2016) SSD: Single Shot MultiBox Detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp. 21–37. External Links: Link, Document Cited by: §2.1.
- (2018) Semi-Supervised Learning with Generative Adversarial Networks for Chest X-Ray Classification with Ability of Data Domain Adaptation. See 1, pp. 1038–1042. External Links: Link, Document Cited by: §3.
- (1979-01) A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics 9 (1), pp. 62–66. External Links: Document, ISSN 0018-9472 Cited by: §2.1.
- (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint arXiv:1511.06434. Cited by: §2.4.
- (2019) Transfusion: Understanding Transfer Learning with Applications to Medical Imaging. CoRR abs/1902.07208. External Links: Link, 1902.07208 Cited by: §4.
- (2017) MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs. arXiv preprint arXiv:1712.06957. Cited by: item 1, §1.
- (2018) A Tour of Unsupervised Deep Learning for Medical Image Analysis. CoRR abs/1812.07715. External Links: Link, 1812.07715 Cited by: §3.
- (2017) Variational Approaches for Auto-encoding Generative Adversarial Networks. arXiv preprint arXiv:1706.04987. Cited by: §2.4.
- (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 2234–2242. External Links: Link Cited by: 11st item, 8th item.
- (2018) A primitive study on unsupervised anomaly detection with an autoencoder in emergency head CT volumes. In Medical Imaging 2018: Computer-Aided Diagnosis, Houston, Texas, USA, 10-15 February 2018, pp. 105751P. External Links: Link, Document Cited by: §3.
- (2017) Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In Information Processing in Medical Imaging - 25th International Conference, IPMI 2017, Boone, NC, USA, June 25-30, 2017, Proceedings, pp. 146–157. External Links: Link, Document Cited by: §3.
- (2017) Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §5.
- (2019) Unsupervised pathology detection in medical images using conditional variational autoencoders. Int. J. Comput. Assist. Radiol. Surg. 14 (3), pp. 451–461. External Links: Link, Document Cited by: §3.
