Detecting semantic anomalies

Detecting semantic anomalies

Faruk Ahmed
&Aaron Courville
Mila-Université de Montréal. E-mail: faruk.ahmed@umontreal.caMila-Université de Montréal; CIFAR Fellow. E-mail: aaron.courville@umontreal.ca
Abstract

We critically appraise the recent interest in out-of-distribution (OOD) detection and question the practical relevance of existing benchmarks. While the currently prevalent trend is to consider different datasets as OOD, we posit that out-distributions of practical interest are ones where the distinction is semantic in nature for a specified context, and evaluative tasks should reflect this more closely. Assuming a context of object recognition, we recommend a set of benchmarks, motivated by referencing practical applications. Finally, we explore a multi-task learning based approach and show empirically that auxiliary objectives for improved semantic awareness can result in improved semantic anomaly detection, with accompanying generalization benefits.

1 Introduction

In recent years, concerns have been raised about modern neural network based classification systems providing incorrect predictions with high confidence Guo et al. (2017). A possibly related finding is that classification-trained CNNs find it much easier to overfit to low-level properties such as texture Geirhos et al. (2019), canonical pose Alcorn et al. (2019), or contextual cues Beery et al. (2018) rather than learning globally coherent characteristics of objects. A subsequent worry is that such classifiers, trained on data sampled from a particular distribution, are likely to be misleading when encountering novel situations in deployment. For example, silent failure might occur due to equally confident categorization of unknown objects into known categories. This last concern is one of the primary motivating reasons for wanting to be able to detect when test data comes from a different distribution than that of the training data. This problem has been recently dubbed out-of-distribution (OOD) detection Hendrycks and Gimpel (2017); Amodei et al. (2016), but is also referred to as anomaly / novelty / outlier detection in the contemporary machine learning context. Evaluation is typically carried out with benchmarks of the style proposed in Hendrycks and Gimpel (2017), where different datasets are treated as OOD after training on a particular in-distribution dataset. This area of research has been steadily developing Liang et al. (2018); DeVries and Taylor (2018); Shalev et al. (2018); Lee et al. (2018); Hendrycks et al. (2019), with some additions of new OOD datasets to the evaluation setup Liang et al. (2018), and improved results.

Current benchmarks are ill-motivated

Despite such tasks rapidly becoming the standard benchmark for OOD detection in the community, we suggest that, taken as a whole, they are not very well-motivated. For example, the object recognition dataset CIFAR-10 (consisting of images of objects placed in the foreground), is typically trained and tested against noise, or different datasets such as downsampled LSUN (a dataset of scenes), or SVHN (a dataset of house numbers), or tiny-Imagenet (a different dataset of objects). For the simpler cases of noise, or datasets with scenes or numbers, low-level image statistics are sufficient to tell them apart. While choices like tiny-Imagenet might seem more reasonable, it has been noted that particular datasets have particular biases related to specific data collection and curation quirks Torralba and Efros (2011); Tommasi et al. (2017), which renders the problem of treating different datasets for OOD detection questionable. It is possible we are only getting better at distinguishing such idiosyncrasies. Perhaps due to these tasks being intrinsically easy, most approaches typically report very flattering performance on such benchmarks.

Semantic distributional shift is relevant

We call into question the practical relevance of these evaluative tasks which are currently treated as standard by the community. While they might have some value as very preliminary reliability certification or as a testbed for diagnosing peculiar pathologies Nalisnick et al. (2019), their significance as benchmarks for practical OOD detection is less clear. The implicit goal for the current style of benchmarks is that of detecting one or more of a wide variety of distributional shifts, which mostly consist of irrelevant factors when high-dimensional data has low-dimensional semantics. We suggest that this is misguided: in a realistic setting, distributional shift across non-semantic factors (for example, camera and image-compression artifacts) is something we want to be robust to, while shift in semantic factors (for example, object identity) should be flagged down as anomalous or novel. Therefore, we argue that in practical contexts, OOD detection is well-motivated only when the distributional shift is semantic in nature.

Context dictates semantics of interest

We argue that, in practical settings, OOD detection becomes meaningful only after acknowledging context, and identifying the relevant semantics of interest. Such semantics are the factors of variation whose unnatural deviation are of concern to us in our assumed context. For example, in the context of scene classification, a kitchen with a bed in the middle is an anomalous observation. However, in the context of object recognition, the primary semantic is not the composition of scene-components anymore, but the identity of the foreground object. Now the unusual context should not prevent correct object recognition. If we claim that our object recognition models should be less certain of identifying an object in a novel context, it amounts to saying that we would prefer our models to be biased. Similarly, in action recognition, we care about the temporal evolution of the action being performed across frames. In such a context, a non-semantic shift, such as in background detail, does not constitute novelty. In fact, we would like our models to systematically generalize Fodor and Pylyshyn (1988) in order to be trustworthy and useful. We would like them to form predictions from a globally coherent assimilation of the relevant semantics for the task.

Without context, OOD detection is too broad to be meaningful

The problem of OOD detection then, as currently treated by the community, suffers from imprecision due to context-free presumption and evaluation. Even though most works assume an underlying classification task, the benchmark OOD datasets include significant variation over non-semantic factors (also see Appendix D). OOD detection with density models are typically presented as being unaware of a downstream module, but we posit that such a context must be specified in order to determine what shifts are of concern to us, since we typically do not care about all possible variations. Being agnostic of context when discussing OOD detection leads to a corresponding lack of clarity about the implications of underlying methodologies in proposed approaches. The current benchmarks and methods therefore carry a risk of potential misalignment between evaluative performance and field performance in practical OOD detection problems. Henceforth, we shall refer to such realistic OOD detection problems, where the concerned distributional shift is semantic for a specified context, by the term anomaly detection.

Contributions and overview

Our contributions in this paper are summarized as follows.

1. Semantic shifts are interesting, and benchmarks should reflect this more closely: In this section, we provided a grounded discussion about the relevance of semanticity in the context of a task for realistic OOD detection problems, which we regard as anomaly detection. Under the view of regarding distributional shifts as being either semantic and non-semantic for a specified context, we concluded that semantic shifts are of practical interest. If we want to develop and deploy reliable models in the real world, we typically wish to achieve robustness against non-semantic shift.

2. More practical benchmarks for anomaly detection: Although our discussion applies generally, in this paper we assume the common context of an object recognition task. In this context, unseen object categories may be considered anomalous at the “highest level” of semanticity. Anomalies corresponding to intermediate levels of semantic decomposition can also be relevant; for example, a liger should result in 50-50 uncertainties if the training data contains only lions and tigers. However such anomalies are significantly harder to curate, requiring careful interventions at collection-time. Since detection of novel categories is a compelling anomaly detection task in itself (in an object recognition context), we recommend benchmarks that reflect such applications in section 2.

3. Auxiliary objectives for improved semantic representation improves anomaly detection: Following our discussion about the relevance of semanticity, in sections 4 and 5 we investigate the effectiveness of multi-task learning with auxiliary self-supervised objectives. These have been shown to result in semantic representations, as measured by linear separability of object categories. Our results are indicative that such augmented objectives result in improved anomaly detection, with accompanying improvements in generalization.

2 Motivation and proposed tasks

In order to develop meaningful benchmarks, we begin by considering some practical applications where being able to detect anomalies, in the context of classification tasks, would find use.

Nature studies and monitoring: Biodiversity scientists want to keep track of variety and statistics of species across the world. Online tools such as iNaturalist iNaturalist (2019) enable photo-based classification and subsequent cataloguing in data repositories from pictures uploaded by naturalists. In such automated detection tools, a potentially novel species should result in a request for expert help rather than misclassification into a known species, and detection of undiscovered species is in fact a task of interest. A similar practical application is camera-trap monitoring of members in an ecosystem, notifying caretakers upon detection of invasive species Fedor et al. (2009); Willi et al. (2019). Taxonomy of collected specimens is often backlogged due to the human labour involved. Automating digitization and identification can help catch up, and often new species are brought to light through the process Carranza-Rojas et al. (2017), which obviously depends on effective detection of novel specimens.

Medical diagnosis and clinical microbiology: Online medical diagnosis tools such as Chester Cohen et al. (2019) can be impactful at improving healthcare levels worldwide. Such tools should be especially adept at being able to know when faced with a novel pathology rather than categorizing into a known subtype. Similar desiderata applies to being able to quickly detect new strains of pathogens when using machine learning systems to automate clinical identification in the microbiology lab Zieliński et al. (2017).

AI safety: Amodei et al. (2016) discusses the problem of distributional shift in the context of autonomous agents operating in our midst, with examples of actions that do not translate well across domains. A similar example in the vein of Amodei et al. (2016), grounded in a computer vision classification task, is the contrived scenario of encountering a novel vehicle (that follows different dynamics of motion), which might lead to a dangerous decision by a self-driving car which fails to recognize unfamiliarity.

Having compiled the examples above, we can now try to come up with an evaluative setting more aligned with realistic applications. The basic assumptions we make about possible evaluative tasks are: (i) that anomalies of practical interest are semantic in nature; (ii) that they are relatively rare events whose correct detection is of more primary relevance than minimizing false positives; and (iii) that we do not have access to examples of anomalies (as some existing works assume Liang et al. (2018); Lee et al. (2018)). These assumptions guide our choice of benchmarks and evaluation.

Proposed benchmarks

A very small number of recent works Akcay et al. (2018); Zenati et al. (2018) have considered a case that is more aligned with the goals stated above. Namely, for a choice of dataset, for example MNIST, train as many versions of classifiers as there are classes, holding out one class every time. At evaluation time, score the ability of being able to detect the held out class as anomalous. This is a setup more clearly related to the task of being able to detect semantic anomalies, holding dataset-bias factors invariant to a significantly greater extent. In this paper, we shall explore this setting with CIFAR-10 and STL-10, and recommend this as the default benchmark for evaluating anomaly detection in the context of object recognition. Similar setups apply to different contexts. We discourage the practice of treating one category as in-distribution and many other categories as out-distributions (as in Pidhorskyi et al. (2018); Golan and El-Yaniv (2018), for example): this is less aligned with the more prevalent multi-class scenarios and, being a much easier task, leads to overly optimistic scores.

While the hold-out-class setting for CIFAR-10 and STL-10 is a good setup for testing anomaly detection of disparate objects, a lot of applications, including some of the ones we described earlier, require detection of more fine-grained anomalies. For such situations, we propose a suite of tasks comprised of subsets of Imagenet (ILSVRC2012 Russakovsky et al. (2015)), with fine-grained subcategories. For example, the spider subset consists of members tarantula, Argiope aurantia, barn spider, black widow, garden spider, wolf spider. We also propose fungus, dog, snake, and car subsets. These subsets have varied sizes, with some of them being fairly small (see Table 1). Although this is a significantly harder task, we believe this setting aligns with the practical situations we described above, where sometimes large quantities of labelled data are not always available, and a particular fine-grained selection of categories is of interest. We intend our recommendations as a reasonably well-aligned and low-overhead set of tasks for researchers to evaluate approaches on. For very particular tasks, we advise practitioners to curate particularly relevant and more thorough benchmarks. See Appendix A for more details about our construction.

Evaluation

Current works tend to mainly use both Area under the Receiver-Operator-Characteristics (AUROC) and Area under Precision-Recall curve (AUPRC) to evaluate performance on anomaly detection. In situations where positive examples are not only much rarer, but also of primary interest for detection, AUROC scores are a poor reflection of detection performance; precision is more relevant than the false positive rate Fawcett (2006); Davis and Goadrich (2006); Avati et al. (2018). We shall not inspect AUROC scores because in all of our settings, normal examples significantly outnumber anomalous examples, and AUROC scores are insensitive to skew, thus resulting in misleading scores Davis and Goadrich (2006). Precision and recall are calculated as

precision (1)
recall (2)

and a precision-recall curve is then defined as a set of precision-recall points

(3)

where is a threshold parameter.

The area under the precision-recall curve is calculated by varying the threshold over a range spanning the data, and creating a finite set of points for the PR curve. One alternative is to interpolate these points, producing a continuous curve as an approximation to the true curve, and computing the area under the interpolation by, for example, the trapezoid rule. Interpolation in a precision-recall curve can sometimes be misleading, as studied in Boyd et al. (2013), who recommend a number of more robust estimators. Here we use the standard approximation to average precision as the

Subset Number of members Total training images Total test images
Dog (hound dog) 12 14864 600
Car 10 13000 500
Snake (colubrid snake) 9 11700 450
Spider 6 7800 300
Fungus 6 7800 300
Table 1: Sizes of proposed benchmark subsets from ILSVRC2012. Sample images are in the Appendix. The training set consists of roughly 1300 images per member, and 50 images per member in the test set (which come from the validation set images in the ILSVRC2012 dataset).

weighted mean of precisions at thresholds, weighted by the increase in recall from the previous threshold

(4)

3 Related work

Evaluative tasks

As discussed earlier, the style of benchmarks widely adopted today follows the recommendation in Hendrycks and Gimpel (2017). Among follow-ups, the most significant successor has been Liang et al. (2018) which augmented the suite of tests with slightly more reasonable choices: for example, tiny-Imagenet is considered as out-of-distribution for in-distrbution datasets such as CIFAR-10. However, on closer inspection, we find that tiny-Imagenet shares semantic categories with CIFAR-10, such as species of {dogs, cats, frogs, birds}, so it is unclear how such choices of evaluative tasks correspond to realistic anomaly detection problems. Work in the area of open-set recognition is closer to a realistic setup in terms of evaluation: in Bendale and Boult (2016), detection of novel categories is tested with a set of images corresponding to different classes that were discontinued in subsequent versions of Imagenet, but later work Dhamija et al. (2018) relapsed into treating very different datasets as novel. We do not encourage using one particular split of a collection of unseen classes as anomalous. This is because such a one-time split might favour implicit biases in the predefined split, and the chances of this happening is reduced with multiple hold-out trials. As mentioned earlier, a small number of works have already used the hold-out-class style of tasks for evaluation. Unfortunately, due to a lack of a motivating discussion, the community continues to adopt the tasks in Hendrycks and Gimpel (2017) and Liang et al. (2018).

Approaches to OOD detection

In Hendrycks and Gimpel (2017), the most natural baseline for a trained classifier is presented, where the detection score is simply given by the predictive confidence of the classifier (MSP). Follow-up work in Liang et al. (2018) proposed adding a small amount of adversarial perturbation, followed by temperature scaling of the softmax (ODIN). Methodologically, the approach suffers from having to pick a temperature and perturbation weight per anomaly-dataset. Confidence calibration has also been explored in DeVries and Taylor (2018), and was shown to improve complementary approaches like MSP and ODIN.

Using auxiliary datasets as surrogate anomalies has been shown to improve performance on existing benchmarks in Hendrycks et al. (2019). This approach is limited, due to its reliance on other datasets, but a more practical variant in Lee et al. (2018) uses a GAN to generate negative samples. However, Lee et al. (2018) suffers from the methodological issue of hyperparameters being optimized per anomaly-dataset. We believe that such contentious practices arise from a lack of a clear discussion of the nature of the tasks we should be concerned with, and a lack of grounding in practical applications which would dictate proper methodology. The primary goal of our paper is to help fill this gap.

In Shalev et al. (2018), the training set is augmented with semantically similar labels, but it is not always practical to assume access to a corpora providing such labels. In the next part of the paper, we explore a way to potentially induce more semantic representation, with the hope that this would lead to corresponding improvements in semantic anomaly detection and generalization.

CIFAR-10 STL-10
Classification only
Classification+rotation
Table 2: Multi-task augmentation with the self-supervised objective of predicting rotation improves generalization.
Figure 1: Plots of training and testing costs, accuracies, and average precision corresponding to hold-out-class experiments with three categories each from CIFAR-10 (top) and STL-10 (bottom), using the MSP method Hendrycks and Gimpel (2017). While classification performance is not correlated with performance at anomaly detection (compare absolute test accuracy numbers with average precision scores across columns), the “pattern” of improvement in anomaly detection appears roughly related to generalization (compare the coarse shape of test accuracy curves with that of average precision curves across iterations), indicating that there is some connection between generalization and the ability to detect semantic anomalies.

4 Encouraging semantic representation with auxiliary self-supervised objectives

We hypothesize that classifiers that learn representations which are more oriented toward capturing semantic properties would naturally lead to better performance at detecting semantic anomalies. Overfitting to low-level features such as colour or texture without consideration of global coherence might result in potential confusions in situations where the training data is biased and not representative. For a lot of existing datasets, it is quite possible to achieve good generalization performance without learning semantic distinctions, a possibility that spurs the search for removing algorithmic bias Zemel et al. (2013), and which is often exposed in embarrassing ways. As a contrived example, if the training and testing data consists of only one kind of animal which is furry, the classifier only needs to learn about fur-texture, and can ignore other meaningful characteristics such as the shape. Such a system would fail to recognize another furry, but differently shaped creature as novel, while achieving good test performance. Motivated by this line of thinking, we ask the question of how we might encourage classifiers to learn more meaningful representations.

Multi-task learning with auxiliary objectives

Caruana (1993) describes how sharing parameters for learning multiple tasks, which are related in the sense of requiring similar features, can be a powerful tool for inducing domain-specific inductive biases in a learner. Hand-design of inductive biases requires complicated engineering, while using the training signal from a related task can be a much easier way to achieve similar goals. Even when related tasks are not explicitly available, it is often possible to construct one. We explore such a framework for augmenting object recognition classifiers with auxiliary tasks. Expressed in notation, given the primary loss function, , which is the categorical cross-entropy loss in the case of classification, and the auxiliary loss corresponding to the auxiliary task, we aim to optimize the combined loss

(5)

where are the shared parameters across both tasks, is the dataset, is a hyper-parameter we learn by optimizing for classification acccuracy on the validation set. In practice, we alternate between the two updates rather than taking one global step; this balances the training rates of the two tasks.

Auxiliary tasks

Recently, there has been strong interest in self-supervision applied to vision Doersch et al. (2015); Pathak et al. (2016); Noroozi and Favaro (2016); Zhang et al. (2017); van den Oord et al. (2018); Gidaris et al. (2018); Caron et al. (2018), exploring tasks that induce representations which are linearly separable by object categories. These objectives naturally lend themselves as auxiliary tasks for encouraging inductive biases towards semantic representations. First, we experiment with the recently introduced task in Gidaris et al. (2018), which asks the learner to predict the orientation of a rotated image. In Table 2, we show significantly improved generalization performance of classifiers on CIFAR-10 and STL-10 when augmented with the auxiliary task of predicting rotation. Details of experimental settings, and performance on anomaly detection, are in the next section. We also perform experiments on anomaly detection with contrastive predictive coding van den Oord et al. (2018) as the auxiliary task and find that similar trends continue to hold.

The addition of such auxiliary objectives is complementary to the choice of scoring anomalies. Additionally, as in standard multi-task learning setups, it enables further augmentation with more auxiliary tasks Doersch and Zisserman (2017), which we leave for future exploration.

5 Evaluation

We study the two existing representative baselines of maximum softmax probability (MSP) Hendrycks and Gimpel (2017), and ODIN Liang et al. (2018) on the proposed benchmarks. For ODIN, it is unclear how to choose the hyperparameters for temperature scaling and the weight for adversarial perturbation without assuming access to anomalous examples, an assumption we consider unrealistic in most practical settings. We fix for all experiments, following the most common setting.

CIFAR-10 Classification-only Rotation-augmented
Anomaly MSP ODIN Accuracy MSP ODIN Accuracy
airplane 43.30 1.13 48.23 1.90 96.00 0.16 46.87 2.10 49.75 2.30 96.91 0.02
automobile 14.13 1.33 13.47 1.50 95.78 0.12 17.39 1.26 17.35 1.12 96.66 0.03
bird 46.55 1.27 50.59 0.95 95.90 0.17 51.49 1.07 54.62 1.10 96.79 0.06
cat 38.06 1.31 38.97 1.43 97.05 0.12 53.12 0.92 55.80 0.76 97.46 0.07
deer 49.11 0.53 53.03 0.50 95.87 0.12 50.35 2.57 52.82 2.96 96.76 0.09
dog 25.39 1.17 24.41 1.05 96.64 0.13 32.11 0.82 32.46 1.39 97.36 0.06
frog 40.91 0.81 42.21 0.48 95.65 0.09 52.39 4.58 54.44 5.80 96.51 0.12
horse 36.18 0.77 36.78 0.82 95.64 0.08 39.93 2.30 39.65 4.31 96.27 0.07
ship 28.35 0.81 30.61 1.46 95.70 0.15 29.36 3.16 28.82 4.63 96.66 0.17
truck 27.17 0.73 28.01 1.06 96.04 0.24 29.22 2.87 29.93 3.86 96.91 0.12
Average 34.92 0.41 36.63 0.61 96.03 0.00 40.22 0.16 41.56 0.15 96.83 0.02
Table 3: We train ResNet classifiers on CIFAR-10 holding out each class per run, and score detection with average precision for the maximum softmax probability (MSP) baseline in Hendrycks and Gimpel (2017) and ODIN Liang et al. (2018). We find that augmenting with rotation results in complementary improvements to both scoring methods for anomaly detection (contrast columns in the right half with those in the left half). All results are reported over 3 trials.
STL-10 Classification-only Rotation-augmented
Anomaly MSP ODIN Accuracy MSP ODIN Accuracy
airplane 19.21 1.05 23.46 1.65 85.18 0.20 22.21 0.76 23.37 1.71 89.24 0.12
bird 29.05 0.69 33.51 0.36 85.91 0.36 36.12 2.08 40.08 3.30 89.91 0.29
car 14.52 0.37 16.14 0.83 84.32 0.55 15.95 2.20 16.87 2.94 89.52 0.44
cat 25.21 0.93 27.92 0.84 86.95 0.36 29.34 1.30 31.35 1.88 90.89 0.26
deer 24.29 0.53 25.94 0.49 85.34 0.35 27.60 2.22 29.71 2.55 89.20 0.17
dog 23.42 0.60 23.44 1.18 87.78 0.45 26.78 0.71 26.14 0.62 91.37 0.33
horse 21.31 1.01 22.19 0.75 85.52 0.21 23.79 1.46 23.59 1.63 89.60 0.11
monkey 23.67 0.83 21.98 0.91 86.66 0.31 28.43 1.67 28.32 1.20 90.07 0.23
ship 14.61 0.12 13.78 0.63 84.65 0.21 16.79 1.20 15.37 1.22 89.33 0.15
truck 15.43 0.17 14.35 0.12 85.34 0.17 17.05 0.50 16.59 0.60 90.08 0.38
Average 21.07 0.25 22.27 0.29 85.77 0.13 24.41 0.23 25.14 0.45 89.92 0.08
Table 4: Average precision scores for hold-out-class experiments with STL-10. We observe that the same trends in improvements hold as with the previous experiments on CIFAR-10.
Classification-only Rotation-augmented
Subset Skew MSP ODIN Accuracy MSP ODIN Accuracy
dog 8.33 23.92 0.49 25.85 0.09 85.09 0.14 24.66 0.58 25.73 0.87 85.25 0.17
car 10.00 21.54 0.62 22.49 0.54 77.17 0.10 21.66 0.19 22.38 0.46 76.72 0.19
snake 11.11 18.62 0.93 19.18 0.79 69.74 1.63 20.23 0.18 21.17 0.12 70.51 0.48
spider 16.67 21.20 0.56 24.15 0.72 68.40 0.21 22.90 1.29 25.10 1.78 68.68 0.77
fungus 16.67 42.56 0.49 44.59 1.46 88.23 0.45 44.19 1.86 46.86 1.13 88.47 0.43
Table 5: Averaged average precisions for the proposed subsets of Imagenet, with rotation-prediction as the auxiliary task. Each row shows averaged performance across all members of the subset. A random-guessing baseline would score at the skew rate. Expanded rows are in Appendix B.

5.1 Experimental settings

Settings for Cifar-10 and Stl-10

Our base network for all CIFAR-10 experiments is a Wide ResNet Zagoruyko and Komodakis (2016) with 28 convolutional layers and a widening factor of 10 (WRN-28-10) with the recommended dropout rate of 0.3. Following Zagoruyko and Komodakis (2016), we train for 200 epochs, with an initial learning rate of 0.1 which is scaled down by 5 at the 60th, 120th, and 160th epochs, using stochastic gradient descent with Nesterov’s momentum at 0.9. We train in parallel on 4 Pascal V100 GPUs with batches of size 128 on each. For STL-10, we use the same architecture but append an extra group of 4 residual blocks with the same layer widths as in the previous group. We also use a widening factor of 4 instead of 10, and batches of size 64 on each of the 4 GPUs. We use the same optimizer settings as with CIFAR-10. In both cases, we apply standard data augmentation of random crops (after padding) and random horizontal reflections.

Settings for Imagenet

For experiments with the proposed subsets of Imagenet, we replicate the architecture we use for STL-10, but add a downsampling average pooling layer after the first convolution on the images. We do not use dropout, and use a batch size of 64; otherwise all other details follow from the experiments on STL-10. The standard data augmentation steps of random crops to a size of and random horizontal reflections are applied.

Predicting rotation as an auxiliary task

For adding rotation-prediction as an auxiliary task, all we do is append an extra linear layer alongside the one that is responsible for object recognition. is tuned to 0.5 for CIFAR-10, 1.0 for STL-10, and a mix of 0.5 and 1.0 for Imagenet on validation. The optimizer and regularizer settings are kept the same, with the learning rate decayed along with the learning rate for the classifier at the same scales.

We emphasize that this procedure is not equivalent to data augmentation, since we do not optimize the linear classification layer for rotated images. Only the rotation prediction linear layer gets updated for inputs corresponding to the rotation task, and only the linear classification layer gets updated for non-rotated, object-labelled images. Asking the classifier to be rotation-invariant would require the auxiliary task to develop a disjoint subset in the shared representation that is not rotation-invariant, so that it can succeed at predicting rotations. This encourages an internally split representation, thus diminishing the potential advantage we hope to achieve from a shared, mutually beneficial space.

CPC as an auxiliary task

We also experimented with contrastive predictive coding van den Oord et al. (2018) as an auxiliary task. Since this is a patch-based method, the input spaces are different across the two tasks: that of predicting encodings of patches in the iamge, and that of predicting object category from the entire image. We found that two tricks are very useful for fostering co-operation: (i) replacing the normalization layers with their conditional variants de Vries et al. (2017) (conditioning on the task at hand), and (ii) using symmetric-padding instead of zero-padding. Since CPC induces significant computational overhead, we resorted to a lighter-weight base network. This generally comes at the cost of a drop in classification accuracy and performance at detecting anomalies. We still find, in table 6, that similar patterns of improvements continue to hold, in terms of improved anomaly detection and improved generalization, with our auxiliary task. We describe details of the model and report full results in Appendix C.

Classification-only Rotation-augmented
Subset Skew MSP ODIN Accuracy MSP ODIN Accuracy
dog 8.33 20.84 0.50 22.77 0.74 83.12 0.26 21.43 0.63 24.08 0.63 84.16 0.07
car 10.00 19.86 0.21 21.42 0.48 75.42 0.11 22.21 0.44 23.61 0.57 78.88 0.15
snake 11.11 18.20 0.76 18.67 1.07 66.15 1.89 18.78 0.40 20.39 0.60 68.02 0.85
spider 16.67 22.03 0.68 24.08 0.70 66.65 0.42 22.28 0.60 23.37 0.68 68.67 0.36
fungus 16.67 39.19 1.26 41.71 1.94 87.05 0.06 42.08 0.57 45.05 1.11 88.91 0.46
Table 6: Averaged average precisions for the proposed subsets of Imagenet where CPC is the auxiliary task. Expanded results are in Appendix C.

5.2 Discussion

Self-supervised multi-task learning is effective

In Tables 3 and 4 we report average precision scores on CIFAR-10 and STL-10 for the baseline scoring methods MSP Hendrycks and Gimpel (2017) and ODIN Liang et al. (2018). We note that ODIN, with fixed hyperparameter settings across all experiments, continues to outperform MSP most of the time. When we augment our classifiers with the auxiliary rotation-prediction task, we find that anomaly detection as well as test set accuracy are markedly improved for both scoring methods. As we have remarked earlier, a representation space with greater semanticity should be expected to bring improvements on both fronts. All results report mean standard deviation over 3 trials. In Table 5, we repeat the same process for the much harder Imagenet subsets. Full results, corresponding to individual members of the subsets, are in Appendix B, while here we only show the average performance across all members of the subset. In Appendix C, we show results when CPC is the auxiliary task. Taken together, our results indicate that multi-task learning with self-supervised auxiliary tasks can be an effective approach for improving anomaly detection, with accompanying improvements in generalization.

Improving test set accuracy might not improve anomaly detection

Training methods developed solely to improve generalization, without consideration of the affect on semantic understanding, might perform worse at detecting semantic anomalies. This is because it is often possible to overfit to low-level or contextual discriminatory patterns, which are almost surely biased in small datasets for complex domains such as natural images, and perform reasonably well on the test set. To illustrate this, we run an experiment where we randomly mask out a region in CIFAR-10 images from within the central region. We find that while this leads to improved test accuracies, anomaly detection suffers (numbers are averages across hold-out-class trials):

Method Accuracy Av. Prec. with MSP
Base model 96.03 0.00 34.92 0.41
Random-center-masked 96.27 0.05 34.41 0.74
Rotation-augmented 96.83 0.02 40.22 0.16

This hints that while the masking strategy is effective as a regularizer, it might come at the cost of less semantic representation. Such training choices can therefore result in models with seemingly improved generalization but which have a poorer understanding of object coherence, due to potentially overfitting to a greater extent on biases in local statistics or contextual cues in the dataset. For comparison, the rotation-augmented network achieves both a higher test set accuracy as well as an improved average precision. This example serves as yet another cautionary tale about developing techniques that might inadvertently lead to neural networks achieving reassuring test set performance, while following an internal modus operandi very much misaligned with the pattern of reasoning we hope they discover. This can have unexpected consequences when such models are deployed.

6 Conclusion

We provided a critical review of the current interest in OOD detection, concluding that realistic applications involve detecting semantic distributional shift in a specified context, which we regard as anomaly detection. While there is significant recent interest in the area, current research suffers from questionable benchmarks and methodology. In light of these considerations, we proposed a set of benchmarks which are better aligned with realistic anomaly detection applications in the context of object classification systems.

We also explored the effectiveness of a multi-task learning framework with auxiliary objectives. Our results demonstrate improved anomaly detection along with improved generalization under such augmented objectives. This suggests that inductive biases induced through such auxiliary tasks could have an important role to play in developing neural networks with representations that lead to improved anomaly detection and generalization.

We note that the ability to detect semantic anomalies also provides us with an indirect view of semanticity in the representations learned by our mostly opaque deep models.

Acknowledgements

We thank Rachel Rolland for referencing and discussing the motivating examples of anomaly detection in nature studies. Ishaan Gulrajani, Tim Cooijmans, and anonymous reviewer R1 provided useful feedback.

The experiments were carried out on computational resources provided by Compute Canada.

\Urlmuskip

=0mu plus 2mu

References

  • [1] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon (2018) GANomaly: semi-supervised anomaly detection via adversarial training. ACCV. Cited by: §2.
  • [2] M. Alcorn, Q. Li, Z. Gong, C. Wang, L. Mai, W. Ku, and A. Nguyen (2019) Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. CVPR. Cited by: §1.
  • [3] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in AI safety. CoRR abs/1606.06565. Cited by: §1, §2.
  • [4] A. Avati, K. Jung, S. Harman, L. Downing, A. Ng, and N. H. Shah (2018-12-12) Improving palliative care with deep learning. BMC Medical Informatics and Decision Making 18 (4), pp. 122. Cited by: §2.
  • [5] S. Beery, G. V. Horn, and P. Perona (2018) Recognition in terra incognita. CoRR. Cited by: §1.
  • [6] A. Bendale and T. E. Boult (2016) Towards open set deep networks. ICCV. Cited by: §3.
  • [7] K. Boyd, K. H. Eng, and C. D. Page (2013) Area under the precision-recall curve: point estimates and confidence intervals. ECML-PKDD. Cited by: §2.
  • [8] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In ECCV, Cited by: §4.
  • [9] J. Carranza-Rojas, H. Goeau, P. Bonnet, E. Mata-Montero, and A. Joly (2017-08-11) Going deeper in the automated identification of herbarium specimens. BMC Evolutionary Biology 17 (1), pp. 181. Cited by: §2.
  • [10] R. Caruana (1993) Multitask learning: a knowledge-based source of inductive bias. In ICML, Cited by: §4.
  • [11] J. P. Cohen, P. Bertin, and V. Frappier (2019) Chester: A web delivered locally computed chest x-ray disease prediction system. CoRR abs/1901.11210. Cited by: §2.
  • [12] J. Davis and M. Goadrich (2006) The relationship between precision-recall and roc curves. pp. 233–240. Cited by: §2.
  • [13] H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville (2017) Modulating early visual processing by language. NIPS. Cited by: Appendix C, §5.1.
  • [14] T. DeVries and G. W. Taylor (2018) Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: §1, §3.
  • [15] A. R. Dhamija, M. Günther, and T. Boult (2018) Reducing network agnostophobia. NIPS. Cited by: §3.
  • [16] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. ICCV. Cited by: §4.
  • [17] C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In ICCV, Cited by: §4.
  • [18] T. Fawcett (2006) An introduction to roc analysis. Pattern Recognition Letters 27 (8), pp. 861 – 874. Cited by: §2.
  • [19] P. Fedor, J. Vanhara, J. Havel, I. Malenovsky, and I. Spellerberg (2009) Artificial intelligence in pest insect monitoring. Systematic Entomology 34 (2), pp. 398–400. Cited by: §2.
  • [20] J. A. Fodor and Z. W. Pylyshyn (1988) Connectionism and cognitive architecture: a critical analysis. Cognition 28 (1), pp. 3 – 71. Cited by: §1.
  • [21] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR. Cited by: §1.
  • [22] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. ICLR. Cited by: §4.
  • [23] I. Golan and R. El-Yaniv (2018) Deep anomaly detection using geometric transformations. NeuRIPS. Cited by: §2.
  • [24] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. ICML, pp. 1321–1330. Cited by: §1.
  • [25] O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: Appendix C.
  • [26] D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. ICLR. Cited by: §1, Figure 1, §3, §3, §5.2, Table 3, §5.
  • [27] D. Hendrycks, M. Mazeika, and T. Dietterich (2019) Deep anomaly detection with outlier exposure. ICLR. Cited by: §1, §3.
  • [28] iNaturalist (2019) https://news.developer.nvidia.com/ai-app-identifies-plants-and-animals-in-seconds, accessed on 17 may 2019. Cited by: §2.
  • [29] K. Lee, H. Lee, K. Lee, and J. Shin (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. ICLR. Cited by: §1, §2, §3.
  • [30] S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution detection in neural networks. ICLR. Cited by: Appendix D, §1, §2, §3, §3, §5.2, Table 3, §5.
  • [31] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan (2019) Do deep generative models know what they don’t know?. ICLR. Cited by: Appendix D, §1.
  • [32] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. ECCV. Cited by: §4.
  • [33] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros (2016) Context encoders: feature learning by inpainting. CVPR. Cited by: §4.
  • [34] S. Pidhorskyi, R. Almohsen, and G. Doretto (2018) Generative probabilistic novelty detection with adversarial autoencoders. NIPS. Cited by: §2.
  • [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §2.
  • [36] G. Shalev, Y. Adi, and J. Keshet (2018) Out-of-distribution detection using multiple semantic label representations. NeuRIPS. Cited by: §1, §3.
  • [37] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars (2017) A deeper look at dataset bias. In Domain Adaptation in Computer Vision Applications, pp. 37–55. Cited by: §1.
  • [38] A. Torralba and A. A. Efros (2011-06) Unbiased look at dataset bias. In CVPR, pp. 1521–1528. Cited by: §1.
  • [39] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. Cited by: Appendix C, §4, §5.1.
  • [40] M. Willi, R. T. Pitman, A. W. Cardoso, C. Locke, A. Swanson, A. Boyer, M. Veldthuis, and L. Fortson (2019) Identifying animal species in camera trap images using deep learning and citizen science. Methods in Ecology and Evolution 10 (1), pp. 80–91. Cited by: §2.
  • [41] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In BMVC, Cited by: §5.1.
  • [42] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013) Learning fair representations. ICML. Cited by: §4.
  • [43] H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar (2018) Efficient gan-based anomaly detection. CoRR abs/1802.06222. Cited by: §2.
  • [44] R. Zhang, P. Isola, and A. A. Efros (2017) Split-brain autoencoders: unsupervised learning by cross-channel prediction. CVPR. Cited by: §4.
  • [45] B. Zieliński, A. Plichta, K. Misztal, P. Spurek, M. Brzychczy-Włoch, and D. Ochońska (2017) Deep learning approach to bacterial colony classification. PLoS One 12(9). Cited by: §2.

Appendix A Imagenet benchmarks

We present details of the Imagenet-based benchmark we proposed. For constructing these datasets, we first sorted all subsets by the number of members, as structured in the Imagenet hierarchy. We then picked from among the list of top twenty subsets, with a preference for subsets that are more closely aligned with the theme of motivating practical applications we provided. We also manually inspected the data, to check for inconsistencies, and performed some pruning. For example, the beetle subset, while seeming ideal, has some issues with labelling: leaf beetle and ladybug appear to overlap in some cases. Finally, we settled on our choice of 5 subsets. In table 7, we list the members under every proposed subset.

Dog (hound) Car Snake (colubrid) Spider Fungus
Ibizan hound Model T ringneck snake tarantula stinkhorn
bluetick race car vine snake Argiope aurantia bolete
beagle sports car hognose snake barn spider hen-of-the-woods
Afghan hound minivan thunder snake black widow earthstar
Weimaraner ambulance garter snake garden spider gyromitra
Saluki cab king snake wolf spider coral fungus
redbone beach wagon night snake
otterhound jeep green snake
Norweigian elkhound convertible water snake
basset hound limo
Scottish deerhound
bloodhound
Table 7: Imagenet subset members

In Figures 2,3,4,5,6 we show samples of images. The sets are collected by first resizing such that the shorter side is of length 256 pixels, followed by a center crop. It is obivous that due to intrinsic dataset bias, some categories may be viewed as anomalous without careful inspection of the object of interest. For example, owing to their smaller size, ringneck snakes are most often photographed when held in human hands, and race cars are usually pictured on race tracks. Such dataset biases have historically been hard to account for, and we recommend more thoughtful curation of specific datasets for specific tasks for our proposed style of benchmarks to be more reflective of field performance. This is also why we recommend multiple hold-out trials as opposed to a single predefined split: it is possible that a particular set of such biases fall into a particular split, which would score methods that exploit such biases higher than ones that are potentially more robust across a family of biases. Multiple hold-out-trials reduce such inadvertent advantages to bias.

(a) Ibizan hound
(b) Bluetick
(c) Beagle
(d) Afghan hound
(e) Weimaraner
(f) Saluki
(g) Redbone
(h) Otterhound
(i) Norweigian elkhound
(j) Basset hound
(k) Scottish deerhound
(l) Bloodhound
Figure 2: Dog (hound dog)
(a) Model T
(b) Race car
(c) Sports car
(d) Minivan
(e) Ambulance
(f) Cab
(g) Beach wagon
(h) Jeep
(i) Convertile
(j) Limousine
Figure 3: Car
(a) Ringneck snake
(b) Vine snake
(c) Hognose snake
(d) Thunder snake
(e) Garter snake
(f) King snake
(g) Night snake
(h) Green snake
(i) Water snake
Figure 4: Snake (colubrid)
(a) Tarantula
(b) Argiope aurantia
(c) Barn spider
(d) Black widow
(e) Garden spider
(f) Wolf spider
Figure 5: Spider
(a) Stinkhorn
(b) Bolete
(c) Hen-of-the-woods
(d) Earthstar
(e) Gyromitra
(f) Coral fungus
Figure 6: Fungus

Appendix B Expanded results for Imagenet-subset experiments with rotation-prediction as the auxiliary task

We show expanded results for the Imagenet experiments with prodicting rotation as an auxiliary task, corresponding to every hold-out-class experiment, in the tables below.

Dog Classification-only Rotation-augmented
Anomalous dogs MSP ODIN Accuracy MSP ODIN Accuracy
Ibizan hound 25.56 2.43 26.34 2.65 85.19 0.65 22.85 2.54 24.27 2.60 85.99 0.73
bluetick 34.37 4.32 39.19 1.43 85.50 0.82 29.60 4.39 31.70 1.99 85.36 0.59
beagle! 18.29 1.54 17.05 1.21 86.33 1.18 19.79 2.02 18.52 1.69 86.37 0.71
Afghan hound 20.05 3.07 18.69 1.22 84.16 0.59 20.62 1.26 20.56 1.37 83.44 0.98
Weimaraner 31.04 2.22 36.87 2.90 83.68 0.36 27.65 2.52 30.59 1.03 83.62 1.40
Saluki 26.64 2.50 31.75 2.01 83.76 1.08 28.20 1.46 29.27 1.30 84.74 0.19
redbone 17.93 0.59 18.66 0.59 86.61 1.48 19.14 0.39 19.54 1.15 86.01 1.10
otterhound 22.71 0.77 23.31 0.83 84.50 0.54 21.32 3.71 22.90 4.28 84.24 0.56
Norweigian elkhound 28.82 2.16 36.55 0.61 83.91 1.35 34.64 6.16 41.61 6.55 84.33 0.84
basset hound 18.39 0.91 16.23 0.58 86.34 0.76 21.33 3.10 19.46 1.72 86.45 0.65
Scottish deerhound 26.83 2.97 26.52 2.23 83.95 0.61 26.64 1.17 24.87 0.52 83.80 0.02
bloodhound 16.43 1.17 19.04 0.91 87.17 0.24 24.19 6.23 25.42 6.60 88.69 0.84
Average 23.92 0.49 25.85 0.09 85.09 0.14 24.66 0.58 25.73 0.87 85.25 0.17
Car Classification-only Rotation-augmented
Anomalous cars MSP ODIN Accuracy MSP ODIN Accuracy
Model T 26.77 1.21 31.20 1.22 72.92 0.49 32.09 0.86 36.10 1.84 72.52 0.62
race car 22.48 2.53 27.12 3.90 79.65 1.85 20.32 5.47 22.41 6.41 74.67 3.39
sports car 16.20 1.77 13.86 0.44 80.97 1.97 16.80 0.93 15.58 0.96 81.00 0.48
minivan 17.19 2.57 17.78 1.68 79.25 1.89 17.32 3.08 18.45 2.67 80.01 0.98
ambulance 11.13 1.78 9.51 0.97 75.71 2.44 11.24 0.84 10.61 1.25 75.78 0.31
cab 26.17 2.42 27.93 2.30 75.92 3.77 28.57 1.91 29.39 2.52 76.74 3.09
beach wagon 24.82 0.85 26.30 2.00 78.75 1.09 24.50 1.64 25.22 1.89 79.81 1.27
jeep 25.47 0.38 26.99 2.70 74.67 1.37 27.92 5.01 27.74 3.28 72.84 0.47
convertible 20.00 2.63 18.35 1.17 76.86 0.81 15.32 2.01 14.79 2.24 76.26 1.74
limo 25.16 1.43 25.87 0.83 77.04 1.52 22.53 2.08 23.49 1.17 77.54 1.19
Average 21.54 0.62 22.49 0.54 77.17 0.10 21.66 0.19 22.38 0.46 76.72 0.19
Snake Classification-only Rotation-augmented
Anomalous snakes MSP ODIN Accuracy MSP ODIN Accuracy
ringneck snake 20.18 2.98 20.56 2.78 71.08 3.13 20.84 0.77 23.22 1.07 69.03 0.46
vine snake 16.07 4.51 17.19 3.91 67.94 3.96 15.94 1.05 16.15 0.96 72.65 6.44
hognose snake 16.82 0.38 16.65 0.46 67.95 2.81 19.70 1.22 19.32 0.77 69.85 2.50
thunder snake 17.06 2.94 19.18 3.31 71.86 7.58 21.26 0.35 23.08 0.70 69.34 7.08
garter snake 21.45 4.35 22.16 3.81 67.26 2.19 22.67 1.13 23.12 1.83 68.29 5.90
king snake 17.37 0.39 16.55 1.19 66.45 5.38 19.47 3.72 17.96 2.74 68.13 2.74
night snake 21.70 4.01 20.50 3.36 76.56 0.78 23.28 0.79 24.12 1.26 78.71 3.91
green snake 12.42 3.31 13.49 3.57 71.07 6.41 13.15 1.11 13.94 0.23 71.74 2.75
water snake 24.50 3.10 26.36 3.24 67.46 6.52 25.77 3.59 29.62 5.04 66.85 0.70
Average 18.62 0.93 19.18 0.79 69.74 1.63 20.23 0.18 21.17 0.12 70.51 0.48
Spider Classification-only Rotation-augmented
Anomalous spiders MSP ODIN Accuracy MSP ODIN Accuracy
tarantula 19.45 0.73 22.91 2.37 60.67 0.66 24.27 3.25 26.07 2.73 60.07 0.86
Argiope aurantia 12.97 0.39 12.49 0.48 69.70 2.17 12.82 0.51 12.01 0.21 69.17 1.57
barn spider 23.03 3.03 23.83 2.95 75.69 0.55 21.41 1.41 23.54 1.21 76.56 1.85
black widow 29.24 4.39 37.96 5.68 61.79 0.87 37.08 7.50 42.64 8.87 62.63 0.09
garden spider 17.36 1.33 15.51 0.88 77.81 2.29 16.57 1.58 15.88 1.42 76.38 1.94
wolf spider 25.15 2.98 32.23 1.91 64.73 0.45 25.48 1.75 30.69 1.28 67.19 0.37
Average 21.20 0.56 24.15 0.72 68.40 0.21 22.90 1.29 25.10 1.78 68.68 0.77
Fungus Classification-only Rotation-augmented
Anomalous fungi MSP ODIN Accuracy MSP ODIN Accuracy
stinkhorn 52.43 1.15 56.37 1.98 90.91 0.54 54.37 4.65 59.10 5.71 92.27 0.97
bolete 51.04 0.42 52.82 3.10 89.19 0.94 49.43 2.05 53.07 3.48 89.22 1.09
hen-of-the-woods 44.83 1.52 48.04 0.84 89.41 1.64 48.87 2.00 51.37 2.44 90.13 0.33
earthstar 34.90 3.26 36.79 2.16 86.70 1.91 41.96 7.66 43.24 4.92 86.46 0.62
gyromitra 46.75 0.42 49.06 2.64 86.79 1.66 44.90 1.94 49.20 1.51 86.39 0.18
coral fungus 25.42 2.60 24.44 3.04 86.36 1.25 25.58 1.15 25.22 2.81 86.35 0.80
Average 42.56 0.49 44.59 1.46 88.23 0.45 44.19 1.86 46.86 1.13 88.47 0.43

Appendix C Experiments with contrastive predictive coding (CPC) as the auxiliary task

In this section, we provide further details of our experiments with CPC [39] as an auxiliary task. We only run these experiments on our proposed Imagenet subsets since as a patch-encoding predictive method, CPC has been developed primarily for signals with sufficient spatial or temporal dimensions for meaningful and sufficient subsampling. Existing work has explored the application to smaller images, but here we only focus on the most realistic and most difficult of the benchmarks we have proposed.

CPC involves performing predictions for encodings of patches of an image from those above them. To avoid learning trivial codes, a contrastive loss is used which essentially trains the model to distinguish between correct codes and “noisy” ones. These negative samples are taken from patches within and across images in the batch.

We use the same network architecture as we used for the Imagenet experiments with rotation-prediction as the auxiliary task, but modify the first convolution layer to have a stride of 2. This reduces the computational overhead sufficiently for concurrent training with CPC at reasonable batch-sizes (CPC training batch-sizes are 32), but at a minor expense of classification performance. We use the first three blocks of the network for the patch encoder as in [39], and append the final layers for the classification task. Unlike with rotation, the auxiliary task works on patches while the primary classifier works on the entire image. This leads to differences in the operating receptive-fields, and differing proportions of boundary effects. To facilitate easier parameter sharing across the two tasks, we make the following changes. First, we replace all default zero-padding with symmetric-padding. This removes the effect of having a different ratio of border-zeros to pixels when the spatial dimensions of the input changes. Second, we replace all normalization layers with conditional normalization variants [13]: this means separate sets of scale and shift parameters are used depending on the current predictive task. Since batch-normalization allows trivial solutions to CPC for patches sampled from different images as noted in [25], we only use patches from within the same image, and find that we can continue using CPC to our advantage (although we found that such an implementation of CPC by itself leads to less linearly-separable representations compared to also taking negative samples from other images) . We keep the same optimizer settings from the rotation experiments, but it is possible that different choices might lead to further improvements. is tuned to 10.0 for all experiments, following a coarse hyperparameter search for best validation-set classification accuracy over a range of .

In the tables below, we show that similar patterns of improved anomaly detection and generalization are observed as with our experiments where rotation-prediction was the auxiliary task.

Classification-only CPC-augmented
Anomalous dog MSP ODIN Accuracy MSP ODIN Accuracy
Ibizan hound 20.07 2.37 21.48 3.04 84.08 0.51 20.33 1.12 21.23 1.80 84.49 0.62
bluetick 24.64 3.66 30.79 2.86 82.09 1.06 27.23 1.64 34.59 1.10 83.86 0.27
beagle 19.45 0.62 18.49 0.70 84.68 0.45 20.17 1.26 18.42 1.03 85.56 0.51
Afghan hound 17.46 1.38 18.51 1.17 80.95 0.58 16.33 1.48 20.60 1.56 83.21 0.84
Weimaraner 26.80 5.68 32.76 6.41 82.15 0.78 26.09 1.99 29.83 2.19 82.72 0.52
Saluki 25.22 1.38 29.76 0.66 82.96 0.50 23.19 0.74 27.07 1.71 84.13 0.13
redbone 16.47 1.12 16.62 1.68 83.78 0.64 18.39 2.35 17.91 0.81 84.25 0.16
otterhound 17.37 1.80 17.41 1.43 82.00 0.81 16.37 1.22 16.44 1.25 83.77 0.21
Norweigian elkhound 23.71 3.89 27.66 3.81 81.26 0.17 28.82 2.58 38.19 4.92 82.53 0.59
basset hound 18.63 1.45 17.53 1.87 84.59 0.62 18.04 1.24 17.11 0.64 85.36 0.30
Scottish deerhound 21.70 0.23 19.70 0.67 82.79 0.79 23.23 2.82 23.42 3.56 83.46 0.55
bloodhound 18.53 1.63 22.59 2.13 86.49 0.72 18.96 1.09 24.09 1.99 86.55 1.02
Average 20.84 0.50 22.77 0.74 83.12 0.26 21.43 0.63 24.08 0.63 84.16 0.07
Classification-only CPC-augmented
Anomalous car MSP ODIN Accuracy MSP ODIN Accuracy
Model T 24.48 2.28 27.42 2.11 71.00 0.98 27.08 2.01 30.44 2.39 76.90 0.60
race car 21.22 0.78 26.82 1.07 76.79 0.80 20.19 2.57 24.54 3.58 80.22 0.16
sports car 15.16 0.82 14.34 0.63 79.40 0.47 19.33 1.45 15.33 0.96 81.70 0.53
minivan 16.92 2.16 18.60 2.95 77.48 0.58 17.87 0.46 19.54 2.34 80.38 0.62
ambulance 11.12 0.58 10.36 0.28 73.16 0.40 11.41 2.43 11.32 2.66 76.70 1.36
cab 23.52 1.33 27.08 1.13 76.26 0.65 26.45 1.01 28.11 1.64 78.51 0.38
beach wagon 23.52 2.35 23.60 0.96 76.82 0.34 24.57 0.68 27.34 1.32 80.96 0.50
jeep 25.72 0.32 27.31 0.89 73.10 0.85 27.37 3.33 29.21 2.51 76.77 0.63
convertible 15.05 0.51 14.11 0.60 74.80 0.19 20.84 2.61 21.43 2.43 78.84 0.85
limo 21.91 2.30 24.52 2.90 75.38 0.63 26.97 2.27 28.84 2.50 77.84 0.97
Average 19.86 0.21 21.42 0.48 75.42 0.11 22.21 0.44 23.61 0.57 78.88 0.15
Classification-only CPC-augmented
Anomalous snake MSP ODIN Accuracy MSP ODIN Accuracy
ringneck snake 19.14 0.52 19.89 1.12 63.49 3.27 20.06 1.20 22.98 0.70 66.64 0.51
vine snake 13.91 1.83 15.51 2.13 64.52 2.01 14.81 1.27 15.63 0.45 67.14 2.57
hognose snake 15.07 0.84 13.72 0.45 65.56 3.19 13.73 0.93 14.10 1.15 67.52 2.72
thunder snake 19.16 0.03 19.35 0.92 70.29 2.43 20.25 0.71 20.07 2.81 72.44 4.05
garter snake 20.86 2.26 22.62 2.94 63.59 2.32 20.25 2.03 23.80 2.18 65.23 2.32
king snake 17.36 1.34 15.17 2.04 62.43 2.04 19.35 2.62 18.16 2.95 64.61 2.39
night snake 20.67 0.10 19.81 1.18 73.04 2.02 21.94 2.49 22.78 0.85 73.62 2.60
green snake 12.31 0.63 13.63 0.90 67.11 4.65 12.43 1.73 12.92 0.97 65.58 2.37
water snake 25.30 2.04 28.33 3.52 65.35 4.95 26.20 2.50 33.06 2.34 69.38 3.72
Average 18.20 0.76 18.67 1.07 66.15 1.89 18.78 0.40 20.39 0.60 68.02 0.85
Classification-only CPC-augmented
Anomalous spider MSP ODIN Accuracy MSP ODIN Accuracy
tarantula 22.42 2.32 22.47 2.51 58.71 1.58 21.94 0.22 23.34 1.48 61.46 0.66
Argiope aurantia 13.84 0.37 12.98 0.38 66.66 1.04 14.44 0.95 12.93 0.34 68.59 1.97
barn spider 25.39 1.40 25.81 2.52 74.17 0.96 20.60 2.90 22.92 3.42 75.76 1.91
black widow 24.20 1.60 29.34 3.18 60.60 1.28 28.93 0.91 34.29 1.42 63.52 1.21
garden spider 17.21 0.28 16.02 0.20 75.05 1.57 17.90 0.53 16.65 0.19 75.87 1.79
wolf spider 29.11 2.91 37.87 2.64 64.72 1.38 29.88 0.83 30.10 0.33 66.79 1.74
Average 22.03 0.68 24.08 0.70 66.65 0.42 22.28 0.60 23.37 0.68 68.67 0.36
Classification-only CPC-augmented
Anomalous fungus MSP ODIN Accuracy MSP ODIN Accuracy
stinkhorn 46.05 1.98 51.30 1.59 89.81 0.78 50.69 3.13 58.81 5.49 91.27 0.08
bolete 46.73 2.58 50.28 6.04 88.41 0.42 49.19 3.72 51.67 2.73 90.87 0.43
hen-of-the-woods 43.58 2.47 47.98 1.84 88.10 0.45 38.97 2.87 42.59 2.33 90.63 0.58
earthstar 35.63 0.71 36.72 1.59 84.75 0.88 39.83 3.16 40.01 3.92 85.42 0.75
gyromitra 39.90 1.59 42.04 2.19 86.35 0.45 45.44 2.50 49.57 0.86 87.99 0.85
coral fungus 23.25 2.15 21.93 2.50 84.89 1.16 28.35 0.43 27.67 3.76 87.29 0.62
Average 39.19 1.26 41.71 1.94 87.05 0.06 42.08 0.57 45.05 1.11 88.91 0.46

Appendix D Trivial baseline for OOD detection on existing benchmarks

To demonstrate that the current benchmarks are trivial with very low-level information, we tested OOD detection with CIFAR-10 as the in-distribution by simply looking at likelihoods under a mixture of 3 Gaussians, trained channel-wise at a pixel-level. We find that this simple baseline compares very well with approaches in recent papers at all but one of the benchmark OOD tasks in [30] for CIFAR-10, as we show below:

OOD dataset Average precision
TinyImagenet (crop) 96.84
TinyImagenet (resize) 99.03
LSUN 58.06
LSUN (resize) 99.77
iSUN 99.21

We see that this method does not do well on LSUN. When we inspect LSUN, we find that the images are cropped patches from scene-images, and a majority of them are of uniform colour and texture, with little variation and structure in them. While this dataset is most obviously different from the in-distribution examples from CIFAR-10, we believe that the particular appearance of the images results in the phenomenon reported in [31], where one distribution that “sits inside” the other because of a similar mean but lower variance ends up being more likely under the wider distribution. In fact, thresholding on simply the “energy” of the edge-detection map gives us an average precision of around 87.5% for LSUN, thus indicating that the extremely trivial feature of a lower edge-count is already a strong indicator for telling apart such an obvious difference.

We found that this simple baseline of pixel-level channel-mixture of Gaussians underperforms severely on the hold-out-class experiments on CIFAR-10, achieving an average precision of a mere 11.17% across the 10 experiments.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390060
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description