# Isotropic Maximization Loss and Entropic Score: Neural Networks Out-of-Distribution Detection

## Abstract

Current out-of-distribution detection (ODD) approaches require cumbersome procedures that add undesired side-effects to the solution. In this paper, we argue that the uncertainty in neural networks is mainly due to SoftMax loss anisotropy. Consequently, we propose an isotropic loss (IsoMax) and a decision score (Entropic Score) to significantly improve the ODD performance while keeping the overall solution fast, accurate, scalable, unexposed, turnkey, and native. Our experiments indeed showed that uncertainty is extremely reduced simply by replacing the SoftMax loss without relying on techniques such as adversarial training/validation, special-purpose data augmentation, outlier exposure, ensembles methods, Bayesian mechanisms, generative approaches, metric learning, or additional classifiers/regressions. The results also showed that our straightforward proposal overcomes ODIN, ACET, and is competitive against the Mahalanobis approach besides avoiding their undesired requirements and weaknesses. Since IsoMax loss works as a direct and transparent SoftMax loss drop-in replacement, these techniques may be used combined with our loss to increase the overall performance even more if their associated drawbacks are not a concern in a particular use case.

## 1 Introduction

Technique | SoftMax | ODIN | Mahalanobis | IsoMax (proposed) |
---|---|---|---|---|

Input Preprocessing: | ||||

Inference-time Backpropagation | Not Required | Required | Required | Not Required |

(3X Slower Inference, 3X Higher Energy Consumption) | ||||

Feature Ensemble: | ||||

Low-Level Features Dependency | Not Required | Not Required | Required | Not Required |

(Reduced Scalability/Applicability) | ||||

Availability of Out-of-Distribution or | ||||

Adversarially Generated Samples | Not Required | Required | Required | Not Required |

(Overfitting/Complexity) | ||||

Post-processing Phase: Validation on | ||||

Out-of-Distribution/Adversary Samples | Not Required | Required | Required | Not Required |

and In-Distribution Data Availability | ||||

Additional Ad-hoc Models: | ||||

Classification/Regression Models Training/Validation | Not Required | Not Required | Required | Not Required |

(Increased Computational Resources) |

Neural networks have been used as classifiers in a wide range of applications. Their design usually considers that an instance from the train distribution is presented to the model at inference time. If this holds, the neural network tends to present satisfactory performance.

However, in real-world applications, this assumption is not fulfilled. Neural networks are known to present overconfident predictions even for objects they were not trained to recognize Guo et al. (2017). In such situations, it is better to have a system that acknowledges that it is unable to decide.

To mitigate these drawbacks, Hendrycks & Gimpel (2017) established baseline datasets and metrics for what is currently called out-of-distribution detection (ODD). This task consists in evaluating whether a sample belongs to the in-distribution on which the network was trained. They also proposed an ODD approach by simply using the maximum predicted probability as the score to detect whether an example belongs to the in-distribution. This solution establishes the baseline performance for this task.

ODIN was proposed in Liang et al. (2018) by combining SoftMax temperature calibration and input preprocessing techniques. Despite significantly outperforming the baseline, ODIN considerably increases the inference time by requiring a backpropagation operation and a second inference to perform the final prediction on a simple sample. Considering that backpropagation is typically slower than inference, this input-prepossessing makes the ODIN inference time at least three times slower than the baseline approach. Moreover, at least three times higher inference power consumption may make this approach prohibitive for embedded and resource constrained devices.

Furthermore, to validate the parameters, the original ODIN proposal required access to out-of-distribution samples, which may be unrealistic. Even if some out-of-distribution samples are indeed available during design, using those examples to validate may make the solution overfitted to detect this particular type of out-distribution. In inference, if the system is presented to samples from other out-distribution, the previously estimated ODD performance may degrade significantly. Therefore, to validate parameters using out-of-distribution samples may be difficult in practice and probably generate unrealistic ODD performance expectations.

The seminal Mahalanobis distance-based approach introduced by Lee et al. (2018), which we call the Mahalanobis method, overcomes the necessity of access to out-of-distribution samples by validating the required hyperparameters in adversarial examples. It is more practical and produces more realistic performance estimates. Hence, we only consider validation on adversarial samples for ODIN and Mahalanobis. Our approach requires no validation.

However, validation using adversarial examples has the disadvantage of adding a cumbersome procedure to the process. Even worse is the fact that the generation of adversarial samples itself requires the definition of parameters such as the maximum adversarial perturbations. For typical research datasets, we may know those, but for real-world (new/unknown) data, it may be hard to find them. Moreover, the Mahalanobis approach still requires input-preprocessing, which brings to this solution drawbacks such as at least 3X slower inference and 3X higher power consumption, which makes this approach also not sustainable from the environment point of view (Schwartz et al., 2019).

Feature ensembles introduced in Mahalanobis also present limitations. Since it requires training/inference of ad-hoc classification/regression models on features produced in many neural network layers, this approach may not scale to applications using large images as it should require using those shallow models in spaces of thousands of dimensions (see Table 1 for techniques and associated shortcomings).

Contributions: In this paper, we develop a new ODD approach that avoids all previously mentioned drawbacks, requirements and side-effects. For this work, we follow the “SoftMax loss” expression as defined in Liu et al. (2016) and illustrated in Fig. (a)a. If not otherwise mentioned, the expression SoftMax means “SoftMax loss”. Our proposed loss is called Isotropic Maximization (IsoMax) loss (Fig. (b)b). If not otherwise mentioned, the expression “IsoMax” means “IsoMax loss”. IsoMax was designed to work as a drop-in replacement for SoftMax. From a practical point of view, this substitution could be performed by simply changing one line of code. Therefore, neither model, data, nor training procedure modifications are required for the swap of SoftMax with IsoMax. Moreover, loss enhancement techniques such as outlier exposure (Hendrycks et al., 2019; Papadopoulos et al., 2019) can be readily adapted to work not only with SoftMax but also with IsoMax.

We also show that after training the neural network using IsoMax, high performance ODD can be obtained by merely calculating the entropy of the network output probabilities, which we define as the Entropic Score (ES). Indeed, our combined approach (IsoMax for training and Entropic Score for ODD during inference) is fast (no inference input preprocessing; no adversarial training), accurate (no classification accuracy degradation), scalable (no feature ensemble; no adversarial training), and unexposed (no auxiliary dataset of outliers required for training). Furthermore, it is also turnkey (no post-processing for validation or inference; no access to out-of-distribution or adversarial samples is necessary) and native (ad-hoc classification/regression models are required neither to be trained nor when predicting/performing ODD) while providing clear state-of-the-art performance when compared with alternative methods under the same constraints. Moreover, the performance of our proposal is still competitive with state-of-the-art approaches operating under much more favorable and less restrictive conditions.

In other words, our experiments showed that IsoMax presented no classification performance degradation compared to SoftMax. Besides, they also showed that IsoMax+ES overcomes the baseline, ODIN, ACET (Hein et al., 2018), and it is competitive to Mahalanobis. Actually, IsoMax+ES eventually outperforms Mahalanobis, even being a fast, scalable, turnkey, and native solution.

Despite not being used in our approaches to avoid the associated drawbacks, all the previously mentioned techniques are compatible with IsoMax. Therefore, if the mentioned mechanisms do not represent a real-world concern for the application under consideration, they can be applied in conjunction with IsoMax (in the same way they were used with SoftMax) to achieve even higher ODD performance eventually. For example, it is possible to construct an IsoMax-based Mahalanobis solution. Therefore, despite we compare IsoMax to ODIN, ACET, and Mahalanobis in this work to give a perspective on our solution ODD performance, in reality, the only actual direct competitor of IsoMax is SoftMax.

Finally, our solution does not rely on adversarial training as the ACET proposed in Hein et al. (2018). Therefore, training a neural network with IsoMax is as fast as training with SoftMax. Also, as a direct consequence of not using adversarial training, our proposal scales to large-scale datasets of large images just like SoftMax trained networks, which is a challenge for solutions with relies on adversarial training.

## 2 IsoMax Loss and Entropic Score

### 2.1 Isotropic Maximization Loss

Model | In-Distribution (training) | Out-Distribution (unseen) | Out-of-Distribution Detection: | ||
---|---|---|---|---|---|

Fast, Accurate, Scalable, Unexposed, Turnkey, and Native | |||||

TNR (%) [] | AUROC (%) [] | DTACC (%) [] | |||

SoftMax+MPS / SoftMax+ES / IsoMax+ES | |||||

DenseNet | CIFAR10 | SVHN | 32.2 / 33.2 / 77.0 | 86.6 / 86.9 / 96.6 | 79.9 / 79.9 / 91.6 |

TinyImageNet | 55.8 / 59.8 / 88.0 | 93.5 / 94.2 / 97.8 | 87.6 / 87.8 / 93.2 | ||

LSUN | 64.9 / 69.5 / 94.5 | 95.2 / 95.9 / 98.8 | 89.9 / 90.0 / 94.9 | ||

CIFAR100 | SVHN | 20.6 / 24.9 / 29.3 | 80.1 / 81.9 / 88.8 | 73.9 / 74.3 / 83.4 | |

TinyImageNet | 19.4 / 23.7 / 53.6 | 77.0 / 78.8 / 91.1 | 70.6 / 71.1 / 83.2 | ||

LSUN | 18.8 / 24.4 / 61.5 | 75.9 / 77.9 / 93.1 | 69.5 / 70.2 / 86.1 | ||

SVHN | CIFAR10 | 81.5 / 83.7 / 90.7 | 96.5 / 96.9 / 97.8 | 91.9 / 92.1 / 93.5 | |

TinyImageNet | 88.2 / 90.0 / 95.3 | 97.7 / 98.1 / 98.7 | 93.5 / 93.7 / 95.3 | ||

LSUN | 86.4 / 88.4 / 93.1 | 97.3 / 97.8 / 98.4 | 92.8 / 93.0 / 94.3 | ||

ResNet | CIFAR10 | SVHN | 43.1 / 44.5 / 56.8 | 91.7 / 92.0 / 93.8 | 86.5 / 86.5 / 87.3 |

TinyImageNet | 46.3 / 48.0 / 74.8 | 89.8 / 90.0 / 95.2 | 84.0 / 84.1 / 88.7 | ||

LSUN | 51.2 / 53.3 / 85.6 | 92.2 / 92.6 / 97.3 | 86.5 / 86.6 / 92.2 | ||

CIFAR100 | SVHN | 15.9 / 18.0 / 41.9 | 71.3 / 72.7 / 90.5 | 66.1 / 66.3 / 84.0 | |

TinyImageNet | 18.5 / 22.4 / 37.5 | 74.7 / 76.3 / 89.2 | 68.8 / 69.1 / 82.8 | ||

LSUN | 18.4 / 22.4 / 36.9 | 74.7 / 76.5 / 90.1 | 69.1 / 69.4 / 84.3 | ||

SVHN | CIFAR10 | 67.3 / 67.7 / 88.1 | 89.8 / 89.7 / 97.4 | 87.0 / 86.9 / 92.7 | |

TinyImageNet | 66.9 / 67.3 / 86.7 | 89.0 / 89.0 / 97.1 | 86.7 / 86.6 / 92.2 | ||

LSUN | 62.2 / 62.5 / 85.4 | 86.0 / 85.8 / 96.6 | 84.2 / 84.1 / 91.5 |

Model | In-Distribution (training) | Out-Distribution (unseen) | ODD approaches present different requirements. | |

ODIN/Mahalanobis produce undesired side-effects. | ||||

AUROC (%) [] | DTACC (%) [] | |||

ODIN / IsoMax+ES / Mahalanobis | ||||

DenseNet | CIFAR10 | SVHN | 92.8 / 96.6 / 97.6 (+1.0) | 86.5 / 91.6 / 92.6 (+1.0) |

TinyImageNet | 97.2 / 97.8 / 98.8 (+1.0) | 92.1 / 93.2 / 95.0 (+1.8) | ||

LSUN | 98.5 / 98.8 / 99.2 (+0.4) | 94.3 / 94.9 / 96.2 (+1.3) | ||

CIFAR100 | SVHN | 88.2 / 88.8 / 91.8 (+3.0) | 80.7 / 83.4 / 84.6 (+1.2) | |

TinyImageNet | 85.3 / 91.1 / 97.0 (+5.9) | 77.2 / 83.2 / 91.8 (+8.6) | ||

LSUN | 85.7 / 93.1 / 97.9 (+4.8) | 77.3 / 86.1 / 93.8 (+7.6) | ||

SVHN | CIFAR10 | 91.9 / 97.8 / 98.8 (+1.0) | 86.6 / 93.5 / 96.3 (+2.8) | |

TinyImageNet | 94.8 / 98.7 / 99.8 (+1.1) | 90.2 / 95.3 / 98.9 (+3.6) | ||

LSUN | 94.1 / 98.4 / 99.9 (+1.5) | 89.1 / 94.3 / 99.2 (+4.9) | ||

ResNet | CIFAR10 | SVHN | 86.5 / 93.8 / 95.5 (+1.7) | 77.8 / 87.3 / 89.1 (+1.8) |

TinyImageNet | 93.9 / 95.2 / 99.0 (+3.8) | 86.0 / 88.7 / 95.4 (+6.7) | ||

LSUN | 93.7 / 97.3 / 99.5 (+2.2) | 85.8 / 92.2 / 97.2 (+5.0) | ||

CIFAR100 | SVHN | 72.0 / 90.5 / 84.4 (-6.1) | 67.7 / 84.0 / 76.5 (-7.5) | |

TinyImageNet | 83.6 / 89.2 / 87.9 (-1.3) | 75.9 / 82.8 / 84.6 (+1.8) | ||

LSUN | 81.9 / 90.1 / 82.3 (-7.8) | 74.6 / 84.3 / 79.7 (-4.6) | ||

SVHN | CIFAR10 | 92.1 / 97.4 / 97.6 (+0.2) | 89.4 / 92.7 / 94.6 (+1.9) | |

TinyImageNet | 92.9 / 97.1 / 99.3 (+2.2) | 90.1 / 92.2 / 98.8 (+6.6) | ||

LSUN | 90.7 / 96.6 / 99.9 (+3.3) | 88.2 / 91.5 / 99.5 (+6.0) |

Let represent the input applied to a neural network and represent the high-level feature vector produced by it (for simplicity, from now on, we omit the subscript which represent the neural network parameters). For this work, the underlying structure of the neural network does not matter. Considering to be the correct class for a particular training example , we can write the SoftMax loss associated to this specific training sample as:

(1) |

In the above equation, and represent, respectively, the weights and biases associated with the class . From a geometric perspective, the term represents a hyperplane in the high-level feature space. It divides the feature space into two subspaces that we call the positive and negative subspaces. The more deep inside into the positive subspace the features of an example is located, the more likely the example belongs to the considered class. Therefore, training neural networks using SoftMax loss does not incentive the agglomeration of the representations of the examples associated with a particular class into a limited region of the hyperspace. The immediate consequence is the propensity of SoftMax trained neural networks to make confident predictions on examples in regions very far away from the training examples, which explain the low out-of-distribution detection performance (Hein et al., 2018).

Additionally, the main characteristic of the Mahalanobis distance used in Lee et al. (2018) is to be locally isotropic around the prototypes produced. The fact that it achieved high ODD performance indicates that deploying locally isotropic spaces around class prototypes indeed improves the ODD performance. However, SoftMax trained neural networks are based on affine transformations on the last layer, which are essentially internal products. Consequently, the last layer representations of SoftMax trained networks tend to align in the direction of the weights vector, producing a preferential direction in space and anisotropy.

A possible way to avoid the mentioned anisotropy is to design a loss which only depends on the distances of high-level representations to class prototypes. Such loss would forbid the networks to learn preferred directions in the feature space, enforcing local isotropy during the training in a natural way while avoiding the need of Mahalanobis distance-based metric learning post-processing procedures.

Therefore, to construct a loss that enforces the high-level representation of the examples associated with a particular class into a confined locally isotropic region of the feature space, we have to replace the term by a expression of the form . The expression represents a differentiable scalar function with possible learnable parameters (for simplicity, from now on, we omit the subscript ). The expression represents a valid distance between the high-level representation and a learnable prototype associated with the class . Replacing the by the term into Eq. (1), we have:

(2) |

The above equation defines a generic IsoMax loss. If we make monotonically crescent, the optimization of the IsoMax naturally leads to reducing the distance from the input high-level representations and the respective class associated learnable prototype. Simultaneously, the minimization of IsoMax also increases the distances among the prototypes that are being learned. Hence, during inference, inputs that generate feature space representations far way from the correspondent class prototype are easily claimed to be out-of-distribution examples.

IsoMax trained networks behave in feature space similarly to distance-based generative classifiers such as Mahalanobis while avoiding all the previously mentioned drawbacks. We imposed the features to be learned using a predefined metric during training rather than try to find an appropriated metric a posteriori. The representations are created during network training in such a way that the predefined metric is optimal for the learned high-level features, so no metric learning post-processing is required. Using IsoMax, the representations are learned from the ground up to perform properly with the predefined distance in the future feature space.

There are valid reasons to learn an optimal metric when the problem involves dealing directly with the data, or the features are hand-craft engineered (machine learning). However, if the situation also involves feature learning (deep learning), using a loss with a predefined metric imposes the creation of a feature space in which the considered distances make sense from the start, avoiding a two stage process of independent procedures (metric learning after feature leaning). Rather than learning a metric from a preexisting feature space, we are learning a feature space from a preexisting metric. We propose better feature learning rather than feature learning followed by metric learning. Hence, no additional classification/regression models are needed to transform the originally anisotropic feature space learned by the network into a new isotropic one.

Input probability distributions are not given a priori but rather constructed during network training by appropriately imposing the loss geometric restrictions on the input representations. No cumbersome covariance matrix optimization is required during network training, as it should be the case if we tried to make a truly Mahalanobis distance-based solution trainable end-to-end. Indeed, directly training neural networks to use Mahalanobis distance in a seamless way is difficult because of the covariance matrix.

During inference, the isotropic probability of a particular input belong to a given class with an associated prototype is defined by the following equation ( is a scalar function. It is most likely also monotonically crescent):

(3) |

In a first attempt to tackle ODD problems using distance-based loss functions, we tried the loss used in Snell et al. (2017) to perform Few-Shot Learning. To build a SoftMax loss drop-in replacement, we adapted it to be trained end-to-end (simultaneously features and prototypes) using only stochastic gradient descent (SGD) and backpropagation without offline procedures. Indeed, in Prototypes Networks, the prototypes are offline calculated as the mean of examples instead of directly using SGD/backpropagation (see Snell et al. (2017), Algorithm 1).

However, the experiments showed low ODD performance probably because using squared Euclidean distance in such a case is, after all, equivalent to a linear transformation (Snell et al., 2017; Mensink et al., 2013). Admittedly, Snell et al. (2017) themselves make it clear that, in Prototypes Networks, “all of the required non-linearity can be learned within the embedding function” rather than in the last layer. Moreover, differently from non-squared Euclidean, the squared Euclidean does not obeys Cauchy–Schwarz inequality. Indeed, Prototype Networks use Bregman divergences (Snell et al., 2017; Banerjee et al., 2005) rather than true geometric metrics. IsoMax has to use a proper metric for previous geometric considerations to make sense.

Therefore, we decided to use as the non-squared Euclidean distance for constructing the IsoMax loss. Consequently, IsoMax trained neural networks are capable of performing a truly non-linearity transformation in the last layer, and the previously presented geometric considerations are consistent as our distance indeed obeys the Cauchy–Schwarz inequality. Additionally, from a practical point of view, we noticed that incorporating the prototypes to be learned simultaneously with the features using a unified loss minimization procedure is too much easier using non-squared Euclidean distance than squared ones as numeric calculus problems are much likely to occur when performing derivations with values of the order of than .

Besides, our experiments showed that defining is of critical importance to achieve high ODD performance (see Fig. 2). We defined as a global hyperparameter. Once validated in a simple metric, model, in-distribution, and out-distribution; the global value of generalizes well to all other metrics, models, in-distributions, and out-distributions (see Fig. 2). No further (case-by-case) validations are required. It is remarkably important to emphasize the effect of since the solution does not present high ODD performance if is absent or, in other words, .

Experimentally, we observed that using Xavier Glorot & Bengio (2010) or Kaiming He et al. (2016) initialization for prototypes made ODD performance oscillates. Sometimes it improved, sometimes decreased. Hence, we decided always to initialize all prototypes to zero. Since prototypes are network weights, the weight decay was applied to them.

(4) |

(5) |

We adopted and used as the default value. However, adversarial samples may be used to validate to improve our proposal ODD performance. We are using no regularization term. As mentioned before, with IsoMax, the neural network is trained end-to-end, including the prototypes, using loss minimization by regular SGD/backpropagation. Summarizing, we can rewrite the equations (2) and (3) as equations (4) and (5), respectively.

### 2.2 Entropic Score

Out-of-distribution detection approaches typically define a score to be used during inference to evaluate whether an example should be considered out-of-distribution. In generative classifiers, this score is usually (a function of) the minimum distance of the example representation to any class prototype. If the nearest prototype is distant more than a threshold, the example is considered out-of-distribution.

Our proposal follows an entirely different approach. In a seminal work, Shannon (1948) demonstrated that the entropy presents the optimum measure of the uncertainty of a source of symbols. More broadly, currently, we understand entropy as a measure of the uncertainty we have about a random variable. Hence, if the probabilities of the output of neural networks are supposed to make any sense, the most natural and theoretic sound measure to evaluate how uncertain they are regarding classify a particular example is simply the value of the entropy of their output probabilities. Additionally, the uncertainty in classifying a specific sample should be an optimum metric to evaluate whether a particular example is out-of-distribution if these concepts are supposed to be fundamentally consistent. In other words, if the entropy of the probabilities produced by neural networks outputs is not a high-quality measure to decide by how much we believe a sample is out-of-distribution, it implicates that the neural network is producing meaningless probabilities.

(6) |

Therefore, the second major component of our proposal for the seamless integration of ODD into the neural networks framework is what we call the Entropic Score (ES), which consists in using the negative of the entropy of the probabilities outputs as the ODD score. By using the negative entropy as a score, rather than relying on a single network output (for example, distance to only one prototype), this approach takes into consideration the information provided by all available outputs to evaluate if a particular sample is out-of-distribution. Hence, ES is expressed by equation (6).

From a practical perspective, it means that it is possible to avoid to train an ad-hoc additional regression model to detect out-of-distributions samples in post-processing phase as required in Mahalanobis which uses feature ensembles. Even more important, since no regression model needs to be trained, there is no need for unrealistic access to out-of-distribution samples nor to generate adversarial examples. Since ES is a predefined no-trainable score, it is available as soon as the neural network training finishes.

## 3 Experimental Results

The deterministic code to reproduce the results is available online^{1}

Considering that outlier exposure may be successfully integrated and benefit both SoftMax and IsoMax losses, all experiments were performed without relying in outlier exposure data. Hence, outlier exposure techniques may be further used to improve the results we present in this paper. We speculate that IsoMax would benefit more from outlier exposure than SoftMax because of its isotropic nature. Similar arguments hold for background samples based approaches Dhamija et al. (2018).

The classification results as well as the correspondent TNR are presented in the Fig. 2. The value of presented the best ODD performance. The mentioned value generalizes well to unseen out-distributions as required for a global hyperparameter candidate. Consequently, this value was used for all other experiments (combinations of models, in-distributions, and out-distributions in Tables 2 and 3). It can also be observed that the classification accuracy of networks trained with IsoMax is insensitive to changes in the hyperparameter. Hence, once confirmed as our global hyperparameter, the experiments showed that IsoMax trained networks present classification accuracy performance extremely similar to SoftMax ones to all other datasets and models (see Supplementary Material B).

In Table 2, the compared approaches neither adversarial training, require input-preprocessing, temperature calibration, feature ensemble, out-of-distribution/adversarial validation nor additional ad-hoc classification/regression models. SoftMax with Maximum Probability Score presents the worst results and Entropic Score produces small positive effect when applied to SoftMax trained networks.

However, the combination of IsoMax with the same Entropic Score significantly improves the ODD performance across all metrics for all pairs of in-distribution and out-distribution. This is robust evidence that IsoMax trained neural networks present more realistic output probabilities. The gains are usually of several percentage points.

Comparing our results with Hein et al. (2018) [Table 1, ResNet model], we observe that IsoMax+ES presented much better AUROC performance on the in/out pairs CIFAR10/ImageNet (+10%), CIFAR/LSUN (+11%), CIFAR100/ImageNet (+15%), and CIFAR100/LSUN (+20%) besides being a much more straightforward solution which avoids adversarial training and its associated drawbacks (magenta bold in Table 2). Our proposal performed slightly worse on the in/out pairs CIFAR10/SHVN (-4%), and SVHN/LSUN (-3%). For the in/out pairs CIFAR100/SVHN, SVHN/CIFAR10, and SVHN/ImageNet, both approaches produced similar results (difference of less than 1%).

Table 3 shows the results of a set of approaches that present different requirements and side-effects. Input prepossessing (and subsequently much slower inference) and validation on adversarial samples are used in both ODIN and Mahalanobis, while temperature calibration is required only in ODIN. Feature ensemble and ad-hoc classification/regression models are mandatory in Mahalanobis solution. The IsoMax+ES approach does not rely on any of these techniques. Regardless of the previous considerations which impose much more restrictive conditions on our approach, the mentioned table shows that IsoMax+ES considerably outperforms ODIN in all evaluated scenarios.

Moreover, in more than half of the cases, even operation over much more favorable circumstances, Mahalanobis surpasses IsoMax+ES by less than 2%. More surprisingly, in some scenarios, the latter even overcomes the former despite never have being presented to adversarial samples, being native, more scalable, straightforward to implement, and presenting at least there times faster inference. IsoMax performs particularly well in one of the CIFAR100 cases, which may suggest that the fact ES uses all outputs to decide works even better when many classes are presented.

If we allowed the parameter to be selected using adversarial validation, IsoMax+ES would be even more competitive against Mahalanobis. In such a case, it would be possible to overcome the Mahalanobis performance in more scenarios. However, we speculate this could also be achieved in a better way using isotropic regularization or special data augmentation techniques Thulasidasan et al. (2019); Yun et al. (2019) to avoid the need for out-of-distribution or adversarial samples. We believe that the isotropy of IsoMax will perform particularly well in combination with those ODD oriented data augmentation techniques. Adding outlier exposure is also a promising alternative. Indeed, we believe IsoMax isotropy will lead to increased ODD performance when exposed to outlier data during training.

In Fig. 3, SoftMax trained networks produce very high maximum probabilities for both in-distribution and out-distribution, while IsoMax ones produce higher maximum probabilities for in-distribution rather than out-distribution. Once more, the experiments show that IsoMax naturally produces much more realistic probabilities than the extremely overconfident SoftMax. The maximum probability and entropy are strongly correlated in SoftMax trained networks because they almost always have a maximum probability near one. It is not true for IsoMax trained networks.

## 4 Conclusion

In this paper, we proposed the IsoMax loss and the Entropic Score to prove that neural networks ODD performance and uncertainty can be significantly improved in a fast, accurate, scalable, unexposed, turnkey and native way simply by replacing the SoftMax loss and using an appropriate, predefined, meaningful, and information-theoretic well-founded score without relying on ad-hoc techniques to avoid their associated drawbacks, requirements and side-effects.

However, if the mentioned limitations are not a concern for a particular application, those techniques may be combined with IsoMax to achieve even higher ODD performance. In future works, we intend to make a learnable parameter.

Supplementary Material

## Appendix A Experiment Details

### a.1 Data Distributions

In our experiments, we trained from scratch several 100 layers DenseNets Huang et al. (2017) and 34 layers ResNets on CIFAR10 Krizhevsky (2009), CIFAR100 Krizhevsky (2009) and SVHN Netzer et al. (2011) datasets using SoftMax and IsoMax losses using exaclty the same protocol (learning rates, learning rate schedule and weight decay values, etc) presented in Lee et al. (2018).

To evaluate the performance of the competing approaches, we added out-of-distribution images to the test images presented in each CIFAR10, CIFAR100, and SVHN datasets. The final test sets had 50% of in-distribution images and 50% of out-of-distribution images. We also used resized images form the datasets TinyImageNet Deng et al. (2009)^{2}

### a.2 Performance Assessment

The performance of the compared methods was evaluated using three detection accuracy metrics. First, we calculate the True Negative Rate (TNR) at 95% True Positive Rate (TPR). Besides, we evaluated the Area Under the Receiver Operating Characteristic Curve (AUC) and the Detection Accuracy (DTACC). All the mentioned metrics follow the same calculation procedures detailed in Lee et al. (2018).

## Appendix B Neural Network Performance Comparison

Test Accuracy (%) [] | |||

Model | Data | SoftMax Loss | IsoMax Loss |

SVHN | 96.6 | 96.7 | |

DenseNet | CIFAR10 | 94.9 | 95.1 |

CIFAR100 | 75.7 | 76.1 | |

SVHN | 96.7 | 96.6 | |

ResNet | CIFAR10 | 95.4 | 95.3 |

CIFAR100 | 75.8 | 75.3 |

### Footnotes

### References

- Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with bregman divergences. Journal of Machine Learning Research, 2005.
- Deng, J. D. J., Dong, W. D. W., Socher, R., Li, L.-J. L. L.-J., Li, K. L. K., and Fei-Fei, L. F.-F. L. ImageNet: A large-scale hierarchical image database. Conference on Computer Vision and Pattern Recognition, 2009.
- Dhamija, A. R., Günther, M., and Boult, T. Reducing network agnostophobia. In Advances in Neural Information Processing Systems. 2018.
- Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, 2010.
- Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, 2017.
- He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International Conference on Computer Vision, 2016.
- Hein, M., Andriushchenko, M., and Bitterwolf, J. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. Conference on Computer Vision and Pattern Recognition, 2018.
- Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
- Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.
- Huang, G., Liu, Z., Maaten, L. v. d., and Weinberger, K. Q. Densely Connected Convolutional Networks. In Conference on Computer Vision and Pattern Recognition, 2017.
- Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Science Department, University of Toronto, Tech, 2009.
- Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, 2018.
- Liang, S., Li, Y., and Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018.
- Liu, W., Wen, Y., Yu, Z., and Yang, M. Large-margin softmax loss for convolutional neural networks. In International Conference on Machine Learning, 2016.
- Mensink, T., Verbeek, J. J., Perronnin, F., and Csurka, G. Distance-based image classification: Generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
- Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.
- Papadopoulos, A.-A., Rajati, M. R., Shaikh, N., and Wang, J. Outlier exposure with confidence control for out-of-distribution detection. arXiv preprint arXiv:1906.03509, 2019.
- Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green artificial intelligence. ArXiv, 2019.
- Shannon, C. E. A Mathematical Theory of Communication. Bell System Technical Journal, 1948.
- Snell, J., Swersky, K., and Zemel, R. S. Prototypical networks for few-shot learning. In Neural Information Processing Systems, 2017.
- Thulasidasan, S., Chennupati, G., Bilmes, J. A., Bhattacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In Neural Information Processing Systems. 2019.
- Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ArXiv, abs/1506.03365, 2015.
- Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision, 2019.