Denoising Autoencoders for Overgeneralization in Neural Networks

Denoising Autoencoders for Overgeneralization in Neural Networks

Giacomo Spigler,  G. Spigler is with the Department of Psychology, The University of Sheffield, Sheffield, UK e-mail: g.spigler sheffield

Despite the recent developments that allowed neural networks to achieve impressive performance on a variety of applications, these models are intrinsically affected by the problem of overgeneralization, due to their partitioning of the full input space into the fixed set of target classes used during training. Thus it is possible for novel inputs belonging to categories unknown during training or even completely unrecognizable to humans to fool the system into classifying them as one of the known classes, even with a high degree of confidence. Solving this problem may help improve the security of such systems in critical applications, and may further lead to applications in the context of open set recognition and 1-class recognition. This paper presents a novel way to compute a confidence score using denoising autoencoders and shows that such confidence score can correctly identify the regions of the input space close to the training distribution by approximately identifying its local maxima.

overgeneralization, fooling, autoencoder, open set recognition, open world recognition, 1-class recognition, confidence score, neural networks

I Introduction

Discriminative models in machine learning, like neural networks, have achieved impressive performance in a variety of applications. Models in this class, however, suffer from the problem of overgeneralization, whereby the whole input space is partitioned between the set of target classes specified during training, and generally lack the possibility to reject a novel sample as not belonging to any of those.

A main issue with overgeneralization is in the context of open set recognition [1] and open world recognition [2], where only a limited number of classes is encountered during training while testing is performed on a larger set that includes a potentially very large number of unknown classes that have never been observed before. An example is shown in Figure 1 where a linear classifier is trained to discriminate between drawings of digits ‘0’ and ‘6’. As digit ‘9’ is not present in the training set, it is here wrongly classified as ‘6’. In general, instances of classes that are not present in the training set will fall into one of the partitions of the input space learnt by the classifier. The problem becomes worse in real world applications where it may be extremely hard to know in advance all the possible categories that can be observed.

Further, the region of meaningful samples in the input space is usually small compared to the whole space. This can be easily grasped by randomly sampling a large number of points from the input space, for example images at a certain resolution, and observing that the chance of producing a recognizable sample is negligible. Yet, discriminative models may assign a high confidence score to such random images, depending on the learnt partition of the input space. This is indeed observed with fooling [3], for which it was shown to be possible to generate input samples that are unrecognizable to humans but get classified as a specific target class with high confidence (see example in Fig. 1). Fooling in particular may lead to severe problems in applications related to safety and security.

Fig. 1: A linear classifier is trained to recognize exclusively pictures of digits ‘0’ and ‘6’. Digit ‘9’ was never observed during training, but in this example it is wrongly classified as digit ‘6’. This is an example of overgeneralization. A similar problem is ‘fooling’, whereby it is possible to generate images that are unrecognizable to humans but are nonetheless classified as one of the known classes with high confidence, for example here the noise-looking picture in the bottom-left corner that is classified as digit ‘0’.

As suggested in [3], these problems may be mitigated or solved by using generative models, that rather than learning the posterior of the class label directly, learn the joint distribution from which can be computed. Modeling the distribution of the data would then give a model the capability to identify input samples as belonging to know classes, and to reject those that are believed to belong to unknown ones. Apart from mitigating the problem of overgeneralization, modeling the distribution of the data would also be useful for applications in novelty and outlier detection [4] and incremental learning [2], broadening the range of applications the same model could be used in.

Estimating the marginal probability is, however, not trivial. Luckily, computing the full distribution may not be necessary. The results in this work suggest that identification of high-density regions close to the local maxima of the data distribution is sufficient to correctly identify which samples belong to the distribution and which ones are to be rejected. Specifically, it is possible to identify and classify the critical points of the data distribution by exploiting recent work that has shown that in denoising [5] and contractive [6] autoencoders, the reconstruction error tends to approximate the gradient of the log-density. A measure of a confidence score can then be computed as a function of this gradient.

A similar approach is exploited in “Energy-Based Generative Adversarial Networks” (EBGANs) [7], where a denoising autoencoder is used as the discriminator of a GAN model, trying to determine whether its inputs belong to the training distribution or not. Indeed, the reconstruction error of an autoencoder can be used to to compute an energy function that assigns low values to points belonging to the distribution of interest and high values everywhere else, provided that regularization is used to prevent it from learning the identity function, corresponding to a constant energy landscape.

Here, a set of experiments is presented to show the empirical performance of the proposed model and to compare it with baselines and with the COOL (Competitive Overcomplete Output Layer) [8] model that has been recently applied to the problem of fooling.

Ii Overview of Previous Work

The problem of overgeneralization in discriminative models like neural networks can be a serious concern in the context of security and model interpretability, and is a critical aspect of recognition problems where a limited number of classes have to be recognized among a larger number of unknown ones.

The simplest way to build a recognition system based on a given classifier is to set a threshold on the predicted output labels, computed for example using a softmax function that turns them into a probability distribution over the set of known classes, and rejecting any sample below its value (for example [9]). The output of the model is thus treated as an estimate of the confidence score of the classifier. This approach may however be of only limited help, as it was shown to be sensitive to the problem of fooling [3]

Another way to mitigate the problem, as done in classical object recognition, is to use a training set of positive samples complemented with a set of negative samples that bundle together instances belonging to a variety of ‘other’ classes. This approach however does not completely solve the problem, and it is usually affected by an unbalanced training set due to the generally larger amount of negatives required for correct classification. As the potential amount of negatives can be arbitrarily large, a further problem consists in gathering the sufficient amount of data required to approximate their actual distribution, which is made even worse by the fact that the full set of negative categories may not be fully known when training the system. For example, in the context of object recognition in vision, high-resolution images may represent any possible image class, the majority of which is likely not known during training. The use of negative training instances may however mitigate the effect of categories that are known to be potentially observed by the system.

The problem of overgeneralization can be further analysed in the context of ‘open set recognition’, that was formally defined by Scheirer and colleagues [1]. In this framework, it is assumed that machine learning models are trained on a set of ‘known’ classes and potentially on a set of ‘known unknown’ ones (e.g., negative samples). Testing, however, is performed on a larger set of samples that include ‘unknown unknown’ classes that are never seen during training. Models developed to address the problem of open set recognition focus on the problem of ‘unknown unknown’ classes [10]. The seminal paper that gave the first formal definition of the problem proposed an extension of the SVM algorithm called the 1-vs-Set Machine that is designed to learn an envelope around the training data using two parallel hyperplanes, with the inner one separating the data from the origin, in feature space [1]. Scheirer and colleagues then proposed the Weibull-calibrated SVM (W-SVM) algorithm to address multi-class open set recognition [10]. Another interesting approach was recently applied to deep neural networks with the OpenMax model [11], that works by modelling the class-specific distribution of the activation vectors in the top hidden layer of a neural network, and using the information to recognize outliers.

Related to the problem of open set recognition is that of ‘open world recognition’, in which novel classes first have to be detected and then learnt incrementally [2]. This can be seen as an extension to open set recognition in which the ‘unknown unknown’ classes are discovered over time, becoming ‘novel unknowns’. The new classes are then labelled, potentially in an unsupervised way, and become ’known’. The authors proposed the Nearest Non-Outlier (NNO) algorithm to address the problem.

An extreme case of open set recognition is 1-class-recognition, in which training is performed using samples from a single class, with or without using a set of negative samples. The 1-Class SVM algorithm was proposed to address this problem [12], by fitting a hyperplane that separates all the data points from the origin, in feature space, maximizing its distance from the origin. The algorithm has been applied in novelty and outlier detection [13]. Variants of the algorithm like Support Vector Data Description (SVDD) have also been used to learn an envelope around points in the dataset [14]. Other systems have tried to estimate the boundaries of the data by computing the region of minimum volume in input space containing a certain probability mass [15].

Finally, a specific sub-problem of overgeneralization is that of fooling [3]. In their paper, Nguyen and colleagues make a particularly interesting remark, suggesting that generative models may be less affected by the problem, due to their modeling of the data generating distribution. They also argued, however, that present generative models may still be too limited to fully solve the problem, as they do not scale well to high-dimensional datasets like ImageNet [16]. The “Competitive Overcomplete Output Layer” (COOL) model [8] was recently proposed to mitigate the problem of fooling. COOL works by replacing the final output layer of a neural network with a special COOL layer, constructed by replacing each output unit, corresponding to a specific target class, with ones (the degree of overcompleteness). The output units for each target class are then made to compete for activation by means of a softmax activation that forces them to learn to recognize different parts of the input space, overlapping only around the region of support of the data generating distribution. The network can then compute a confidence score as the product of the activation of all the units belonging to the same target class, that is high for inputs on which a large number of units agrees on, and low in regions far from the data distribution, where only few output units will be active.

Iii Proposed Solution

The solution presented here is based on a novel measure of confidence in the correct identification of data points as belonging to the distribution observed during training, or their rejection. Ideally, such a confidence score would be a function of the probability of a sample as belonging to the data distribution. Computing the full distribution may however not be necessary. In particular, the problem can be simplified with the identification of points belonging to the data manifold as points that are close to the local maxima of the data generating distribution.

It has been recently shown that denoising [5] and contractive [6] autoencoders implicitly learn features of the underlying data distribution [17, 18], specifically that their reconstruction error approximates the gradient of its log-density


for small corruption noise (). is the reconstructed input. Larger noise is however found to work best in practice. The result has been proven to hold for any type of input (continuous or discrete), any noise process and any reconstruction loss, as long as it is compatible with a log-likelihood interpretation [18]. A similar interpretation suggested that the reconstruction error of regularized autoencoders can be used to define an energy surface that is trained to take small values on points belonging to the training distribution and higher values everywhere else [7].

Thus, critical points of the data distribution correspond to points with small gradient of the log distribution, that is small reconstruction error (Equation 1). Those are indeed points that the network can reconstruct well, and that it has thus hopefully experienced during training or has managed to generalize to well. A confidence score is designed that takes high values (close to 1) for points on the data manifold, that is points near the local minima of the energy function, corresponding to a small magnitude of the gradient of the log-density of the data distribution, and small values (close to 0) everywhere else.

Note however that this approach cannot distinguish between local minima, maxima or saddle points (Figure 2 shows such an example), and may thus assign a high confidence score to a small set of points not belonging to the target distribution. Here the problem is addressed by scaling the computed confidence by a function that favours small or negative curvature of the log-density of the data distribution, which can be computed from the diagonal of the Hessian, estimated as suggested in [17]


A variety of functions may be defined with the required characteristics. Throughout this paper, we will compute the confidence score as


where is the dimensionality of the inputs (i.e., ), a parameter that controls the sensitivity of the function to outliers and a parameter that controls the sensitivity to the local curvature of the log-density.

A generic classifier can finally be modified by scaling its predicted output probabilities by computed using a denoising autoencoder trained together with the classifier


If the outputs of the classifier are normalized, for example using a softmax output, this can be interpreted as introducing an implicit ‘reject’ option with probability . The confidence score proposed here, however, was not designed as a probability estimate.

In the experiments presented here, the classifier is constructed as a fully connected softmax layer attached on top of the top hidden layer of an autoencoder with symmetric weights (i.e., attached to the output of the encoder), in order to keep the number of weights similar (minus the bias terms of the decoder) to an equivalent feed-forward benchmark model, identical except for its lack of the decoder. In general, keeping the autoencoder separate from the classifier or connecting the two in more complex ways will work, too, as well as using a classifier that is not a neural network. In case the autoencoder and the classifier are kept separate, the autoencoder is only used to infer information about the data distribution. Pairing the systems together, however, might provide advantages outside the scope of the present work, like enabling a degree of semi-supervised learning. The autoencoder can be further improved by replacing it with the discriminator of an EBGAN [7] to potentially learn a better model of the data.

Iv Experiments

Iv-a 2D example

The model was first tested on a 2D classification task to help visualize its capacity to learn the region of the input space of each training class. Three target distributions were defined as uniform rings with thickness of and inner radius of , centered at the three points , and . The training distributions are shown in Figure 2A. Training was performed with minibatches of size using the Adam optimizer [19] for a total of update steps. As shown in Figure 2B, the model learned to correctly identify the support region of the target distributions. On the contrary, the uncorrected classifier partitioned the whole space into three regions, incorrectly labeling most points (panel C). The confidence score computed by the model presented here (panel D) helps the system to prevent overgeneralization by limiting the decisions of the classifier to points likely to belong to one of the target distributions. Panel D further evidences the different contributions of the two factors used to compute the confidence score. A measure of proximity to any local extrema of the data generating distribution (top part of panel D) is modulated to remove the local minima using the function of the local curvature of the log probability of the distribution (bottom part of panel D). It is important to observe, however, that the function may reduce the computed confidence score of valid samples. Different types of applications may benefit from its inclusion or exclusion, depending on whether more importance is given to the correct rejection of samples that do not belong to the training distribution or to their correct classification, to the expense of a partial increase in overgeneralization.

Fig. 2: The system presented here is trained to classify points sampled from three uniform ring distributions. A. Sampling of data points from each of the target distributions. B. Labeling of each point in the input space scaled by the computed confidence score. Regions in white are assigned low confidence scores. C. Labeling of each point in the input space without re-scaling of the output of the classifier by the computed confidence score. D. Top: confidence score before scaling by . Bottom: estimate of the curvature of the log-distribution of the data (). The confidence score is the product of the two functions. The panel in B is the product of the classifier’s output (C) and the confidence score. Black denotes high values.

Iv-B Fooling on MNIST

The model presented in this paper was next tested on a benchmark of fooling on the MNIST dataset [20] similar to the one proposed in [8]. However, contrary to the previous work, the classification accuracy of the models tested is reported as a ‘thresholded classification accuracy’, that is with the further requirement that in order to be counted as correctly classified, the test samples have to be assigned a threshold higher than the one used to consider a fooling instance valid. This metric should indeed be reported alongside the fooling rate for each model, as otherwise a model that trivially caps the confidence scores of a network to a fixed value lower than the threshold used to consider fooling attempts to be valid would by definition never be fooled. The same model would however never classify any valid sample above that same threshold. This metric thus proves useful to compare different models with varying degrees of sensitivity to overgeneralization.

The fooling test was performed by trying to fool a target network to classify an input that is unrecognizable to humans into each target class (digits to ). The fooling instances were generated using a Fooling Generator Network (FGN) consisting of a single layer perceptron with sigmoid activation and an equal number of input and output units (size of in MNIST). Most importantly, the FGN produces samples with values bounded in without requiring an explicit definition of the constraint. Fooling of each target digit was attempted by performing stochastic gradient descent on the parameters of the FGN to minimize the cross-entropy between the output of the network to be fooled and the specific desired target output class. Fooling of each digit was attempted for trials with different random inputs to the FGN, each trial consisting of up to parameter updates, as described in [8].

In the first test we compared three models, a plain Convolutional Neural Network (CNN), the same CNN with a Competitive Overcomplete Output Layer (COOL) [8], and a network based on the system described in Section III, built on the same CNN as the other two models with the addition of a decoder taking the activation of the top hidden layer of the CNN as input, to complete the denoising autoencoder used to compute the confidence score . The denoising autoencoder (dAE) was trained with corruption of the inputs by additive Gaussian noise. All the models were trained for a fixed epochs. Fooling was attempted at two different thresholds, and , in contrast to the previous work that used only the one [8]. Comparing the models at different thresholds may indeed give more information about their robustness and may amplify their differences, thus improving the comparison. Table I reports the results for the three models, with the further splitting of the denoising autoencoder model in two separate tests, using either a separate decoder or building the decoder as a symmetric transpose of the encoder (with the addition of new bias parameters). The table reports the thresholded classification accuracy for all the models together with the original, unthresholded one. Fooling was measured as the proportion of trials ( total, repetitions of digits) that produced valid fooling samples within the maximum number of updates. The average number of updates required to fool each network is reported in parentheses. The full set of parameters used in the simulations is reported in Appendix -B. The model presented here outperformed the other two at both thresholds, while also retaining a high thresholded classification accuracy, even at high thresholds. As in the previous protocol [8], the cross-entropy loss used to optimize the FGN was computed using the unscaled output of the network.

Model Accuracy Fooling Rate (Avg Steps)
CNN 99.35% 99.23% 99% % 100% (63.5) 99% (187.1)
COOL 99.33% 98.1% 93.54% 34.5% (238.8) 4.5% (313.4)
dAE sym 98.98% 98.11% 96.8% 0% (-) 0% (-)
dAE asym 99.14% 98.41% 0% (-)
TABLE I: MNIST fooling results

As the difference between the autoencoders using a symmetric versus asymmetric decoder was found to be minimal, the symmetric autoencoder was used for all the remaining experiments, so that the number of parameters of the three model was kept similar ( parameters for the plain CNN, for COOL, for the asymmetric autoencoder and for the symmetric one).

The results in Table I however were found to be different from those reported in [8]. In particular, the fooling rate of the COOL was found to be significantly lower than that reported (), as well as the average number of updates required to fool it (more than ). The major contributor to this difference was found to be the use of Rectified Linear Units (ReLUs) in the experiments reported here, compared to sigmoid units in the original study. This was shown in a separate set of simulations where all the three models used sigmoid activations instead of ReLUs and a fixed fooling threshold of . In this case the thresholded classification accuracy of the models was slightly higher ( for the plain CNN, for COOL, and for dAE), but it was matched with a significant increase in the fooling rate of the COOL model (; plain CNN , dAE ). Other variations in the protocol that could further account for the differences found could be the different paradigm for training ( fixed training epochs versus early stopping on a maximum of epochs) and a slightly different network architecture, that in the present work used a higher number of filters at each convolutional layer.

Next, the effect of the learning rate used in the fooling update steps was investigated by increasing it from the one used in the previous study () to the one used to train the models , expecting a higher fooling rate. The classification threshold was set to . Indeed, the plain benchmark network was found to be fooled on of the trials in just updates, while the dAE based model was still never fooled. COOL, on the other hand, significantly decreased in performance, with a fooling rate of ( average updates).

Finally, the COOL and dAE models were tested by trying to fool their confidence scores directly, rather than their output classification scores, in contrast to [8] (i.e., using instead of to compute the cross-entropy loss used to update the FGN). A threshold of was used. Interestingly, the COOL model was never fooled, while the model described here was fooled on of the trials, although requiring a large number of updates ( on average). Also, it was found that while adding regularization to the weights of the dAE model led to the generation of a large number of instances passing the fooling test ( rate in average updates for ), the generated samples actually resembled real digits closely, and thus could not be considered examples of fooling. This shows that the dAE model, when heavily regularized, is capable of learning a tight boundary around the high density regions of the data generating distribution, although at the cost of reducing its thresholded accuracy ( for ). The set of generated samples is shown as Supplementary Figure D for .

An example of the generated fooling samples is reported in Fig. 3, showing instances from the main results of table I for the plain CNN and COOL, and for the experiment with fooling the confidence scores directly for the dAE model.

Fig. 3: Visualization of a set of generated fooling samples from the main results of Table I. The samples from the plain CNN and the COOL models were computed by trying to fool each system’s output classification scores above a threshold of . As fooling was unsuccessful on the dAE model in this case, the results reported here were taken from the simulations in which fooling was performed directly on the output scaled by the confidence score ().

Iv-C Open Set Recognition

The three models that were tested on fooling, a plain CNN, COOL [8] and the dAE model described in this paper were next compared in the context of open set recognition.

Open set recognition was tested by building a sequence of classification problems with varying degrees of openness based on the MNIST dataset [20]. Each problem consisted in training a target model only on a limited number of ‘known‘ training classes (digits) and then testing it on the full test set of digits, requiring the model to be able to reject samples hypothesized to belong to ‘unknown’ classes. The degree of openness of each problem was computed similarly to [1], as

where is the number of ‘known’ classes seen during training and is for MNIST. A high value of openness reflects a larger number of unknown classes seen during testing than that of classes experienced during training. The number of training classes was varied from to , reflecting the full range of degrees of openness offered by the dataset.

For each fixed number of training classes used in training, the models were trained for repetitions on different random subsets of the digits, to balance between easier and harder problems depending on the specific digits used, with the exception for the test using the full training set (), that was run only once. The same subsets of digits were used for all the three models. Correct classification was computed as a correct identification of the class label and a confidence score above a classification threshold of , while correct rejection was measured as either assigning a low classification score (below ) or classifying the input sample as any of the classes not seen during training (for simplicity, the networks used a fixed number of output units for all the problems, with the target outputs corresponding to the ‘unknown’ classes always set to zero). The models were trained for a fixed epochs for each task.

Figure 4 reports the results of the experiment. Like in the previous published benchmarks on open set recognition [1, 10, 11], the performance of the models for each degree of openness (indexed by ) was computed as the F-measure, the harmonic mean of the precision and recall scores, averaged across all the repetitions for the same degree of openness.

Results from a similar experiment with a lower threshold of are available as Supplementary Figure F.

Fig. 4: Comparison of a plain CNN, COOL and the dAE model described in this paper on a benchmark of open set recognition. The F-measure was computed for each of the three models on problems created from the MNIST dataset [20] by only using a limited number of ‘known’ classes during training while testing on the full test set, for example, by training each model only on digits and but testing it on all the digits (), requiring the model to be able to reject samples hypothesized to belong to ‘unknown’ classes. Higher values for the openness of a problem (x-axis) reflect a smaller number of classes used during training, while a value of corresponds to the case in which the full set of digits is used. The curves are averaged across runs using different sub-sets of digits, for each degree of openness except , where all the digits were used in training, and was performed only once. Error bars denote standard deviation.

Iv-D 1-Class Recognition

The limit of open set recognition in which a single training class is observed during training, that is the problem of 1-class-recognition, was explored here comparing the model presented in this paper with COOL [8] and 1-Class SVM [12].

A separate 1-class-recognition problem was created from the MNIST dataset [20] for each digit. For each problem the models were trained using only samples from the selected class, while they were tested on the full test set comprising all the digits. No negative samples were used during training. Training was performed for epochs for each model on each problem.

Figure 5 reports the results as a ROC curve averaged across the curves computed for each of the 1-class-recognition problems. The dAE based model outperformed the other two, with an Area Under the Curve () of , compared to of 1-Class SVM and of COOL.

Fig. 5: ROC curves averaged over 1-class recognition problems, one for each digit in MNIST, for three models, the dAE model described in this paper, 1-Class SVM [12] and COOL [8]. The Area Under the Curve () for each model is for dAE, for 1-Class-SVM and for COOL.

V Discussion

The confidence score based on denoising autoencoders that was introduced in this paper was found to perform better than a set of competing models in open set recognition and 1-class recognition. The system was also found to be significantly more robust to the problem of fooling than the state of the art COOL model. Together, these results show that it is possible to use information about the data generating distribution implicitly learnt by denoising autoencoders in meaningful ways, even without explicitly modelling the full distribution.

It is to be noted that when comparing the results to the COOL model we used the same degree of overcompleteness () as in the original paper. However, fine tuning of the parameter and in particular using higher values may achieve higher performance on the benchmarking tasks used here. Also, similarly to the original COOL paper, fooling was attempted on the output of the classifier, rather than directly on the confidence scores. This gives an advantage to systems in which the confidence score is computed in more complex ways, not directly dependendent on the output of the classifier. However, further tests as presented in Section IV-B showed that the system presented here significantly outperforms the other models even when fooling is attempted directly on the confidence scores. In this particular case, it was further found that training the denoising autoencoder with heavy regularization resulted in generated samples resembling real digits, thus showing that the model had learnt a tight boundary around the data manifold.

It is interesting that the Energy-Based GAN (EBGAN) [7] makes use of the reconstruction error of a denoising autoencoder in a way compatible with the interpretation proposed here. In particular, it uses it as an approximated energy function that is learnt by the autoencoder to take low values for points belonging to the training distribution and high values everywhere else. As we have seen in Equation 1, it has been shown that the reconstruction error of denoising autoencoders is proportional to the gradient of the log distribution of the data. Thus, small absolute values of the reconstruction error correspond to extrema points of the distribution, not limited to local maxima but also including minima and saddle points. If Fig. 2 were a good example of the dynamics of the system even on more complex data, then the problem of local minima and saddle points may be limited. However, if that was not the case, then EBGAN might learn to generate samples from regions of local minima of the data distribution, which may not be desirable. It would be interesting to modify the system using the function described here (Equation 4) in order to correctly isolate only the local maxima of the distribution.

Likewise, it would be interesting to complement the model presented here with the regularization function used in EBGAN, the “repelling regularizer”, that adds a Pulling-away Term (PT) that forces learnt representations to be maximally different for different data points, by attempting to orthogonalize each pair of samples in a minibatch [7]. The stronger regularization might help the denoising autoencoder to learn a tighter representation of the support of the data distribution, thus improving the confidence score .

Further improvements in the performance of the system may be achieved by separating the classifier and the denoising autoencoder, although combining the two may result in other advantages, like adding a degree of semi-supervised learning or further regularizing the autoencoder. It may also be possible to use an autoencoder built on top of a larger pre-trained network, that is then fixed during training, thus having it learn to reconstruct higher-level, more stable feature vectors rather than high-dimensional raw inputs.

Vi Conclusion

Overgeneralization is a significant problem for neural networks. This paper presented a novel approach to address the problem by pairing a generic classifier with a denoising or contractive autoencoder that is used to compute a confidence score that assigns high values only for input vectors likely to belong to the training data distribution. In particular, recognition of an input as belonging to the distribution is performed by taking into account the gradient of the log density and its curvature at the specific input point, and using this information to determine whether it lies close to a local maximum of the distribution.

It is further suggested that the system presented here may have significant applications on a variety of problems. First, the problem of fooling and overgeneralization may lead to issues in the context of security in deployed systems [3], as it makes it possible to trick such models and thus make them vulnerable to malicious exploitation in a manner similar to that proposed for adversarial examples [21, 22]. An example could be a face or retina recognition system for restricted access to a critical facility. Thus, resistance to fooling and overgeneralization in general can lead to classifiers that are more robust and more likely to behave as desired.

In this paper, we have also explored the natural application of an overgeneralization-resistant system in the context of open set recognition. In general, the model presented here could be used in more complex architectures to allow for incremental and continual learning, by learning to recognize the regions of input space that have already been explored and learnt and potentially provide for different training regimes in the unexplored parts, in case new samples from those regions were to be observed in the future. For example, it may be applied to a system to allow for adding novel target classes even after it is deployed for real world use, without requiring a full re-training that may be costly in terms of compute time required, especially for large models.

As we have seen, the extreme case of open set recognition is 1-class recognition, that has proved to be a challenging problem. Building systems capable of robust 1-class recognition has critical applications in the context of novelty, outliers and anomaly detection.

In conclusion, developing discriminative models capable of capturing some aspects of the data distribution, even without computing it perfectly, can prove very useful in a large number of practical applications, and future work on the topic will be highly beneficial. Here a system was presented to address the problem and was shown to perform better than other previously proposed systems on a set of benchmarks.

[Details of the simulations]

Training of all models reported in this paper -with the exception of 1-class-SVM- was performed with Stochastic Gradient Descent with a cross-entropy loss function using the Adam algorithm [19] with , and . The models and the tests were implemented using Python and Tensorflow [23].

-a 2D example

The dAE model used parameters , and . The network used is a symmetric denoising autoencoder with inputs of size and two hidden layers both of size . The output layer for the classifier was connected to the top hidden layer of the autoencoder and had output units. Training was performed for steps with minibatches of size . The three target distributions were defined as uniform rings with thickness of and inner radius of , centered at the three points , and .

-B Fooling - MNIST

The three models compared are a regular CNN, the same network with the output layer replaced with a COOL layer (degree of overcompleteness , as in [8]), and the same network with the addition of a decoder connected to the top hidden layer of the CNN, to complete the denoising autoencoder used to compute the confidence score . A supplementary reconstruction loss was added to the cross-entropy loss of the classification task when training the autoencoder. Two variants of autoencoders were tested, using a symmetric decoder whose weights were the transpose of the weights of the encoder and an asymmetric one with independent weights of the same size.

The architecture of the CNN is . Each layer is followed by a ReLU non-linearity, except for the output layer that is followed by a softmax.

Fooling was attempted for times for each digit, each for up to update steps with a learning rate for updating the FGN set to as in [8]. Training was performed for a fixed epochs for each model. Training of the denoising autoencoder was done by corrupting the input samples with additive Gaussian noise with zero mean and . The parameters for the dAE model were set to and variable depending on the threshold used ( for the classification threshold, for the threshold).

-C Open Set Recognition

The same models used for the MNIST fooling experiments were used for the Open Set Recognition tests. The classification threshold was fixed to .

-D 1-Class Recognition

The same parameters as in the previous experiments were used for the COOL and the dAE models, except for the dAE model that used and the addition of regularization to the weights (). For 1-Class SVM, the simulations used and an RBF kernel was used with scaling parameter . Training of the 1-Class-SVM model was done using the implementation available from the scikit-learn library [24].


The author would like to thank everybody!


  • [1] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1757–1772, 2013.
  • [2] A. Bendale and T. Boult, “Towards open world recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1893–1902.
  • [3] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 427–436.
  • [4] M. Markou and S. Singh, “Novelty detection: a review - part 1: statistical approaches,” Signal processing, vol. 83, no. 12, pp. 2481–2497, 2003.
  • [5] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning.   ACM, 2008, pp. 1096–1103.
  • [6] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance during feature extraction,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 833–840.
  • [7] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adversarial network,” arXiv preprint arXiv:1609.03126, 2016.
  • [8] N. Kardan and K. O. Stanley, “Mitigating fooling with competitive overcomplete output layer neural networks,” in Neural Networks (IJCNN), 2017 International Joint Conference on.   IEEE, 2017, pp. 518–525.
  • [9] P. J. Phillips, P. Grother, and R. Micheals, “Evaluation methods in face recognition,” in Handbook of face recognition.   Springer, 2011, pp. 551–574.
  • [10] W. J. Scheirer, L. P. Jain, and T. E. Boult, “Probability models for open set recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 11, pp. 2317–2324, 2014.
  • [11] A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1563–1572.
  • [12] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
  • [13] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt, “Support vector method for novelty detection,” in Advances in neural information processing systems, 2000, pp. 582–588.
  • [14] D. M. Tax and R. P. Duin, “Support vector data description,” Machine learning, vol. 54, no. 1, pp. 45–66, 2004.
  • [15] C. Park, J. Z. Huang, and Y. Ding, “A computable plug-in estimator of minimum volume sets for novelty detection,” Operations Research, vol. 58, no. 5, pp. 1469–1480, 2010.
  • [16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [17] G. Alain and Y. Bengio, “What regularized auto-encoders learn from the data-generating distribution,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3563–3593, 2014.
  • [18] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” in Advances in Neural Information Processing Systems, 2013, pp. 899–907.
  • [19] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [20] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/, 1998.
  • [21] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
  • [22] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  • [23] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

Giacomo Spigler was born in 1990. He received the B.Sc. (Hons.) Degree and Diploma (Hons.) in computer engineering from the University of Pisa and Sant’Anna School of Advanced Studies, respectively, and the M.Sc. (Hons.) in Cognitive Science (Computational Neuroscience and Neuroinformatics) from the University of Edinburgh. He is currently finishing his Ph.D. in Computational Neuroscience at the University of Sheffield. His current research interests include deep neural networks and continual learning.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description