Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks
The Predictive Normalized Maximum Likelihood (pNML) scheme has been recently suggested for universal learning in the individual setting, where both the training and test samples are individual data. The goal of universal learning is to compete with a “genie” or reference learner that knows the data values, but is restricted to use a learner from a given model class. The pNML minimizes the associated regret for any possible value of the unknown label. Furthermore, its min-max regret can serve as a pointwise measure of learnability for the specific training and data sample.
In this work we examine the pNML and its associated learnability measure for the Deep Neural Network (DNN) model class. As shown, the pNML outperforms the commonly used Empirical Risk Minimization (ERM) approach and provides robustness against adversarial attacks. Together with its learnability measure it can detect out of distribution test examples, be tolerant to noisy labels and serve as a confidence measure for the ERM. Finally, we extend the pNML to a “twice universal” solution, that provides universality for model class selection and generates a learner competing with the best one from all model classes.
Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks
Koby Bibas School of Electrical Engineering Tel Aviv University firstname.lastname@example.org Yaniv Fogel School of Electrical Engineering Tel Aviv University Yaniv.email@example.com Meir Feder School of Electrical Engineering Tel Aviv University firstname.lastname@example.org
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\end@float
In the common situation of supervised machine learning, a trainset is given to a learner, consisting of pairs of examples, , where is the data or the feature and is the label. Then, a new is given and the task is to predict its corresponding label .
The formal definition of the learning problem includes a loss function that measures the accuracy of the prediction. Here we will assume that the learner assigns a probability to the next label, and we will use the log-loss:
Clearly, a reasonable goal is to find the predictor with the minimal loss. However, this problem is ill-posed unless additional assumptions are made.
First, a “model” class, or ‘hypotheses” class must be defined. This class is a set of conditional probability distributions
where is a general index set. This is equivalent to saying that there is a set of stochastic functions used to explain the relation between and .
Next, assumptions must be made on how the data and the labels are generated. The most common setting in learning theory is Probably Approximately Correct (PAC), established in  where and are assumed to be generated by some source . is not necessarily a member of . The goal is to perform as well as a learner that knows the true probability.
Another possible setting for the learning problem, recently suggested in  following, e.g., , is the individual setting, where the data and labels of both the training and test are specific and individual values. In this setting the goal is to compete with a “genie” or a reference learner that knows the desired label value, but is restricted to use a model from the given hypotheses class , and that does not know which of the samples is the test. This learner then chooses:
The log-loss difference between a universal learner and the reference is the regret:
As advocated in , the chosen universal learner solves:
This min-max optimal solution, termed Predictive Normalized Maximum Likelihood (pNML), is obtained using “equalizer” reasoning, following :
and its corresponding regret, independent of the true , is:
Note that this deviates from the commonly-used Empirical Risk Minimization (ERM) approach , where in the log-loss case the learner chooses the model that assigns the maximal probability for the trainset:
Nevertheless, it turns out that can also be used to obtain a bound on the performance of the ERM, see .
The pNML has been derived for several model classes in related works  such as the barrier (or 1-D perceptron) model, and in  for the linear regression problem. In all these cases the advantages of the pNML and its corresponding learnability measure were discussed.
In some cases there are several possible model classes, or several possible algorithms. In this case, one can use a ’twice-universal’ approach, see , to achieve near-optimal performance not just within a class but over all possible classes. In this approach, the best pNML learner from each of model classes is chosen . Then, another pNML procedure is executed over all of these learners:
Main contributions. This paper’s contribution is essentially in demonstrating the pNML technique for DNNs model classes. We show that the pNML solution improves upon the ERM learner’s performance in accuracy and log-loss on the testset, especially in the worst case performance. The pNML is also more robust to noisy labels in the trainset and to adversarial attacks in the test. Furthermore we show that may serve as a confidence measure, or learnability measure, for both the pNML and the ERM, and that it can used to point out when the trainset is composed of random labels and when the test sample is out of distribution or adversarial. Finally, we demonstrate that the twice universal pNML scheme over the number of fine-tuned layers of DNNs forms has an even superior performance over both the pNML and the ERM.
2 Related Work
In this section, we briefly describe related work in universal prediction, model generalization, out of distribution detection and adversarial attack robustness.
Universal Prediction and Online Learning. In on-line learning prediction is done on a sequentially revealed sequence, where the loss is accumulated, and as the predicted label is revealed it essentially becomes part of the training. This problem is an extension of the well-studied work on universal prediction, which is essentially online learning with no ’s, see the comprehensive survey . As described in , the universal prediction solution with log-loss provides a “universal probability” for the entire sequence, , which can be converted to a sequential prediction strategy via the chain rule . The universal probability is given by either a Bayesian mixture (stochastic setting) or by the normalized maximum likelihood (NML), , , (individual setting). Both solutions solve a corresponding minmax problem. Online learning, with a feature sequence is less understood as it has one important difference: the chain rule does not hold for conditional probabilities. Yet in  a universal learning solution is proposed that achieves vanishing regret in various settings
Model Generalization. Understating the model generalization capabilities is considered a fundamental problem in machine learning . This problem was mainly considered in the PAC learning framework, where several measures such as the VC-dimension and Rademacher complexity were suggested. However, those measures seems to fail in explaining the generalization of DNNs, see . In this paper we advocate another measure, the pNML regret, which seems to work for DNNs.
Out of distribution. The problem of detecting out of distribution samples has been studied extensively for deep learning models. It can be treated as classification problem where one tries to determine whether a test sample is from a different distribution than the training data. Some works detect out of distribution test samples by post-processing the output of the model [11, 12] in order to generate uncertainty measure. Others require a modification the neural networks or the training process [13, 14]. In  it was suggested to use dropout as Bayesian inference approximations and by that get the variance of the prediction for each test sample. Again, in this paper we provide another technique to detect this situation.
Adversarial Attack. DNN models are vulnerable to adversarial perturbed samples that were designed to mislead a model at inference time. A number of defensive techniques against adversarial attack in DNN have been proposed [16, 17]. Adversarial training [18, 19] appears to have the best results for learning robust models, however, it mostly protects against the attack it was trained for. Unlike these methods, the method we present based on the pNML and its learnability measure, can be utilized for any learner, doesn’t use adversarial samples in the training phase, not restricted to an architecture or the learning process and so it may improve any suggested procedure.
3 Deep pNML
Recall the pNML. Denote the normalization factor:
and so the pNML learner (6) can be rewritten as:
Intuitively, to assign a probability for a possible outcome, the pNML adds it to the trainset, finds the best-suited model, and takes the probability it gives to that label. It then follows this procedure for every label and normalize to get a valid probability assignment. This method can be extended to any general learning procedure that generates a prediction based on a trainset. Such an algorithm can be the stochastic gradient descent used to train neural networks.
Our implementation111Code will be available upon publication of the pNML for DNNs is described in Algorithm 1 and consists of two steps: First, we conduct an initial training where we train the DNN using a given trainset with stochastic gradient descent (SGD). This is done using some given hyperparameters , , and which are the initial weights, learning rate, weight decay and number of epochs respectively. Then, given a specific test example , we examine every possible and conduct the fine-tuning - we add the pair to the trainset, and perform several more SGD steps ( epochs) with the same hyperparameters. The pNML predictor for that label will simply be the predicted distribution for that label after the fine-tuning, normalized by the summation of all the corresponding probabilities to get a valid probability assignment. Note that during the fine-tuning phase we did not necessarily change all the weights - some of them were selected only based on the initial training. When presenting our results in Section 4 we will explicitly state which weights were allowed to change and which were ’frozen’.
For the rest of the paper, when referring to the ERM we will actually refer to the DNN whose weights were given by the initial training. The genie will be the DNN whose weights were obtained after the fine-tuning with the true label. The pNML predictor will be the one obtained by taking the predictions for each label, each with the corresponding fine-tuned weights, and normalizing to get a valid probability function.
Twice universality. In many problems the best model class is unknown, and therefore being universal with respect to a number of model classes can improve the performance of the learner. As part of our research we implemented various versions of the pNML, where during the fine-tuning phase only some layers of the DNN’s are updated. This can be seen as a nested hierarchy model class, where the richest model is the one where all the layers are updated in the fine-tuning phase, and the smallest class is that where none of them are updated and essentially the only model in the class is the ERM. We treated each of these variants as different classes , and evaluated their predictors, , based on the pNML scheme in Algorithm 1. Then, we executed the twice-universal algorithm described in (9) and in Algorithm 2.
This section describes the application of the pNML in deep neural networks. First, we show how the pNML can improve upon the ERM learner in both the log-loss and accuracy sense. Then, we examine a trainset where the labels are assigned randomly, and show, again, that the pNML outperforms the ERM, and that the amount of noise in the trainset can be detected using the pNML regret. Next, we use the pNML regret measure for detecting out of distribution samples. We further demonstrate that with the pNML we create a learner that is more robust to adversarial samples. Finally, we execute the twice universality algorithm on the number of fine-tuned layers, and we show that it outperforms the best learner tuned to each class.
4.1 pNML prediction performance
For our first experiment we have used ResNet-18 architecture  and CIFAR10 dataset . We executed initial training and fine-tuning as described in Algorithm 1. The initial training consisted of 200 epochs using stochastic gradient descent with a batch size of 128 and a learning rate 0.1, with a decrease to 0.01 and 0.001 after 100 and 150 epochs respectively. In addition, we used a momentum of 0.9, weight decay of and standard normalization and data augmentation techniques. During fine-tuning we allowed updates of the weights of the last residual block along with the last fully connected layer, a total of 37,504 trainable parameters. The fine-tuning consisted of 10 epochs with learning rate 0.001.
|Method||Acc.||Loss mean||Loss STD|
The comparison between the ERM, pNML and the genie for the 10,000 test samples of CIFAR10 dataset is summarized in Table 1. We can see a improvement in the log-loss with a slightly better accuracy rate. More notable is the improvement in the standard deviation of the log-loss, which suggests that the pNML manages to avoid large losses. This property of the pNML can also be observed by the log-loss histogram of the ERM and the pNML, presented in Figure 0(a).
4.2 The regret as a confidence measure
Figure 0(b) shows the histogram of the regret with a distinction between correctly and incorrectly predicted samples. Clearly this figure shows how the regret histograms are distinct and how the histogram for correctly classified samples indicates lower typical regret values than the corresponding values of incorrectly classified. Another interesting plot is given in Figure 2. This figure shows a scatter plot of the log-loss of both the pNML and ERM as a function of the pNML regret of the test samples, along with the empirical marginal distributions of the loss and the regret in semi-log scale. One clearly see that for large regret there is a tendency for high log-loss of the ERM, and in these cases for the same sample the ERM loss is higher (sometimes much higher) than that of the pNML. One can also observe the almost straight line representing the pNML loss as a function of . This is because the pNML loss is plus the genie loss, and so may be interpreted as an ‘insurance” the pNML pays to protect against large loss.
These results lead us to conclude that is indeed a valid confidence measure for both the pNML and the ERM. To deepen our understanding of that conclusion, we checked the performances of both predictors when taking into account samples whose regret is not larger than some threshold, as shown in Figure 3. When predicting about 70% of the test samples, whose regret (measured at base ) is less than , the pNML predictor correctly classifies 99% of them with log-loss smaller than 0.041. When classifying 90% of the test samples with regret smaller than about , the pNML’s accuracy is 0.947 and the log-loss is smaller than 0.1187. The ERM’s results are similar, indicating that the regret is also a good confidence measure for the ERM.
4.3 Random labels
Recent evidence shows that it is possible to train DNNs on data with randomly generated labels and still get an accuracy rate of 100% over the trainset . This may question the generalization property of DNNs, as clearly such a model cannot generalize despite its perfect fit to the data. Interestingly, (7) which measures the distance from the best model, can be used in order to reflect the generalization capabilities for a specific trainset.
In order to examine this situation we trained WideResNet-18 model  on CIFAR10 dataset, where some of the trainset were assigned random labels. We performed the initial training using stochastic gradient descent with a learning rate of 0.01 for 350 epochs without any regularization nor data augmentation. We ensured that the model accuracy rate on the trainset is 1.0. The fine-tuning phase consisted of 6 epochs with a learning rate of 0.01, changing the weights of only the last residual block along with the last fully connected layer. We repeated the experiment with a variety of probabilities of the train samples labels to be random: 0.0, 0.2, 0.4, 0.6, 0.8 and 1.0 for 300 test samples.
The results of this experiment are presented in Table 2. Note that the regret is considerably larger when a significant part of the trainset has random labels, allowing us to detect when the model was trained in such a manner. In addition, it is evident that the pNML is more robust to noisy training data in the log-loss sense than the ERM, while there is no significant difference in the accuracy.
4.4 Detecting Out of Distribution examples
Another situation where one can use a confidence measure is detecting out of distribution examples. Theoretically, a good learner or predictor should assign a low confidence to his predictions regarding out of distribution examples. Nevertheless, DNNs frequently produces high-confidence predictions, arguably because softmax probabilities are computed with the fast-growing exponential function . Thus, it would be interesting to see if is indeed large for out of distribution examples.
To check that, we performed the initial training with the ResNet-18 architecture and CIFAR10 dataset as in 4.1. We then performed the fine-tuning phase, again in a similar manner to that described in 4.1, examining 10,000 test samples from CIFAR10, 100 test samples from SVHN  and 100 random Gaussian noise images.
The regret histograms for the different testset are presented in Figure 4, showing a clear distinction between in distribution and out of distribution samples.
To further evaluate the performance of the regret as a way to detect out of distribution samples, we used common metrics that measure the differences between two probability distributions. Let denote the probability assignment of the in distribution samples and the probability assignment of the out of distribution samples. The first metric to measure the difference between the probabilities is the average Kullback Leibler (KL) divergence:
The second evaluation metric is the Bhattacharyya distance  which measures the amount of overlapping in the probability assignment
The last metric comes from the likelihood ratio test (LRT), see . Denote the probability assignment
and let be the point for which the KL divergence between and equals to the KL divergence between and . The value of the KL divergence at this point is the distance
We compare our method to the maximum probability method which determines if the test sample is from the in distribution or out of distribution based on the maximum of probability assignment  and to , where is the maximum probability of the prediction and is the second largest value of the probability assignment .
The comparison is shown in Table 3. One can note that by all the metrics we considered, the regret is indeed a significantly better classifier of out of distribution samples than the max probability and uncertainty measures for both noise images and SVHN images.
Figure 5 (Top) shows the histogram of in distribution and out of distribution test images based on the max probability of the ERM output and Figure 5 (Middle) shows the histogram based on criterion. The histogram of in distribution and out of distribution test images based on the pNML regret is shown in figure 5 (Bottom). Clearly, when we categorize samples according to the regret there is a significant separation between the two distributions, whereas when using the max probability criterion or the criterion the two histograms overlap almost entirely.
4.5 Adversarial Attack
We evaluated the performance of our approach for adversarial samples. Our goal is in both detecting adversarial samples and devising a robust learner against them.
We created the adversarial samples using the Fast Gradient Sign Method (FGSM) : First, we computed the sign of the loss function’s gradients according to the input pixels, and then multiplied these signs by a small value to create a perturbation. The adversarial image is the original image after the addition of the perturbation:
We generated adversarial samples on CIFAR10 dataset using a different model than the one used for testing (black box setting), namely ResNet-32 with equal 0, 0.001, 0.005, 0.01 and 0.05 (the images were normalized such that the pixels value is between -1.0 and 1.0) for 500 test samples.
We implemented the pNML with the same parameters as in 4.1 and with the ResNet-18 architecture, and tested on adversarial samples. The performance of the pNML, ERM and the genie for each are presented in Table 4 and in Figure 6. We can see that the pNML and the ERM have similar performances in accuracy. However, the log-loss of the pNML is much better for every and especially for , where the pNML log-loss is 1.72 compared with the ERM log-loss of 3.22. It is interesting to note that the genie, which is trained with the true test label, also has a significant decrease in accuracy, with only 53% success for .
Table 4 shows that when the perturbation strength is increased, the mean regret is also increased. This behavior implies that the adversarial image might be detected based on the regret.
The evaluation of the adversarial images detection using the regret is shown in Table 5 with the same metrics as in 4.4. It shows that there is some difference in the regret between regular and adversarial samples, which is increased when is increased. Nevertheless, the difference is much more subtle than that obtained in the out of distribution detection case, even for large values of .
4.6 Twice Universality
In addition to the universality inside a model class , a second universality can be made regarding the index of multiple model families . This is the ‘twice universal’ approach. For the DNNs models we considered 3 model classes: The first class is the one described in 4.1 where the fine-tuning affected only the last two layers. In the second class fine-tuning affected all layers. The third class is simply the DNN obtained after the initial training, which is the ERM learner.
Our next results were obtained by twice-universality over these 3 classes: For each test sample and for each label, we took the maximum of the pNML probabilities over the 3 model classes. Then we normalized the probability of the test sample in order to create a valid probability assignment, as described in Algorithm 2.
Table 6 summarizes the results on CIFAR10. The twice universal predictor is better than any other class predictor in the log-loss sense. In addition, its accuracy is 0.92 which is as good as the 2 layer model.
We also examined the twice universality method on MNIST dataset . The network used was a simple multilayer perceptron , with 10 hidden units. The initial training consisted of 100 epochs with a learning rate of 0.01 for the first 10 epochs, 0.01 for the next 30, 0.001 for the next 40 and 0.0001 for the rest. Next, we executed the pNML procedure of Algorithm 1. The fine-tuning phase consisted of 10 epochs with a learning rate of 0.0001, changing either the two layers or only the last one.
The results of the twice universality method on MNIST dataset are also presented in Table 6. The twice universal learner has the best performance, both in the log-loss and the accuracy rate. Since this learner is close to the optimal one for every specific sample, it seems that it avoids the pitfalls of each of the other learners and thus manages to outperform them.
In this paper, we presented the Deep pNML leaner in the individual setting with respect to the log-loss function. We showed that the pNML scheme outperforms the ERM in the accuracy and log-loss sense, it is more robust for noisy trainset and have better performance for adversarial samples. In addition, the inherent regret property of the pNML can detect out of distribution and adversarial samples as well as noisy trainsets. Finally, we introduce the twice universality concept and showed its performance on CIFAR10 and MNIST dataset with different DNN architectures, making it universal on any hyper-parameter choice, yet still at least as good as the best one.
Future work. There is an on-going work on the theory of the pNML and its generalizations, including relating the pNML procedure to stability analysis. In the more practical aspects related to DNNs, it is interesting to explore the effect of using a varying number of epochs in the fine-tuning phase. This may enhance our understanding of DNNs, and perhaps improve the performance. In addition, we plan to try the pNML for various network architectures and hyper-parameters, with the hope to improve performance and to get more insight on how they affect the generalization abilities of the network.
-  Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
-  Yaniv Fogel and Meir Feder. Universal Batch Learning with Log-loss in the Individual Setting. arxiv, 2018.
-  Neri Merhav and Meir Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6):2124–2147, 1998.
-  Yaniv Fogel and Meir Feder. Universal Supervised Learning for Individual Data. arXiv preprint arXiv:1812.09520v1, pages 1–15, 2018.
-  Yurii Mikhailovich Shtarkov. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3–17, 1987.
-  Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
-  Koby Bibas, Yaniv Fogel, and Meir Feder. New look at an old problem: A universal learning approach to linear regression. Submitted, ISIT, 2019.
-  Yaniv Fogel and Meir Feder. On the problem of on-line learning with log-loss. IEEE International Symposium on Information Theory - Proceedings, pages 2995–2999, 2017.
-  Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
-  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
-  Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
-  Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
-  Jerone TA Andrews, Thomas Tanay, Edward J Morton, and Lewis D Griffin. Transfer representation-learning for anomaly detection. In ICML, 2016.
-  Gabi Shalev, Yossi Adi, and Joseph Keshet. Out-of-distribution detection using multiple semantic label representations. In Advances in Neural Information Processing Systems, pages 7386–7396, 2018.
-  Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
-  Aran Nayebi and Surya Ganguli. Biologically inspired protection of deep networks from adversarial attacks. arXiv preprint arXiv:1703.09202, 2017.
-  Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. arXiv preprint arXiv:1704.08847, 2017.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2015.
-  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
-  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2016.
-  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  Thomas Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE transactions on communication technology, 15(1):52–60, 1967.
-  Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
-  Noam Mor and Lior Wolf. Confidence prediction for lexicon-free ocr. arXiv preprint arXiv:1805.11161, 2018.
-  Yann LeCun and Corinna Cortes. MNIST handwritten digit database. online: http://yann.lecun.com/exdb/mnist/, 2010.
-  Umut Orhan, Mahmut Hekim, and Mahmut Ozer. Eeg signals classification using the k-means clustering and a multilayer perceptron neural network model. Expert Systems with Applications, 38(10):13475–13481, 2011.