Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks

Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks

Koby Bibas
School of Electrical Engineering
Tel Aviv University
kobybibas@gmail.com
&Yaniv Fogel
School of Electrical Engineering
Tel Aviv University
Yaniv.fogel8@gmail.com
\ANDMeir Feder
School of Electrical Engineering
Tel Aviv University
meir@eng.tau.ac.il
Abstract

The Predictive Normalized Maximum Likelihood (pNML) scheme has been recently suggested for universal learning in the individual setting, where both the training and test samples are individual data. The goal of universal learning is to compete with a “genie” or reference learner that knows the data values, but is restricted to use a learner from a given model class. The pNML minimizes the associated regret for any possible value of the unknown label. Furthermore, its min-max regret can serve as a pointwise measure of learnability for the specific training and data sample.

In this work we examine the pNML and its associated learnability measure for the Deep Neural Network (DNN) model class. As shown, the pNML outperforms the commonly used Empirical Risk Minimization (ERM) approach and provides robustness against adversarial attacks. Together with its learnability measure it can detect out of distribution test examples, be tolerant to noisy labels and serve as a confidence measure for the ERM. Finally, we extend the pNML to a “twice universal” solution, that provides universality for model class selection and generates a learner competing with the best one from all model classes.

 

Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks


  Koby Bibas School of Electrical Engineering Tel Aviv University kobybibas@gmail.com Yaniv Fogel School of Electrical Engineering Tel Aviv University Yaniv.fogel8@gmail.com Meir Feder School of Electrical Engineering Tel Aviv University meir@eng.tau.ac.il

\@float

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\end@float

1 Introduction

In the common situation of supervised machine learning, a trainset is given to a learner, consisting of pairs of examples, , where is the data or the feature and is the label. Then, a new is given and the task is to predict its corresponding label .

The formal definition of the learning problem includes a loss function that measures the accuracy of the prediction. Here we will assume that the learner assigns a probability to the next label, and we will use the log-loss:

(1)

Clearly, a reasonable goal is to find the predictor with the minimal loss. However, this problem is ill-posed unless additional assumptions are made.

First, a “model” class, or ‘hypotheses” class must be defined. This class is a set of conditional probability distributions

(2)

where is a general index set. This is equivalent to saying that there is a set of stochastic functions used to explain the relation between and .

Next, assumptions must be made on how the data and the labels are generated. The most common setting in learning theory is Probably Approximately Correct (PAC), established in [1] where and are assumed to be generated by some source . is not necessarily a member of . The goal is to perform as well as a learner that knows the true probability.

Another possible setting for the learning problem, recently suggested in [2] following, e.g., [3], is the individual setting, where the data and labels of both the training and test are specific and individual values. In this setting the goal is to compete with a “genie” or a reference learner that knows the desired label value, but is restricted to use a model from the given hypotheses class , and that does not know which of the samples is the test. This learner then chooses:

(3)

The log-loss difference between a universal learner and the reference is the regret:

(4)

As advocated in [4], the chosen universal learner solves:

(5)

This min-max optimal solution, termed Predictive Normalized Maximum Likelihood (pNML), is obtained using “equalizer” reasoning, following [5]:

(6)

and its corresponding regret, independent of the true , is:

(7)

Note that this deviates from the commonly-used Empirical Risk Minimization (ERM) approach [6], where in the log-loss case the learner chooses the model that assigns the maximal probability for the trainset:

(8)

Nevertheless, it turns out that can also be used to obtain a bound on the performance of the ERM, see [2].

The pNML has been derived for several model classes in related works [4] such as the barrier (or 1-D perceptron) model, and in [7] for the linear regression problem. In all these cases the advantages of the pNML and its corresponding learnability measure were discussed.

In some cases there are several possible model classes, or several possible algorithms. In this case, one can use a ’twice-universal’ approach, see [2], to achieve near-optimal performance not just within a class but over all possible classes. In this approach, the best pNML learner from each of model classes is chosen . Then, another pNML procedure is executed over all of these learners:

(9)

Main contributions. This paper’s contribution is essentially in demonstrating the pNML technique for DNNs model classes. We show that the pNML solution improves upon the ERM learner’s performance in accuracy and log-loss on the testset, especially in the worst case performance. The pNML is also more robust to noisy labels in the trainset and to adversarial attacks in the test. Furthermore we show that may serve as a confidence measure, or learnability measure, for both the pNML and the ERM, and that it can used to point out when the trainset is composed of random labels and when the test sample is out of distribution or adversarial. Finally, we demonstrate that the twice universal pNML scheme over the number of fine-tuned layers of DNNs forms has an even superior performance over both the pNML and the ERM.

Paper Structure. The paper is organized as follows. In section 2 related works are presented. Section 3 describes our method of learning. while section 4 shows the different applications of our method. Section 5 presents the conclusions and further possible works.

2 Related Work

In this section, we briefly describe related work in universal prediction, model generalization, out of distribution detection and adversarial attack robustness.

Universal Prediction and Online Learning. In on-line learning prediction is done on a sequentially revealed sequence, where the loss is accumulated, and as the predicted label is revealed it essentially becomes part of the training. This problem is an extension of the well-studied work on universal prediction, which is essentially online learning with no ’s, see the comprehensive survey [3]. As described in [3], the universal prediction solution with log-loss provides a “universal probability” for the entire sequence, , which can be converted to a sequential prediction strategy via the chain rule . The universal probability is given by either a Bayesian mixture (stochastic setting) or by the normalized maximum likelihood (NML), , [5], (individual setting). Both solutions solve a corresponding minmax problem. Online learning, with a feature sequence is less understood as it has one important difference: the chain rule does not hold for conditional probabilities. Yet in [8] a universal learning solution is proposed that achieves vanishing regret in various settings

Model Generalization. Understating the model generalization capabilities is considered a fundamental problem in machine learning [9]. This problem was mainly considered in the PAC learning framework, where several measures such as the VC-dimension and Rademacher complexity were suggested. However, those measures seems to fail in explaining the generalization of DNNs, see [10]. In this paper we advocate another measure, the pNML regret, which seems to work for DNNs.

Out of distribution. The problem of detecting out of distribution samples has been studied extensively for deep learning models. It can be treated as classification problem where one tries to determine whether a test sample is from a different distribution than the training data. Some works detect out of distribution test samples by post-processing the output of the model [11, 12] in order to generate uncertainty measure. Others require a modification the neural networks or the training process [13, 14]. In [15] it was suggested to use dropout as Bayesian inference approximations and by that get the variance of the prediction for each test sample. Again, in this paper we provide another technique to detect this situation.

Adversarial Attack. DNN models are vulnerable to adversarial perturbed samples that were designed to mislead a model at inference time. A number of defensive techniques against adversarial attack in DNN have been proposed [16, 17]. Adversarial training [18, 19] appears to have the best results for learning robust models, however, it mostly protects against the attack it was trained for. Unlike these methods, the method we present based on the pNML and its learnability measure, can be utilized for any learner, doesn’t use adversarial samples in the training phase, not restricted to an architecture or the learning process and so it may improve any suggested procedure.

3 Deep pNML

Recall the pNML. Denote the normalization factor:

(10)

and so the pNML learner (6) can be rewritten as:

(11)

Intuitively, to assign a probability for a possible outcome, the pNML adds it to the trainset, finds the best-suited model, and takes the probability it gives to that label. It then follows this procedure for every label and normalize to get a valid probability assignment. This method can be extended to any general learning procedure that generates a prediction based on a trainset. Such an algorithm can be the stochastic gradient descent used to train neural networks.

Our implementation111Code will be available upon publication of the pNML for DNNs is described in Algorithm 1 and consists of two steps: First, we conduct an initial training where we train the DNN using a given trainset with stochastic gradient descent (SGD). This is done using some given hyperparameters , , and which are the initial weights, learning rate, weight decay and number of epochs respectively. Then, given a specific test example , we examine every possible and conduct the fine-tuning - we add the pair to the trainset, and perform several more SGD steps ( epochs) with the same hyperparameters. The pNML predictor for that label will simply be the predicted distribution for that label after the fine-tuning, normalized by the summation of all the corresponding probabilities to get a valid probability assignment. Note that during the fine-tuning phase we did not necessarily change all the weights - some of them were selected only based on the initial training. When presenting our results in Section 4 we will explicitly state which weights were allowed to change and which were ’frozen’.

For the rest of the paper, when referring to the ERM we will actually refer to the DNN whose weights were given by the initial training. The genie will be the DNN whose weights were obtained after the fine-tuning with the true label. The pNML predictor will be the one obtained by taking the predictions for each label, each with the corresponding fine-tuned weights, and normalizing to get a valid probability function.

Algorithm 1 Deep pNML Input: , , , , , Initialize randomly Train: for i=1 to  do                   for  to  do       Return , Algorithm 2 Twice Universality Input: Test sample , learners for i=1 to  do       for  to  do       Return

Twice universality. In many problems the best model class is unknown, and therefore being universal with respect to a number of model classes can improve the performance of the learner. As part of our research we implemented various versions of the pNML, where during the fine-tuning phase only some layers of the DNN’s are updated. This can be seen as a nested hierarchy model class, where the richest model is the one where all the layers are updated in the fine-tuning phase, and the smallest class is that where none of them are updated and essentially the only model in the class is the ERM. We treated each of these variants as different classes , and evaluated their predictors, , based on the pNML scheme in Algorithm 1. Then, we executed the twice-universal algorithm described in (9) and in Algorithm 2.

4 Applications

This section describes the application of the pNML in deep neural networks. First, we show how the pNML can improve upon the ERM learner in both the log-loss and accuracy sense. Then, we examine a trainset where the labels are assigned randomly, and show, again, that the pNML outperforms the ERM, and that the amount of noise in the trainset can be detected using the pNML regret. Next, we use the pNML regret measure for detecting out of distribution samples. We further demonstrate that with the pNML we create a learner that is more robust to adversarial samples. Finally, we execute the twice universality algorithm on the number of fine-tuned layers, and we show that it outperforms the best learner tuned to each class.

4.1 pNML prediction performance

For our first experiment we have used ResNet-18 architecture [20] and CIFAR10 dataset [21]. We executed initial training and fine-tuning as described in Algorithm 1. The initial training consisted of 200 epochs using stochastic gradient descent with a batch size of 128 and a learning rate 0.1, with a decrease to 0.01 and 0.001 after 100 and 150 epochs respectively. In addition, we used a momentum of 0.9, weight decay of and standard normalization and data augmentation techniques. During fine-tuning we allowed updates of the weights of the last residual block along with the last fully connected layer, a total of 37,504 trainable parameters. The fine-tuning consisted of 10 epochs with learning rate 0.001.


Method Acc. Loss mean Loss STD
ERM 0.9183 0.194 0.822
PNML 0.9184 0.167 0.314
genie 0.9875 0.027 0.231
Table 1: Performance comparison. Performances of the learning methods on CIFAR10 testset using ResNet-18 architecture.

The comparison between the ERM, pNML and the genie for the 10,000 test samples of CIFAR10 dataset is summarized in Table 1. We can see a improvement in the log-loss with a slightly better accuracy rate. More notable is the improvement in the standard deviation of the log-loss, which suggests that the pNML manages to avoid large losses. This property of the pNML can also be observed by the log-loss histogram of the ERM and the pNML, presented in Figure 0(a).

(a) Log-loss histogram for pNML and ERM.
(b) pNML Regret histogram.
Figure 1: Log-loss and Regret histograms. The right figures shows the log-loss for CIFAR10 testset with ResNet-18 architecture. The histogram is plotted for both pNML (6) and ERM (8) with density in semi-log scale. The log-loss is in base 10. The left figure presents the Regret histogram with correctly classified and miss-classified separation. The histogram is plotted for the pNML leaner with ResNet-18 architecture, trained and tested on CIFAR10 dataset.

Figure 2: Scatter plot of the log-loss and the regret. (Bottom-left) Scatter plot of the log-loss and the regret of the pNML and the ERM on CIFAR10 testset using ResNet-18 architecture. (Top) the marginal empirical distribution of the pNML regret in semi-log scale. (Right) the pNML log-loss marginal empirical distribution in in semi-log scale.

4.2 The regret as a confidence measure

Figure 0(b) shows the histogram of the regret with a distinction between correctly and incorrectly predicted samples. Clearly this figure shows how the regret histograms are distinct and how the histogram for correctly classified samples indicates lower typical regret values than the corresponding values of incorrectly classified. Another interesting plot is given in Figure 2. This figure shows a scatter plot of the log-loss of both the pNML and ERM as a function of the pNML regret of the test samples, along with the empirical marginal distributions of the loss and the regret in semi-log scale. One clearly see that for large regret there is a tendency for high log-loss of the ERM, and in these cases for the same sample the ERM loss is higher (sometimes much higher) than that of the pNML. One can also observe the almost straight line representing the pNML loss as a function of . This is because the pNML loss is plus the genie loss, and so may be interpreted as an ‘insurance” the pNML pays to protect against large loss.


Figure 3: Regret based classifier performance. (Top) is the accuracy rate for all CIFAR10 test samples with regret smaller than the regret threshold. (Middle) is the log-loss of the learners as function of the regret threshold. (Bottom) is the Cumulative Distribution Function (CDF) of the test samples.

These results lead us to conclude that is indeed a valid confidence measure for both the pNML and the ERM. To deepen our understanding of that conclusion, we checked the performances of both predictors when taking into account samples whose regret is not larger than some threshold, as shown in Figure 3. When predicting about 70% of the test samples, whose regret (measured at base ) is less than , the pNML predictor correctly classifies 99% of them with log-loss smaller than 0.041. When classifying 90% of the test samples with regret smaller than about , the pNML’s accuracy is 0.947 and the log-loss is smaller than 0.1187. The ERM’s results are similar, indicating that the regret is also a good confidence measure for the ERM.

4.3 Random labels

Recent evidence shows that it is possible to train DNNs on data with randomly generated labels and still get an accuracy rate of 100% over the trainset [10]. This may question the generalization property of DNNs, as clearly such a model cannot generalize despite its perfect fit to the data. Interestingly, (7) which measures the distance from the best model, can be used in order to reflect the generalization capabilities for a specific trainset.

In order to examine this situation we trained WideResNet-18 model [22] on CIFAR10 dataset, where some of the trainset were assigned random labels. We performed the initial training using stochastic gradient descent with a learning rate of 0.01 for 350 epochs without any regularization nor data augmentation. We ensured that the model accuracy rate on the trainset is 1.0. The fine-tuning phase consisted of 6 epochs with a learning rate of 0.01, changing the weights of only the last residual block along with the last fully connected layer. We repeated the experiment with a variety of probabilities of the train samples labels to be random: 0.0, 0.2, 0.4, 0.6, 0.8 and 1.0 for 300 test samples.


Property 0.0 0.2 0.4 0.6 0.8 1.0
ERM acc. 0.84 0.72 0.46 0.31 0.15 0.09
PNML acc. 0.85 0.74 0.47 0.30 0.17 0.09
ERM loss 0.51 1.00 2.63 3.67 5.49 6.29
PNML loss 0.49 0.83 0.90 0.97 1.12 1.28
Regret 0.49 0.81 0.86 0.91 0.90 0.89
Table 2: Random labels performance. Learners performances on CIFAR10 testset when the trainset labels have the following probabilities to be random: 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.

The results of this experiment are presented in Table 2. Note that the regret is considerably larger when a significant part of the trainset has random labels, allowing us to detect when the model was trained in such a manner. In addition, it is evident that the pNML is more robust to noisy training data in the log-loss sense than the ERM, while there is no significant difference in the accuracy.

4.4 Detecting Out of Distribution examples

Another situation where one can use a confidence measure is detecting out of distribution examples. Theoretically, a good learner or predictor should assign a low confidence to his predictions regarding out of distribution examples. Nevertheless, DNNs frequently produces high-confidence predictions, arguably because softmax probabilities are computed with the fast-growing exponential function [11]. Thus, it would be interesting to see if is indeed large for out of distribution examples.


Figure 4: Out of distribution regret histogram. The regret histogram of test samples from CIFAR10, SVHN and random noise images. The pNML was trained using CIFAR10 trainset with ResNet-18 architecture.

Testset Model
Noise Max prob 0.794 0.188 0.113
/ 0.643 0.162 0.164
Regret 3.583 0.652 0.658
SVHN Max prob 0.777 0.189 0.108
/ 0.701 0.152 0.153
Regret 2.950 0.517 0.535
Table 3: Out of distribution detection. Out of distribution comparison between the different criteria for noise images and for images from SVHN dataset. The metrics are mean KL divergence, Bhattacharya distance, and log-likelihood ratio test. The higher the value of metrics, better detection of out of distribution samples.

To check that, we performed the initial training with the ResNet-18 architecture and CIFAR10 dataset as in 4.1. We then performed the fine-tuning phase, again in a similar manner to that described in 4.1, examining 10,000 test samples from CIFAR10, 100 test samples from SVHN [23] and 100 random Gaussian noise images.

The regret histograms for the different testset are presented in Figure 4, showing a clear distinction between in distribution and out of distribution samples.

To further evaluate the performance of the regret as a way to detect out of distribution samples, we used common metrics that measure the differences between two probability distributions. Let denote the probability assignment of the in distribution samples and the probability assignment of the out of distribution samples. The first metric to measure the difference between the probabilities is the average Kullback Leibler (KL) divergence:

(12)

The second evaluation metric is the Bhattacharyya distance [24] which measures the amount of overlapping in the probability assignment

(13)

The last metric comes from the likelihood ratio test (LRT), see [25]. Denote the probability assignment

(14)

and let be the point for which the KL divergence between and equals to the KL divergence between and . The value of the KL divergence at this point is the distance

(15)

We compare our method to the maximum probability method which determines if the test sample is from the in distribution or out of distribution based on the maximum of probability assignment [11] and to , where is the maximum probability of the prediction and is the second largest value of the probability assignment [26].

The comparison is shown in Table 3. One can note that by all the metrics we considered, the regret is indeed a significantly better classifier of out of distribution samples than the max probability and uncertainty measures for both noise images and SVHN images.


Figure 5: In and Out distribution histogram. The empirical distribution of the different uncertainty measures. The learners were trained with CIFAR10 trainset and test both on CIFAR10 testset and SVHN. (Top) the empirical distribution of CIFAR10 and SVHN test images based on the max probability output. (Middle) the empirical distribution of CIFAR10 and SVHN test images based on the . (Bottom) the empirical distribution of CIFAR10 and SVHN test images based on the regret (4).

Figure 5 (Top) shows the histogram of in distribution and out of distribution test images based on the max probability of the ERM output and Figure 5 (Middle) shows the histogram based on criterion. The histogram of in distribution and out of distribution test images based on the pNML regret is shown in figure 5 (Bottom). Clearly, when we categorize samples according to the regret there is a significant separation between the two distributions, whereas when using the max probability criterion or the criterion the two histograms overlap almost entirely.

4.5 Adversarial Attack

We evaluated the performance of our approach for adversarial samples. Our goal is in both detecting adversarial samples and devising a robust learner against them.

We created the adversarial samples using the Fast Gradient Sign Method (FGSM) [18]: First, we computed the sign of the loss function’s gradients according to the input pixels, and then multiplied these signs by a small value to create a perturbation. The adversarial image is the original image after the addition of the perturbation:

(16)

We generated adversarial samples on CIFAR10 dataset using a different model than the one used for testing (black box setting), namely ResNet-32 with equal 0, 0.001, 0.005, 0.01 and 0.05 (the images were normalized such that the pixels value is between -1.0 and 1.0) for 500 test samples.

We implemented the pNML with the same parameters as in 4.1 and with the ResNet-18 architecture, and tested on adversarial samples. The performance of the pNML, ERM and the genie for each are presented in Table 4 and in Figure 6. We can see that the pNML and the ERM have similar performances in accuracy. However, the log-loss of the pNML is much better for every and especially for , where the pNML log-loss is 1.72 compared with the ERM log-loss of 3.22. It is interesting to note that the genie, which is trained with the true test label, also has a significant decrease in accuracy, with only 53% success for .

Table 4 shows that when the perturbation strength is increased, the mean regret is also increased. This behavior implies that the adversarial image might be detected based on the regret.

The evaluation of the adversarial images detection using the regret is shown in Table 5 with the same metrics as in 4.4. It shows that there is some difference in the regret between regular and adversarial samples, which is increased when is increased. Nevertheless, the difference is much more subtle than that obtained in the out of distribution detection case, even for large values of .


0.0 0.001 0.005 0.01 0.05
ERM acc. 0.92 0.88 0.77 0.68 0.26
PNML acc. 0.92 0.87 0.77 0.69 0.27
Genie acc. 0.99 0.96 0.92 0.86 0.53
ERM loss 0.19 0.31 0.70 1.11 3.22
PNML loss 0.18 0.20 0.35 0.52 1.72
Genie loss 0.03 0.07 0.17 0.31 1.39
Regret 0.15 0.13 0.18 0.22 0.33
Table 4: Adversarial attack performance. Learners performances on adversarial samples from CIFAR10 testset. The adversarial samples were generated with Fast Gradient Sign Method [18] in black box settings. indicates the strength of the perturbation of the adversarial attack.

Figure 6: Adversarial samples log-loss. The log-loss of the ERM and pNML of adversarial samples from CIFAR10 testset. The adversarial samples were generated with Fast Gradient Sign Method [18] in black box settings. indicates the strength of the perturbation of the adversarial attack.

0.05 0.67 0.17 0.17
0.01 0.14 0.03 0.03
0.005 0.05 0.01 0.01
0.001 0.03 0.01 0.01
Table 5: Adversarial attack detection. The metrics are mean KL divergence, Bhattacharya distance and log likelihood-ratio test. The higher the distance, the better the adversarial sample is detected.

4.6 Twice Universality

In addition to the universality inside a model class , a second universality can be made regarding the index of multiple model families . This is the ‘twice universal’ approach. For the DNNs models we considered 3 model classes: The first class is the one described in 4.1 where the fine-tuning affected only the last two layers. In the second class fine-tuning affected all layers. The third class is simply the DNN obtained after the initial training, which is the ERM learner.

Our next results were obtained by twice-universality over these 3 classes: For each test sample and for each label, we took the maximum of the pNML probabilities over the 3 model classes. Then we normalized the probability of the test sample in order to create a valid probability assignment, as described in Algorithm 2.


Dataset Model Acc. Log-loss
CIFAR10 0 layers 0.920 0.203
2 layers 0.921 0.173
7 layers 0.913 0.244
Twice Univ. 0.920 0.168
MNIST 0 layers 0.930 0.093
1 layer 0.937 0.086
2 layers 0.937 0.085
Twice Univ. 0.950 0.081
Table 6: Twice Universality performance. The performance of the pNML learners with different amount of layers that were finetuned. In addition, the performance of the Twice Universal learner. The leaner were trained and tested on CIFAR10 and MNIST datasets.

Table 6 summarizes the results on CIFAR10. The twice universal predictor is better than any other class predictor in the log-loss sense. In addition, its accuracy is 0.92 which is as good as the 2 layer model.

We also examined the twice universality method on MNIST dataset [27]. The network used was a simple multilayer perceptron [28], with 10 hidden units. The initial training consisted of 100 epochs with a learning rate of 0.01 for the first 10 epochs, 0.01 for the next 30, 0.001 for the next 40 and 0.0001 for the rest. Next, we executed the pNML procedure of Algorithm 1. The fine-tuning phase consisted of 10 epochs with a learning rate of 0.0001, changing either the two layers or only the last one.

The results of the twice universality method on MNIST dataset are also presented in Table 6. The twice universal learner has the best performance, both in the log-loss and the accuracy rate. Since this learner is close to the optimal one for every specific sample, it seems that it avoids the pitfalls of each of the other learners and thus manages to outperform them.

5 Conclusion

In this paper, we presented the Deep pNML leaner in the individual setting with respect to the log-loss function. We showed that the pNML scheme outperforms the ERM in the accuracy and log-loss sense, it is more robust for noisy trainset and have better performance for adversarial samples. In addition, the inherent regret property of the pNML can detect out of distribution and adversarial samples as well as noisy trainsets. Finally, we introduce the twice universality concept and showed its performance on CIFAR10 and MNIST dataset with different DNN architectures, making it universal on any hyper-parameter choice, yet still at least as good as the best one.

Future work. There is an on-going work on the theory of the pNML and its generalizations, including relating the pNML procedure to stability analysis. In the more practical aspects related to DNNs, it is interesting to explore the effect of using a varying number of epochs in the fine-tuning phase. This may enhance our understanding of DNNs, and perhaps improve the performance. In addition, we plan to try the pNML for various network architectures and hyper-parameters, with the hope to improve performance and to get more insight on how they affect the generalization abilities of the network.

References

  • [1] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • [2] Yaniv Fogel and Meir Feder. Universal Batch Learning with Log-loss in the Individual Setting. arxiv, 2018.
  • [3] Neri Merhav and Meir Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6):2124–2147, 1998.
  • [4] Yaniv Fogel and Meir Feder. Universal Supervised Learning for Individual Data. arXiv preprint arXiv:1812.09520v1, pages 1–15, 2018.
  • [5] Yurii Mikhailovich Shtarkov. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3–17, 1987.
  • [6] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
  • [7] Koby Bibas, Yaniv Fogel, and Meir Feder. New look at an old problem: A universal learning approach to linear regression. Submitted, ISIT, 2019.
  • [8] Yaniv Fogel and Meir Feder. On the problem of on-line learning with log-loss. IEEE International Symposium on Information Theory - Proceedings, pages 2995–2999, 2017.
  • [9] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
  • [10] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  • [11] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  • [12] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
  • [13] Jerone TA Andrews, Thomas Tanay, Edward J Morton, and Lewis D Griffin. Transfer representation-learning for anomaly detection. In ICML, 2016.
  • [14] Gabi Shalev, Yossi Adi, and Joseph Keshet. Out-of-distribution detection using multiple semantic label representations. In Advances in Neural Information Processing Systems, pages 7386–7396, 2018.
  • [15] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
  • [16] Aran Nayebi and Surya Ganguli. Biologically inspired protection of deep networks from adversarial attacks. arXiv preprint arXiv:1703.09202, 2017.
  • [17] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. arXiv preprint arXiv:1704.08847, 2017.
  • [18] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2015.
  • [19] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [21] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
  • [22] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2016.
  • [23] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
  • [24] Thomas Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE transactions on communication technology, 15(1):52–60, 1967.
  • [25] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • [26] Noam Mor and Lior Wolf. Confidence prediction for lexicon-free ocr. arXiv preprint arXiv:1805.11161, 2018.
  • [27] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. online: http://yann.lecun.com/exdb/mnist/, 2010.
  • [28] Umut Orhan, Mahmut Hekim, and Mahmut Ozer. Eeg signals classification using the k-means clustering and a multilayer perceptron neural network model. Expert Systems with Applications, 38(10):13475–13481, 2011.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
357615
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description