Improving CNN classifiersby estimating test-time priors

Improving CNN classifiers
by estimating test-time priors

Milan Sulc, Jiri Matas
Dept. of Cybernetics, FEE CTU in Prague
Technicka 2, Prague, Czech Republic

The problem of different training and test set class priors is addressed in the context of CNN classifiers. An EM-based algorithm for test-time class priors estimation is evaluated on fine-grained computer vision problems for both the batch and on-line situations. Experimental results show a significant improvement on the fine-grained classification tasks using the known evaluation-time priors, increasing the top-1 accuracy by 4.0% on the FGVC iNaturalist 2018 validation set and by 3.9% on the FGVCx Fungi 2018 validation set. Iterative estimation of test-time priors on the PlantCLEF 2017 dataset increased the image classification accuracy by 3.4%, allowing a single CNN model to achieve state-of-the-art results and outperform the competition-winning ensemble of 12 CNNs.


Improving CNN classifiers
by estimating test-time priors

  Milan Sulc, Jiri Matas Dept. of Cybernetics, FEE CTU in Prague Technicka 2, Prague, Czech Republic sulcmila,


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

A common assumption of many machine learning algorithms is that the training set is independently sampled from the same data distribution as the test data [1, 4, 5]. In practical computer vision tasks, this assumption is often violated - training samples may be obtained from diverse sources where classes appear with frequencies differing from the test-time. For instance, for the task of fine-grained recognition of plant species from images, training examples can be downloaded from the online Encyclopedia. However, the number of photographs of a species in the Encyclopedia may not correspond to the species incidence or to the frequency a species is queried in a plant identification service.

In this paper, we show that state-of-the-art results can be obtained by expecting and adapting to the change of class priors. Methods [10, 2] for adjusting classifier outputs to new and unknown a priori probabilities have been published years ago, yet the problem of changed class priors is commonly not addressed in computer vision tasks where the situation arises. Section 2 provides a formulation of the problem: a probabilistic interpretation of CNN classifier outputs in Section 2.1, compensation for the change in a-priori class probabilities in Section 2.2 and estimation the new a-priori probabilities in Section 2.3.

The training set a-priori class probabilities can be easily empirically determined from the class frequencies in the training set. We also consider the more complex scenario where the training set (and its distribution) changes during training and fine-tuning.

Experiments in Section 3 show that the predictions of state-of-the-art Convolutional Neural Networks (CNN) on fine-grained image classification tasks can noticeably benefit from correcting the a priori probabilities. We evaluate the impact of the estimation of the a priori probabilities for the case when the whole test set is available to the classifier as well as the situation where the test images are classified on-line (sequentially).

Figure 1: The numbers of instances per class in the training and test data of the PlantCLEF 2017 plant recognition challenge have a long-tail distribution which is typical for many real-world fine-grained classification problems. The second and third plots zoom on the 20 most frequent species and the 3000 least frequent species.

2 Problem Formulation and Methodology

2.1 Probabilistic interpretation of CNN outputs

Let us assume that a Convolutional Neural Network classifier is trained to provide an estimate of posterior probabilities of classes given an image observation :


where are parameters of the trained CNN.

This is a common interpretation of the process of training a deep network by minimizing the cross-entropy loss over samples with known class-membership labels :


where is a one-hot encoding of class label :


The cross-entropy minimization from Eq. 2 can be rewritten as a maximum a-posteriori (MAP) estimation:


2.2 New a-priori class distribution

When the prior class probabilities in our validation/test111We use index (for evaluation) to denote all evaluation-time distributions. set differ from the training set, the posterior changes too. The probability density function , describing the statistical properties of observations on class , remains unchanged:


Since , the mutual relation of the posteriors is:


The class priors can be empirically quantified as the number of images labeled as in the training set. The test-time priors are, however, often unknown at test time.

2.3 Estimating the new a-priori probabilities

Saerens et al. [10] proposed to approach the estimation of unknown test-time a priori probabilities by iteratively maximizing the likelihood of the test observations:


They derive a simple EM algorithm comprising of the following steps:


where Eq. 8 is the Expectation-step, Eq. 9 is the Maximization-step, and may be initialized, for example, by the training set relative frequency .

Du Plessis and Sugiyama [2] proved that this procedure is equivalent to fixed-point-iteration optimization of the KL divergence minimization between the test observation density and a linear combination of the class-wise predictions , where are the estimates of .


3 Experiments

Figure 2: Examples from the CIFAR-100 dataset.
Figure 3: Examples from FGVCx Fungi 2018 (top row), FGVC iNaturalist 2018 (middle row), and PlantCLEF 2017 (bottom row)

The following fine-grained classification datasets are used for experiments in this Section:

  1. CIFAR-100 is a popular dataset for smaller-scale fine-grained classification experiments, introduced by Krizhevsky and Hinton [8] in 2009. It contains small resolution (32x32) color images of 100 classes. While the dataset is balanced (with 500 training samples and 100 test samples for each class), we sample a number of its unbalanced subsets for our experiments in this Section.

  2. PlantCLEF 2017 [3] was a plant species recognition challenge organized as part of the LifeCLEF workshop [7]. The provided training images for 10,000 plant species consisted from a EOL "trusted" training set (downloaded from the Encyclopedia of Life222, a significantly larger "noisy" training set (obtained from Google and Bing image search results, including mislabeled or irrelevant images), and the previous years (2015-2016) images depicting only a subset of the species. We use the training data in two ways: Either training on all the sets together (including the "noisy" set) - further denoted as PlantCLEF-All, or excluding the "noisy" set (i.e. using the 2017 EOL data and the previous years data) - further denoted as PlantCLEF-Trusted. The test set from the PlantCLEF 2017 challenge is used for evaluation. All data is publicly available333, PlantCLEF presents an example of a real-world fine-grained classification task, where the number of available images per class is highly unbalanced.

  3. FGVC iNaturalist 2018 is a large scale species classification competition, organized with the FGVC5 workshop at CVPR 2018. The provided dataset covers 8,142 species of plants, animals and fungi: The training set is highly unbalanced and contains almost 440K images. A balanced validation set of 24K images is provided.

  4. FGVCx Fungi 2018 is a another species classification competition, focused only on fungi, also organized with the FGVC5 workshop at CVPR 2018. The dataset covers nearly 1,400 fungi species. The training set contains almost 86K images, and is highly unbalanced. The validation set is balanced, with 4,182 images in total.

Examples from the dataset are displayed in Figures 2 and 3.

3.1 Validation of posterior estimates on the training set

Figure 4: Comparison of class frequency and CNN output marginalization over all images in the train- and test- sets sampled from CIFAR-100.

Before considering the change in class priors, let us validate that the marginalization of CNN predictions on training and validation data estimates the class priors well:


We simulated normal and exponential prior class distributions by randomly picking subsets of the CIFAR-100 database that follow the chosen distributions. A 32-layer Residual Network444Implementation from [6] was trained on the training-subsets. The comparison of empirical class frequencies and the estimates obtained by marginalizing the CNN outputs (i.e. averaging CNN predictions) is displayed in Figure 4. The training set class distributions are estimated almost perfectly. The estimates on the test set are more noisy, but still approximate the class frequencies well.

3.2 Adjusting posterior probabilities when test-time priors are known

FGVC iNaturalist 2018:
      FGVCx Fungi 2018:

Figure 5: Training and validation set distributions (left) and accuracy before and after correcting predictions with the known/uniform val. set distribution (right) for FGVC iNaturalist 2018 (top) and FGVCx Fungi 2018 (bottom)

For experiments with known test-time prior probabilities , we use the training and validations sets from the FGVC iNaturalist555 Competition 2018 and the FGVCx Fungi666 Classification Competition 2018. In these challenges, the validation sets are balanced (i.e. the class prior distribution is uniform). A state-of-the-art Convolutional Neural Network, Inception-v4 [11], was fine-tuned for each task. The predictions were corrected as prescribed by Equation 6.

Figure 5 displays the training and evaluation distribution and the improvement in accuracy achieved by correcting the predictions with the known priors. The improvement in top-1 accuracy is 4.0% and 3.9% after 400K training steps (and up to 7.4% and 4.9% during fine-tuning) for the FGVC iNaturalist and FGVCx Fungi classification challenges respectively.

3.3 Adjusting posterior probabilities when the whole test set with unknown priors is available at test-time

We choose the PlantCLEF 2017 challenge test set as an example of test environment, where no knowledge about the class distribution was available. The training set is highly unbalanced and the test set statistics do not follow the training set statistics very well, see Figure 1.

We used an Inception-V4 model pre-trained on all available training data (PlantCLEF-All). The results in Table 1 show, that the top-1 accuracy increases by 3.4% when estimating the test set priors using the EM algorithm of Saerens et al. [10] (Eq. 8, 9). To compare with the results of the 2017 challenge, we combine the predictions per specimen observation (the test set contained several images per specimen, linked by ObservationID meta-data) and compute the observation-identification accuracy. Note that after the test set prior-estimation, our single CNN model outperforms the winning submission of PlantCLEF 2017 composed of 12 very deep CNN models (ResNet-152, ResNeXt-101 and GoogLeNet architectures).

Model Accuracy Accuracy after EM Accuracy per observation (combined) Acc. per obs., known (oracle)
Inception V4 83.3% 86.7% 90.8% 93.7%
Ensemble of 12 CNNs [9] (PlantCLEF2017 winner) 88.5%
Table 1: Improvement in accuracy after applying the iterative test set prior estimation in the PlantCLEF 2017 plant identification challenge.

Another set of experiments was performed with the networks from in Section 3.1 trained on the selected subsets of CIFAR-100. We evaluate the networks against the full (balanced) CIFAR-100 test set, and compare the accuracies of the CNN predictions against the adjusted predictions - either using the iterative estimation, or in the case of known test-time priors. The results are in Table 2.

Interestingly, in experiments on the FGVC iNaturalist 2018 and FGVCx Fungi 2018 challenges, adjusting the predictions using the iterative test prior estimation did not improve the results - it actually even decreased the accuracy at the training, by 1.6% and 0.5% respectively.

Train. distr. Accuracy [%] 48.15 55.70 60.88 64.01 65.62 67.29 36.68 47.72 54.00 56.57 60.37 61.66 Accuracy [%] after EM 49.73 56.90 61.57 64.57 65.61 67.13 38.65 49.04 55.15 57.03 60.58 61.76 Accuracy [%] known 51.20 57.61 62.23 64.73 65.92 67.44 40.62 50.07 55.86 57.49 60.92 62.11

Table 2: Correction of CNN estimates trained on unbalanced CIFAR-100 subsets and evaluated on the full CIFAR-100 test set

3.4 Adjusting posterior probabilities on-line with new test samples

In practical tasks, test samples are often evaluated rather sequentially than all at once. We evaluate how the test-time class prior estimation on the PlantCLEF 2017 dataset affects the results on-line, i.e. when the priors are always estimated from the already seen examples. In Figure 6, after about 1,000 test samples, the predictions adjusted by iteratively estimated class priors gain a noticeable margin against the plain CNN predictions. Moreover, the accuracy of the adjusted predictions was not significantly lower than the original predictions even for the first few hundred test cases.

Figure 6: On-line test-prior estimation (i.e. images tested sequentially) on the PlantCLEF 2017 dataset.

3.5 Changing the training set priors

Two experiments were performed the training set changed during the CNN training.

In the first experiment, new samples are added into the training set. We take a network from Section 3.1 pre-trained on an unbalanced subset of CIFAR-100 and we fine-tune it on the full (balanced) CIFAR-100 training set. The predictions are evaluated on the complete (and balanced) test set.

The second experiment covers the other case: removing samples from the training set. On the PlantCLEF 2017 dataset, we used all training data (PlantCLEF-All) and then removed the major subset with noisy labels and fine-tuned only on the trusted data (PlantCLEF-Trusted).

Figure 7: CNN pre-trained on unbalanced CIFAR-100 subset fine-tuned on the full CIFAR-100 training set (left). CNN pre-trained on PlantCLEF-All fine-tuned on PlantCLEF-Trusted (right).

The results for both experiments are in Figure 7. From the CIFAR experiment, it is clearly visible that using the old training set priors is still favorable for a few fine-tuning steps, but the effective priors of the CNN classifier seem to change fast. In the second experiment, the difference between the old and new priors is significantly lower, but displays a similar case.

4 Conclusions

The paper highlighted the importance of not ignoring the commonly found difference between the class priors in the training and test sets. We show that an existing EM-based method [10] for estimation of class priors is suitable and effective for CNN-based classifiers for both the case where the test set can be handled as a batch and where the classification is on-line.

Experimental results show a significant improvement on the FGVC iNaturalist 2018 and FGVCx Fungi 2018 classification tasks using the known evaluation-time priors, increasing the top-1 accuracy by 4.0% and 3.9% respectively. Iterative estimation of test-time priors on the PlantCLEF 2017 dataset increases the image classification accuracy by 3.4%, allowing a single CNN model to achieve state-of-the-art results and outperform the competition-winning ensemble of 12 CNNs.


  • [1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2006.
  • [2] Marthinus Christoffel Du Plessis and Masashi Sugiyama. Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Networks, 50:110–119, 2014.
  • [3] Herve Goeau, Pierre Bonnet, and Alexis Joly. Plant identification based on noisy web data: the amazing performance of deep learning (lifeclef 2017). CEUR Workshop Proceedings, 2017.
  • [4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
  • [5] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [7] Alexis Joly, Hervé Goëau, Hervé Glotin, Concetto Spampinato, Pierre Bonnet, Willem-Pier Vellinga, Jean-Christophe Lombardo, Robert Planque, Simone Palazzo, and Henning Müller. Lifeclef 2017 lab overview: multimedia species identification challenges. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 255–274. Springer, 2017.
  • [8] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
  • [9] Mario Lasseck. Image-based plant species identification with deep convolutional neural networks. Working Notes of CLEF, 2017, 2017.
  • [10] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
  • [11] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description