Improving CNN classifiers
by estimating testtime priors
Abstract
The problem of different training and test set class priors is addressed in the context of CNN classifiers. An EMbased algorithm for testtime class priors estimation is evaluated on finegrained computer vision problems for both the batch and online situations. Experimental results show a significant improvement on the finegrained classification tasks using the known evaluationtime priors, increasing the top1 accuracy by 4.0% on the FGVC iNaturalist 2018 validation set and by 3.9% on the FGVCx Fungi 2018 validation set. Iterative estimation of testtime priors on the PlantCLEF 2017 dataset increased the image classification accuracy by 3.4%, allowing a single CNN model to achieve stateoftheart results and outperform the competitionwinning ensemble of 12 CNNs.
Improving CNN classifiers
by estimating testtime priors
Milan Sulc, Jiri Matas Dept. of Cybernetics, FEE CTU in Prague Technicka 2, Prague, Czech Republic sulcmila,matas@fel.cvut.cz
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
A common assumption of many machine learning algorithms is that the training set is independently sampled from the same data distribution as the test data [1, 4, 5]. In practical computer vision tasks, this assumption is often violated  training samples may be obtained from diverse sources where classes appear with frequencies differing from the testtime. For instance, for the task of finegrained recognition of plant species from images, training examples can be downloaded from the online Encyclopedia. However, the number of photographs of a species in the Encyclopedia may not correspond to the species incidence or to the frequency a species is queried in a plant identification service.
In this paper, we show that stateoftheart results can be obtained by expecting and adapting to the change of class priors. Methods [10, 2] for adjusting classifier outputs to new and unknown a priori probabilities have been published years ago, yet the problem of changed class priors is commonly not addressed in computer vision tasks where the situation arises. Section 2 provides a formulation of the problem: a probabilistic interpretation of CNN classifier outputs in Section 2.1, compensation for the change in apriori class probabilities in Section 2.2 and estimation the new apriori probabilities in Section 2.3.
The training set apriori class probabilities can be easily empirically determined from the class frequencies in the training set. We also consider the more complex scenario where the training set (and its distribution) changes during training and finetuning.
Experiments in Section 3 show that the predictions of stateoftheart Convolutional Neural Networks (CNN) on finegrained image classification tasks can noticeably benefit from correcting the a priori probabilities. We evaluate the impact of the estimation of the a priori probabilities for the case when the whole test set is available to the classifier as well as the situation where the test images are classified online (sequentially).
2 Problem Formulation and Methodology
2.1 Probabilistic interpretation of CNN outputs
Let us assume that a Convolutional Neural Network classifier is trained to provide an estimate of posterior probabilities of classes given an image observation :
(1) 
where are parameters of the trained CNN.
This is a common interpretation of the process of training a deep network by minimizing the crossentropy loss over samples with known classmembership labels :
(2) 
where is a onehot encoding of class label :
(3) 
The crossentropy minimization from Eq. 2 can be rewritten as a maximum aposteriori (MAP) estimation:
(4) 
2.2 New apriori class distribution
When the prior class probabilities in our validation/test^{1}^{1}1We use index (for evaluation) to denote all evaluationtime distributions. set differ from the training set, the posterior changes too. The probability density function , describing the statistical properties of observations on class , remains unchanged:
(5) 
Since , the mutual relation of the posteriors is:
(6) 
The class priors can be empirically quantified as the number of images labeled as in the training set. The testtime priors are, however, often unknown at test time.
2.3 Estimating the new apriori probabilities
Saerens et al. [10] proposed to approach the estimation of unknown testtime a priori probabilities by iteratively maximizing the likelihood of the test observations:
(7) 
They derive a simple EM algorithm comprising of the following steps:
(8) 
(9) 
where Eq. 8 is the Expectationstep, Eq. 9 is the Maximizationstep, and may be initialized, for example, by the training set relative frequency .
Du Plessis and Sugiyama [2] proved that this procedure is equivalent to fixedpointiteration optimization of the KL divergence minimization between the test observation density and a linear combination of the classwise predictions , where are the estimates of .
(10) 
3 Experiments
The following finegrained classification datasets are used for experiments in this Section:

CIFAR100 is a popular dataset for smallerscale finegrained classification experiments, introduced by Krizhevsky and Hinton [8] in 2009. It contains small resolution (32x32) color images of 100 classes. While the dataset is balanced (with 500 training samples and 100 test samples for each class), we sample a number of its unbalanced subsets for our experiments in this Section.

PlantCLEF 2017 [3] was a plant species recognition challenge organized as part of the LifeCLEF workshop [7]. The provided training images for 10,000 plant species consisted from a EOL "trusted" training set (downloaded from the Encyclopedia of Life^{2}^{2}2http://www.eol.org/), a significantly larger "noisy" training set (obtained from Google and Bing image search results, including mislabeled or irrelevant images), and the previous years (20152016) images depicting only a subset of the species. We use the training data in two ways: Either training on all the sets together (including the "noisy" set)  further denoted as PlantCLEFAll, or excluding the "noisy" set (i.e. using the 2017 EOL data and the previous years data)  further denoted as PlantCLEFTrusted. The test set from the PlantCLEF 2017 challenge is used for evaluation. All data is publicly available^{3}^{3}3http://www.imageclef.org/lifeclef/2017/plant, http://www.imageclef.org/node/198. PlantCLEF presents an example of a realworld finegrained classification task, where the number of available images per class is highly unbalanced.

FGVC iNaturalist 2018 is a large scale species classification competition, organized with the FGVC5 workshop at CVPR 2018. The provided dataset covers 8,142 species of plants, animals and fungi: The training set is highly unbalanced and contains almost 440K images. A balanced validation set of 24K images is provided.

FGVCx Fungi 2018 is a another species classification competition, focused only on fungi, also organized with the FGVC5 workshop at CVPR 2018. The dataset covers nearly 1,400 fungi species. The training set contains almost 86K images, and is highly unbalanced. The validation set is balanced, with 4,182 images in total.
3.1 Validation of posterior estimates on the training set
Before considering the change in class priors, let us validate that the marginalization of CNN predictions on training and validation data estimates the class priors well:
(11) 
We simulated normal and exponential prior class distributions by randomly picking subsets of the CIFAR100 database that follow the chosen distributions. A 32layer Residual Network^{4}^{4}4Implementation from https://github.com/tensorflow/models/tree/master/research/resnet [6] was trained on the trainingsubsets. The comparison of empirical class frequencies and the estimates obtained by marginalizing the CNN outputs (i.e. averaging CNN predictions) is displayed in Figure 4. The training set class distributions are estimated almost perfectly. The estimates on the test set are more noisy, but still approximate the class frequencies well.
3.2 Adjusting posterior probabilities when testtime priors are known
For experiments with known testtime prior probabilities , we use the training and validations sets from the FGVC iNaturalist^{5}^{5}5https://sites.google.com/view/fgvc5/competitions/inaturalist Competition 2018 and the FGVCx Fungi^{6}^{6}6https://sites.google.com/view/fgvc5/competitions/fgvcx/fungi Classification Competition 2018. In these challenges, the validation sets are balanced (i.e. the class prior distribution is uniform). A stateoftheart Convolutional Neural Network, Inceptionv4 [11], was finetuned for each task. The predictions were corrected as prescribed by Equation 6.
Figure 5 displays the training and evaluation distribution and the improvement in accuracy achieved by correcting the predictions with the known priors. The improvement in top1 accuracy is 4.0% and 3.9% after 400K training steps (and up to 7.4% and 4.9% during finetuning) for the FGVC iNaturalist and FGVCx Fungi classification challenges respectively.
3.3 Adjusting posterior probabilities when the whole test set with unknown priors is available at testtime
We choose the PlantCLEF 2017 challenge test set as an example of test environment, where no knowledge about the class distribution was available. The training set is highly unbalanced and the test set statistics do not follow the training set statistics very well, see Figure 1.
We used an InceptionV4 model pretrained on all available training data (PlantCLEFAll). The results in Table 1 show, that the top1 accuracy increases by 3.4% when estimating the test set priors using the EM algorithm of Saerens et al. [10] (Eq. 8, 9). To compare with the results of the 2017 challenge, we combine the predictions per specimen observation (the test set contained several images per specimen, linked by ObservationID metadata) and compute the observationidentification accuracy. Note that after the test set priorestimation, our single CNN model outperforms the winning submission of PlantCLEF 2017 composed of 12 very deep CNN models (ResNet152, ResNeXt101 and GoogLeNet architectures).
Model  Accuracy  Accuracy after EM  Accuracy per observation (combined)  Acc. per obs., known (oracle) 

Inception V4  83.3%  86.7%  90.8%  93.7% 
Ensemble of 12 CNNs [9] (PlantCLEF2017 winner)  –  –  88.5%  
Another set of experiments was performed with the networks from in Section 3.1 trained on the selected subsets of CIFAR100. We evaluate the networks against the full (balanced) CIFAR100 test set, and compare the accuracies of the CNN predictions against the adjusted predictions  either using the iterative estimation, or in the case of known testtime priors. The results are in Table 2.
Interestingly, in experiments on the FGVC iNaturalist 2018 and FGVCx Fungi 2018 challenges, adjusting the predictions using the iterative test prior estimation did not improve the results  it actually even decreased the accuracy at the training, by 1.6% and 0.5% respectively.
3.4 Adjusting posterior probabilities online with new test samples
In practical tasks, test samples are often evaluated rather sequentially than all at once. We evaluate how the testtime class prior estimation on the PlantCLEF 2017 dataset affects the results online, i.e. when the priors are always estimated from the already seen examples. In Figure 6, after about 1,000 test samples, the predictions adjusted by iteratively estimated class priors gain a noticeable margin against the plain CNN predictions. Moreover, the accuracy of the adjusted predictions was not significantly lower than the original predictions even for the first few hundred test cases.
3.5 Changing the training set priors
Two experiments were performed the training set changed during the CNN training.
In the first experiment, new samples are added into the training set. We take a network from Section 3.1 pretrained on an unbalanced subset of CIFAR100 and we finetune it on the full (balanced) CIFAR100 training set. The predictions are evaluated on the complete (and balanced) test set.
The second experiment covers the other case: removing samples from the training set. On the PlantCLEF 2017 dataset, we used all training data (PlantCLEFAll) and then removed the major subset with noisy labels and finetuned only on the trusted data (PlantCLEFTrusted).
The results for both experiments are in Figure 7. From the CIFAR experiment, it is clearly visible that using the old training set priors is still favorable for a few finetuning steps, but the effective priors of the CNN classifier seem to change fast. In the second experiment, the difference between the old and new priors is significantly lower, but displays a similar case.
4 Conclusions
The paper highlighted the importance of not ignoring the commonly found difference between the class priors in the training and test sets. We show that an existing EMbased method [10] for estimation of class priors is suitable and effective for CNNbased classifiers for both the case where the test set can be handled as a batch and where the classification is online.
Experimental results show a significant improvement on the FGVC iNaturalist 2018 and FGVCx Fungi 2018 classification tasks using the known evaluationtime priors, increasing the top1 accuracy by 4.0% and 3.9% respectively. Iterative estimation of testtime priors on the PlantCLEF 2017 dataset increases the image classification accuracy by 3.4%, allowing a single CNN model to achieve stateoftheart results and outperform the competitionwinning ensemble of 12 CNNs.
References
 [1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2006.
 [2] Marthinus Christoffel Du Plessis and Masashi Sugiyama. Semisupervised learning of class balance under classprior change by distribution matching. Neural Networks, 50:110–119, 2014.
 [3] Herve Goeau, Pierre Bonnet, and Alexis Joly. Plant identification based on noisy web data: the amazing performance of deep learning (lifeclef 2017). CEUR Workshop Proceedings, 2017.
 [4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
 [5] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [7] Alexis Joly, Hervé Goëau, Hervé Glotin, Concetto Spampinato, Pierre Bonnet, WillemPier Vellinga, JeanChristophe Lombardo, Robert Planque, Simone Palazzo, and Henning Müller. Lifeclef 2017 lab overview: multimedia species identification challenges. In International Conference of the CrossLanguage Evaluation Forum for European Languages, pages 255–274. Springer, 2017.
 [8] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 [9] Mario Lasseck. Imagebased plant species identification with deep convolutional neural networks. Working Notes of CLEF, 2017, 2017.
 [10] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
 [11] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.