Improving CNN classifiers
by estimating test-time priors
The problem of different training and test set class priors is addressed in the context of CNN classifiers. An EM-based algorithm for test-time class priors estimation is evaluated on fine-grained computer vision problems for both the batch and on-line situations. Experimental results show a significant improvement on the fine-grained classification tasks using the known evaluation-time priors, increasing the top-1 accuracy by 4.0% on the FGVC iNaturalist 2018 validation set and by 3.9% on the FGVCx Fungi 2018 validation set. Iterative estimation of test-time priors on the PlantCLEF 2017 dataset increased the image classification accuracy by 3.4%, allowing a single CNN model to achieve state-of-the-art results and outperform the competition-winning ensemble of 12 CNNs.
Improving CNN classifiers
by estimating test-time priors
Milan Sulc, Jiri Matas Dept. of Cybernetics, FEE CTU in Prague Technicka 2, Prague, Czech Republic sulcmila,email@example.com
noticebox[b]Preprint. Work in progress.\end@float
A common assumption of many machine learning algorithms is that the training set is independently sampled from the same data distribution as the test data [1, 4, 5]. In practical computer vision tasks, this assumption is often violated - training samples may be obtained from diverse sources where classes appear with frequencies differing from the test-time. For instance, for the task of fine-grained recognition of plant species from images, training examples can be downloaded from the online Encyclopedia. However, the number of photographs of a species in the Encyclopedia may not correspond to the species incidence or to the frequency a species is queried in a plant identification service.
In this paper, we show that state-of-the-art results can be obtained by expecting and adapting to the change of class priors. Methods [10, 2] for adjusting classifier outputs to new and unknown a priori probabilities have been published years ago, yet the problem of changed class priors is commonly not addressed in computer vision tasks where the situation arises. Section 2 provides a formulation of the problem: a probabilistic interpretation of CNN classifier outputs in Section 2.1, compensation for the change in a-priori class probabilities in Section 2.2 and estimation the new a-priori probabilities in Section 2.3.
The training set a-priori class probabilities can be easily empirically determined from the class frequencies in the training set. We also consider the more complex scenario where the training set (and its distribution) changes during training and fine-tuning.
Experiments in Section 3 show that the predictions of state-of-the-art Convolutional Neural Networks (CNN) on fine-grained image classification tasks can noticeably benefit from correcting the a priori probabilities. We evaluate the impact of the estimation of the a priori probabilities for the case when the whole test set is available to the classifier as well as the situation where the test images are classified on-line (sequentially).
2 Problem Formulation and Methodology
2.1 Probabilistic interpretation of CNN outputs
Let us assume that a Convolutional Neural Network classifier is trained to provide an estimate of posterior probabilities of classes given an image observation :
where are parameters of the trained CNN.
This is a common interpretation of the process of training a deep network by minimizing the cross-entropy loss over samples with known class-membership labels :
where is a one-hot encoding of class label :
The cross-entropy minimization from Eq. 2 can be rewritten as a maximum a-posteriori (MAP) estimation:
2.2 New a-priori class distribution
When the prior class probabilities in our validation/test111We use index (for evaluation) to denote all evaluation-time distributions. set differ from the training set, the posterior changes too. The probability density function , describing the statistical properties of observations on class , remains unchanged:
Since , the mutual relation of the posteriors is:
The class priors can be empirically quantified as the number of images labeled as in the training set. The test-time priors are, however, often unknown at test time.
2.3 Estimating the new a-priori probabilities
Saerens et al.  proposed to approach the estimation of unknown test-time a priori probabilities by iteratively maximizing the likelihood of the test observations:
They derive a simple EM algorithm comprising of the following steps:
Du Plessis and Sugiyama  proved that this procedure is equivalent to fixed-point-iteration optimization of the KL divergence minimization between the test observation density and a linear combination of the class-wise predictions , where are the estimates of .
The following fine-grained classification datasets are used for experiments in this Section:
CIFAR-100 is a popular dataset for smaller-scale fine-grained classification experiments, introduced by Krizhevsky and Hinton  in 2009. It contains small resolution (32x32) color images of 100 classes. While the dataset is balanced (with 500 training samples and 100 test samples for each class), we sample a number of its unbalanced subsets for our experiments in this Section.
PlantCLEF 2017  was a plant species recognition challenge organized as part of the LifeCLEF workshop . The provided training images for 10,000 plant species consisted from a EOL "trusted" training set (downloaded from the Encyclopedia of Life222http://www.eol.org/), a significantly larger "noisy" training set (obtained from Google and Bing image search results, including mislabeled or irrelevant images), and the previous years (2015-2016) images depicting only a subset of the species. We use the training data in two ways: Either training on all the sets together (including the "noisy" set) - further denoted as PlantCLEF-All, or excluding the "noisy" set (i.e. using the 2017 EOL data and the previous years data) - further denoted as PlantCLEF-Trusted. The test set from the PlantCLEF 2017 challenge is used for evaluation. All data is publicly available333http://www.imageclef.org/lifeclef/2017/plant, http://www.imageclef.org/node/198. PlantCLEF presents an example of a real-world fine-grained classification task, where the number of available images per class is highly unbalanced.
FGVC iNaturalist 2018 is a large scale species classification competition, organized with the FGVC5 workshop at CVPR 2018. The provided dataset covers 8,142 species of plants, animals and fungi: The training set is highly unbalanced and contains almost 440K images. A balanced validation set of 24K images is provided.
FGVCx Fungi 2018 is a another species classification competition, focused only on fungi, also organized with the FGVC5 workshop at CVPR 2018. The dataset covers nearly 1,400 fungi species. The training set contains almost 86K images, and is highly unbalanced. The validation set is balanced, with 4,182 images in total.
3.1 Validation of posterior estimates on the training set
Before considering the change in class priors, let us validate that the marginalization of CNN predictions on training and validation data estimates the class priors well:
We simulated normal and exponential prior class distributions by randomly picking subsets of the CIFAR-100 database that follow the chosen distributions. A 32-layer Residual Network444Implementation from https://github.com/tensorflow/models/tree/master/research/resnet  was trained on the training-subsets. The comparison of empirical class frequencies and the estimates obtained by marginalizing the CNN outputs (i.e. averaging CNN predictions) is displayed in Figure 4. The training set class distributions are estimated almost perfectly. The estimates on the test set are more noisy, but still approximate the class frequencies well.
3.2 Adjusting posterior probabilities when test-time priors are known
For experiments with known test-time prior probabilities , we use the training and validations sets from the FGVC iNaturalist555https://sites.google.com/view/fgvc5/competitions/inaturalist Competition 2018 and the FGVCx Fungi666https://sites.google.com/view/fgvc5/competitions/fgvcx/fungi Classification Competition 2018. In these challenges, the validation sets are balanced (i.e. the class prior distribution is uniform). A state-of-the-art Convolutional Neural Network, Inception-v4 , was fine-tuned for each task. The predictions were corrected as prescribed by Equation 6.
Figure 5 displays the training and evaluation distribution and the improvement in accuracy achieved by correcting the predictions with the known priors. The improvement in top-1 accuracy is 4.0% and 3.9% after 400K training steps (and up to 7.4% and 4.9% during fine-tuning) for the FGVC iNaturalist and FGVCx Fungi classification challenges respectively.
3.3 Adjusting posterior probabilities when the whole test set with unknown priors is available at test-time
We choose the PlantCLEF 2017 challenge test set as an example of test environment, where no knowledge about the class distribution was available. The training set is highly unbalanced and the test set statistics do not follow the training set statistics very well, see Figure 1.
We used an Inception-V4 model pre-trained on all available training data (PlantCLEF-All). The results in Table 1 show, that the top-1 accuracy increases by 3.4% when estimating the test set priors using the EM algorithm of Saerens et al.  (Eq. 8, 9). To compare with the results of the 2017 challenge, we combine the predictions per specimen observation (the test set contained several images per specimen, linked by ObservationID meta-data) and compute the observation-identification accuracy. Note that after the test set prior-estimation, our single CNN model outperforms the winning submission of PlantCLEF 2017 composed of 12 very deep CNN models (ResNet-152, ResNeXt-101 and GoogLeNet architectures).
|Model||Accuracy||Accuracy after EM||Accuracy per observation (combined)||Acc. per obs., known (oracle)|
|Ensemble of 12 CNNs  (PlantCLEF2017 winner)||–||–||88.5%|
Another set of experiments was performed with the networks from in Section 3.1 trained on the selected subsets of CIFAR-100. We evaluate the networks against the full (balanced) CIFAR-100 test set, and compare the accuracies of the CNN predictions against the adjusted predictions - either using the iterative estimation, or in the case of known test-time priors. The results are in Table 2.
Interestingly, in experiments on the FGVC iNaturalist 2018 and FGVCx Fungi 2018 challenges, adjusting the predictions using the iterative test prior estimation did not improve the results - it actually even decreased the accuracy at the training, by 1.6% and 0.5% respectively.
3.4 Adjusting posterior probabilities on-line with new test samples
In practical tasks, test samples are often evaluated rather sequentially than all at once. We evaluate how the test-time class prior estimation on the PlantCLEF 2017 dataset affects the results on-line, i.e. when the priors are always estimated from the already seen examples. In Figure 6, after about 1,000 test samples, the predictions adjusted by iteratively estimated class priors gain a noticeable margin against the plain CNN predictions. Moreover, the accuracy of the adjusted predictions was not significantly lower than the original predictions even for the first few hundred test cases.
3.5 Changing the training set priors
Two experiments were performed the training set changed during the CNN training.
In the first experiment, new samples are added into the training set. We take a network from Section 3.1 pre-trained on an unbalanced subset of CIFAR-100 and we fine-tune it on the full (balanced) CIFAR-100 training set. The predictions are evaluated on the complete (and balanced) test set.
The second experiment covers the other case: removing samples from the training set. On the PlantCLEF 2017 dataset, we used all training data (PlantCLEF-All) and then removed the major subset with noisy labels and fine-tuned only on the trusted data (PlantCLEF-Trusted).
The results for both experiments are in Figure 7. From the CIFAR experiment, it is clearly visible that using the old training set priors is still favorable for a few fine-tuning steps, but the effective priors of the CNN classifier seem to change fast. In the second experiment, the difference between the old and new priors is significantly lower, but displays a similar case.
The paper highlighted the importance of not ignoring the commonly found difference between the class priors in the training and test sets. We show that an existing EM-based method  for estimation of class priors is suitable and effective for CNN-based classifiers for both the case where the test set can be handled as a batch and where the classification is on-line.
Experimental results show a significant improvement on the FGVC iNaturalist 2018 and FGVCx Fungi 2018 classification tasks using the known evaluation-time priors, increasing the top-1 accuracy by 4.0% and 3.9% respectively. Iterative estimation of test-time priors on the PlantCLEF 2017 dataset increases the image classification accuracy by 3.4%, allowing a single CNN model to achieve state-of-the-art results and outperform the competition-winning ensemble of 12 CNNs.
-  Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2006.
-  Marthinus Christoffel Du Plessis and Masashi Sugiyama. Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Networks, 50:110–119, 2014.
-  Herve Goeau, Pierre Bonnet, and Alexis Joly. Plant identification based on noisy web data: the amazing performance of deep learning (lifeclef 2017). CEUR Workshop Proceedings, 2017.
-  Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
-  Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Alexis Joly, Hervé Goëau, Hervé Glotin, Concetto Spampinato, Pierre Bonnet, Willem-Pier Vellinga, Jean-Christophe Lombardo, Robert Planque, Simone Palazzo, and Henning Müller. Lifeclef 2017 lab overview: multimedia species identification challenges. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 255–274. Springer, 2017.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
-  Mario Lasseck. Image-based plant species identification with deep convolutional neural networks. Working Notes of CLEF, 2017, 2017.
-  Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
-  Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.