Texture Bias Of CNNs Limits Few-Shot Classification Performance
Accurate image classification given small amounts of labelled data (few-shot classification) remains an open problem in computer vision. In this work we examine how the known texture bias of Convolutional Neural Networks (CNNs) affects few-shot classification performance. Although texture bias can help in standard image classification, in this work we show it significantly harms few-shot classification performance. After correcting this bias we demonstrate state-of-the-art performance on the competitive miniImageNet task using a method far simpler than the current best performing few-shot learning approaches.
The ability of neural networks to perform image classification has increased dramatically in recent years Krizhevsky et al. (2012); He et al. (2016); Tan and Le (2019). However, much of this improvement has relied on using large amounts of labelled data, with successful classification of a given class often requiring thousands of labelled images for training. This is in contrast to the ability of humans to recognize new classes using only one or two labelled examples. The goal of few-shot classification is to bridge this gap such that machines can generalize their classification ability to unseen classes using very small amounts of labelled data.
Geirhos et al. (2019) shows that Convolutional Neural Networks (CNNs) show a greater bias towards learning texture-based features than humans. However, they also demonstrate that this bias actually improves classification performance in the standard classification setting, where the classes at test time are drawn from the same distribution as those seen at train time.
At the time of writing there has been no investigation as to how the texture bias of CNNs affects classification performance when a distributional shift in classes is experienced between train and test time, such as that seen in the few-shot learning setting. This work performs this investigation and demonstrates that texture bias significantly decreases performance under distributional shift and correcting for this bias leads to large improvements in few-shot classification accuracy. We focus particularly on how the data itself can be manipulated or exploited to aid the learning process.
2 Related Work
2.1 Few-Shot Learning
A common methodology for evaluating few-shot image classification approaches is to pre-train a model on a corpus of labelled images and then test the model’s classification ability on unseen classes, given a small amount of labelled data from said unseen classes. The labelled data from the unseen classes is typically termed the support set and the images on which classification is tested is termed the query set. A wide range of approaches based on this methodology have been developed Finn et al. (2017); Vinyals et al. (2016); Rusu et al. (2019); Lee et al. (2019).
One distinction between such approaches is that of parametric versus non-parametric methods. Parametric methods pre-train a model and, when presented with the support set, will fine-tune the parameters of the pre-trained model to improve performance on the query set. One such parametric approach is model-agnostic meta-learning (MAML) Finn et al. (2017), which uses second-order gradients to learn an initialization that can be fine-tuned on a given support set. The resulting model can then perform classification on a corresponding query set. At the time of writing, the best performing parametric approach is classifier synthesis learning (CASTLE) Ye et al. (2019), which synthesizes few-shot classifiers based on a shared neural dictionary across classes, and then combines these synthesized classifiers with standard ’many-shot’ classifiers.
Non-parametric methods perform no fine-tuning when given the support set. Instead, they learn an embedding function and an associated metric space over which classification of new images can be performed. Such approaches include prototypical networks Snell et al. (2017) and matching networks Vinyals et al. (2016). Snell et al. (2017) learn an embedding function that maps images to points in a latent space. For a support set, each class ‘prototype’ is represented by the mean embedding of the examples in the support set. The query set is then classified according to the prototype that minimizes euclidean distance to the embedding of each query image. Non-parametric approaches have shown marginally inferior performance than parametric approaches. However, they are far simpler in their implementation at both at pre-training and test time.
2.2 Texture Bias Of CNNs
Outside of the few-shot learning field, there has been great progress in the interrogation of the features learned by CNNs when performing classification. Until recently, it was widely believed that CNNs were able to recognize objects through learning increasingly complicated spatial features Krizhevsky et al. (2012). However, Geirhos et al. (2019) show that CNNs trained on the ImageNet dataset show extreme bias towards learning texture-based representations of images over shape-based representations. Furthermore, Landau et al. (1988) show that shape is the single most important feature that humans use when classifying images.
Brendel and Bethge (2019) train CNNs with constrained receptive fields, effectively limiting the learned features to only low-frequency local features, such as texture. The resulting model achieves high test accuracies on ImageNet. This shows that texture-based features are adequate for good performance for standard image classification, where train and test classes are drawn from the same distribution.
Taken together, these results pose the question: why do CNNs learn to classify based almost entirely on texture where as humans rely mostly on shape? In this work, we consider a hypothesis: although high-frequency local features (such as texture) generalize well within a distribution, global low-frequency features (such as shape) generalize better under distributional shift. If this hypothesis were true, it could help explain why humans are so heavily dependent on shape-based features; it is because they generalize better to new classes than texture-based features.
3.1 Problem Formulation
In the typical few-shot classification formulation, a task consists of using a labelled support set to classify an unseen query set . The support and query sets are sampled from the same set of classes. During the pre-training phase, tasks are sampled from a set of tasks and at test time the tasks are sampled from . There is no class overlap between and . A k-shot n-way task will contain images from n different classes and k images from each class. In this work we use the same training scheme, model and loss as Snell et al. (2017).
3.2 Stylized Pre-training
Geirhos et al. (2019) are able to remove the texture bias of CNNs by training on Stylized-ImageNet, which removes each image’s texture by using AdaIN style transfer Huang and Belongie (2017). For this work, we create an analogous dataset: Stylized-miniImageNet.
Our pre-training tasks are sampled from Stylized-miniImageNet with probability and from miniImageNet with probability . By sampling from Stylized-miniImageNet with a given probability, we can control the extent to which our model can learn texture-based features as opposed to shape-based features. At test time all tasks are sampled from the withheld classes of standard miniImageNet.
3.3 Support & Query Data Augmentation
At test-time, as the k-shot of a task increases, the problem tends from few-shot to standard many-shot classification and test accuracy increases dramatically. In light of this, we ‘artificially’ increase k-shot by performing data-augmentation on the support set. Each image in the support set is augmented times. Our intention is to investigate the efficacy of emulating the high k setting using augmented examples from the support set. The prototype of class , , is then given by:
is the support set for class . is a randomly sampled data augmentation function. is the learned embedding function, in our case a prototypical network Snell et al. (2017).
For the query set, we also augment each image times. Each of these augmented images is then passed through the embedding function and the estimated probability for a given original query image, x, belonging to class is given by equations 2 and 3.
|Matching Networks Vinyals et al. (2016)||Conv-64||55.3 0.7|
|MAML Finn et al. (2017)||Conv-32||63 1|
|TADAM Oreshkin et al. (2018)||ResNet-12||76.7 0.3|
|ProtoNet (Baseline) Snell et al. (2017)||ResNet-12||76.8 0.3|
|LEO Rusu et al. (2019)||WRN-28-10||77.6 0.1|
|MetaOptNet Lee et al. (2019)||ResNet-12||78.6 0.5|
|CASTLE Ye et al. (2019)||ResNet-12||79.52 0.02|
Table 1 shows the performance of our training scheme when testing on miniImageNet for the 5-shot 5-way task. Our method is entirely non-parametric and far simpler to implement at both pre-train and test time than many of the other few-shot classification methods.
Geirhos et al. (2019) show that training on a combination of unstylized and stylized data leads to a small drop in classification accuracy. This is because when the training and testing data are drawn from the same distribution (i.e. the same classes) the inherent texture bias of CNNs can actually aid performance.
However, in the few-shot learning setting, testing data is drawn from a different distribution to the training data. The ablation shown in Table 2 shows that pre-training on a combination of unstylized and stylized data actually increases performance at test time, even though the testing data is entirely unstylized. This suggests that the texture bias of CNNs does adversely impact performance under distributional shift.
|Unstylized Data||Stylized Data||Support TTA||Query TTA||Test Accuracy|
Figure 2 shows that as the proportion of Stylized-miniImageNet pre-training data increases from 0 to 0.3, the accuracy increases as the resulting model is less biased towards texture-based features. However, as the proportion of Stylized-miniImageNet increases from 0.3, the accuracy begins to decrease again as the final model is biased too heavily towards shape-based features.
It has previously been demonstrated that CNNs are biased towards learning texture-based features over the shape-based features used in human vision. Although this bias aids classification performance when training and testing classes are drawn from the same distribution (standard image classification), this work shows that the texture bias of CNNs significantly decreases classification performance in the few-shot learning setting, where distributional shift is experienced. Correcting for this bias achieves state-of-the-art performance on miniImageNet, even using only a simple non-parametric method.
-  (2019) Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. External Links: Cited by: §2.2.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, External Links: Cited by: §2.1, §2.1, Table 1.
-  (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, Cited by: §1, §2.2, §3.2, §4, Stylized-miniImageNet.
-  (2016-06) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §1, Experimental Setup.
-  (2017-10) Arbitrary style transfer in real-time with adaptive instance normalization. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: Cited by: §3.2.
-  (2014) Adam: a method for stochastic optimization. External Links: Cited by: Experimental Setup.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §1, §2.2.
-  (1988) The importance of shape in early lexical learning. Cognitive Development 3 (3), pp. 299–321. Cited by: §2.2.
-  (2019) Meta-learning with differentiable convex optimization. In CVPR, Cited by: §2.1, Table 1.
-  (2018) TADAM: task dependent adaptive metric for improved few-shot learning. External Links: Cited by: Table 1.
-  (2019) Meta-learning with latent embedding optimization. External Links: Cited by: §2.1, Table 1.
-  (2017) Prototypical networks for few-shot learning. External Links: Cited by: §2.1, §3.1, §3.3, Table 1, Experimental Setup.
-  (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, External Links: Cited by: §1.
-  (2016) Matching networks for one shot learning. In NIPS, External Links: Cited by: §2.1, §2.1, Table 1.
-  (2019) Learning classifier synthesis for generalized few-shot learning. External Links: Cited by: §2.1, Table 1.
In this work we use a prototypical network  with a standard ResNet12 backbone . For training we use the Adam  optimizer with an initial learning rate of with parameters and . We do not initialize from any pre-trained weights and our model is trained for 70,000 steps, with the learning rate halving every 15,000 steps. We perform early stopping based off of validation accuracy. We use a temperature of 32 in the SoftMax function in equation 2. We test models over 20,000 sampled few-shot tasks, with the mean accuracy and 95% confidence intervals being reported above. We use no dropout, weight-decay or label smoothing.
For the generation of Stylized-miniImageNet, we began with miniImageNet and used the same stylization procedure as .  use a stylization coefficient of on ImageNet. When applied to the smaller images of miniImageNet, this stylization coefficient led to images that were so distorted that even humans were unable to perform successful classification. For this reason, we generated Stylized-miniImageNet with a less aggressive stylization coefficient of .
To ensure diversity of styles and true independence of texture and underlying image, we generate 10 stylized images for each original miniImageNet image. The stylization was performed only on the train split of miniImageNet as testing and validation were both done on standard miniImageNet.
At train and test time, we apply a standard set of data augmentations. The applied data augmentations are as follows: random horizontal flip, random brightness jitter, random contrast jitter, random saturation jitter and random crop between 70% and 100% of original image size. The final image is re-sized to be of size 84x84 pixels.