A Close Look at Deep Learning with Small Data

A Close Look at Deep Learning with Small Data


In this work, we perform a wide variety of experiments with different deep learning architectures on datasets of limited size. According to our study, we show that model complexity is a critical factor when only a few samples per class are available. Differently from the literature, we show that in some configurations, the state of the art can be improved using low complexity models. For instance, in problems with scarce training samples and without data augmentation, low-complexity convolutional neural networks perform comparably well or better than state-of-the-art architectures. Moreover, we show that even standard data augmentation can boost recognition performance by large margins. This result suggests the development of more complex data generation/augmentation pipelines for cases when data is limited. Finally, we show that dropout, a widely used regularization technique, maintains its role as a good regularizer even when data is scarce. Our findings are empirically validated on the sub-sampled versions of popular CIFAR-10, Fashion-MNIST and, SVHN benchmarks.

I Introduction

Machine Learning (ML) popularity has rapidly increased thanks to the success of deep learning [26]. In particular, convolutional neural network (CNN) architectures, have achieved considerable success in a wide range of computer vision tasks including object classification [6], object detection [19] and semantic segmentation [15], just to cite a few. The two main ingredients that have favored the rise of this type of algorithms are i) networks with deeper structure and ii) the use of large annotated datasets. The latter requirement can not always be fulfilled for several reasons. Obtaining and labeling data is needed to achieve strong results but this process might be extremely expensive or not possible at all. For instance, in the medical field, high-quality annotations by radiology experts are often costly and not manageable at large scales [13]. Several sub-domains of ML are trying to mitigate the necessity of training data, tackling the problem from different perspectives. Transfer learning aims at learning representations from one domain and transfer the learned features to a closely related target domain [18], [3]. Similarly, few-shot learning uses a base set of labelled pairs to generalize from a scarce support set of target classes [24]. Both approaches have gained much attention in the community but still require a large source of annotated data. Furthermore, the target domain must be related to the source domain. Another research direction is trying to reduce the demand for annotations. This field is known as self-supervised learning. Usually, a large pool of images is used to teach how to solve a pretext task to a CNN [9]. This task does not require human annotations and is conceived to teach general visual features that can be transferred to the downstream task. In this manner, costly human annotations are not needed but there is still the problem of collecting many images. In general, we would like to have systems that can recognize objects from just a few exemplars.

In this work, we present a detailed empirical study of deep learning models in the small data regime. Similarly to what has been done in [2] and [1], we benchmark the approaches by varying the number of data points in the training sample while keeping it low with respect to the current standards of computer vision datasets. Such a scenario can also be seen as the starting periods of an application that collects data over time and aims to perform the task in the best possible way before it has collected large amounts of data. For example, a robot involved in an interactive activity with people needs to recognize human actions or behaviour [27], [23]. Due to its inherent difficulty, not many past works have studied such a problem. Indeed, it is very hard to train function approximators when the availability of multi-dimensional points to interpolate is scarce. Any ML model is highly prone to overfit the training dataset, especially if its complexity is too high to handle the current task. CNNs have proved to be extraordinarily resistant to overfitting, although the number of trainable parameters is usually much greater than the number of data points [10]. In this paper, we show that model complexity is a critical factor in small-data domains and that small nets can be better than big ones in scenarios with limited training samples. A large number of works have proposed techniques to increase the generalization capabilities of deep networks. For instance, data augmentation is commonly used to fight the data scarcity problem and to increase generalization [21]. We perform our analysis with and without data augmentation to understand its effectiveness when the dataset size is limited. In our experiments, we show that standard augmentation pipelines improve recognition performance up to large margins. Clearly, the augmentation type should be carefully designed since its success is highly correlated to the image type. Moreover, dropout is an extremely popular technique that regularizes neural networks [22] by randomly dropping units to prevent co-adaptation and to favor generalization through ensembles. In this paper, we show that dropout is a good regularizer even when data is scarce, slightly less effective when the number of samples per class is extremely low (e.g. ). In summary, the contributions of this paper are the following: 1) we perform a large set of experiments with different quantities of training data and deep architectures over three popular computer vision benchmarks; 2) we show that for small-data problems, low-complexity CNNs are comparable to or better than high-complexity ones depending on the training set dimension and the use of data augmentation; 3) we demonstrate that standard data augmentation consistently improves testing accuracy of deep networks trained with few samples; 4) we show that dropout results to be a good regularizer even in these small-data settings.

CNN-lc 8 16 32 64
CNN-mc 16 32 64 128
CNN-hc 32 64 128 256
TABLE I: In this table, we give more details regarding the computational complexity of the tested networks. We show the number of convolutional filters per layer of the standard CNNs used. Further, we report for each model and dataset the number of trainable parameters (PARAMS) and floating-point operations (FLOPs), two standard metrics to measure model complexity.

Ii Related Work

Learning from datasets of limited size is extremely challenging and, for this reason, largely unsolved. As previously said, few works have tried to tackle the problem of training deep architectures with a small number of samples due to the difficulty of generalizing to novel instances.
We start by mentioning a series of works that focused on the classification of vector data and mainly used the UCI Machine Learning Repository as a benchmark. In [5] the authors have shown the superiority of random forests over a large set of classifiers including feed-forward networks. Later, [17] used a linear program to empirically decompose fitted neural networks into ensembles of low-bias sub-networks. They showed that these sub-networks were relatively uncorrelated which lead to an internal regularization process similar to what happens in random forests, obtaining comparable results. More recently, [1] proposed the use of neural tangent kernel (NTK) architectures in low data tasks and obtained significant improvements over all previously mentioned classifiers.
All previous works did not test CNNs since inputs were not images. When the input dimensionality increases, the classification task inevitably becomes more complex. For this reason, a straightforward approach to improve generalization is to implement techniques that try to synthesize new images through different transformations (e.g. data augmentation [21]). Some previous knowledge regarding the problem at hand might turn to be useful in some cases [8]. However, this makes data augmentation techniques not always generalizable to all possible image classification domains. It has also been proposed to train generative models (e.g. GANs) to increase the dataset size and consequently, performance [14]. Generating new images to improve performance is extremely attractive and effective. Yet, training a generative model might be computationally intensive or present severe challenges in the small sample domain. In our work, we use standard data augmentation and do not focus on approaches that improve image synthesis. We show that even standard augmentation can be highly effective to improve recognition performance and confirm that data augmentation/generation is a promising approach for datasets of limited size.
[20] suggested to train CNNs with a greedy layer-wise method, analogous to that used in unsupervised deep networks and showed that their method could learn more interpretable and cleaner visual features. [2] proposed the use of the cosine loss to prevent overfitting claiming that the L2 normalization involved in the cosine loss is a strong, hyper-parameter-free regularizer when the availability of samples is scarce. They obtained the best results on fine-grained datasets that have between and samples per class. On the other hand, [1] performed experiments with convolutional neural tangent kernel (CNTK) networks on small CIFAR-10 showing the superiority of CNTK in comparison to a ResNet-34 model. Finally, [4] studied the generalization performance of deep networks as the size of the training set varies. They found out that even larger networks can handle overfitting and obtain comparable or better results than smaller nets if properly optimized and calibrated.

Iii Small-Data Classification Problem

In this section, we first outline the definition of our classification problem with limited data, followed by describing the deep learning architectures studied in our experiments. Finally, we describe the datasets and regularization techniques used in this work.

Iii-a Problem definition

As a standard supervised ML problem, given a training distribution of images and a label distribution , our objective is to learn a classifier , parametrized by a set of variables , such that for any image with corresponding label , . Let us suppose that our distribution is made of different classes and that our training dataset sampled from and , is composed of images per class.

Differently from the standard computer vision datasets, we choose to keep low. Despite the notion of low is highly subjective, we follow two different sub-sampling protocols for setting our experiments. Firstly, to directly compare our tested models with the networks proposed in [1] is doubled each time starting from up to . Secondly, since we were also interested in configurations with more data, we also varied in the set , corresponding to a ten-time increase of samples per class with respect to the just cited protocol. In this manner, it is easy to benchmark models capabilities as the number of samples per class increases. In the result section, we specify when we are using the first or second sub-sampling protocol.

As previously said, such a protocol could be easily found in a situation where data is collected through time. Understanding which is the more effective model before reaching large quantities of data is an important research question.

In general, there is no specific limit that regards the number of classes . In our work, we have chosen three problems with .

Iii-B Models

Since we are interested in comparing deep models with different complexity we define three CNNs with increasing filter widths and a more complex ResNet-20.
 CNN. We test a standard CNN architecture made of convolutional and max-pooling layers as feature extractors. The entire network minimizes the standard classification loss:


More details about the structure will follow.
ResNet-20. Residual networks were introduced in [7] to improve the training of very deep neural networks. The basic learning block of such networks is the residual block. ResNets are made of different blocks of stacked layers (convolutions with non-linear activation functions and batch normalization). Shortcut connections join the input of each block to the output. This addition helps gradients to flow backward and ease the training of the overall network. ResNet-20 minimizes the same loss described in Eq. 1.
Models structure. The CNNs process the input image through four convolutional layers. The kernel is of size and moves with stride on both width and height directions. On the other hand, the max-pooling layers have a pool-size and stride of . The output of the convolutional feature extractor is flattened. Then, a feed-forward layer maps the extracted features to the requested output dimensionality .
We do not describe the full structure of ResNet-20 that can be found in [7]. For what concerns the kernel widths of ResNet-20, we set them to , and as in the default CIFAR-10 network proposed in [7].
Models complexity. We control the complexity of the CNNs by the number of filters in convolutional layers. We fix the lowest model complexity for the CNN by setting a number of filters for the first convolutional layer. Then, for each deeper layer, we simply double their number in order to increase complexity. We define a low, medium and high complexity CNN that we will call CNN-lc, CNN-mc, CNN-hc. The three architectures have respectively , and base filters. A detailed description can be found in Tab. I. We also report the complexity in terms of parameters and floating-point operations (FLOPs) of all networks.
Training setup. All standard CNNs have been optimized with Adam and default parameters [11].
For what concerns ResNet-20, we slightly modified the training policy originally proposed in [7]. The optimizer used is stochastic gradient descent (SGD) with weight decay and Nesterov momentum respectively set to and . We start with a learning rate of and decrease it after 75% of the total number of iterations by an order of magnitude. We increased the number of iterations with the initial learning rate to be sure of decreasing it after having reached the training loss plateau. Furthermore, we have noticed that decreasing a second time the learning rate did not improve further the testing performance. The network was fed with a mini-batch of images. We preferred a smaller batch with respect to the original since led to better performance. We also trained the other CNNs with a batch size of for the same reason. Due to different training set dimensions (including the use of data augmentation), we train the networks for a different number of epochs. For the sub-sampling protocol with , assuming that the first element of the set corresponds to datasets with samples per class and the last one to , architectures are trained for .
For the other sub-sampling protocol () and without data augmentation, models are trained for epochs. When augmentation is used, we slightly increase the number of epochs to be sure that models have converged (). Note that standard CNNs require far fewer iterations than ResNet-20 to converge because of their inherent lower complexity and the optimizer used. Indeed, Adam is known to be faster than SGD in terms of converge speed. To be sure that overfitting did not influence our results, for each run, the maximum value of the accuracy scored on the test set was considered. We mainly use the categorical cross-entropy as classification loss (). When specified, we perform some comparisons with the cosine loss proposed in [2].

Iii-C Datasets

We perform our experiments on three popular computer vision datasets, namely CIFAR-10 [12], FMNIST (Fashion-MNIST) [25] and SVHN [16].
FMNIST is a popular dataset comprised of black and white images of dimension . The total number of categories is and each class belongs to fashion items including dresses, trousers, sandals, etc. The dataset has originally training images and testing images with both sets balanced.
CIFAR-10 is an established computer vision benchmark consisting of color images () coming from different classes of objects and animals. The sizes of training and test splits of CIFAR-10 are equal to the ones of FMNIST.
Finally, SVHN is a real-world image dataset semantically similar to MNIST since contains images of digits. However, it is significantly harder because recognizing numbers in natural scene images constitutes a more complex task. The cropped-version sets used originally have training and testing images of dimension .
We have chosen datasets made of ten classes and relatively small images since the restrictions imposed on the number of samples per class make the problem already challenging. As previously anticipated, since all three original datasets contain several samples per class (many more than our defined ), we build, for each dimension, a sub-sampled version of the original dataset. We will refer to any down-sampled version of any dataset as . For instance, sCIFAR-10-20 indicates the down-sampled version of CIFAR-10 with twenty samples per class.
The testing sets remain fixed to the original sizes throughout all the evaluations. To ensure consistency of the results, we perform runs for each sub-sampled version of each dataset.

Iii-D Regularization techniques

We study the influence of two popular regularization techniques for some of our experiments with datasets of limited size.
Dropout. We use dropout in the last layer of CNN-lc, CNN-mc and CNN-hc. We vary the dropping-rate probabilities in the set , corresponding to absent, medium and, high dropout. We will show that high rates help to prevent overfitting except for rare cases.
Data augmentation. Artificially increasing the images of a dataset by applying data augmentation techniques has been shown to improve the generalization capabilities of deep models even when a great quantity of data is available [21].
The transformations applied to the images should not modify the semantics needed to perform the classification task. To this end, we applied three different sets of augmentations to the three datasets. Inputs are augmented at each epoch, constituting a non-static dataset.
For CIFAR-10 images we used the standard augmentation pipeline also used in [7]. More precisely, each image is padded 4 pixels on each side and a crop is randomly sampled from the padded image or its horizontal flip.
FMNIST objects present small variations in orientation and scale with respect to the ones of CIFAR-10. In this case, we padded the images with 2 pixels per side and cropped a window of pixels. We have not applied horizontal flipping.
Finally, digits in SVHN undergo noticeable changes in terms of illumination and contrast. For this reason, we adjusted image brightness and contrast by random factors. The first one was contained in while the second one in between and . Further, we applied the cropping policy of CIFAR-10 without considering the horizontal flipping that could change the numbers semantics.

Iv Results

Model Augmentation Dropout sCIFAR10-10 sCIFAR10-20 sCIFAR10-40 sCIFAR10-80 sCIFAR10-160 sCIFAR10-320 sCIFAR10-640 sCIFAR10-1280
CNN-lc 0.0 0.271 0.324 0.361 0.418 0.453 0.501 0.541 0.594
CNN-lc 0.4 0.297 0.338 0.382 0.435 0.471 0.528 0.572 0.617
CNN-lc 0.7 0.297 0.349 0.400 0.449 0.494 0.539 0.581 0.623
CNN-mc 0.0 0.285 0.343 0.388 0.430 0.486 0.537 0.588 0.633
CNN-mc 0.4 0.297 0.349 0.399 0.454 0.509 0.558 0.610 0.662
CNN-mc 0.7 0.315 0.362 0.413 0.471 0.519 0.577 0.628 0.676
CNN-hc 0.0 0.301 0.342 0.391 0.447 0.506 0.561 0.612 0.659
CNN-hc 0.4 0.317 0.361 0.408 0.465 0.520 0.580 0.635 0.688
CNN-hc 0.7 0.319 0.370 0.425 0.481 0.539 0.595 0.648 0.698
CNN-hc 0.7 0.325 0.382 0.455 0.526 0.603 0.677 0.730 0.772
ResNet-20 0.233 0.290 0.319 0.385 0.447 0.513 0.623 0.715
ResNet-20 0.266 0.339 0.385 0.469 0.603 0.710 0.794 0.848
Model Augmentation Dropout sFMNIST-10 sFMNIST-20 sFMNIST-40 sFMNIST-80 sFMNIST-160 sFMNIST-320 sFMNIST-640 sFMNIST-1280
CNN-lc 0.0 0.711 0.740 0.778 0.810 0.836 0.854 0.867 0.882
CNN-lc 0.4 0.712 0.761 0.798 0.819 0.842 0.857 0.871 0.887
CNN-lc 0.7 0.713 0.756 0.787 0.816 0.833 0.853 0.870 0.879
CNN-mc 0.0 0.725 0.755 0.790 0.820 0.844 0.863 0.880 0.891
CNN-mc 0.4 0.724 0.761 0.796 0.829 0.847 0.866 0.881 0.896
CNN-mc 0.7 0.725 0.769 0.799 0.829 0.849 0.868 0.882 0.894
CNN-hc 0.0 0.719 0.759 0.801 0.823 0.851 0.868 0.886 0.895
CNN-hc 0.4 0.722 0.763 0.802 0.830 0.850 0.869 0.885 0.898
CNN-hc 0.7 0.733 0.774 0.805 0.832 0.853 0.871 0.888 0.899
CNN-hc 0.7 0.739 0.777 0.812 0.844 0.865 0.880 0.896 0.908
ResNet-20 0.623 0.714 0.770 0.804 0.841 0.869 0.892 0.905
ResNet-20 0.709 0.757 0.795 0.832 0.862 0.890 0.905 0.919
Model Augmentation Dropout sSVHN-10 sSVHN-20 sSVHN-40 sSVHN-80 sSVHN-160 sSVHN-320 sSVHN-640 sSVHN-1280
CNN-lc 0.0 0.253 0.375 0.505 0.641 0.730 0.777 0.809 0.836
CNN-lc 0.4 0.284 0.449 0.590 0.702 0.770 0.803 0.833 0.858
CNN-lc 0.7 0.288 0.467 0.606 0.721 0.777 0.813 0.8323 0.862
CNN-mc 0.0 0.268 0.399 0.534 0.686 0.755 0.796 0.832 0.855
CNN-mc 0.4 0.296 0.433 0.643 0.721 0.783 0.822 0.848 0.873
CNN-mc 0.7 0.277 0.458 0.641 0.746 0.797 0.832 0.862 0.882
CNN-hc 0.0 0.249 0.375 0.555 0.677 0.745 0.801 0.840 0.861
CNN-hc 0.4 0.275 0.456 0.631 0.736 0.793 0.827 0.857 0.881
CNN-hc 0.7 0.288 0.448 0.647 0.744 0.795 0.840 0.863 0.886
CNN-hc 0.7 0.344 0.582 0.766 0.832 0.874 0.895 0.910 0.924
ResNet-20 0.203 0.400 0.547 0.741 0.835 0.867 0.895 0.922
ResNet-20 0.314 0.574 0.746 0.830 0.879 0.905 0.923 0.938
TABLE II: Average testing accuracy for all tested models considering the sub-sampling protocol with and . The best value per column is reported in bold. Note that only in some cases we used data augmentation. The numbers below the Dropout column stand for the probability of dropping units for the dropout layer. We do not report standard deviations to improve table readability. If needed, standard deviations are qualitatively shown in the figures that analyze the proposed comparisons of our paper.

Iv-a Influence of models complexity on performance

Fig. 1: Comparison between networks with different complexities. The simpler CNNs are all equipped with dropout (dropping rate equal to ). In these experiments, data augmentation is not used. Results, over the test sets, are in terms of accuracy, averaged over runs.

First, we analyze the impact of model complexity in our small-data classification problems. For this reason, we have chosen to test the three standard CNNs with increasing complexity (CNN-lc, CNN-mc, and CNN-hc) and ResNet-20 that has the highest complexity in terms of FLOPs (check Tab. I for more details). In this case, all networks are trained with the cross-entropy loss. The results of this analysis are shown in Fig. 1. Note that for this analysis, data augmentation is not used. The standard CNNs consistently outperform ResNet-20 in the sCIFAR10 problem when is smaller than samples. ResNet scores the best accuracy out of the four models only when . The difference between ResNet-20 and CNN-hc is roughly for sCIFAR10-10 up to sCIFAR10-320. For larger training sets, the difference sharply decreases until becomes negative. It is interesting to note that also the CNN-lc keeps a consistent gap up to sCIFAR10-320 despite it has roughly two orders of magnitudes fewer FLOPs than ResNet-20. Similar behavior can be noted in sFMNIST and sSVHN with the only difference that the most complex model increases its performance more rapidly. This is probably due to the intrinsic difficulty of the classification problem. When the training and testing distributions are more similar (e.g. sFMNIST or sSVHN), the addition of training samples helps the classifier at a faster rate. In sFMNIST and sSVHN, ResNet starts to have a comparable accuracy when the number of samples per class is, respectively, at least and . The gap reaches a net difference in terms of average accuracy of around for sSVHN-40 and sFMNIST-10. For what concerns the comparison between the CNNs, we note that results are comparable although there is a significant difference in terms of model complexity (i.e. CNN-hc has more than one order of magnitude further parameters than CNN-lc).

Iv-B Influence of regularization techniques on performance

Fig. 2: Comparison between CNN-hc and ResNet-20 trained with and without data augmentation. CNN-hc is using dropout with a dropping rate of . Results are in terms of accuracy averaged over runs.
Fig. 3: Analysis regarding the influence of dropout on testing accuracy with models of different complexities and three levels of dropout (absent, medium, and high). All networks are trained without data augmentation. Results are in terms of average accuracy obtained over runs.

Here, we study how two popular regularization techniques such as dropout and data augmentation influence recognition performance when a few samples are available.
Dropout. Fig. 3 shows the results of the three CNNs with different levels of dropout on the three datasets. Exact numbers are reported in Tab. II. For all three cases, dropout regularizes the dense layer of the classifier.
In sCIFAR-10, we note that high drop-rates () improve the generalization abilities of all three networks. Higher improvements over the case without dropout are obtained with the CNN-lc (around ). However, also CNN-hc outperforms its variants with absent and medium dropout.
A similar picture can be seen for sSVHN. Here, improvements are even more noticeable when the training set is limited. For instance, CNN-mc with high dropout outperforms its version without dropout by in sSVHN-40. Improvements decrease as the training set increases, but they are still noticeable.
Finally, for sFMNIST the improvements are less important, and using or not using dropout does not seem to influence much the final testing accuracy. When happening, improvements are very small. The case in which dropout seems to be more effective is when using CNN-hc and or samples per class.
To summarize, dropout maintains its role of good regularizer also when the dataset has a limited size. Networks equipped with it either generalize better either perform comparably well depending on the type of dataset.
Data augmentation. We report the results regarding the use of data augmentation in Fig. 2. Refer to Tab. II to read the exact values. We have chosen to train two networks with the previously specified augmentation policies. The first one is the CNN-hc that generally scored the highest testing accuracy among the standard CNNs. The second one is the ResNet-20 which is expected to noticeably improve given the help of data augmentation. Starting from sCIFAR-10, we note that data augmentation boosts recognition abilities in almost all cases, playing an important role from roughly samples per class. The maximum and minimum gains of CNN-hc corresponds to and with and . On the other hand, net improvements of augmented ResNet-20 are and with and samples per class. It is interesting to notice that data augmentation allows ResNet-20 to match the CNN-hc testing accuracy with . We have previously seen in the analysis without augmentation that this result was happening given at least samples per class.
Large gains are also obtained on sSVHN. However, the trend is different from the one seen on sCIFAR-10. Indeed, in this case, the greatest improvements are obtained with smaller training sets (). Augmented CNN-hc scores a maximum increase of with while ResNet-20 of with . Differently from sCIFAR-10, we also notice that augmentation does not give a clear advantage to ResNet-20. Indeed, the two architectures perform comparably well, except for larger datasets () where ResNet-20 scores improvements of around .
Finally, data augmentation in sFMNIST shows low gains for CNN-hc (maximum around ). On the other hand, ResNet-20 is still largely helped when data is very scarce. Testing accuracy is boosted by in the case of sFMNIST-10. As we have also seen in sCIFAR-10, here ResNet-20 can match CNN-hc performance when trained with smaller datasets. In this scenario, it happens with half of the samples, more precisely when is equal to .
To sum up, from our experiments we have seen that the success of data augmentation seems to be related to the dataset size and type. However, in many cases, testing accuracy has been boosted (up to margins) making even standard augmentation policies a must-use when dealing with a small dataset.

Iv-C Comparison with the state of the art

Fig. 4: Comparison between CNN-hc and ResNet-20 trained with cross-entropy loss and other state-of-the-art techniques. Results are given in terms of accuracy averaged over runs. Models are trained without any kind of data augmentation.

Finally, in Fig. 4 we compare the CNN-hc and ResNet-20 trained with the cross-entropy loss to state-of-the-art techniques. More precisely, we use the sub-sampling protocol of CIFAR-10 originally proposed in [1] (, ) to compare our networks to CNTK. In this case, augmentation is turned off to match the procedure used in [1]. Further, we also train CNN-hc and ResNet-20 with the cosine loss that was proposed as a state-of-the-art loss function for deep learning with tiny datasets in [2].
We start to compare the CNN-hc and ResNet-20 trained with the two different losses. We note that the cosine loss makes the two architectures either comparable to either worst than networks trained with the cross-entropy. Indeed, the model trained with the cross-entropy loss is more performant in many cases. We presume that the dataset type, size, and complexity of the model greatly influence the final performance. The experiments in [2] were run with high-capacity neural nets (i.e. ResNet-110, ResNet-50) on mainly fine-grained datasets with a greater number of classes. Furthermore, in all training set-ups, data augmentation was included. In this setup proposed by Arora et al. [1], the cosine loss does not outperform the cross-entropy. This setup is extremely hard for larger networks since the number of training samples is extremely low and data augmentation is not used. The obtained results are reasonable and somewhat expected. Indeed, ResNets trained with cross-entropy or cosine loss are the worst models. More studies regarding the use of the cosine loss should be performed to actually understand when it’s advantageous over the cross-entropy.
Then, we compare CNN-hc and ResNet-20 with CNTK. As was already shown in [1], the CNTK architecture outperforms ResNets. In the original paper, the comparison was with a ResNet-34. We confirm the superiority of CNTK also over a smaller ResNet. However, we note that our CNN-hc clearly outperforms CNTK in all cases. The gains are around throughout all training splits with the maximum obtained with and minimum with .
To recap, when using the hard training protocol proposed in [1] with extremely limited samples and without data augmentation, the cosine loss does not give a clear advantage over the cross-entropy considering the two main architectures that we have used in this work. Moreover, the computationally simple CNN-hc outperforms CNTK in all sub-sampled versions of CIFAR-10 making low-complexity CNNs a viable choice in these extreme training conditions.

V Conclusions

In this work, we have tackled the problem of image classification with few samples per class. Although the results are still very far from the successes of high-capacity models with tons of data, we encourage the community to study and improve the capabilities of deep networks with tiny datasets. We have run several experiments and trained different flavors of deep architectures on three popular computer vision benchmarks.
In the experiments regarding the influence of model complexity on performance, we have shown that relatively simple networks can play a major role in being less prone to overfitting and generalizing better when facing small datasets. Therefore, architectures that will be proposed in the future, should not only be compared with state-of-the-art models (e.g. ResNet-34), but also with simpler networks. On the other hand, when data augmentation is used, deeper networks rapidly gain higher performance even with basic augmentation policies thanks to the artificial addition of training images.
Furthermore, we run a wide analysis to test the benefits of using popular regularization techniques such as data augmentation and dropout in small data settings. We figured out that the first one can induce large improvements in recognition capabilities depending on the dataset type and size. For what concerns dropout, we have shown that it generally improves results and should be used as well. For future works, we would like to perform the proposed analysis on other datasets and implement more sophisticated data augmentation techniques.


Work partially funded by the EUHorizon 2020 research and innovation program under AI4EU project grant N. 825619.


  1. S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang and D. Yu (2020) Harnessing the power of infinitely wide deep nets on small-data tasks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §I, §II, §III-A, §IV-C.
  2. B. Barz and J. Denzler (2020) Deep learning on small datasets without pre-training using cosine loss. In The IEEE Winter Conference on Applications of Computer Vision, pp. 1371–1380. Cited by: §I, §II, §III-B, §IV-C.
  3. Y. Bengio (2012) Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning, pp. 17–36. Cited by: §I.
  4. J. Bornschein, F. Visin and S. Osindero (2020) Small data, big decisions: model selection in the small-data regime. Proceedings of the 37th International Conference on Machine Learning. Cited by: §II.
  5. M. Fernández-Delgado, E. Cernadas, S. Barro and D. Amorim (2014) Do we need hundreds of classifiers to solve real world classification problems?. The journal of machine learning research 15 (1), pp. 3133–3181. Cited by: §II.
  6. K. He, X. Zhang, S. Ren and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §I.
  7. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-B, §III-D.
  8. G. Hu, X. Peng, Y. Yang, T. M. Hospedales and J. Verbeek (2017) Frankenstein: learning deep face representations using small data. IEEE Transactions on Image Processing 27 (1), pp. 293–303. Cited by: §II.
  9. L. Jing and Y. Tian (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I.
  10. K. Kawaguchi, L. P. Kaelbling and Y. Bengio (2017) Generalization in deep learning. arXiv preprint arXiv:1710.05468. Cited by: §I.
  11. D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §III-B.
  12. A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: §III-C.
  13. G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken and C. I. Sánchez (2017) A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §I.
  14. L. Liu, M. Muelly, J. Deng, T. Pfister and L. Li (2019) Generative modeling for small-data object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6073–6081. Cited by: §II.
  15. J. Long, E. Shelhamer and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I.
  16. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §III-C.
  17. M. Olson, A. Wyner and R. Berk (2018) Modern neural networks generalize on small data sets. In Advances in Neural Information Processing Systems, pp. 3619–3628. Cited by: §II.
  18. S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §I.
  19. J. Redmon, S. Divvala, R. Girshick and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §I.
  20. D. Rueda-Plata, R. Ramos-Pollán and F. A. González (2015) Supervised greedy layer-wise training for deep convolutional networks with small datasets. In Computational Collective Intelligence - 7th International Conference, ICCCI 2015, Madrid, Spain, September 21-23, 2015. Proceedings, Part I, M. Núñez, N. T. Nguyen, D. Camacho and B. Trawinski (Eds.), Lecture Notes in Computer Science, Vol. 9329, pp. 275–284. External Links: Link, Document Cited by: §II.
  21. C. Shorten and T. M. Khoshgoftaar (2019) A survey on image data augmentation for deep learning. Journal of Big Data 6 (1), pp. 60. Cited by: §I, §II, §III-D.
  22. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §I.
  23. S. Valipour, C. Perez and M. Jagersand (2017) Incremental learning for robot perception through hri. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2772–2777. Cited by: §I.
  24. J. Vanschoren (2018) Meta-learning: a survey. arXiv preprint arXiv:1810.03548. Cited by: §I.
  25. H. Xiao, K. Rasul and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §III-C.
  26. Y. B. Y. LeCun and G. E. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. External Links: Link, Document Cited by: §I.
  27. L. Zhang, S. Li, H. Xiong, X. Diao, O. Ma and Z. Wang (2019) Prediction of intentions behind a single human action: an application of convolutional neural network. In 2019 IEEE 9th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), pp. 670–676. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description