Automatic Dataset Augmentation
Large scale image dataset and deep convolutional neural network (DCNN) are two primary driving forces for the rapid progress made in generic object recognition tasks in recent years. While lots of network architectures have been continuously designed to pursue lower error rates, few efforts are devoted to enlarge existing datasets due to high labeling cost and unfair comparison issues. In this paper, we aim to achieve lower error rate by augmenting existing datasets in an automatic manner. Our method leverages both Web and DCNN, where Web provides massive images with rich contextual information, and DCNN replaces human to automatically label images under guidance of Web contextual information. Experiments show our method can automatically scale up existing datasets significantly from billions web pages with high accuracy, and significantly improve the performance on object recognition tasks by using the automatically augmented datasets, which demonstrates that more supervisory information has been automatically gathered from the Web. Both the dataset and models trained on the dataset are made publicly available.
Generic object recognition is a fundamental problem in computer vision, and has achieved steady progress with efforts from both large scale dataset construction and sophisticated model design. Though the goal is to minimize expected error on previously unseen images, only empirical error can be directly optimized on a set of labeled images with respect to a function space defined by a model. According to statistical learning theory, the gap between expected error and empirical error is determined by the sample size and model capacity. The gap becomes smaller with increasing of sample size, while model design tries to minimize the expected error by defining a function space to minimize the empirical error and control the model capacity. Starting from the success of AlexNet  on ILSVRC-2012 dataset [3, 22], years of effort has been devoted to model designing, and a series of improved DCNNs such as ZFNet , VGGNet , GoogLeNet , ResNet  are proposed. There are also many efforts to create new datasets for new recognition tasks [13, 17, 31, 33]. However, there is little effort to increase an existing dataset to make the empirical error closer to the expected error, mainly for two reasons, one is the labeling cost scales linearly with the size of dataset, the other is unfair comparison issue due more human labeling is used to achieve better results.
In this work, we attempt to automatically augment111This is different with the common practice of data augmentation for DCNN training, which randomly cropping training samples from an image to avoid overfitting and achieve translation/scale invariance. an existing dataset from the Web with a pre-trained DCNN on the existing dataset.
Web hosts massive images with rich contextual information and the volume is keeping growing fast, which made many applications possible such as image search engines. Web is also the basic source of many datasets which are scraped from search engines w/o further human labeling, e.g., ImageNet , Places , 80M tiny images , CIFAR-10/100  etc. An image on a Web page often comes with rich contextual information edited by Web authors, e.g., Alt text that conveys the same essential information can be used for displaying to replace the associated image in a pure text-based browser, page title describes what is the whole web page about, and surrounding texts around the image which are related to the image content in some manner. Since contextual information is not purposely edited to annotate image content, it is also quite noisy.
DCNNs trained on large scale datasets have achieved superior performance, which inspires us to investigate the possibility to use DCNN replace human to do image labeling task. In our early study, we found that DCNN trained on ImageNet performs much worse on Web images due both images and categories are not following the same distribution as the training set, and results in many false positives for each category. The problem can be alleviated by setting high thresholds for the prediction score, however, these images can provide limited additional information to improve the pre-trained DCNN since the DCNN is already quite confident on these images.
DCNN extracts image’s visual information while Web provides image’s contextual information, which are complementary and can jointly provide additional information to an existing dataset. Noise of contextual information can be removed by the DCNN using visual information, while rich contextual information helps lower the threshold for the prediction score of a DCNN that required to achieve high prediction accuracy. Together, we can augment an existing dataset in a scalable, accurate and informative way. Specifically, we automatically augment ILSVRC-2012 with additional 12.5 million images from the Web. By training the same DCNN on the augmented dataset without human labeled images, significant performance gains are observed, which demonstrates a well-trained DCNN can improve itself by self labeling more images from Web. Another encouraging experimental result is that we can even boost the performance of ResNet-50 on ILSCRV-2012 validation set from 74.55% to 77.35% by using our augmented dataset which is labeled a lower performance AlexNet. We release the dataset and models222The dataset and models can be found at https://auto-da.github.io/ to facilitate the research on learning based object recognition.
The rest of this paper proceeds as follows: After an overview of related work in Section 2. We introduce our proposed method which could automatically augment dataset according to Web labeling and DCNN labeling in Section method. We evaluate the quality of augmented datasets in Section 4, and conclude with discussion in Section 5.
2 Related Work
Datasets are the basic inputs for statistical learning algorithms to train learning models, and significant efforts have been made to construct datasets for various recognition tasks. In this section, we discuss related efforts according to the amount of human labeling used during dataset construction.
2.1 No human labeling
Some datasets are directly collected from image search engines or social networks without human labeling. TinyImage  contains 80 million low resolution images, collected from image search engines by using words in WordNet as queries. YFCC100M  is another large database of approximately 100 million images associated with metadata collected from Flickr. Krause et al.  try only using Web images to fine-tune DCNN pre-trained on ILSVRC-2012 for fine-grained classification, and get even higher accuracies than using fine-grained benchmark datasets, which is expected due existing fine-grained benchmark datasets are quite small. Phong et al.  collect 3.14 million Web images from Bing and Flickr for the same 1000 categories of ILSVRC-2012.Massouh et al.  also proposed a framework to collect images from Web and using a visual and natural language concept expansion strategy to improve the visual variability of constructed dataset. Recently, Li et al.  also constructed a dataset by directly querying images from Flickr and Google Images Search. However, DCNN trained on all of these automatically constructed datasets perform much worse than human labeled dataset when testing on ILSVRC-2012, which reflects the noisy and high-biased nature of Web images.
2.2 Fully human labeling
Each image is manually labeled by one or multiple users to ensure high accuracy. Due the high labeling cost, datasets constructed by fully labeling are often with small size, some typical exemplar datasets are Caltech101/256 [5, 6], Pascal VOC  and several ones for fine-grained object recognition [11, 18, 29]. These datasets are widely used for shallow model learning, while are not large enough to train a DCNN from scratch. Though challenging, million scale datasets have been constructed, such as ImageNet  for object recognition and Places  for scene recognition. With ImageNet, DCNN first proves its success and improves most object recognition tasks by the learned feature extractors . However, the high labeling cost limits both the number of images can be labeled for each category and the number of categories can be labeled.
2.3 Partially human labeling
To alleviate human labeling cost and use limited budget in more effective ways, several active learning based approaches are proposed to only label images that are considered as informative for a model. Collins et al.  propose to iteratively do image labeling and model training, where some randomly selected images are first labeled as seed training set to train an initial model, then the model is applied to a set of unlabeled images, to select a subset of images which the model is mostly uncertain for human labeling. The process is iterated until the classification accuracy converges or the budget is run out. Krause et al.  present a similar scheme for fine-grained object recognition by using DCNN as model. Since informative images are selected based on some specific model, human involvement is always required for newly designed models.
To decouple human labeling from model training, Tong et al.  propose to train DCNN for clothing classification with both clean dataset manually labeled by annotators and millions images with noisy labels provided by sellers from online shopping websites. Though noisy, the accuracy of images from online shopping websites (% ) is much higher than general Web images ( ). Sukhbaatar et al.  try to train DCNN with 0.3M clean ILSVRC-2012 training images and 0.9M noisy Web images, and show marginal improvement with a noise layer to model noise, but still with much higher error rate than DCNN directly trained on 1.2M ILSVRC-2012 training images.
Since the accuracy of Web images is relatively low, the number of Web images needs to be orders of magnitude larger than existing datasets to contain enough clean images. Thus, we aims to use as more Web images as possible, till July 2017, we have used 186.4 million Web images as candidate images to augment several labeled image datasets. These augmented image datasets achieve high performance on objects recognition tasks than human-labeled datasets with significantly more training images. To the best of our knowledge, this is the first work that uses DCNN to label Web images and demonstrates a well-trained DCNN can automatically improve itself by “surfing” the Web.
3 Automatic Dataset Augmentation
Starting from a human labeled image dataset , we are targeting at augmenting it to a much larger dataset , where is automatically labeled from Web images by a DCNN trained on . Labeling images is an intelligent process which requires sufficient intelligence and knowledge. In this section, we will first investigate two separated labeling methods by DCNN and Web, respectively, then present our method which labels image by both Web and DCNN. If no special mention is made, AlexNet designed by Krizhevsky et al.  will be used as the basic DCNN considering it is with relatively low computational cost for large scale experiments.
3.1 Labeling By DCNN
DCNNs have achieved remarkable prediction accuracy on validation set and testing set of ILSVRC-2012  by end-to-end learning on the training set, which inspires us to use DCNN replace human to do the image labeling task. Given a DCNN trained on the labeled dataset , which maps an image to a set of confidence scores for each pre-defined category . Then, using the DCNN to do labeling is intuitive, a new image can be labeled as an instance of category if has a confidence score on exceeds some threshold , i.e.,
Then an augmented dataset can be labeled by applying the DCNN on a large unlabeled image set , i.e.,
The labeling process is fully automatic which only requires feedforward calculation on a unlabeled image set. We investigate this method by using the DCNN learned from ILSVRC-2012 training set to label an unlabeled candidate image set which randomly collected from Web. By analyzing the labeling results, we find several properties of labeling by DCNN.
Unbalance Figure 1 shows the number of images labeled for each category using a relatively high threshold . The number of images of different categories is extremely unbalanced, where “web site” has more than 100,000 images, while “toilet tissue”, “American chameleon, anole” have no images. The unbalance is caused by the unbalanced nature that web images since images of some categories are inherently more popular than others. To label enough images for each category, the only way is to predict more images where most computations are spent on images of popular categories.
Low Accuracy Figure 2 shows the quantity and accuracy of automatically labeled datasets by setting different thresholds , where accuracy is estimated by manually inspecting randomly sampled images (10 random images per category) from 100 categories in constructed dataset. As expected, higher threshold will result smaller dataset with higher accuracy. However, even with the relatively high threshold , the accuracy is still much lower than the accuracy achieved by human labeler on ImageNet . Figure 3 shows some false positive images that are incorrectly labeled for each category, where most images are out of the 1000 categories used for training but visually similar to the category in some aspects. The result also shows that the DCNN is still hard to generalize to a testing set with many out-of-class images.
Less Informative Though higher accuracy can be obtained by keeping increasing the threshold, this will cause two problems. One is the number of images can be collected will be reduced for a fixed unlabeled dataset, and the unlabeled dataset needs to be even larger to collect enough images. The other problem is even worse, images labeled by high confidence scores are iconic samples and with high similarity with images in existing training set as showed in the third row of Figure 4, which can bring little new supervisory information to the existing training set.
3.2 Labeling By Web
Web hosts trillions images with rich metadata, which provides a “free” way to label images since labels are already in the metadata provided by Web users. Image search engines directly leverage these metadata to index massive Web images and make them retrievable. Though image search engines provide a convenient way to collect Web images by searching words or word phrases that describe a category, they are with several limitations for dataset construction as they are optimized for human users, e.g., they typically limit the number of images retrievable for each query (in the order of a few hundred to a thousand), the retrieved images are often iconic, presenting a single, centered object with a simple background, which is not representative of natural conditions. Thus, we directly resort to raw images with textual metadata from the Web as our source data. Specifically, four textual fields are collected for each image, including
Anchor text is the visible, clickable text in a hyperlink linked to an images, which usually gives the user relevant description about the content of the linked image.
Alt text is shown when an image cannot be displayed to a reader. Thus, it can be seen as a textual counterpart to the visual content of an image.
Page title is an important field for the page author to state what the main content of the webpage is about.
Surrounding text consists of the text paragraphs around an image in a webpage. The surrounding text is in many cases semantically related to the image content. However, since the surrounding text can also contain information that is uncorrelated to the image, this field as a contextual information source can be much noisy.
Then a data item from Web can be denoted by . Figure 5 shows a web image and its four types of textual metedata, where rich information about is embedded in metadata for the image.
Given a web image dataset denoted by , labeling by Web can be directly carried out through string match. Let each category be represented by a set of word phrases from its WordNet synonyms  and relevant descriptions in 12 different languages (including AR, ZH, EN, FR, DE, EL, HE, HI, IT, JA, RU, ES) from BableNet , denoted by . An image is labeled as an instance of category if at least one textual field contain at least one element in , i.e.,
Then an augmented dataset can be labeled by Web data , i.e.,
The labeling process is also fully automatic and very fast after has been collected.
By the method, we collect a dataset with 186.4 million images for the 1000 categories from ILSVRC-2012 dataset. Here, we summarize several properties observed from the dataset.
Figure 6 shows percentage of images collected by each textual field, where surrounding text contributes the most since most images are with surrounding texts and typically contains more words than other fields, while the number of images collected by anchor text is much smaller than other fields since anchor texts are typically very short and often not provided by web authors.
Besides the quantity, we also check the quality of the collected dataset. To avoid manually checking, we use the DCNN to calculate the confidence score of the labeled category of each image in , and large confidence score means large probability of the labeled image to be correct. Figure 7 shows the distribution of confidence scores by different textual fields, where images collected by anchor text and alt text are with larger proportion of high confidence scores, which also means these two fields are more reliable than the others. The conclusion is also consistent with experiences of using textual features for image search engines 333https://support.google.com/webmasters/answer/114016?hl=en.
However, as expected, images collected from Web are very noisy, where 82.8% images are with confidence scores lower than 0.05. After analyzing the noisy images, we find that most noisy images are introduced by ambiguities in textual metadata. A typical example is a category named by “jay” which is supposed to be a bird, lots of images about human are collected since “jay” is often used as human name. Though these noisy images are hard to remove by only using textual information, they are easy to remove by visual information since images of different senses of a name are typically visually distinguishable.
3.3 Labeling by Web and DCNN
Since both visually labeling by DCNN and contextually labeling by the Web have their own limitations, here we combine them together to improve the labeling by leveraging their complementarity. Learned from the above experience that labeling by DCNN is more computational cost and tend to spend too much time on popular categories, thus we first use the Web to label a dataset in a relatively balanced way, then use DCNN to go through the textually labeled dataset . Together, a dataset can be labeled by Web and DCNN via,
where is a filtered subset of where lots noisy images are filtered out by DCNN. Different from labeling by DCNN in Eq. 2, the contextual labeling can filter out the majority of out-of-class noisy images, and the used is with much higher signal-noise ratio than , which allows us to use lower threshold to label more informative images. Figure 8 shows the quantity and accuracy curve with respect to confidence threshold on images labeled by the Web, it is encouraging that much higher accuracy is achieved even with very low confidence threshold, e.g., 94% accuracy is achieved when the threshold is set to 0.1.
Since the accuracy of is still relatively low by simply using string match, which limits us to set lower confidence threshold to bring in more diverse and difficult images while keep high accuracy. Thus, we are motivated to further decrease the noise in . Since image , text , metadata type and image URL domain are coupled together as a single data item in our dataset, labels assigned to images by DCNN are also assigned to metadata, then we construct an automatically labeled textual dataset, i.e.,
where collects noisy images for each category by string match. Inspired by previous work on sentence classification , we train a two-layer fully connected network to categorize textual metadata at semantic level. The input to the network is the one hot representation of metadata type , image URL domain and bigrams in . As Figure 7 shown, the metadata type is a helpful prior for this text classification task. Meanwhile, we also found that there are some special websites on which vast majority of images are relevant to a some category, e.g. “farnhamanglingsociety.com” is a website about fishing and lots of images about tench can be found on this website. The first layer of the network generates embedding representation for inputs with weight matrix , and the second layer classifies into categories based on the representation with weight matrix using softmax regression,
The model is trained by minimizing
where . We train this model by using stochastic gradient descent and a linear decaying learning rate. As a result, a new dataset labeled by our text classification model can be constructed:
The experimental results show that the accuracy of images set is nearly 71.5%, which is significantly higher than whose accuracy is nearly 21.3%.Naturally, a new dataset with jointly constrained by DCNN and text classification model can be constructed:
where . The high-performance text classification model makes it possible to decrease the visual threshold from to , and to mine a more diverse and larger scale dataset without accuracy dropping, e.g. 93.8% accuracy is achieved when . Finally, we get a dataset labeled by Web and DCNN jointly,
Figure 4 shows snapshots of human labeled dataset ImageNet and four automatically constructed datasets by different methods. Compare to the dataset labeled only by DCNN or the Web, the dataset constructed based on semantic and visual jointly restriction could have higher accuracy and diversity.
4 Experimental Results
In our experiments, for a given category set with labeled images, we first train a DCNN that will be used for labeling and as the baseline for comparing. To make comprehensive analysis and comparisons, we consider four sets of categories from different domains, including dog, bird, wheeled object and structure. Human labeled datasets for the four domains are subsets of ILSVRC-2012 training set. Four DCNNs trained on each human labeled dataset are used to label a Web labeled dataset which contains 186.4 million images. Table 1 summarizes statistics of the human labeled datasets and the automatically labeled datasets, and our method significantly increases dataset scale in each domain.
|Dataset||# of categories||# of images|
4.1 Results on Augmented Datasets
In this paper, we measure the quality of augmented datasets by measuring their performance on object recognition. Single-crop top-1 accuracy on corresponding subsets of ILSVRC-2012 validation set is used as the performance metric. Table 2 reports the results of DCNNs trained from augmented datasets and human labeled datasets. The augmented datasets (without human labeled images) give consistent improvement across all four different domains, which demonstrates that well-trained DCNNs can automatically label more useful images from Web and improve themselves further. Averaging the predictions of the two DCNNs trained on human labeled datasets and augmented datasets can further improve the performance.
To analysis how augmented datasets improve the performance of recognition, we compare the training/test curves of the DCNNs trained on and in Figure 9. The smaller training/test gap and better test accuracy show that the significantly larger datasets help prevent overfitting for training deep models.
To analyze how Web labeling influence the quality of constructed dataset, we compare the performance of DCNNs trained on and in dog domain. Since the accuracy of heavily relies on the confidence threshold as showed in Figure 2, we try three different settings with for constructing in this experiment. The experimental results in Table 3 show the performance of DCNN trained on is improved by increasing the confidence threshold since higher confidence threshold can lead to highly accurate dataset. However, the performance of DCNNs trained with is lower than the DCNN trained on , which means that DCNN still cannot improve itself by self-labeling images without using contextual information from Web.
We further investigate whether better model design and automatically labeled larger dataset can boost recognition performance together. Here, we choose ResNet-50  which performs much better than AlexNet on ILSVRC-2012 to repeat the experiment on the dog domain. Table 4 reports the results, where ResNet-50 consistently outperforms AlexNet as expected, and ResNet-50 also improves itself using the automatically labeled data, which demonstrates that better model design and larger automatically labeled dataset can further boost the performance together.
4.2 Results on ILSVRC-2012
We also try to augment ILSVRC-2012 training set () based on our proposed method. For categories with more than 15,000 images, we just keep 15,000 images by random sampling. After that there are 12.5 millions of images left in the augmented ILSVRC-2012, and most of the categories have more than 10,000 images, but there are still several rare categories contain fewer than 6000 images. Considering that unbalanced dataset for training can lead to poor performance since the validation set is a balanced one, then we balance the distribution of the augmented dataset by subsampling categories with more than 6,000 images, and construct a balanced dataset with 5.7 million of images.
The experimental results in Table 5 show that the top-1 and top-5 classification accuracy on the validation set of ILSVRC-2012 with a single crop being evaluated. We found that classification performance to a large extent is affected by the number of training iterations. Models training on larger training dataset needs more iterations to be fully converged. Best performance is archived on both AlexNet and ResNet-50 by merging the human-label dataset and augmented dataset. It is worth noting that the augmented dataset is labeled by a low performance AlexNet whose top-1 accuracy is 56.15%, but the augmented dataset can still boost a high performance ResNet-50 from 74.55% to 77.36%. We also evaluated the performance of DCNN without dropout layers. The experimental results in Table 5 shows that the DCNN without dropout layers can converge faster, and the influence of overfitting is alleviated and better performance is achieved thanks to the large scale augmented dataset.
However, the performance by only using the automatically constructed is still lower than the human labeled dataset. We will do careful analysis in next section.
|Merge||Merge (No Dropout)|
|AlexNet||0.4M||56.15 (78.11)||51.99 (73.86)||56.13 (79.27)||59.90 (81.17)|
|2.0M||60.36 (82.38)||56.58 (78.57)||62.71 (83.71)||61.72 (82.62)|
|ResNet-50||0.5M||74.55 (92.06)||67.25 (85.99)||75.57 (91.83)||-|
|2.5M||74.44 (92.11)||70.17 (88.09)||77.36 (93.29)||-|
4.2.1 Dataset Analysis
The performance gap comes from the distribution difference between the two datasets. ImageNet is collected about ten years ago where visual appearance of many categories are changed over time especially some man-made categories such as monitors and table lamp. In addition, the main source of ImageNet is Flickr, while our augmented dataset is from a wider range of websites where some are even not existing during ImageNet collecting such as pinterest. Figure 10 shows the difference of domain distributions of image source of ImageNet and our augmented dataset respectively.
To systematically study the distribution difference between the two datasets, we train the discriminator of Wasserstein Generative Adversarial Network (GAN)  to differentiate images in ILSVRC-12 and images in our dataset by maximizing the Wasserstein distance between and : . By using the trained discriminator model , we sorted the images in according to the output value of and show the images whose styles are most different/similar with in Figure 11. We found that many images are easily distinguished from images in ILSVRC-2012 are collected from e-commence websites.
Considering the difference between ImageNet and our dataset are mainly on man-made categories, we split the 1000 categories into two subsets according WordNet ontology, one is artifact set including 522 categories, the other is natural object set including 478 categories. We compare DCNNs trained on these two subset with ImageNet respectively, where Table 6 reports the performance. As expect, our dataset achieve better results on natural categories while worse on artificial categories. Since many images in ImageNet are out-of-date, we will do more evaluation to verify our dataset.
4.2.2 Results on Cross-Dataset Generalization
To further compare ILSVRC-2012 and our constructed dataset, we verify the cross-dataset generalization ability of these two datasets. Cross-dataset generalization measures the performance of classifiers learned from one dataset on the other dataset. We compare our augmented dataset with WebVision  dataset which is constructed from Flickr and Google Images Search by querying the category names. Table 7 shows the classification error rates. Each dataset produces a DCNN using its training set, and then evaluate the trained model on test set from different dataset. In all of the cases, the best performance is achieved by training and testing on the same dataset. The experimental results shows that our augmented dataset have better performance than ImageNet on WebVision. Moreover, our augmented dataset also achieves better performance than WebVision on human labeled image dataset ImageNet. Overall, our dataset generalizes much better than the other two datasets.
|Training Data||Test Data|
|ILSVRC 2012 Val||WebVision Val|
4.2.3 Results of MSR-Bing Grand Challenge
Inspired by the success of feature extractors in DCNNs learned from ILSVRC-2012, we also try to compare the generalization ability of features extractors learned from human labeled ILSVRC-2012 and our augmented ILSVRC-2012 dataset. To evaluate the quality of feature extractors in a more comprehensive way, we test the performance of the feature extractors on an open domain image retrieval task - MSR-Bing Grand Challenge .
The MSR-Bing Grand Challenge task provides a training set including 11.7 million queries and 1 million images, a test set including 1000 queries and 79,665 images, and requires to learn a ranking model based on training set and rank images for each query in test set, where Normalized Discounted Cumulative Gain () is used as evaluation metric for a ranking list, which is defined as
where = Excellent = 3, Good = 2, Bad = 0 is the manually judged relevance for an image ranked at with respect to the query, is a normalization factor to make the score to be 1 for ideal case. The performance is measured by average on all queries in test set.
We use Canonical Correlation Analysis (CCA)  as the basic ranking model and represent a query with bag-of-textual-words. For images, we use the outputs of the last but one fully-connected layer of a DCNN as the image representation, and two DCNNs trained on ILSVRC-2012 and augmented ILSVRC-2012 will be used. Figure 12 compares the performance of ranking results using image representations provided by the two DCNNs, where the DCNN trained on augmented ILSVRC-2012 achieves consistently better performance, which further demonstrates the generalization ability of model learned from automatically labeled dataset.
In this paper, we propose a method to do automatic dataset augmentation, where both Web and DCNN are used. Specifically, Web provides massive images with rich contextual information, while well-trained DCNNs are used to label these images and filter out noisy images. Meanwhile, the rich contextual information from Web ensures DCNN to achieve high labeling accuracy with relatively low confidence threshold. Together, we can augment an labeled image dataset in a scalable, accurate, and informative way. Extensive experiments demonstrate that well-trained DCNNs can automatically label images from Web and further improves themselves with the automatically labeled datasets. We hope the automatically constructed large-scale datasets with rich contextual information can help research in large neural networks.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
-  B. Collins, J. Deng, K. Li, and L. Fei-Fei. Towards scalable dataset construction: An active learning approach. In European Conference on Computer Vision, pages 86–98. Springer, 2008.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. International journal of computer vision, 88(2):303–338, 2010.
-  L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106(1):59–70, 2007.
-  G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007.
-  D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639–2664, 2004.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  X.-S. Hua, L. Yang, J. Wang, J. Wang, M. Ye, K. Wang, Y. Rui, and J. Li. Clickage: towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM international conference on Multimedia, pages 243–252. ACM, 2013.
-  A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
-  A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC), volume 2, 2011.
-  J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable effectiveness of noisy data for fine-grained recognition. arXiv preprint arXiv:1511.06789, 2015.
-  R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105. 2012.
-  W. Li, L. Wang, E. Agustsson, and L. V. Gool. WebVision: Visual Understanding by Learning from Web Data. http://www.vision.ee.ethz.ch/webvision, 2017. [Online; accessed 6-Aug-2017].
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
-  S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
-  N. Massouh, F. Babiloni, T. Tommasi, J. Young, N. Hawes, and B. Caputo. Learning deep visual object models from noisy web data: How to make it work. arXiv preprint arXiv:1702.08513, 2017.
-  G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 1995.
-  R. Navigli and S. P. Ponzetto. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250, 2012.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
-  A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11):1958–1970, 2008.
-  P. D. Vo, A. Ginsca, H. L. Borgne, and A. Popescu. On deep representation learning from noisy web images. arXiv preprint arXiv:1512.04785, 2015.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD birds-200-2011 dataset. 2011.
-  T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2691–2699, 2015.
-  J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014.
-  B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pages 487–495, 2014.