DAF:RE: A CHALLENGING, CROWD-SOURCED, LARGE-SCALE, LONG-TAILED DATASET FOR ANIME CHARACTER RECOGNITION

Daf:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset for Anime Character Recognition

Abstract

In this work we tackle the challenging problem of anime character recognition. Anime, referring to animation produced within Japan and work derived or inspired from it. For this purpose we present DAF:re (DanbooruAnimeFaces:revamped), a large-scale, crowd-sourced, long-tailed dataset with almost 500 K images spread across more than 3000 classes. Additionally, we conduct experiments on DAF:re and similar datasets using a variety of classification models, including CNN based ResNets and self-attention based Vision Transformer (ViT). Our results give new insights into the generalization and transfer learning properties of ViT models on substantially different domain datasets from those used for the upstream pre-training, including the influence of batch and image size in their training. Additionally, we share our dataset, source-code, pre-trained checkpoints and results, as Animesion, the first end-to-end framework for large-scale anime character recognition: https://github.com/arkel23/animesion.

\name

Edwin Arkel Rios, Wen-Huang Cheng, Bo-Cheng Lai \addressInstitute of Electronic Engineering, National Chiao Tung University, Taiwan {keywords} anime, cartoon, face recognition, transfer learning, visual benchmark dataset

1 Introduction

Figure 1: Samples of images from DAF:re.

Anime, originally a word to describe animation works produced in Japan, can be seen now an umbrella term for work that is inspired or follows a similar style to the former [4]. It is a complex global, cultural phenomenon, with an industry that surpasses 2 trillion Japanese yen [1]. Recently, the anime film, Kimetsu no Yaiba (Demon Slayer), became the highest-grossing film of all time in Japan, the highest-grossing anime and Japanese film of all time, the highest animated movie of 2020, and the 5th highest-grossing film of 2020 worldwide [10]. Clearly, anime, as a phenomenon and industry is thriving from an economic point of view. Furthermore, viewing has been recognized as an integral part of literacy development by educators [8]. It’s importance as a medium, cannot be understated. For these reasons, it’s important to develop robust multimedia content analysis systems, for more efficient access, digestion, and retrieval of the information. These are key requirements for effective content recommendation systems, such as those used by Netflix.

Our work aims to facilitate research in this area of computer vision (CV) and multimedia analysis systems, by making three contributions. The first, we revamp an existing dataset, DanbooruAnimeFaces (DAF), and re-coin it as DanbooruAnimeFaces:revamped (DAF:re) to make it more tractable and manageable for the task of anime character recognition. The second, we conduct extensive experiments on this dataset, and another similar but much smaller dataset, using a variety of neural network model architectures, including the CNN based ResNet [11], and the recent self-attention based state-of-the-art (SotA) model for image classification, Vision Transformer (ViT) [7], giving us new insights into the generalization, and transfer learning properties of ViT models for downstream tasks that are substantially different from the ones used for the upstream pre-training, including the effects of image size and batch size in the classification downstream task. Third, we release our datasets, along with source-code and pre-trained model checkpoints, in an effort to encourage and facilitate researchers to continue work in this domain.

2 Background and Related Work

2.1 Deep Learning for Computer Vision

In the past few years we’ve had a meteoric rise in deep learning (DL) applications. Some of the factors that have allowed this include the fact DL models can easily take advantage of increases in computational resources and available data [15]. In particular, the dataset associated with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [6], became the de facto testbed for many new image classification models [14, 11], pushing the SotA in image classification to super-human levels of precision.

Recently, there has been a lot of attention from the research community into transformer models [13]. Since Vaswani et al. [20] proposed them in 2017, self-attention based transformers have revolutionized the natural language processing field (NLP) field, and there’s been quite active research into porting this architecture to CV tasks [18, 5]. The big breakthrough came in the form of the Vision Transformer, proposed by Dosovitskiy et al. in [7], that took a transformer encoder and applied it directly to image patches. ViT reached SotA in image classification in a variety of datasets, by modelling long-range dependencies between patches.

2.2 Computer Vision for Drawn Media

Anime, comics, cartoons, manga, sketches, however we call it, all of these have something in common; traditionally, they have all been drawn media. There’s a significant gap in terms of the characteristics between these mediums, and natural images, images captured by standard cameras, which most CV algorithms are designed for. Of particular relevance, is the fact that CNNs are biased towards texture recognition, rather than shapes [9]. Therefore, drawn media can be a challenging testbed for CV models.

CV research on these mediums is not new and several reviews on approaches leveraging computation exist [2]. Most of the existing works have been focused on how to apply CV methods for image translation, synthesis, generation and/or colorization of characters [12, 21].

On the other side, the task of character recognition and classification has been mostly unexplored. Work has been done using comic and manga (Japanese comics) datasets, but it’s been done using small datasets composed of dozens to hundreds of characters at most, with samples in the order of thousands [19]. Matsui et al. [16] compiled the manga109 dataset as a solution to lack of a standard, relatively large-scale testbed for CV approaches to sketch analysis, and it has been used for manga character recognition at larger scales, in the order of hundreds to thousand of characters [17]. However, this dataset is not entirely suitable for anime character recognition, since the styles have some differences, the most significant being that manga is usually in grey-scale, while a characteristic feature of anime is the variety of color palettes and styles.

With this in consideration, the closest work to ours, is the one conducted by Zheng et al. [22]. They compiled a dataset, Cartoon Face, composed of 389,678 images with 5,013 identities for recognition, and 60,000 images for face detection, collected from public websites and online videos. They established two tasks, face recognition and face detection. The face recognition is the most similar to ours, but there’s a few significant differences.

First, the way we frame the task for our dataset is different, since it essentially follows a standard K-label classification structure, split into three sets, and with a classification label for each image. Second, since our dataset is crowd-sourced, it naturally contains noise. Also, it’s highly non-uniform in terms of styles, even for a same character, since there may be many different artists involved. While this may be considered a weakness, we embrace these difficulties since it makes the task not only more challenging, but also allow our models to be more robust to generalization. Finally, due to the crowd-sourced procedure for obtaining our dataset, updating it to include more entities, and a variety of other adjustments related to examples per class, image size, and adaptation to support other tasks such as object and face detection and segmentation, is a much more feasible task.

3 Methodology

3.1 Data

Figure 2: Histogram of DAF:re for the 100 classes with most number of samples. It’s clear that the distribution is long-tailed.

Figure 3: Histogram of moeImouto for the 100 classes with most number of samples.

DAF:re

DAF:re is mostly based on DanbooruAnimeFaces (DAF) 1, which is a subset of Danbooru2018 [3]. Danbooru2018 is probably the largest tagged, crowd-sourced dataset for anime-related illustrations. It was extracted from Danbooru, a board developed by the anime community for image hosting and collaborative tagging. The first release of Danbooru dataset was the 2017 version, with 2.94M images with 77.5M tag instances (of 333K defined tags), the 2018 version contains 3.33M images with 92.7M tag instances (of 365K defined tags), and the latest release is the 2019 version, with 3.69M images with 108M tag instances (of 392K defined tags).

DAF was made to address the challenging problem of anime character recognition. To obtain it, the authors first filtered to only keep the character tags. Then, they kept images that have only one character tag, and extracted head bounding boxes using a YoloV3-based anime head detector; images with multiple head boxes detected were discarded. This resulted in 0.97M head images, that were resized to 128x128, representing 49K character classes. The authors further filtered the dataset by only keeping those images with bounding box prediction confidence above 85%. This resulted in 561K images, that were split across training (541K), validation (10K), and testing (10K) sets, representing 34K classes.

However, the problem with the way this split was made, is that it was way too difficult for an image classifier to accurately classify an image into the correct character class, as evidenced by the fact their best model, using a ResNet-18 and an ArcFace loss could only achieve 37.3% testing accuracy. The difficulty arised from the nature of the dataset, noisy, long-tailed, few-shot classification, and the aforementioned difficulties of CNNs regarding drawings.

For this reason, we proposed a set of small, but significant improvements in the filtering methodology, to obtain DAF:re. First, we ony kept classes with samples above a certain threshold. We tried 5, 10 and 20, resulting in a reduction of samples from 977K images to 520K, 495K and 463K, respectively, and classes from 49K to 9.4K, 5.6K, and 3.2K. We settled for 20 to ensure all splits had at least one sample of each class, making the dataset more manageable, while still keeping it challenging. Second, we split the dataset using a standard 0.7, 0.1 and 0.2 ratio for training, validation, and testing. Our final version kept 463,437 head images with a resolution of 128x128, representing 3,263 classes. The mean, median and standard deviation of samples per class is 142, 48, and 359, respectively.

moeImouto

The moeImouto dataset was obtained from Kaggle2. It was originally developed by nagadomi 3 using a custom face detector based on Viola-Jones cascade classifier. It originally contained 14,397 head images with a resolution of roughly 160x160, representing 173 character classes. We discard two images that were not saved in RGB format, leaving us with 14,395 images, that were split between training and testing set with ratios of 0.8 and 0.2, respectively. The mean, median and standard deviation of samples per class is 83, 80, and 27, respectively.

3.2 Experiments

We conducted experiments on the two aforementioned datasets, for 50 or 200 training epochs, using image sizes of 128x128 or 224x224, and batch sizes of either 64 or 1024 images per batch. We perform comparisons using these settings across a variety of neural network architectures for image classification. As a baseline we use a shallow CNN architecture based mostly on LeNet, with only 5 layers. We also perform experiments using ResNet-18 and ResNet-152, pretrained on ImageNet 1K and not, and the self-attention based ViT B-16, B-32, L-16 and L-32. For the pre-trained ResNet models we freeze all layers except the classification layer, which we substitute depending on the number of classes in our dataset. For all of our experiments we utilize stochastic gradient descent (SGD) with momentum, with an initial learning rate (LR) of 0.001 and momentum of 0.9. We also apply LR decay, where we reduce the current LR by 1/3 after each 20 epochs if training for 50 epochs, and after 50 epochs if training for 200 epochs.

As a pre-processing step, we normalize the images, and apply random flip and random crop for data augmentation during the training, first resizing the image to a square with size 160 or 256, then taking a random squared crop of the desired input size (128 or 224). For the validation and testing, we only resize and normalize the images.

4 Results and Discussion

We use validation and testing top-1 and top-5 classification accuracies, as our performance metrics. In this section we refer to the shallow architecture as SN (ShallowNet), ResNet-18 as R-18, ResNet-152 as R-152, and the ViT models by their configuration (B-16, B-32, L-16, L-32). The results are summarized in Tables 1, 2, 3, 4. We highlight the best results in bold.

Of particular noteworthiness, is the fact that CNN-based architectures severely outperform ViT models when using a batch size of 1024 with an image size of 224x224, regardless of the dataset, or if it’s pretrained or not. However, when using a smaller image size, 128x128, with a smaller batch of 64, the ViT models obtain much better results, when pretrained, and competitive results, when not.

Model Pretrained=False Pretrained=True
Top-1 Top-5 Top-1 Top-5
R-18 69.09 84.64 26.47 45.30
R-152 64.36 81.20 26.49 44.88
B-16 63.30 78.58 82.14 92.77
B-32 51.09 71.30 75.42 89.22
L-16 59.39 77.91 85.95 94.23
L-32 51.81 71.81 75.88 89.39
Table 1: Classification accuracy (%) for DAF:re trained for 50 epochs with batch size: 64 and image size: 128x128.
Model Pretrained=False Pretrained=True
Top-1 Top-5 Top-1 Top-5
R-18 72.58 90.86 61.20 83.83
R-152 63.17 86.41 63.06 85.90
B-16 67.28 82.22 91.57 98.06
B-32 48.76 78.19 85.76 96.81
L-16 66.56 87.43 92.80 98.44
L-32 49.88 78.36 85.22 96.94
Table 2: Classification accuracy (%) for moeImouto trained for 200 epochs with batch size: 64 and image size: 128x128.
Model Pretrained=False Pretrained=True
Top-1 Top-5 Top-1 Top-5
SN 53.68 72.04
R-18 68.30 84.01 24.31 39.82
B-32 38.19 59.06 59.92 79.20
Table 3: Classification accuracy (%) for DAF:re trained for 200 epochs with batch size: 1024 and image size: 224x224.
Model Pretrained=False Pretrained=True
Top-1 Top-5 Top-1 Top-5
SN 57.49 80.43
R-18 60.41 84.20 34.08 58.95
B-32 9.17 20.12 24.69 54.03
Table 4: Classification accuracy (%) for moeImouto trained for 200 epochs with batch size: 1024 and image size: 224x224.

5 Future Work

DAF:re can be easily modified to include more or less images per class, and following the original methodology proposed by the authors of DAF, augment it with the updates made in 2019 to the parent dataset, Danbooru2019. Furthermore, the original DAF also included bounding boxes, but to make this initial version more manageable, we decided to focus on the classification task. Our next step would be to update it to include bounding boxes.

On the other side, with respect to ViT models, there’s much work to be done. A detailed study on the effects of image size and batch size, for upstream and downstream tasks, in similar domain and domain-adaptation tasks, needs to be conducted.

6 Conclusion

We present DAF:re dataset, to study the challenging problem of anime character recognition. We perform extensive experiments on DAF:re and moeImouto datasets, using a variety of models. From our results, we conclude that while ViT models offer a promising alternative to CNN-based models for image classification, however more work needs to be done on the effects of different hyperparameters if we aim to fully utilize the generalization and transfer learning capacities of transformers for computer vision applications.

7 Disclaimer

This dataset was created to enable the study of computer vision for anime multimedia systems. DAF:re does not own the copyright of these images. It only provides thumbnails of images, in a way similar to ImageNet.

Footnotes

  1. https://github.com/grapeot/Danbooru2018AnimeCharacterRecognitionDataset
  2. https://www.kaggle.com/mylesoneill/tagged-anime-illustrations/home
  3. http://www.nurs.or.jp/~nagadomi/animeface-character-dataset/

References

  1. (2020-12) Anime Industry Report 2019 Summary. The Association of Japanese Animations (ja). External Links: Link Cited by: §1.
  2. O. Augereau, M. Iwata and K. Kise (2018-04) A survey of comics research in computer science. arXiv:1804.05490 [cs]. Note: arXiv: 1804.05490 External Links: Link Cited by: §2.2.
  3. G. Branwen (2015-12) Danbooru2019: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset. (en-us). Note: Last Modified: 2020-09-04 External Links: Link Cited by: §3.1.1.
  4. P. Brophy (2007-01) Tezuka the Marvel of Manga. National Gallery of Victoria, Melbourne, Vic. External Links: ISBN 978-0-7241-0278-5 Cited by: §1.
  5. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov and S. Zagoruyko (2020-05) End-to-End Object Detection with Transformers. arXiv:2005.12872 [cs]. Note: arXiv: 2005.12872 External Links: Link Cited by: §2.1.
  6. J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei (2009-06) ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Note: ISSN: 1063-6919 External Links: Document Cited by: §2.1.
  7. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby (2020-10) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs]. Note: arXiv: 2010.11929 External Links: Link Cited by: §1, §2.1.
  8. N. Frey and D. Fisher (2008-01) Teaching Visual Literacy: Using Comic Books, Graphic Novels, Anime, Cartoons, and More to Develop Comprehension and Thinking Skills. Corwin Press (en). Note: Google-Books-ID: cb4xcSFkFtsC External Links: ISBN 978-1-4129-5311-5 Cited by: §1.
  9. R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann and W. Brendel (2019-01) ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv:1811.12231 [cs, q-bio, stat]. Note: arXiv: 1811.12231 External Links: Link Cited by: §2.2.
  10. D. Harding (2020-12) Demon Slayer: Mugen Train Dethrones Spirited Away to Become the No. 1 Film in Japan of All Time. (en-us). External Links: Link Cited by: §1.
  11. K. He, X. Zhang, S. Ren and J. Sun (2015-12) Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs]. Note: arXiv: 1512.03385 External Links: Link Cited by: §1, §2.1.
  12. Y. Jin, J. Zhang, M. Li, Y. Tian, H. Zhu and Z. Fang (2017-08) Towards the Automatic Anime Characters Creation with Generative Adversarial Networks. arXiv:1708.05509 [cs]. Note: arXiv: 1708.05509 External Links: Link Cited by: §2.2.
  13. S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan and M. Shah (2021-01) Transformers in Vision: A Survey. arXiv:2101.01169 [cs]. Note: arXiv: 2101.01169 External Links: Link Cited by: §2.1.
  14. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25, pp. 1097–1105 (en). External Links: Link Cited by: §2.1.
  15. Y. LeCun, Y. Bengio and G. Hinton (2015-05) Deep learning. Nature 521 (7553), pp. 436–444 (en). External Links: ISSN 1476-4687, Link, Document Cited by: §2.1.
  16. Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki and K. Aizawa (2017-10) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76 (20), pp. 21811–21838 (en). External Links: ISSN 1573-7721, Link, Document Cited by: §2.2.
  17. R. Narita, K. Tsubota, T. Yamasaki and K. Aizawa (2017-11) Sketch-Based Manga Retrieval Using Deep Features. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 03, pp. 49–53. Note: ISSN: 2379-2140 External Links: Document Cited by: §2.2.
  18. N. Parmar, A. Vaswani, J. Uszkoreit, Å. Kaiser, N. Shazeer, A. Ku and D. Tran (2018-06) Image Transformer. arXiv:1802.05751 [cs]. Note: arXiv: 1802.05751 External Links: Link Cited by: §2.1.
  19. W. Sun, J. Burie, J. Ogier and K. Kise (2013-08) Specific Comic Character Detection Using Local Feature Matching. In 2013 12th International Conference on Document Analysis and Recognition, pp. 275–279. Note: ISSN: 2379-2140 External Links: Document Cited by: §2.2.
  20. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin (2017-12) Attention Is All You Need. arXiv:1706.03762 [cs]. Note: arXiv: 1706.03762 External Links: Link Cited by: §2.1.
  21. L. Zhang, Y. Ji and X. Lin (2017-06) Style Transfer for Anime Sketches with Enhanced Residual U-net and Auxiliary Classifier GAN. arXiv:1706.03319 [cs]. Note: arXiv: 1706.03319 External Links: Link Cited by: §2.2.
  22. Y. Zheng, Y. Zhao, M. Ren, H. Yan, X. Lu, J. Liu and J. Li (2020-06) Cartoon Face Recognition: A Benchmark Dataset. arXiv:1907.13394 [cs]. Note: arXiv: 1907.13394 External Links: Link Cited by: §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
426866
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description