Afro-MNIST: Synthetic generation of MNIST-style datasets for low-resource languages
We present Afro-MNIST, a set of synthetic MNIST-style datasets for four orthographies used in Afro-Asiatic and Niger-Congo languages: Ge‘ez (Ethiopic), Vai, Osmanya, and N’Ko. These datasets serve as “drop-in” replacements for MNIST. We also describe and open-source a method for synthetic MNIST-style dataset generation from single examples of each digit. These datasets can be found at https://github.com/Daniel-Wu/AfroMNIST. We hope that MNIST-style datasets will be developed for other numeral systems, and that these datasets vitalize machine learning education in underrepresented nations in the research community.
Classifying MNIST Hindu-Arabic numerals (LeCun et al., 1998) has become the “Hello World!” challenge of the machine learning community. This task has excited a large number of prospective machine learning scientists and has led to practical advancements in optical character recognition.
The Hindu-Arabic numeral system is the predominant numeral system used in the world today. However, there are a sizeable number of languages whose numeric glyphs are not inherited from the Hindu-Arabic numeral system. As it relates to human language, work in machine learning and artificial intelligence focuses almost exclusively on high-resource languages such as English and Mandarin Chinese.
Unfortunately, these “mainstream” languages comprise only a tiny fraction of all extant languages. Indeed, of over 7,000 languages in the world, the vast majority are not represented in the machine learning research community. In particular, we note that there is a vast collection of alternative numeral systems for which an MNIST-style dataset is not available.
In an effort to make machine learning education more accessible to diverse groups of people, it is imperative that we develop datasets which represent the heterogeneity of existing numeral systems. Furthermore, as studied in Mgqwashu (2011), familiarity of the glyph shapes from the students’ mother tongue facilitates learning and enhances epistemological access. With the recent drive towards spreading AI literacy in Africa, we argue that it would be rather useful to use local numeral glyphs in the initial Machine Learning 101 courses to spark enthusiasm as well as to establish familiarity.
Additionally, we are inspired by the dire warning that appears in the work by Mgqwashu (2011) that the numeral system is the most endangered aspect of any language and that the excuse of not having these datasets and downstream OCR applications will further accelerate the decline of several numeral systems dealt with in this work.
In similar work, Prabhu (2019) collected and open-sourced an MNIST-style dataset for the Kannada language; however, this process took a significant amount of effort from 65 volunteers. While this methodology is effective for high-resource languages, we expect that it is not practical enough to be applied to the large number of low-resource languages.
Much of the world’s linguistic diversity comes from languages spoken in developing nations. In particular, there is a wealth of linguistic diversity in the languages of Africa, many of which have dedicated orthographies and numeral systems. In this work, we focus on the Ge‘ez (Ethiopic), Vai, Osmanya, and N’Ko scripts
Because large amounts of training data for African languages such as Amharic and Somali are not readily available, we experiment with creating synthetic numerals that mimic the likeness of handwritten numerals in their writing systems. Previous work in few-shot learning and representation learning has shown that effective neural networks can be trained on highly perturbed versions of just a single image of each class (Dosovitskiy et al., 2015). Thus, we propose the synthetic generation of MNIST-style datasets from Unicode exemplars of each numeral.
We release synthetic MNIST-style datasets for four scripts used to write Afro-Asiatic or Niger-Congo languages: Ge‘ez, Vai, Osmanya, N’Ko.
We describe a general framework for resource-light syntheses of MNIST-style datasets.
Inspired by the work in Prabhu et al. (2019), we first generate an exemplar seed dataset for each numeral system (Figure 2) from the corresponding Unicode characters. Following Qi et al. (2006), from a category theoretic perspective these would constitute the classwise prototypes as they
Reflect the central tendency of the instances’ properties or patterns;
Are more similar to some category members than others;
Are themselves self-realizable but are not necessarily an instance.
In order to generate synthetic examples from these prototypical glyphs, we apply elastic deformations and corruptions (similar to Mu and Gilmer (2019) and as mentioned in Simard et al. (2003)) to the exemplars. We empirically chose elastic deformation parameters in order to maximize variance while still retaining visual distinctness. The impact of these parameters on the synthetic image is explored in the Appendix (Figure 11).
We compared our results to a small dataset of written Ge‘ez digits (Molla, 2019); the differences in exemplar versus handwritten data are shown in Figure 3. We note that, in cases where a limited amount of handwritten data is available, deformations and corruptions can be applied to those examples instead of Unicode exemplars.
3 Dataset Details
To produce “drop-in” replacements for MNIST, we closely emulate the format of the latter. Each of our datasets contains 60000 training images and 10000 testing images. Each image is greyscale and pixels in size. Each dataset contains an equal number of images of each digit
We are also interested in the morphological differences between our generated datasets and the original (Hindu-Arabic) MNIST dataset. We visualize the morphological characteristics of our datasets according to the methodology of Castro et al. (2019) and also plot the UMAP embeddings of the data. These analyses on Ge‘ez-MNIST are shown in Figures 5 and 6. Analogous analyses for our other datasets are given in the Appendix (Figures 8 to 10).
To provide a baseline for machine learning methods on these datasets, we train LeNet-5 (LeCun et al., 1998), the network architecture first used on the original MNIST dataset, for numeral classification. We train with the Adam optimizer with an initial learning rate of using the categorical crossentropy loss. Models were trained until convergence. The model architecture is described in the Appendix (Table 2). Results are shown in Table 1.
We note that certain scripts have a comparatively high variability when written as opposed to our exemplars, and this leads to poor generalization (Figure 3, 6). Furthermore, it is clear from an examination of UMAP embeddings that our dataset is not as heterogenous as in-the-wild handwriting. After testing a LeNet-5 trained on Ge‘ez-MNIST on the aforementioned dataset of handwritten Ge‘ez numerals, we found the model achieved only an accuracy of 30.30%. Certain numerals were easier to distinguish than others (Figure 7). We expect this benchmark to be a fertile starting point for exploring augmentation and transfer learning strategies for low-resource languages.
We present Afro-MNIST — a set of MNIST-style datasets for numerals in four African orthographies. It is our hope that the availability of these datasets enables the next generation of diverse research scientists to have their own “Hello World” moment. We also open-source a simple pipeline to generate these datasets without any manual data collection, and we look forward to seeing the release of a wide range of new MNIST-style challenges from the machine learning community for other numeral systems.
Appendix A Appendix
- We note that the Vai, Osmanya, and N’Ko scripts are not in wide use, but nonetheless they can be synthesized using the methods we present.
- The Ge‘ez script lacks the digit 0, so our classes represent the numerals 1-10 for that script.
- Morpho-MNIST: quantitative assessment and diagnostics for representation learning. Journal of Machine Learning Research 20. External Links: Cited by: §3.
- Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 38 (9), pp. 1734–1747. Cited by: §1.
- Languages of the world. SIL International. External Links: Cited by: §1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Table 2, §1, §4.
- Academic literacy in the mother tongue: a pre-requisite for epistemological access. Diversity, Transformation and Student Experience in Higher Education Teaching and Learning, pp. 159. Cited by: §1, §1.
- Ethiopian-mnist. GitHub. Note: \urlhttps://github.com/Tesfamichael1074/Ethiopian-MNIST Cited by: §2.
- Mnist-c: a robustness benchmark for computer vision. arXiv preprint arXiv:1906.02337. Cited by: §2.
- Fonts-2-handwriting: a seed-augment-train framework for universal digit classification. arXiv preprint arXiv:1905.08633. Cited by: §2.
- Kannada-mnist: a new handwritten digits dataset for the kannada language. arXiv preprint arXiv:1908.01242. Cited by: §1.
- Fuzzy soil mapping based on prototype category theory. Geoderma 136 (3-4), pp. 774–787. Cited by: §2.
- Best practices for convolutional neural networks applied to visual document analysis.. In Icdar, Vol. 3. Cited by: §2.