# Hangul Fonts Dataset: a Hierarchical and Compositional Dataset for Interrogating Learned Representations

###### Abstract

Interpretable representations of data are useful for testing a hypothesis or to distinguish between multiple potential hypotheses about the data. In contrast, applied machine learning, and specifically deep learning (DL), is often used in contexts where performance is valued over interpretability. Indeed, deep networks (DNs) are often treated as “black boxes”, and it is not well understood what and how they learn from a given dataset. This lack of understanding seriously hinders adoption of DNs as data analysis tools in science and poses numerous research questions. One problem is that current deep learning research datasets either have very little hierarchical structure or are too complex for their structure to be analyzed, impeding precise predictions of hierarchical representations. To address this gap, we present a benchmark dataset with known hierarchical and compositional structure and a set of methods for performing hypothesis-driven data analysis using DNs. The Hangul Fonts Dataset is composed of 35 fonts, each with 11,172 written syllables consisting of 19 initial consonants, 21 medial vowels, and 28 final consonants. The rules for combining and modifying individual Hangul characters into blocks can be encoded, with translation, scaling, and style variation that depend on precise block content, as well as naturalistic variation across fonts. Thus, the Hangul Fonts Dataset will provide an intermediate complexity dataset with well-defined, hierarchical features to interrogate learned representations. We first present a summary of the structure of the dataset. Using a set of unsupervised and supervised methods, we find that deep network representations contain structure related to the geometrical hierarchy of the characters. Our results lay the foundation for a better understanding of what deep networks learn from complex, structured datasets.

## 1 Introduction

Representation learning underlies many machine learning and deep learning methods. Representations learned from data are valuable if they can be used for downstream analyses or to solve a task. Interpretable representations can be used to understand the underlying structure of dataset (Murdoch et al., 2019). Understanding the representations deep networks learn and how they relate to the structure of the training data is an area of open research (Cheung et al., 2014; Higgins et al., 2017; Saxe et al., 2013).

The goal of experimental science is to generate datasets that can be used to uncover the underlying structure of the world around us. Broadly, the analysis of the produced data falls into two categories that are often used together: exploratory and hypothesis-driven. In exploratory data analysis, there may not be a detailed hypothesis being considered or tested. The goal is typically to uncover the general structure or patterns in a dataset. In hypothesis-driven analysis, one or more hypotheses about the underlying structure in the data are being directly tested. In this case, the goal is to test how well the hypotheses account for the observed data. One of the goals of machine learning and deep learning is to learn representations which are useful for understanding the structure of data or which help solve tasks. In this study, we present an image dataset based on the Hangul writing system for developing methods for hypothesis-driven data analysis of representations learned by deep networks.

One approach to understanding learned representations is to build datasets with known structure. Deep learning benchmark datasets are often drawn from real data (Fig 1A, B, C, E). In image datasets like MNIST, CIFAR10/100, and ImageNet, the images have the naturally varying levels of low-level complexity (Fig 1F, -axis, defined as the mutual information between 2 halves of 10 pixel image patches). These datasets have class labels which limit task complexity (Fig 1F, -axis, measured by the entropy of the labels), but other sub-tasks can be created, for example even-odd classification for MNIST, reductions on the WordNet hierarchy for ImageNet. However, one limitation of many benchmark datasets is that the underlying structure of the data is not well specified. This implies that they are of limited use for understanding the structure of learned representations.

In this work, we present the new Hangul Fonts Dataset (Fig 1D) for investigating methods for understanding learned representations. This dataset has known latent hierarchical and compositional structure which can be used to test hypotheses about representations. The dataset lies in the middle of common deep learning image benchmark datasets (Fig 1F) in terms of low-level data complexity and task complexity. However, unlike the common benchmark datasets, the latent hierarchical and compositional structure is known. We describe a framework and set of methods for comparing generative hypotheses for datasets to deep network representations learned from data. These methods are generally applicable to the case where domain scientists are testing hypotheses about data structure using deep networks. Finally, we explore whether typical deep learning methods can be used to uncover the underlying generative model of the Hangul Fonts Dataset. Using deep networks, we see that different parts of the latent structure are either resolved or discarded as one analyzes deeper layers. The Hangul Fonts Dataset contains a large number of data samples (391,020 across 35 fonts), encodable hierarchical and compositional structure, and naturalistic variation. Together these properties address a gap in benchmark datasets for deep learning and representation learning research.

### 1.1 Related Work

#### Benchmark datasets and deep learning

A number of benchmark datasets have been previously proposed for representation learning research. The Shapeset dataset which is composed of objects of different shapes with various shape parameters (scale, rotation, translation) was used to show that supervised fine-tuning of deep architectures yields better classification performance than training separately the unsupervised and supervised components Lamblin and Bengio (2010). Saxe et al. (2013) used a benchmark hierarchical dataset to study the dynamics of backpropagation. The dSprites dataset was developed to aid in factorized representation learning research (Matthey et al., 2017; Higgins et al., 2017). The Hangul Fonts Dataset expands the types of structure in one benchmark dataset that can be learned and also has more structure per image compared to these datasets (see section 2).

#### Structured representations in deep networks

For datasets where the form of the generative model is not know, deep representation learning methods often look for factorial or disentagled representations (Schmidhuber, 1992; Cheung et al., 2014; Radford et al., 2015; Higgins et al., 2017; Singh et al., 2018; Achille and Soatto, 2018; Kazemi et al., 2019). While factorial representations are useful for certain tasks like sampling, they are less useful for understanding datasets with hierarchy and compositionality.

Deep networks can learn feature hierarchies, wherein features from higher levels of the hierarchy are formed by the composition of lower level features. The hierarchical multiscale RNN captures the latent hierarchical structure on two tasks–character-level language modelling and handwriting sequence generation–by encoding the temporal dependencies with different timescales using a novel update mechanism Chung et al. (2016). Livezey et al. (2018) showed that deep networks learn an articulatory hierarchy when trained on neural data recorded during spoken speech syllables. Hangul characters are formed from a hierarchy of atoms, glyphs, and geometric structure which deep networks can be trained to learn.

## 2 The Hangul Fonts Dataset

The Korean writing system (Hangul) was invented in the year 1444 to promote literacy of Korean Language (2008a). Since the writing system was created at one time for a specific purpose, the exact geometrical rules for combining glyphs into syllable blocks are known. The Hangul alphabet consists of 19 initial consonants, 21 medial vowels, and 27+1 final consonants (including no final consonant) which generate possible Hangul character blocks each of which corresponds to a syllable. Not all blocks/syllables are used in written/spoken Korean, however all 11,172 blocks were generated for use in this dataset. Additionally, there are redundant characters across the initial, medial, and final positions that are known of Korean Language (2008b). The dataset consists of all blocks drawn in 35 different open-source fonts from (Software, ; Google, ) along with single character images and a number of generative labels for blocks and characters for a total of 391,020 images.

The Hangul blocks can be described most simply as having initial, medial, and final independent generative variables. However, there are other sets of generative variables that can be used to describe the dataset. Some of these other variables have hierarchical structure which is induced by the geometrical layout of the blocks. Others have compositional structure through the repeated use and combination of glyphs (after a set of possible translations, rotations, and scalings) within and across the initial, medial, and final locations. Together, these different descriptions of the data facilitate investigation into what aspects of this structure deep networks will learn when trained on this dataset.

### 2.1 The structure of a block

There is a fixed set of geometrical rules for creating a block from individual character glyphs. These rules provide a set of latent variables that can be attributed to an image of any block. The initial consonant is located on the left or top and the vowel(s) and other consonant(s) follow to the right or bottom. The syllable is read left to right and top to bottom (Fig 2). The initial, medial, and final characters in a block are three generative labels associated with all blocks. The set of all possible blocks can be simply described as the outer-product of these three class labels. Each block is then composed of the constituent initial, medial, and final character glyphs. However, as will be described later, the context of the other characters within a block can change the glyph of a character within a block for a specific font.

### 2.2 Hierarchy and compositionality

In addition to the initial, medial, and final (IMF) labels, each block can also be labeled with additional generative latent variables that describe the hierarchical and compositional structure. There is a geometrical hierarchy across initial, medial, and final labels. Furthermore, there is a base set of glyphs that is first expanded through rotations to a set of glyphs that are composed across IMF positions into blocks.

First, we describe the geometrical hierarchy. There are two types of initial characters: single or double characters (indicated by the white lines in Fig 2). There are five possible medial character types: below, right-single, right-double, below-right-single, and below-right-double (shown across columns in Fig 2). There are 3 types of final characters: no final character, single, and double (shown across rows and with white lines in Fig 2). It is also possible to describe all 30 of the geometrical possibilities together. These additional generative variables induce a hierarchy in the individual IMF labels and blocks that shared geometric structure.

Embedded within the IMF and geometrical structure is a set of glyph compositions. There are a base set of atomic glyphs which all IMF glyphs are drawn from (Fig 3, Atom row). Then, one initial, medial, and final glyph is composed into a block (Fig 3, IMF and Block rows). In this view, each block is built from a composition of potential rotations applied to a base set of glyphs which are then structured by the geometrical rules and composed into an image. The underlines in the Atom and IMF rows of Fig 3 correspond to inclusion in the final colored blocks in the bottom row. For comparisons with learned representations, the composition structure is encoded in 2 ways (although the full structure is available in the dataset). The first is a “bag-of-Atoms” binary feature set where each block is given a binary feature vector which contains a 1 if the block contains at least one copy Atoms from the top row of Fig 3 (16 features). The second is a “bag-of-Atom” binary feature set where the rotations have not been taken into account (24 features). These two feature sets do not encode the complete compositional structure, but they are amenable to common representation comparison methods.

### 2.3 Variation across contexts and fonts

The size and shape of a glyph can change within a font depending on the context. Some of these changes are consistent across fonts and stem from the changing geometry of a block with different initial, medial, or final contexts (Fig 2). Additionally, some of this variation is specific to a font and is based on the decision the font designer made. These decisions cannot be quantified and are one part of the “naturalistic variation" in this dataset.

There are variations across fonts due to the nature of the design or style of the characters. These include the style of characters which can vary from clean, computer font-like fonts to highly stylized fonts which are meant to resemble hand-written characters (Fig 4. Additionally, certain fonts draw characters that are connected or disconnected from the neighboring characters. These types of variation are the main source of naturalistic variation in the dataset since they cannot be exactly described. Finally, for certain fonts, bold and light versions of the same font are included, a naturalistic but fairly regular source of variation across fonts.

Different types of variations such as rotation, translation, and more naturalistic style variations arise in the dataset (Fig 4). The first set of panels in Fig 4 shows how two medial glyphs can be rotated and preserve the same geometric structure. The second set of panels shows how scaling can alter how much space a glyph takes up in a block. Two initial glyphs are shown in different contexts. The ㄱ initial glyph in 가 can extend from top to bottom but in 굔 the tail is cut short. Similarly, the ㅇ glyph in 악 is smaller in size than it is in 아. For translation, the ㅁ glyph is shaped differently depending on where it is placed in the block. Finally, style displays the same character in three different fonts.

A model could capture and describe this variation by decomposing characters into their constituent parts or strokes Lake et al. (2015). The Hangul dataset could be used for this type of research into generative models. However, since this “naturalistic" variation cannot be mapped onto any known latent variables, it will be a source of variability that the networks will be tasked with integrating out in this study.

## 3 Recovering generative variables from latent representation

For scientific interpretation, it is desirable for deep network representations to be useful for recovering the generative variables. However, it is currently not known whether deep network representations can be used to read-off latent structure in a simple way. In order to understand this, we attempt to recover the latent structure of the Hangul blocks used unsupervised clustering of the representations and supervised logistic regression. We first show results of using these methods on linear models.

Fig 5A shows latent variable recovery for linear methods. We find that the medial, final, and all geometry labels can be recovered from the linear methods. The initial and medial geometry labels are weakly recovered and the intial and final geometry is not recovered.

Unlike the linear methods, deep networks have many representations across layers. To validate the clustering and regression methods, we first apply them in a context where we have strong expectations about the results. If a deep network is trained to predict the initial, medial, or final labels, then we expect the latent structure in the representations to become more similar to that variables across layers. We find that the clustering analysis does recover this structure across layers in the network (Fig 5A-C, left 3 groupings of bars). For networks trained on the initial labels, we find that the medial labels, medial and all geometry labels are represented in the early layers of the networks, but are no longer represented in the final layers (Fig 5A). Surprisingly, the initial geometry is only weakly represented in the later layers. The final-related labels have representation at chance. For networks trained on the medial labels, we find weak evidence of the intial and final labels and strong evidence of the medial and all geometric labels (Fig 5B). The initial and final geometry labels have representations at chance. For networks trained on the final labels, we find weak evidence of the medial labels and weak evidence of the medial, final, and all geometry labels (Fig 5C). Together, these results could be interpret as showing strong evidence for the hypothesis that the Hangul Fonts dataset has structure related to the geometry variables. There is also weak evidence for initial-medial and medial-final context dependence. These observations are consistent with the ground-truth knowledge about the structure of the dataset.

Understanding whether deep network representations tend to be more distributed or local is an open area of research. We investigated whether deep networks learn a local representation by training logistic regression models with an penalty to predict the latent generative variables from the representations (Fig 6). Across initial, medial and final tasks, we find that the task labels do become more local and are able to be read-out with higher accuracy across layers. The geometric variables do not have a strong trend. The Atom variables cannot be read out with accuracy higher than chance for any network or layer. These results suggest that standard, fully-connected deep networks do not typically learn local representations.

## 4 Methods

### 4.1 Creating and normalizing images

We created a text file for the 11,172 Hangul blocks using the Unicode values from (in Korean, ). We then converted the text files to an image file using the convert utility (ImageMagick Studio, 2008) and font files. The image sizes were different across blocks, so the images were resized to the max image size across blocks. Then, for each fontsize, the blocks were different sizes across fonts and so the blocks were resized to the median size across fonts. Individual images for the initial medial and final characters are included in some fonts. For fonts that do not include these inividual glyphs, we cropped them by hand out of composite blocks and then inserted into blank images at a location and scale determined from other fonts. Further information about the dataset creation process and summary statistics for the dataset can be found in Appendix A. The font files and final data arrays will be posted publicly.

The 35 fonts were used in a 7-fold cross validation look for the machine learning methods. The fonts were randomly permuted and then 5 fonts were used for each of the non-overlapping validation and test sets. The analysis of representations was done on the test set representations.

### 4.2 Data and Task Complexity

We define the task complexity as the log of the number of classes (equal to the entropy of the labels if they are equally probably). Benchmark datasets typically have a set of labels that are used. It is also possible to create ad hoc sub-tasks from these labels, for instance, even-versus-odd in MNISt. The Hangul Fonts Dataset has many generative variables and therefore many possible tasks. For ImageNet, the WordNet hierarchy can be used to create tasks that are different from the 1000 label task. Defining and estimating data complexity is much more difficult. Here, to give an intuitive picture, we define data complexity as the mutual information between 2 adjacent. For all datasets, we first cluster the data with KMeans (30 centroids) and use the JVHW estimator for mutual information (Jiao et al., 2015).

### 4.3 Generative structure recovery from representation of the data

Both shallow feature learning methods such as PCA, ICA, and NMF and deep learning models create representations (or transformations) of the input data. PCA and ICA produce linear representations, NMF produces a nonlinear inferred representation from a linear model, and deep networks can produce a increasingly nonlinear set of representations for each layer. Given one or more hypotheses about the generative, latent structure of data, using unsupervised and supervised methods, we can test whether the hypothesized structure is contained in the representation in a “simple" way.

Clustering a representation produces a reduced representation for every datapoint in an unsupervised way. If one chooses the number of clusters to be equal to the dimensionality or number of classes the generative variables has, then they can be directly compared (up to a permutation). We cluster the representations with KMeans and then find the optimal alignment of the real and clustered labels (see Appendix B for more details). We then report the clustering accuracy and chance accuracy of this labeling.

The second method attempts to localize the information about a generative variable into a small set of representational features. To do this, we use supervised classification from to using logistic regression to test for representations that can be read-out in a linear way. In order to have strong selection of feature, we use regularized logistic regression to optimize both predictive accuracy and feature selection. Using this methods, only a small number of features from the representation will be selected as predictive of each generative label . For a given representation and generative variable , we report the classification accuracy over chance and number of features selected (median over features or classes) divided by the number of features or classes.

For deep networks, these two methods were applied to the activation of every layer both before and after the ReLU nonlinearities. The accuracy and chance were calculated for each of the 7 folds’ test sets and summarized across layers, training variables, and latent generative variables.

### 4.4 Representation learning methods

Fully-connected networks with 3 hidden layers were trained on one of the initial, medial, or final glyph variables. For each task, 100 sets of hyperparameters were used for training The hyperparameters and their ranges are listen in Appendix C. All deep learning models were trained using PyTorch (Paszke et al., 2017) on Nvidia GTX 1080s or Titan Xs. The model with the best validation accuracy was chosen and the downstream analysis was done on the test set representations (test accuracies reported in Appendix D). Code for training the networks and reproducing the figures will be posted publicly. Deep networks representation analysis was partially completed on the NERSC supercomputer.

Principal component analysis (PCA), Independent Component Analysis (ICA), and Non-negative Matrix Factorization (NMF) from (Pedregosa et al., 2011) were used to learn representations from the data. These methods were all trained with 100 components which is at least 3-times larger than any of the latent generative variables under consideration. The models were trained on the training and validation sets and the representation analysis was on the test set.

## 5 Discussion

The Hangul Fonts Dataset (HFD) presented here is an intermediate complexity benchmark dataset that has hierarchical and compositional structure that can be encoded into a set of auxillary variable. These features make the HFD well suited for deep representation research. Using a set of unsupervised and supervised methods, we are able to extract information about latent generative variables from the representations of deep networks. Understanding how to recover dataset structure from deep network representations will broaden the application of deep learning in science.

In many scientific domains like cosmology, neuroscience, and climate science, deep learning is being used to make high accuracy predictions given growing dataset sizes (Mathuriya et al., 2018; Livezey et al., 2018; Kim et al., 2019). However, deep learning is not commonly used to directly test hypotheses about dataset structure. This is partially because the nonlinear, compositional structure of deep networks, which is conducive to high accuracy prediction from complex data, is not ideal for interrogating hypotheses about data. It is not generally known how the structure of a dataset influences the learned internal representations and overall mapping or whether the structure of the dataset can be “read-out” of the learned representations. Understanding whether dataset structure can be extracted from learned deep representations is important for the expanded use of deep learning in scientific applications.

In this work, simple fully-connected networks were considered. Understanding how proposed methods for learning factorial or disentagled representations (Schmidhuber, 1992; Cheung et al., 2014; Higgins et al., 2017; Achille and Soatto, 2018) impact the structure of learned representations is important for using deep network representations for hypothesis testing in scientific domains.

## References

- Murdoch et al. [2019] W James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Interpretable machine learning: definitions, methods, and applications. arXiv preprint arXiv:1901.04592, 2019.
- Cheung et al. [2014] Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583, 2014.
- Higgins et al. [2017] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3, 2017.
- Saxe et al. [2013] Andrew M Saxe, James L McClelland, and Surya Ganguli. Learning hierarchical category structure in deep neural networks. In Proceedings of the 35th annual meeting of the Cognitive Science Society, pages 1271–1276, 2013.
- Lamblin and Bengio [2010] Pascal Lamblin and Yoshua Bengio. Important gains from supervised fine-tuning of deep architectures on large labeled sets. In NIPS* 2010 Deep Learning and Unsupervised Feature Learning Workshop, pages 1–8, 2010.
- Matthey et al. [2017] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
- Schmidhuber [1992] Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
- Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Singh et al. [2018] Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Finegan: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. arXiv preprint arXiv:1811.11155, 2018.
- Achille and Soatto [2018] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
- Kazemi et al. [2019] Hadi Kazemi, Seyed Mehdi Iranmanesh, and Nasser Nasrabadi. Style and content disentanglement in generative adversarial networks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 848–856. IEEE, 2019.
- Chung et al. [2016] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016.
- Livezey et al. [2018] Jesse A Livezey, Kristofer E Bouchard, and Edward F Chang. Deep learning as a tool for neural data analysis: speech classification and cross-frequency coupling in human sensorimotor cortex. arXiv preprint arXiv:1803.09807, 2018.
- of Korean Language [2008a] National Institure of Korean Language. Want to know about Hangeul?, Jan. 2008a. URL https://web.archive.org/web/20190111001341/http://www.korean.go.kr/eng_hangeul/setting/002.html.
- of Korean Language [2008b] National Institure of Korean Language. Want to know about Hangeul?, Jan. 2008b. URL https://web.archive.org/web/20190111001835/http://www.korean.go.kr/eng_hangeul/principle/001.html.
- [16] Naver Software. Naver software hangul font collections. URL https://software.naver.com/software/fontList.nhn?categoryId=I0000000.
- [17] Google. Google fonts files. URL https://github.com/google/fontsl.
- Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
- [19] Programming in Korean. Hangul in unicode. URL https://web.archive.org/web/20190513221943/http://www.programminginkorean.com/programming/hangul-in-unicode/.
- ImageMagick Studio [2008] LLC ImageMagick Studio. Imagemagick, 2008.
- Jiao et al. [2015] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835–2885, 2015.
- Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration, 6, 2017.
- Pedregosa et al. [2011] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
- Mathuriya et al. [2018] Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J Pennycook, et al. Cosmoflow: using deep learning to learn the universe at scale. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 819–829. IEEE, 2018.
- Kim et al. [2019] Sookyung Kim, Hyojin Kim, Joonseok Lee, Sangwoong Yoon, Samira Ebrahimi Kahou, Karthik Kashinath, and Mr Prabhat. Deep-hurricane-tracker: Tracking and forecasting extreme climate events. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1761–1769. IEEE, 2019.
- Kuhn [1955] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.

## Appendix

## Appendix A The Hangul fonts dataset: summary statistics and visualization

### a.1 Font families

For certain fonts, bold and light versions of the same font are included, a "natural" but fairly regular source of variation across fonts (Fig A.1).

We also performed correlations across fonts to analyze how fonts differ from one another. Fig A.2 shows the dendrogram that results from hierarchical clustering of the correlation matrix using Ward’s method. GothicA1-Bold, GothicA1-SemiBold, GothicA1-Regular, GothicA1-Light, GothicA1-Thin, GothicA1-Black, GothicA1-Medium, and GothicA1-ExtraBold all fall under the same GothicA1 category.

### a.2 Summary statistics of dataset

Various statistics were calculated on fontsize 24 images within a single and across all fonts. The mean, median, and standard deviation of the images were taken in Fig A.3. This was done for all blocks in all fonts, all blocks in a single font, and a single block in all fonts. All three statistics for all fonts all blocks and one font all blocks preserve the block structure of the images, whereas ’가’ is clearly shown for all fonts single block across the statistics. Fig A.4A shows a histogram of the pixels of all the images within a font for all 35 fonts. The histograms are very similar across the four fonts highlighted in the legend. Fig A.4B shows a histogram of pixels within a font across all characters for all 35 fonts. The Frobenius norm is taken for all characters to study how characters differ within a font. NanumMyeongjo and NanumBrush are similar fonts as they have overlapping character norms. GothicA1-Regular, which resembles computer-type fonts, has the thinnest distribution as its characters do not differ greatly.

### a.3 Dimensionality Reduction: PCA, ICA, NMF

Linear models were trained on individual fonts and the learned dictionaries are shown in Fig A.5.

### a.4 UMAP visualization

Fig A.6A-C shows the result of applying UMAP to a single font’s images, GothicA1-Regular, with initial, medial, and final labels. Individual glyphs are plotted with a red kernel density estimate in the background. Fig A.6B shows the best clustering with regards to the geometry of the glyphs. The more verticially-oriented glyphs (ㅏ,ㅑ,ㅕ,ㅓ) cluster together in the right side while the more horizontally-oriented glyphs (ㅛ,ㅡ,ㅜ,ㅠ) cluster in the left. In Fig A.6C several of the duplet glyphs embed in the same location, suggesting similarity in the duplet structure.

Fig A.7A-C shows the result of applying UMAP to the images with initial, medial, and final geometry labels. Actual points are plotted rather than glyphs. Fig A.7B shows the best separation among the 5 geometric types as they are each distinct, and hence affect the overall structure of the block character. For example, right-single and right-double medial geometric types are always placed on the left region of the blocks. In contrast, initial and final geometry types which include none, single, or double do not drastically influence the greater structure of the block. Single and double geometric types have very similar embeddings in both the initial and final geometry plots.

## Appendix B Clustering accuracy

For a dataset of sample size and a representation we want to compare a clustering of and a categorical generative variable . First, is clustered into clusters with k-means (with k=) or from cutting the dendrogram to give clusters. Given this clustering, each sample is assigned a class label . We then have to find the best alignment between the cluster labels and the generative variable labels. To do this we form a similarity matrix . To calculate , we first form the set which contains the samples labeled and which contains the samples labeled . Then the similarity is the cardinality of the intersection of the sets divided by the cardinality of the union of the sets: , that is, given all samples labeled with either label, what fraction of them are labelled as both. We then use the Hungarian method Kuhn [1955] to optimally pair the generative labels with a permutation of the cluster labels using this similarity matrix. If the cluster labelling is an exact permutation, the clustering accuracy will be 100%, and chance for a random relabelling.

## Appendix C Hyperparameters

All networks were trained with 3 hidden layers of the same dimensionality and ReLU nonlinearities. Table 1 lists the hyperparameters that were randomly sampled and their ranges. and indicate the input and output dimensionality of the data and task.

Name | Type | Range/Options |
---|---|---|

Init. momentum | Float | .5 |

Learning rate reduction on plateau | float | .5 |

Epochs of patience for early stopping | int | 10 |

Dense layer size | int | : |

learning rate | float | -6 : 1 |

(1-momentum) | float | -2 : -.00436 (momentum=.01) |

weight decay | float | -6 : 1 |

Batch size | int | 32 : 512 |

Input dropout rate | float | .1 : .99 |

Input dropout rescale | float | .1 : 10 |

Hidden dropout rate | float | .1 : .99 |

Hidden dropout rescale | float | .1 : 10 |

## Appendix D Classification accuracy

The deep networks were trained on 3 tasks: initial, medial, and final (IMF) classification. Here we report the test-set accuracy for logistic regression as well as the best deep networks (selected by the validation accuracy). Logistic regression had accuracies of , , and respectively for the IMF tasks. Deep networks had accuracies of , , and respectively for the IMF tasks. Chance accuracy is , , and respectively for the IMF tasks.