Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations
Sparse language vectors from linguistic typology databases and learned embeddings from tasks like multilingual machine translation have been investigated in isolation, without analysing how they could benefit from each other’s language characterisation. We propose to fuse both views using singular vector canonical correlation analysis and study what kind of information is induced from each source. By inferring typological features and language phylogenies, we observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy in tasks that require information about language similarities, such as language clustering and ranking candidates for multilingual transfer. With our method, which is also released as a tool, we can easily project and assess new languages without expensive retraining of massive multilingual or ranking models, which are major disadvantages of related approaches.
Recent surveys consider linguistic typology as a potential source of knowledge to support multilingual natural language processing (NLP) tasks O’Horan et al. (2016); Ponti et al. (2019). Linguistic typology studies language variation in terms of their functional processes Comrie (1989). Several typological knowledge bases (KB) have been crafted, from where we can extract categorical language features Littell et al. (2017). Nevertheless, their sparsity and reduced coverage present a challenge for an end-to-end integration into NLP algorithms. For example, the World Atlas of Language Structure (WALS; Dryer and Haspelmath, 2013) encodes 143 features for 2,679 languages, but their mean coverage per language is barely around 14%.
Dense and data-driven language representations have emerged in response. They are computed from multilingual settings of language modelling Östling and Tiedemann (2017) and neural machine translation (NMT) Malaviya et al. (2017). However, the language diversity in the corpus-based representations is limited. The language coverage could be broadened with other knowledge, such as that encoded in WALS, to distinguish even more language properties. Therefore, to obtain the best of both views (KB and task-learned) with minimal information loss, we project a shared space of discrete and continuous features using a variant of canonical correlation analysis Raghu et al. (2017).
For our study, we fuse language-level embeddings from multilingual machine translation with syntactic features of WALS. We inspect how much typological knowledge is present by predicting features for new languages. Then, we infer language phylogenies and inspect whether specific relationships are induced from the task-learned vectors.
Furthermore, to demonstrate that our approach has practical benefits in NLP, we apply our language vectors in multilingual NMT with language clustering Tan et al. (2019) and adapt the ranking of related languages for multilingual transfer Lin et al. (2019). As a side outcome, we identify that there is an ideal setting to encode language relationships in language embeddings from NMT. Finally, we are releasing a simple tool to allow everyone to fuse their own representations for clustering, ranking and more.
2 Multi-view language representations
Our primary goal is to fuse parallel representations of the same language in one shared space, and canonical correlation analysis (CCA)
allows us to find a projection of two views for a given set of data.
With CCA, we look for linear combinations that maximise the correlation of the two sources in each coordinate iteratively Hardoon et al. (2004).
After training, we can apply the transformation learned on a new sample from any view to obtain a CCA-based language representation.
CCA considers all dimensions of the two views as equally important. However, our sources are potentially redundant: KB features are mostly one-hot-encoded, whereas task-learned ones inherit the high dimensionality of the embedding layer. Moreover, few samples and sparsity could make the convergence harder. For the redundancy issue, singular value decomposition (SVD) is an appealing alternative. With SVD, we factorise the source data matrix to compute the principal components and singular values. Furthermore, to deal with sparsity, we adopt a truncated SVD approximation, which is also known as latent semantic analysis in the context of linear dimensionality reduction for term-count matrices Dumais (2004).
The two-step transformation of SVD followed by CCA is called singular vector canonical correlation analysis (SVCCA; Raghu et al., 2017) in the context of understanding the representation learning throughout neural network layers. That being said, we use SVCCA to get language representations and not to inspect a neural architecture.
3 Methodology and research questions
To embed linguistic typology knowledge in dense representations for a broad set of languages, we employ SVCCA (§2) with the following sources:
(NMT) Learned view.
Firstly, we exploit the NMT-learned embeddings from the Bible (; 512 dim.) Malaviya et al. (2017). Up to 731 entries are available in that intersects with . They were trained in a many-to-English NMT model with a pseudo-token identifying the source language at the beginning of every input sentence.
Secondly, we take the many-to-English language embeddings learned for the language clustering task on multilingual NMT (; 256 dim.) Tan et al. (2019), where they use 23 languages of the WIT corpus Cettolo et al. (28-30).
One main difference for the latter is the use of factors in the architecture, meaning that the embedding of every input token was concatenated with the embedded pseudo-token that identifies the source language. The second difference is the neural architecture used to extract the embeddings: the former use a recurrent neural network, whereas the latter a small transformer model Vaswani et al. (2017).
What knowledge do we represent?
Each source embeds specialised knowledge to assess language relatedness. The KB vectors can measure typological similarity, whereas task-learned embeddings correlates with other kinds of language relationships (e.g. genetic) Bjerva et al. (2019). To analyse whether each kind of knowledge is induced with SVCCA, we assess the tasks of typological feature prediction (§4) and reconstruction of a language phylogeny (§5).
What is the benefit for multilingual NMT (and NLP)?
Language-level representations can evaluate the distance between languages in a vector space. We then can assess their applicability on multilingual NMT tasks that require guidance from language relationships. Therefore, language clustering and ranking related partner languages for (multilingual) transfer are our study cases (§6).
4 Prediction of typological features
An example of a typological feature is a word order specification, like whether the adjective is predominately placed before or after the noun (features #24 and #25 of ). Our task consists in predicting syntactic features () leaving one-language and one-language-family out to control phylogenetic relationships Bjerva et al. (2019). Previous work has shown that task-learned embeddings are potential candidates to predict features of a linguistic typology KB Malaviya et al. (2017), and our goal is to evaluate whether SVCCA can enhance the NMT-learned language embeddings with typological knowledge from their KB parallel view.
We use a Logistic Regression classifier per feature, which is trained with the NMT-learned or SVCCA representations in both one-language-out and one-language-family-out settings. For prediction, we use the original embedding or its SVCCA projection as inputs.
In Table 1, we observe that SVCCA outperformed their NMT-learned counterparts for and , where the performance is significantly better for the one-language-out setting. In the case of (with 731 entries), we notice that the overall performance drops, and the SVCCA transformation cannot improve it. We argue that a potential reason for the accuracy dropping is the method used to extract the NMT-learned embeddings (initial pseudo-token instead of factors: §7), which could diminishes the information embedded about each language, and consequently, impacts the SVCCA projection.
5 Language phylogeny analysis
According to Bjerva et al. (2019), there is a positive correlation between the language distances in a phylogenetic tree and a pairwise distance-matrix of task-learned representations. Our goal therefore is to investigate whether fusing linguistic typology with SVCCA can preserve or enhance the embedded relationship information. For that reason, we examine how well a language phylogeny can be reconstructed from language representations (§5.1), and also study the correlation (Appendix B).
5.1 Inference of a phylogenetic tree
Based on previous work Rabinovich et al. (2017), we take a tree of 17 Indo-European languages Serva and Petroni (2008) as a Gold Standard (GS), which is shown in Figure 1a.
We also consider a concatenation () of the KB and NMT-learned views as a baseline.
It is essential to highlight that none of the NMT-learned and vectors have all the 17 language entries of the GS. Therefore, we can already see one of the significant advantages of the SVCCA vectors, as we are able to represent “unknown” languages using one of the views. The NMT-learned views lack English, since they were extracted from the source side of a many-to-English system, but we were able to project the KB English vectors into the shared space.
We differ from previous studies and use a tree edit distance metric, which is defined as the minimum cost of transforming one tree into another by inserting, deleting or modifying (the label of) a node. Specifically, we used the All Path Tree Edit Distance algorithm (APTED; Pawlik and Augsten, 2015, 2016), a novel one for the task. We chose an edit-distance method as it is more transparent for assessing what is the degree of impact for a single change of linkage in the GS.
As we need to compare inferred pruned trees with different number of nodes, we propose a normalised version given by: , where is the inferred tree, and indicates the number of nodes. The denominator then is the maximum cost possible of deleting all nodes of and inserting each GS node.
|(Syntax)||30 – 0.45|
|(Bible)||35 – 0.54||27 – 0.42||23 – 0.34|
|(WIT-23)||35 – 0.62||23 – 0.41||27 – 0.48|
|(TED-53)||15 – 0.26||18 – 0.29||10 – 0.15|
Table 2 shows the results for all settings, where the single-view scores are meagre in most of the cases. For instance, the inferred tree (Fig.1c) requires 30 edits to match the GS. The exception is (Fig.1d), which requires half the edits, although it is incomplete.
We observe that the best absolute and normalised scores are obtained by fusing and with SVCCA (Fig.1b). English is projected in the Germanic branch, although Latvian is separated from the Balto-Slavic group. The latter case is similar for Bulgarian, which is misplaced in the original tree as well. Nevertheless, we only require ten editions to equate the GS (where 66 is the maximum cost possible), confirming that our approach is a robust alternative for completing language entries and inferring a language phylogeny.
In conclusion, we observe that using typological knowledge with SVCCA enhances the language relationship encoded in the NMT-learned embeddings. In Appendix B, we further discuss what kind of relationship we are representing in the NMT-learned embeddings and SVCCA, and study their correlation.
6 Application in multilingual NMT
With multilingual NMT, we can translate several language-pairs using a single model. Low-resource languages usually benefit through multilingual transfer, which resembles a simultaneous training of the parent(s) and child models. Therefore, we want to take advantage of a language-level vector space for relating similar languages and enhancing multilingual transfer within multilingual NMT. For that reason, we first address the language clustering task proposed by Tan et al. (2019), and afterwards, the language ranking model of Lin et al. (2019).
The main idea is to obtain smaller multilingual NMT models as an intermediate point between maintaining many pairwise systems and a single massive multilingual model. With limited resources, it is challenging to support the first scenario, whereas the advantages for the massive setting are also very appealing (e.g. simplified training process, translation improvement for low-resource languages or zero-shot translation Johnson et al. (2017)). Therefore, to address the task, Tan et al. (2019) trained a factored multilingual NMT model of 23 languages from Cettolo et al. (28-30), where the language embedding is concatenated in every input token. Then, they performed hierarchical clustering with the representations, and selected a number of clusters guided by the Elbow method. Finally, they compared the systems against individual, massive and language family-based cluster models.
In a practical multilingual NMT system, it is not only necessary to choose the right clustering, the ability to easily add new languages is also important. With this in mind, we apply our multi-view representations to compute a set of clusters, and we also address the question: do we need to train the massive model again if we want to add one or more new languages to our setting?
The original goal of LangRank is to choose a parent language to perform transfer learning in different tasks, NMT included. To achieve this, Lin et al. (2019) trained a model based on the performance of several hundred pairwise MT systems using the dataset of Qi et al. (2018). For the input features, they considered linguistically-informed vectors from Littell et al. (2017) and corpus-based statistics, such as word/sub-word overlapping and the ratio of the token-types or the data size between the target child and potential candidates, where the latter features were some of the most relevant.
Considering the transfer capabilities within multilingual NMT and the possibility to obtain a ranked list of candidates from LangRank, we propose an adapted task of choosing -related languages for multilingual transfer. We then use our multi-view representations to rank related languages from the vector space, as they embed information about typological and lexical relationships. This is similar to the features that Lin et al. (2019) consider, but without training a ranking model fed with scores from pairwise MT systems.
6.1 Experimental setup
We focus on the many-to-one (English) multilingual NMT setting to simplify the evaluation in both tasks. However, similar experiments could be performed in a one-to-many direction.
We use the dataset processed and tokenised by Qi et al. (2018) of 53 languages (TED-53), from where we learned our embeddings. We opted for TED-53 to better evaluate the extensibility of clusters and because it is also used to train the LangRank model. The list of languages, set sizes and other details are included in Appendix A. Before preprocessing the text, we drop any sentences from the training sets which overlap with any of the test sets. Since we are building many-to-English multilingual systems, this is important, as any such overlap will bias the results.
Model and training.
Similar to Tan et al. (2019), we train small transformer models Vaswani et al. (2017). We jointly learn 90k shared sub-words with the byte pair encoding Sennrich et al. (2016) algorithm built in SentencePiece Kudo and Richardson (2018). We also oversample all the training data of the less-resourced languages in each cluster, and shuffle them proportionally in all batches.
We use Nematus Sennrich et al. (2017) only to extract the factored language embeddings from the TED-53 corpus (). Besides, given the large number of experiments, we also choose the efficient Marian NMT Junczys-Dowmunt et al. (2018) toolkit for training the rest of systems. With Marian NMT, we only use the basic pseudo-token setting for identifying the source language, as we did not need to retrieve new language embeddings after training. Besides, we allow the Marian NMT framework to automatically determine the mini-batch size given the sentence-length and available memory (mini-batch-fit parameter).
We train our models with up to four NVIDIA P100 GPUs using Adam optimiser Kingma and Ba (2014) with default parameters () and early stopping at 5 validation steps for the cross-entropy metric. Finally, the sacreBLEU version string Post (2018) is as follows: BLEU+case.mixed+numrefs.1+smooth.exp +tok.13a+version.1.3.7.
We first list the baselines and our approaches, with the number of clusters/models between brackets:
Individual : Pairwise model per language.
Massive : A single model for all languages.
Language families : Based on historical linguistics. We divide the 33 Indo-European languages into 7 branches. Moreover, 11 groups only have one language.
KB : (Syntax) tends to agglomerate large clusters (with 4-13-33 languages), behaving similar to a massive model (Fig. 1(c)).
Concatenation : .
SVCCA-53 : Multi-view representation with SVCCA composing both and vectors (Fig. 1(a)).
With the last setting, we are interrogating whether SVCCA is a useful method for rapidly increasing the number of languages without retraining massive models given new entries that require their NMT-learned embeddings for clustering.
Similar to Tan et al. (2019), we use hierarchical agglomeration with average linkage and cosine similarity. However, we choose a different criterion for choosing the optimal number of clusters.
Selection of number of clusters.
The Elbow criterion has been suggested for this purpose Tan et al. (2019); however, as we can see in Figure 2, it might be ambiguous. Thus, we propose using a heuristic called Silhouette Rousseeuw (1987), which returns a score in the [-1,1] range. A sample cluster with a silhouette close to 1 indicates that it is cohesive and well-separated. With the average silhouette of all samples, we vary the number of clusters, and look for the peak value above two.
We focus on five low-resource languages from TED-53: Bosnian (bos, Indo-European/Balto-Slavic), Galician (glg, Indo-European/Italic), Malay (zlm, Austronesian), Estonian (est, Uralic) and Georgian (kat, Kartvelian). They have between 5k and 13k translated sentences with English, and we chose them as they achieved the most significant improvement from the individual to the massive setting. We then identified the top-3 related languages using LangRank, which give us a multilingual training set of around 500 thousand sentences for each case. Given that LangRank usually prefers to choose candidates with larger data size Lin et al. (2019), for a fair comparison, we use SVCCA and cosine similarity to choose the closest languages that can agglomerate a similar amount of parallel sentences.
6.2 Language clustering results
We first briefly discuss the composition of clusters obtained by SVCCA. Then, we analyse the results grouped by training size bins. We complement the analysis by family groups in Appendix D.
In Figure 2, we observe that SVCCA-53 (Fig. 1(a)) has adopted ten clusters with a proportionally distributed number of languages (the smallest one is Greek-Arabic-Hebrew, and the largest one has seven entries). Moreover, the languages are usually grouped by phylogenetic or geographical criteria. These agglomeration trends are adopted from both the KB (Fig. 1(c)) and NMT-learned (Fig. 1(d)) sources.
From a more detailed inspection, there are entries that do not correspond to their respective family branches, although the single-view sources might induce the bias. For instance, the phylogenetic tree (Fig. 1d) “misplaced” Bulgarian within Italic languages. Nevertheless, the unexpected agglomerations rely on the features encoded in the KB or the NMT learning process, and we expect they can uncover surprising clusters to avoid isolating languages without close relatives (e.g. Basque, or even Japanese as the only Japonic member in the set). Another benefit is noticeable in the SVCCA-23 clusters (Fig. 1(b)), which have resemblances with the SVCCA-53 agglomeration despite using only 23 languages to compute the shared space.
Training size bins:
We manually define the upper bounds of the bins as [10,75,175,215] thousands of training sentences, which results in groups composed by [14,14,13,12] languages. Figure 3 shows the box plots of BLEU from where we can analyse each distribution (mean, variance).
Throughout all the bins, we observe that both SVCCA-53 and SVCCA-23 accomplish a comparable accuracy with the best setting in each group. In other words, their clusters provide stable performance for both low or high-resource languages.
In the first bin of the smallest corpora, the Massive baseline and the large clusters of barely surpass the SVCCA schemes. Nevertheless, SVCCA contributes a notable advantage if we want to train a multilingual NMT model for a specific low-resource language, and we do not have the resources for training a massive system. We further analyse this scenario in §6.3.
In the rightmost bin, for the highest resource languages, the Massive and performed worse than SVCCA. Furthermore, we show a competitive accuracy for the Individual and Family approaches. The former’s clusters have steady performance across most of the bins as well. Nevertheless, they double the number of clusters that we have in both SVCCA settings, and with more than half of the “clusters” having only one language.
Other approaches, like using the NMT-learned embeddings () as Tan et al. (2019) or the concatenation baseline, obtain similar translation results in the last three bins. However, we need to obtain the NMT-learned embeddings first in order to fulfil those methods (from a 53-languages massive model). Using SVCCA and a pre-trained smaller set of language embeddings is enough for projecting new representations, as we present with our SVCCA-23 approach.
6.3 Language ranking results
After discussing overall translation accuracy for all the languages, we now focus on five specific low-resource cases and how multilingual transfer enhance their performance. Table 3 shows the BLEU scores of the translation into English for the smaller multilingual models that group each child language with their candidates ranked by LangRank and our SVCCA-53 representations.
We also include the results of the individual and massive MT systems. Even when the latter baseline provides a significant improvement over the former, we observe that many of the smaller multilingual models outperform the translation accuracy of the massive system. The result suggests that the amount of data is not the most important confound for supporting multilingual transfer in a low-resource language, which is aligned with the literature Wang and Neubig (2019).
Comparing the two ranking approaches, we observe that SVCCA approximates the performance of LangRank in most of the cases. We note that LangRank prefers related languages with large datasets, as it only requires three candidates to group around half a million training samples, whereas SVCCA suggests to include from three to ten languages to reach a similar amount of parallel sentences. However, increasing the number of languages could impact the multilingual transfer negatively (see the case of Georgian or kat), as it is analogous to adding different “out-of-domain” samples. To alleviate this, we could bypass candidate languages that do not possess a specific amount of training samples.
We argue that our representations still provides a robust alternative to determine which languages are suitable for multilingual transfer learning.
The notable advantage is that we do not need to pre-train MT systems from a specific dataset, and we can easily extend the coverage of languages without re-training the ranking model to consider new language entries
|bos||4.2||26.6||28.8 \textsubscript(434)||28.2 \textsubscript|
|glg||8.4||24.9||27.7 \textsubscript(443)||28.4 \textsubscript|
|zlm||4.1||20.1||21.2 \textsubscript(463)||21.0 \textsubscript|
|est||5.8||13.5||13.5 \textsubscript(533)||12.1 \textsubscript|
|kat||5.8||14.3||13.3 \textsubscript(499)||10.5 \textsubscript|
7 Factors over initial pseudo-tokens
We additionally argue that the configuration used to compute the language embeddings impacts what relationship they can learn. For the analysis, we extract an alternative set of 53 language embeddings () but using the initial pseudo-token setting instead of factors. Then, we perform a silhouette analysis to identify whether we can build cohesive and well-separated clusters of languages.
Figure 4 shows the silhouette analysis for the aforementioned embeddings () together with the Bible embeddings () that were trained with the same configuration. We observe that the silhouette score never exceeds 0.2, and the curve keeps degrading when we examine a higher number of clusters, which contrast the trend shown in Figure 2. The pattern proves that the vectors are not suitable for clustering (the hierarchies are shown in Figure 6 in the Appendix), and they might only encode enough information to perform a classification task in the multilingual NMT training and inference. For that reason, we consider it essential to use language embeddings from factors for extracting language relationships.
8 Related work
For language-level representations, URIEL and Littell et al. (2017) allow a straightforward extraction of typological binary features from different KBs. Murawaki (2015, 2017, 2018) exploits them to build latent language representations with independent binary variables. Language features are encoded from data-driven tasks as well, such as NMT Malaviya et al. (2017) or language modelling Tsvetkov et al. (2016); Östling and Tiedemann (2017); Bjerva and Augenstein (2018) with complementary linguistic-related target tasks Bjerva and Augenstein (2018).
Our approach is most similar to Bjerva et al. (2019), as they build a generative model from typological features and use language embeddings, extracted from factored language modelling at character-level, as a prior of the model to extend the language coverage. However, our method primarily differs as it is mainly based in linear algebra, encodes information from both sources since the beginning, and can deal with a small number of shared entries (e.g. 23 from ) to compute robust representations.
There has been very little work on adopting typology knowledge for NMT. There is not a deep integration of the topics Ponti et al. (2019), but one shallow and prominent case is the ranking method Lin et al. (2019) that we analysed in §6.
Finally, CCA and its variants have been previously used to derive embeddings at word-level Faruqui and Dyer (2014); Dhillon et al. (2015); Osborne et al. (2016). Kudugunta et al. (2019) also used SVCCA but to inspect sentence-level representations, where they uncover relevant insights about language similarity that are aligned with our results in §5. However, as far as we know, this is the first time a CCA-based method has been used to compute language-level representations.
9 Takeaways and practical tool
We summarise our key findings as follows:
SVCCA can fuse linguistic typology KB entries with NMT-learned embeddings without diminishing the originally encoded typological and genetic similarity of languages.
Our method is a robust alternative for identifying clusters and choosing related languages for multilingual transfer in NMT. The advantage is notable when it is not feasible to pre-train a ranking model or learn embeddings from a massive multilingual system. Assessing new languages is an important ability, given that most of them do not have even enough monolingual corpora to learn embeddings from multilingual language modelling Joshi et al. (2020).
Factored language embeddings encode more information to agglomerate related languages than the initial pseudo-token setting.
Furthermore, we make our code available as an open-source tool
We compute multi-view language representations with SVCCA using two sources: KB and NMT-learned vectors. With a typological feature prediction task and the inference of phylogenetic trees, we showed that the knowledge and language relationship encoded in both sources is preserved in the combined representation. Moreover, our approach offers important advantages because we can evaluate projected languages with entries in only one of the views and can easily extend the language coverage. The benefits are noticeable in multilingual NMT tasks, like language clustering and ranking related languages for multilingual transfer. We plan to study how to deeply incorporate our typologically-enriched embeddings in multilingual NMT, where there are promising avenues in parameter selection Sachan and Neubig (2018) and generation Platanios et al. (2018).
[image=true, lines=2, findent=1ex, nindent=0ex, loversize=.15]figs/eu-logo.pngThis work was supported by funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements No 825299 (GoURMET) and the EPSRC fellowship grant EP/S001271/1 (MTStretch). Also, it was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (http://www.csd3.cam.ac.uk/), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk). We express our thanks to Kenneth Heafield and Rico Sennrich, who provided us with access to the computing resources.
Last but not least, we thank the organisers and participants of the First Workshop of Typology for Polyglot NLP, and the members of the Statistical Machine Translation group at the University of Edinburgh, whose provided relevant feedback in an early stage of the study.
Appendix A Languages and individual BLEU scores
We work with 53 languages pre-processed by Qi et al. (2018), from where we mapped the ISO 639-1 codes to the ISO 693-2 standard. However, we need to manually correct the mapping of some codes to identify the correct language vector in the URIEL Littell et al. (2017) library:
zh (zho , Chinese macro-language) mapped to cmn (Mandarin Chinese).
fa (fas , Persian inclusive code for 11 dialects) mapped to pes (Western/Iranian Persian).
ar (ara , Arabic) mapped to arb (Standard Arabic).
We disregard working with artificial languages like Esperanto (eo) or variants like Brazilian Portuguese (pt-br) and Canadian French (fr-ca).
Table 4 presents the list of all the languages with the following details: ISO 693-2 code, language family, size of the training set in thousands of sentences (with their respective training size bin) and the individual BLEU score obtained per clustering approach and other baselines.
|BLEU score per approach|
|ISO||Language||Lang. family||Size (k)||Bin||Individual||Massive||Family||SVCCA-53||SVCCA-23|
Appendix B Correlation of SVCCA with genetic similarity
Bjerva et al. (2019) argued that raw language embeddings from language modelling correlates with genetic and structural similarity
We perform a Spearman correlation between the cophenetic matrix
Appendix C SVD explained variance selection
To compute SVCCA, we transform each source space using SVD, where we can choose to preserve a number of dimensions that represents an accumulated explained variance of the original dataset. For that reason, we perform a parameter sweep between 0.5 and 1.0 using 0.05 incremental steps. For a fair comparison, we also transform the single spaces (KB or Learned) with SVD and look for the optimal threshold.
Prediction of typological features.
We selected a 0.5 threshold for the NMT-learned vectors of and , and 0.7 for . In case of the SVCCA representation, uses [0.75,0.70], whereas and employ [0.95,0.50] values. The parameter values are for both one-language-out and one-family-out settings. We can argue that there is redundancy in the NMT-learned embeddings, as the prediction of typological features with Logistic Regression always prefers a dimensionality-reduced version instead of the original data (threshold at 1.0).
Language phylogeny inference.
In Table 5, we report the optimal value for the SVD explained variance ratio in each single and multi-view (concatenation and SVCCA) setting.
Language clustering (and ranking).
We cannot perform an exhaustive analysis for the threshold of the explained variance ratio per view. As our main goal is to increase the coverage of languages steadily, we must determine what configuration allows a stable growth of the hierarchy.
We thereupon take inspiration from bootstrap clustering Nerbonne et al. (2008), and increase the number of language entries from few entries (e.g. 10) to 53 by resample bootstrapping using each of the source vectors: , and . Afterwards, we search for the threshold value that preserves a stable number of clusters given the peak silhouette value. Our heuristic looks for the least variability throughout the incremental bootstrapping (Fig. 5).
We found that 0.65 is the most stable value for , whereas 0.60 is the best one for both and , so we thereupon fix SVCCA-53 and SVCCA-23 to [0.65,0.6]. We also apply the chosen thresholds on the concatenation baseline for a fair comparison. In the single-view cases, the transformations with the tuned variance ratio do not overcome any non-optimised counterparts.
|(Syntax)||30 / 0.45 \textsubscript(0.5)|
|(Bible)||35 / 0.54 \textsubscript(0.9)||27 / 0.42 \textsubscript(0.70,0.55)||23 / 0.34 \textsubscript(0.70,0.75)|
|(WIT-23)||35 / 0.62 \textsubscript(0.8)||23 / 0.41 \textsubscript(0.75,0.95)||27 / 0.48 \textsubscript(0.50,0.95)|
|(TED-53)||15 / 0.26 \textsubscript(0.6)||18 / 0.29 \textsubscript(0.70,0.55)||10 / 0.15 \textsubscript(1.00,0.55)|
Appendix D Language clustering results by language families
|Lang. families||# L||Size (k)||Individual||Massive||Family||SVCCA-53||SVCCA-23|
|Number of clusters/models||53||1||20||3||11||18||10||10|
Following a guide for evaluating multilingual benchmarks Anastasopoulos (2019), we also group the scores by language families. Table 6 includes the overall weighted average per number of languages in each family branch. We observe that most of the approaches have obtained clusters with similar overall translation accuracy. The individual models are the only ones that significantly underperform. The poor performance is transferred to the Family baseline, as most of the groups contains only one language given the low language diversity of the dataset.
The vectors obtain the highest overall accuracy, mostly from their few large clusters (see Fig. 1(c)). Meanwhile, SVCCA-53 achieves the second-best overall result, by a minimal margin, and with 3 to 7 languages per cluster, which are usually faster to converge. Besides, the massive model, the embeddings and the concatenation baseline present a competitive achievement as well. However, the first requires more resources to train until convergence, whereas the last two need the 53 pre-trained embeddings from a previous massive system.
In contrast, SVCCA-23 is a faster alternative if we want to target specific new languages (see Fig. 1(b)). We only require a small group of language embeddings (e.g. of 23 entries) and project the rest with SVCCA and a set KB-vectors as a side view. For instance, if we need to deploy a translation model for Basque or Thai, we could reach a comparable or better accuracy to a massive model with the SVCCA-23 chosen clusters of only 3 (Arabic, Hebrew) or 5 (Chinese, Indonesian, Vietnamese, Malay) languages, respectively.
- With language representations, we refer to an annotated or unsupervised characterisation of a language itself (e.g. Spanish or English), and not to word or sentence-level representations, as it is used in the recent NLP literature.
- As the SVD step performs a dimensionality reduction while preserving the most explained variance as possible, we can consider two additional parameters: a threshold value in the [0.5,1.0] range with 0.05 incremental steps, for the explained variance ratio of each view. With a value equal to 1, we bypass SVD and compute CCA only. We then tuned all our following experiments (see Appendix C for details).
- We prefer to use factored embeddings over initial pseudo-tokens as we identified that there is a difference for encoding information about language similarity (see §7).
- In other words, for SVCCA, it is difficult to deal with the noise provided in the learned embeddings. In Figures 5(a) and 5(b) of the Appendix, we observe noisy agglomerations in the dendrograms (obtained by clustering different language representations), which is preserved after the fusing with the KB vectors through SVCCA as we can see in Fig. 5(c))
- We do not generalise the analysis for more languages, as the inferred tree of Serva and Petroni (2008) is only an approximation by lexicostatistic methods (see Appendix B).
- This is illustrative only, as we could obtain an English vector from many-to-many multilingual NMT models or language models. However, the artificial case generalises as a benefit for projecting new languages with SVCCA. For instance, contains 2,989 and 287 unique entries in the KB and NMT-learned views, respectively.
- In further analysis, we confirmed that the inferred tree with only 12 languages of SVCCA (without projection of extra entries) is comparable or better against the rest of the baselines.
- However, we do not answer what multilingual NMT really transfers to the low-resource languages. We left that question for further research, together with optimising the number of languages or the amount of data per each language.
- We note that Bjerva et al. (2019) used monolingual texts translated from different languages to investigate what kind of genetic information is preserved. Concerning structural similarity, they computed a distance matrix using syntax-dependency-tags counts per language from annotated treebanks. We leave this analysis for further work.
- Pairwise-distances of the hierarchy’s leaves (languages).
- A note on evaluating multilingual benchmarks. External Links: Cited by: Appendix D.
- What do language representations really represent?. Computational Linguistics 45 (2), pp. 381–389. External Links: Cited by: Appendix B, §3, §5.1, §5, footnote 10.
- Tracking typological traits of uralic languages in distributed language representations. In Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages, Helsinki, Finland, pp. 76–86. External Links: Cited by: §8.
- From phonology to syntax: unsupervised linguistic typology at different levels with language embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 907–916. External Links: Cited by: §8.
- A probabilistic generative model of linguistic typology. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1529–1540. External Links: Cited by: §4, §8.
- WIT: web inventory of transcribed and translated talks. In Proceedings of the 16 Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261–268. External Links: Cited by: §3, §6.
- Language universals and linguistic typology: syntax and morphology. University of Chicago press. Cited by: §1.
- Eigenwords: spectral word embeddings. The Journal of Machine Learning Research 16 (1), pp. 3035–3078. External Links: Cited by: §8.
- WALS online. Max Planck Institute for Evolutionary Anthropology, Leipzig. External Links: Cited by: §1, §3.
- Latent semantic analysis. Annual review of information science and technology 38 (1), pp. 188–230. Cited by: §2.
- An indoeuropean classification: a lexicostatistical experiment. Transactions of the American Philosophical society 82 (5), pp. iii–132. Cited by: Appendix B.
- Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, pp. 462–471. External Links: Cited by: §8.
- Canonical correlation analysis: an overview with application to learning methods. Neural computation 16 (12), pp. 2639–2664. Cited by: §2.
- Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. External Links: Cited by: §6.
- The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6282–6293. External Links: Cited by: 2nd item.
- Marian: cost-effective high-quality neural machine translation in C++. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Melbourne, Australia, pp. 129–135. External Links: Cited by: §6.1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
- SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Cited by: §6.1.
- Investigating multilingual NMT representations at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1565–1575. External Links: Cited by: §8.
- Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3125–3135. External Links: Cited by: §1, §6, §6, §6.1, §6, §8.
- URIEL and lang2vec: representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 8–14. External Links: Cited by: Appendix A, §1, §3, §6, §8.
- Learning language representations for typology prediction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2529–2535. External Links: Cited by: §1, §3, §4, §8.
- Continuous space representations of linguistic typology and their application to phylogenetic inference. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 324–334. External Links: Cited by: §8.
- Diachrony-aware induction of binary latent representations from typological features. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 451–461. External Links: Cited by: §8.
- Analyzing correlated evolution of multiple features using latent representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4371–4382. External Links: Cited by: §8.
- Projecting dialect distances to geography: bootstrap clustering vs. noisy clustering. In Data Analysis, Machine Learning and Applications, C. Preisach, H. Burkhardt, L. Schmidt-Thieme and R. Decker (Eds.), Berlin, Heidelberg, pp. 647–654. External Links: Cited by: Appendix C.
- Survey on the use of typological information in natural language processing. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 1297–1308. External Links: Cited by: §1.
- Encoding prior knowledge with eigenword embeddings. Transactions of the Association for Computational Linguistics 4, pp. 417–430. External Links: Cited by: §8.
- Continuous multilinguality with language vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 644–649. External Links: Cited by: §1, §8.
- Efficient computation of the tree edit distance. ACM Transactions on Database Systems (TODS), pp. 3:1–3:40 (English). External Links: Cited by: §5.1.
- Tree edit distance: robust and memory-efficient. Information Systems 56, pp. 157–173 (English). External Links: Cited by: §5.1.
- Contextual parameter generation for universal neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 425–435. External Links: Cited by: §10.
- Modeling language variation and universals: a survey on typological linguistics for natural language processing. Computational Linguistics 45 (3), pp. 559–601. External Links: Cited by: §1, §8.
- A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Cited by: §6.1.
- When and why are pre-trained word embeddings useful for neural machine translation?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 529–535. External Links: Cited by: Appendix A, §3, §6.1, §6.
- Found in translation: reconstructing phylogenetic language trees from translations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 530–540. External Links: Cited by: §5.1.
- SVCCA: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems 30, pp. 6076–6085. External Links: Cited by: §1, §2.
- Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, pp. 53–65. Cited by: §6.1.
- Parameter sharing methods for multilingual self-attentional translation models. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 261–271. External Links: Cited by: §10.
- Nematus: a toolkit for neural machine translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 65–68. External Links: Cited by: §6.1.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Cited by: §6.1.
- Indo-European languages tree by Levenshtein distance. EPL (Europhysics Letters) 81 (6), pp. 68005. External Links: Cited by: Appendix B, §5.1, footnote 5.
- Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 963–973. External Links: Cited by: §1, §3, §3, §5.1, item 5, item 8, §6, §6.1, §6.1, §6.1, §6.2, §6.
- Polyglot neural language models: a case study in cross-lingual phonetic representation learning. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 1357–1366. External Links: Cited by: §8.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §3, §6.1.
- Target conditioned sampling: optimizing data selection for multilingual neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5823–5828. External Links: Cited by: §6.3.
- Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58 (301), pp. 236–244. External Links: Cited by: §5.1.