Inducing Grammars with and for Neural Machine Translation
Machine translation systems require semantic knowledge and grammatical understanding. Neural machine translation (NMT) systems often assume this information is captured by an attention mechanism and a decoder that ensures fluency. Recent work has shown that incorporating explicit syntax alleviates the burden of modeling both types of knowledge. However, requiring parses is expensive and does not explore the question of what syntax a model needs during translation. To address both of these issues we introduce a model that simultaneously translates while inducing dependency trees. In this way, we leverage the benefits of structure while investigating what syntax NMT must induce to maximize performance. We show that our dependency trees are 1. language pair dependent and 2. improve translation quality.
Language has syntactic structure and translation models need to understand grammatical dependencies to resolve the semantics of a sentence and preserve agreement (e.g., number, gender, etc). Many current approaches to MT have been able to avoid explicitly providing structural information by relying on advances in sequence to sequence (seq2seq) models. The most famous advances include attention mechanisms (bahdanau:2015) and gating in Long Short-Term Memory (LSTM) cells (Hochreiter:1997).
In this work we aim to benefit from syntactic structure, without providing it to the model, and to disentangle the semantic and syntactic components of translation, by introducing a gating mechanism which controls when syntax should be used.
Consider the process of translating the sentence “The boy sitting next to the girls ordered a coffee.” (Figure 1) from English to German. In German, translating ordered, requires knowledge of its subject boy to correctly predict the verb’s number bestellte instead of bestellten. This is a case where syntactic agreement requires long-distance information. On the other hand, next can be translated in isolation. The model should uncover these relationships and decide when and which aspects of syntax are necessary. While in principle decoders can utilize previously predicted words (e.g., the translation of boy) to reason about subject-verb agreement, in practice LSTMs still struggle with long-distance dependencies. Moreover, Belinkov17 showed that using attention reduces the decoder’s capacity to learn target side syntax.
In addition to demonstrating improvements in translation quality, we are also interested in analyzing the predicted dependency trees discovered by our models. Recent work has begun analyzing task-specific latent trees (Williams2017). We present the first results on learning latent trees with a joint syntactic-semantic objective. We do this in the service of machine translation which inherently requires access to both aspects of a sentence. Further, our results indicate that language pairs with rich morphology require and therefore induce more complex syntactic structure.
Our use of a structured self attention encoder (§4) that predicts a non-projective dependency tree over the source sentence provides a soft structured representation of the source sentence that can then be transferred to the decoder, which alleviates the burden of capturing target syntax on the target side.
We will show that the quality of the induced trees depends on the choice of the target language (§7). Moreover, the gating mechanism will allow us to examine which contexts require source side syntax.
In summary, in this work:
We propose a new NMT model that discovers latent structures for encoding and when to use them, while achieving significant improvements in BLEU scores over a strong baseline.
We perform an in-depth analysis of the induced structures and investigate where the target decoder decides syntax is required.
2 Related Work
Recent work has begun investigating what syntax seq2seq models capture (Linzen16), but this is evaluated via downstream tasks designed to test the model’s abilities and not its representation.
Simultaneously, recent research in neural machine translation (NMT) has shown the benefit of modeling syntax explicitly (aharoni-goldberg:2017:Short; Bastings17; Li17; Eriguchi17) rather than assuming the model will automatically discover and encode it.
bradbury-socher:2017:StructPred presented an encoder-decoder architecture based on RNNG dyer-EtAl:2016:N16-1. However, their preliminary work was not scaled to a large MT dataset and omits analysis of the induced trees.
Unlike the previous work on source side latent graph parsing (Hashimoto17), our structured self attention encoder allows us to extract a dependency tree in a principled manner. Therefore, learning the internal representation of our model is related to work done in unsupervised grammar induction Klein:2004aa; Spitkovsky:2011ab except that by focusing on translation we require both syntactic and semantic knowledge.
In this work, we attempt to contribute to both modeling syntax and investigating a more interpretable interface for testing the syntactic content of a new seq2seq models’ internal representation.
3 Neural Machine Translation
Given a training pair of source and target sentences of length and respectively, neural machine translation is a conditional probabilistic model implemented using neural networks
where is the model’s parameters. We will omit the parameters herein for readability.
The NMT system used in this work is a seq2seq model that consists of a bidirectional LSTM encoder and an LSTM decoder coupled with an attention mechanism (bahdanau:2015; luong-pham-manning:2015:EMNLP). Our system is based on a PyTorch implementation
Here we use as a concatenation of . The decoder is composed of stacked LSTMs with input-feeding. Specifically, the inputs of the decoder at time step are a concatenation of the embedding of the previous generated word and a vector :
where is a one layer feed-forward network, is the output of the LSTM decoder, and is a context vector computed by an attention mechanism
where is a trainable parameter.
Finally a single layer feed-forward network takes as input and returns a multinomial distribution over all the target words:
4 Syntactic Attention Model
We propose a syntactic attention model
4.1 Head Word Selection
The head word selection layer learns to select a soft head word for each source word. This layer transforms into a matrix that encodes implicit dependency structure of using structured self attention. First we apply three trainable weight matrices to map to query, key, and value matrices , , respectively. Then we compute the structured self attention probabilities via a function sattn: . Finally the syntactic context is computed as .
Here is the length of the source sentence, so captures all pairwise word dependencies. Each cell of the attention matrix is the posterior probability . The structured self attention function sattn is inspired by the work of (Kim17) but differs in two important ways. First we model non-projective dependency trees. Second, we utilize the Kirchhoff’s Matrix-Tree Theorem (Tutte84) instead of the sum-product algorithm presented in (Kim17) for fast evaluation of the attention probabilities. We note that Liu2017 were first to propose using the Matrix-Tree Theorem for evaluating the marginals in end to end training of neural networks. Their work, however, focuses on the task of natural language inference (Bowman:snli15) and document classification which arguably require less syntactic knowledge than machine translation. Additionally, we will evaluate our structured self attention on datasets that are up to 20 times larger than the datasets studied in previous work.
Let be an adjacency matrix encoding a source’s dependency tree. Let be a scoring matrix such that cell scores how likely word is to be the head of word . The probability of a dependency tree is therefore given by
where is the partition function.
In the head selection model, we are interested in the marginal
We use the framework presented by koo2007 to compute the marginal of non-projective dependency structures. koo2007 use the Kirchhoff’s Matrix-Tree Theorem (Tutte84) to compute by first defining the Laplacian matrix as follows:
Now we construct a matrix that accounts for root selection
The marginals in are then
where is the Kronecker delta. For the root node, the marginals are given by
The computation of the marginals is fully differentiable, thus we can train the model in an end-to-end fashion by maximizing the conditional likelihood of the translation.
4.2 Incorporating Syntactic Context
Having set the annotations and with the encoder, the LSTM decoder can utilize this information at every generation step by means of attention. At time step , we first compute standard attention weights and context vector as in Equations (3) and (4). We then compute a weighted syntactic vector:
Note that the syntactic vector and the context vector share the same attention weights . The main idea behind sharing attention weights (Figure 1(c)) is that if the model attends to a particular source word when generating the next target word, we also want the model to attend to the head word of . We share the attention weights because we expect that, if the model picks a source word to translate with the highest probability , the contribution of ’s head in the syntactic vector should also be highest.
Figure 3 shows the latent tree learned by our translation objective. Unlike the gold tree provided in Figure 1, the model decided that “the boy” is the head of “ordered”. This is common in our model because the BiLSTM context means that a given word’s representation is actually a summary of its local context/constituent.
It is not always useful or necessary to access the syntactic context at every time step . Ideally, we should let the model decide whether it needs to use this information or not. For example, the model might decide to only use syntax when it needs to resolve long distance dependencies on the source side. To control the amount of source side syntactic information, we introduce a gating mechanism:
The vector from Eq. (2) now becomes
Another approach to incorporating syntactic annotations in the decoder is to use a separate attention layer to compute the syntactic vector at time step :
We will provide a comparison to this approach in our results.
4.3 Hard Attention over Tree Structures
Finally, to simulate the scenario where the model has access to a dependency tree given by an external parser we report results with hard attention. Forcing the model to make hard decisions during training mirrors the extraction and conditioning on a dependency tree (§7.1). We expect this technique will improve the performance on grammar induction, despite making translation lossy. A similar observation has been reported in (Hashimoto17) which showed that translation performance degraded below their baseline when they provided dependency trees to the encoder.
Recall the marginal gives us the probability that word is the head of word . We convert these soft weights to hard ones by
We train this model using the straight-through estimator (bengio2013estimating). In this setup, each word has a parent but there is no guarantee that the structure given by hard attention will result in a tree (i.e., it may contain cycle). A more principled way to enforce a tree structure is to decode the best tree using the maximum spanning tree algorithm (chu-liu-1965; Edmonds) and to set if the edge . Maximum spanning tree decoding can be prohibitively slow as the Chu-Liu-Edmonds algorithm is not GPU friendly. We therefore greedily pick a parent word for each word in the sentence using Eq. (15). This is actually a principled simplification as greedily assigning a parent for each word is the first step in Chu-Liu-Edmonds algorithm.
Next we will discuss our experimental setup and report results for EnglishGerman (EnDe), EnglishRussian (EnRu), and RussianArabic (RuAr) translation models.
We use the WMT17 (bojar-EtAl:2017:WMT1) data in our experiments. Table 1 shows the statistics of the data. For EnDe, we use a concatenation of Europarl, Common Crawl, Rapid corpus of EU press releases, and News Commentary v12. We use newstest2015 for development and newstest2016, newstest2017 for testing. For EnRu, we use Common Crawl, News Commentary v12, and Yandex Corpus. The development data comes from newstest2016 and newstest2017 is reserved for testing. For RuAr, we use the data from the six-way sentence-aligned subcorpus of the United Nations Parallel Corpus v1.0 Ziemski:16. The corpus also contains the official development and test data.
|EnDe||5.9M||2,169||2,999 / 3,004||36,251 / 35,913|
|EnRu||2.1M||2,998||3,001||34,872 / 34,989|
|RuAr||11.1M||4,000||4,000||32,735 / 32,955|
Our language pairs were chosen to compare results across and between morphologically rich and poor languages. This will prove particularly interesting in our grammar induction results where different pairs must preserve different amounts of syntactic agreement information.
We use BPE (sennrich-haddow-birch:2016:P16-12) with 32,000 merge operations. We run BPE for each language instead of using BPE for the concatenation of both source and target languages.
Our baseline is an NMT model with input-feeding (§3). As we will be making several modifications from the basic architecture in our proposed structured self attention NMT (SA-NMT), we will verify each choice in our architecture design empirically. First we validate the structured self attention module by comparing it to a self-attention module (lin:2017; Vaswani17). Self attention computes attention weights simply as . Since self-attention does not assume any hierarchical structure over the source sentence, we refer it as flat-attention NMT (FA-NMT). Second, we validate the benefit of using two sets of annotations in the encoder. We combine the hidden states of the encoder with syntactic context to obtain a single set of annotation using the following equation:
Here we first down-weight the syntactic context before adding it to . The sigmoid function decides the weight of the head word of based on whether translating needs additionally dependency information. We refer to this baseline as SA-NMT-1set. Note that in this baseline, there is only one attention layer from the target to the source .
In all the models, we share the weights of target word embeddings and the output layer as suggested by Inan:2016 and press:2017.
5.3 Hyper-parameters and Training
For all the models, we set the word embedding size to 1024, the number of LSTM layers to 2, and the dropout rate to 0.3. Parameters are initialized uniformly in . We use the Adam optimizer (kingma:2014) with an initial learning rate of 0.001. We evaluate our models on development data every 10,000 updates for De–En and RuAr, and 5,000 updates for Ru–En. If the validation perplexity increases, we decay the learning rate by 0.5. We stop training after decaying the learning rate five times as suggested by denkowski-neubig:2017:NMT. The mini-batch size is 64 in RuAr experiments and 32 in the rest. Finally, we report BLEU scores computed using the standard multi-bleu.perl script.
In our experiments, the SA-NMT models are twice slower than the baseline NMT measuring by the number of target words generated per second.
5.4 Translation Results
Table 2 shows the BLEU scores in our experiments. We test statistical significance using bootstrap resampling (riezler-maxwell:2005:MTSumm). Statistical significances are marked as and when compared against the baselines. Additionally, we also report statistical significances and when comparing against the FA-NMT models that have two separate attention layers from the decoder to the encoder. Overall, the SA-NMT (shared) model performs the best gaining more than 0.5 BLEU DeEn on wmt16, up to 0.82 BLEU on EnDe wmt17 and 0.64 BLEU EnRu direction over a competitive NMT baseline. The gain of the SA-NMT model on RuAr is small (0.45 BLEU) but significant. The results show that structured self attention is useful when translating from English to languages that have long-distance dependencies and complex morphological agreements. We also see that the gain is marginal compared to self-attention models (FA-NMT-shared) and not significant. Within FA-NMT models, sharing attention is helpful. Our results also confirm the advantage of having two separate sets of annotations in the encoder when modeling syntax. The hard structured self attention model (SA-NMT-hard) performs comparably to the baseline. While this is a somewhat expected result from the hard attention model, we will show in Section 7 that the quality of induced trees from hard attention is often far better than those from soft attention.
6 Gate Activation Visualization
As mentioned earlier, our models allow us to ask the question: When does the target LSTM need to access source side syntax? We investigate this by analyzing the gate activations of our best model, SA-NMT (shared). At time step , when the model is about to predict the target word , we compute the norm of the gate activations
The activation norm allows us to see how much syntactic information flows into the decoder. We observe that has its highest value when the decoder is about to generate a verb while it has its lowest value when the end of sentence token
</s> is predicted. Figure 4 shows some examples of German target sentences. The darker colors represent higher activation norms.
It is clear that translating verbs requires structural information. We also see that after verbs, the gate activation norms are highest at nouns Zeit (time), Mut (courage), Dach (roof) and then tail off as we move to function words which require less context to disambiguate. Below are the frequencies with which the highest activation norm in a sentence is applied to a given part-of-speech tag on newstest2016. We include the top 7 most common activations. We see that while nouns are often the most common tag in a sentence, syntax is disproportionately used for translating verbs.
7 Grammar Induction
NLP has long assumed hierarchical structured representations are important to understanding language. In this work, we borrowed that intuition to inform the construction of our model. We investigate whether the internal latent representations discovered by our models share properties previously identified within linguistics and if not, what important differences exist. We investigate the interpretability of our model’s representations by: 1) A quantitative attachment accuracy and 2) A qualitative look at its output.
Our results corroborate and refute previous work (Hashimoto17; Williams2017). We provide stronger evidence that syntactic information can be discovered via latent structured self attention, but we also present preliminary results indicating that conventional definitions of syntax may be at odds with task specific performance.
Unlike in the grammar induction literature our model is not specifically constructed to recover traditional dependency grammars nor have we provided the model with access to part-of-speech tags or universal rules Naseem:2010aa; Bisk:2013aa. The model only uncovers the syntactic information necessary for a given language pair, though future work should investigate if structural linguistic constraints benefit MT.
7.1 Extracting a Tree
For extracting non-projective dependency trees, we use Chu-Liu-Edmonds algorithm (chu-liu-1965; Edmonds). First, we must collapse BPE segments into words. Assume the -th word corresponds to BPE tokens from index to . We obtain a new matrix by summing over that are the corresponding BPE segments