Global-Context Neural Machine Translation through Target-Side Attentive Residual Connections

Global-Context Neural Machine Translation through Target-Side Attentive Residual Connections

Abstract

Neural sequence-to-sequence models achieve remarkable performance not only in machine translation (MT) but also in other language processing tasks. One of the reasons for their effectiveness is the ability of their decoder to capture contextual information through its recurrent layer. However, its sequential modeling over-emphasizes the local context, i.e. the previously translated word in the case of MT. As a result, the model ignores important information from the global context of the translation. In this paper, we address this limitation by introducing attentive residual connections from the previously translated words to the output of the decoder, which enable the learning of longer-range dependencies between words. The proposed model can emphasize any of the previously translated words, as opposed to only the last one, gaining access to the global context of the translated text. The model outperforms strong neural MT baselines on three language pairs, as well as a neural language modeling baseline. The analysis of the attention learned by the decoder confirms that it emphasizes a wide context, and reveals resemblance to syntactic-like structures.

1Introduction

Neural machine translation [12] has recently become the state-of-the-art approach to machine translation [4]. Due to its quality and simplicity, in contrast to the previous phrase-based statistical machine translation approaches [15], neural machine translation (NMT) has been adopted widely in both the academia and industry. The effectiveness of the attention-based NMT can be largely attributed to the ability of its decoder to capture contextual information through its recurrent layer. However, its sequential modeling over-emphasizes the local context, mainly the previously translated word, and, consequently, gives insufficient importance to the global context of a translated text. Ignoring the global context may lead to sub-optimal lexical choices and imperfect fluency of an output text.

Figure 1: Decoder of the baseline NMT, which incorporates a residual connection to the previous predicted word y_{t-1} at each time step.
Figure 1: Decoder of the baseline NMT, which incorporates a residual connection to the previous predicted word at each time step.
Figure 2: Decoder with access to global context. The summary vector d_{t} is computed using residual connections coming from all the previously predicted words. Several ways to compute d_{t} are proposed in Equations  and –.
Figure 2: Decoder with access to global context. The summary vector is computed using residual connections coming from all the previously predicted words. Several ways to compute are proposed in Equations and –.

To address this limitation, we propose a novel approach, represented in Figure 2, which is based on attentive residual connections from the previously translated words to the output of the decoder, within an attention-based NMT architecture. The ability of the model to access to the global context at any time improves the flow of information and enables the learning of longer-range dependencies between words than previous models. Moreover, these benefits require only a small computational overhead with respect to previous models. We demonstrate the generality of the proposed approach by applying it, in addition to NMT, to language modeling. For both tasks, our approach outperforms strong baselines, respectively in terms of BLEU [19] and perplexity scores.

Our contributions are the following ones:

  • We propose and compare several options for using attentive residual connections within a standard decoder for attention-based NMT, which enable access to the global context of a translated text.

  • We demonstrate consistent improvements over strong SMT and NMT baselines for three language pairs (English-to-Chinese, Spanish-to-English, and English-to-German), as well as for language modeling.

  • We perform an ablation study and analyze the learned attention function in terms of its behavior and structure, providing additional insights on its actual contributions.

2Related Work

Enhancing sequential models in order to better capture the structure of sentences has been explored in a variety of studies. Many of them investigated recursive long short-term memory (LSTM) [11] architectures to combine words into phrases and higher-level arrangements for generating sentence structures [31]. More specifically, [25] proposed a tree-structure LSTM to model syntactic properties of a sentence. Other architectures were designed to model sentences as graphs [30], or used a self-attention memory to capture word dependencies or relations in a sentence [7]. In contrast to the latter study, which modified the LSTM units using memory networks, the self-attention mechanism proposed here uses residual connections and does not interfere with the recurrent layer, thus preserving its efficiency. For instance, in the tree-to-sequence approach proposed by [9] , the encoder is modified to use information from the syntactic tree of the source sentence. In the string-to-tree model proposed by [1] , linearized trees are used to incorporate structural information of the source and target sentences in a sequential model. In the sequence-to-dependency method proposed by [27] , a dependency tree is used on the decoder side. All these approaches utilize external parsers to obtain structural information. In the present work, we let the network learn meaningful connections by itself. The dependencies that appear to be learned are not strictly syntactic as the ones obtained from a parser, but are important for optimizing the objective function of the translation model.

To account for longer-range dependencies, our model relies on residual connections, also called shortcut connections, which have been shown to improve the learning process of deep neural networks by addressing the vanishing gradient problem [10]. These connections create a direct path from previous layers, which helps in the transmission of information. Recently, several architectures using residual connections with LSTMs have been proposed [29], applying the shortcut connections between LSTM layers in the spatial domain. In contrast, here, we propose to use residual connections in the temporal domain.

3Background: Neural Machine Translation

Neural machine translation aims to compute the conditional distribution of emitting a sentence in a target language given a sentence in a source language, noted , where is the set of parameters of the neural model, and and are respectively the representations of source and target sentences as sequences of words. The parameters are learned by training a sequence-to-sequence neural model on a corpus of parallel sentences. In particular, the learning objective is typically to maximize the following conditional log-likehood:

The attention-based NMT model designed by [2] has become the de-facto baseline for machine translation. The model is based on RNNs, typically using gated recurrent units (GRUs) as in the system designed by [8] or LSTMs [11]. The architecture is comprised of three main components: an encoder, a decoder, and an attention mechanism.

The goal of the encoder is to build meaningful representations of the source sentences. It consists of a bidirectional RNN which includes contextual information from past and future words into the vector representation of a particular word vector , formally defined as follows:

Here, and are the hidden states of the forward and backward passes of the bidirectional RNN respectively, and is a non-linear activation function.

The decoder (Figure Figure 1) is in essence a recurrent language model. At each time step, it predicts a target word conditioned over the previous words and the information from the encoder using the following posterior probability:

where is a non-linear multilayer function that outputs the probability of .

The hidden state of the decoder is defined as:

It depends on a context vector that is learned by the attention mechanism.

The attention mechanism allows the decoder to select which parts of the source sentence are more useful to predict the next output word. This goal is achieved by considering a weighted sum over all hidden states of the encoder as follows:

where is a weight calculated using a normalized exponential function, also known as alignment function, which computes how good is the match between the input at position and the output at position :

where

Different types of alignment functions have been used for the NMT framework, as investigated by [17] . Here, we use the one originally defined by [2] .

4Decoder with Attentive Residual Connections

The state-of-the-art decoder of the attention-based NMT model uses a residual connection from the previously predicted words to the output classifier in order to enhance the performance of translation. As we can see in Equation 1, the probability of prediction of a particular word is calculated by a function which takes as input the hidden state of the recurrent layer , the representation of the previously predicted word , and the context vector . In theory, should be enough for predicting the next word given that it is dependent on the other two local-context components according to Equation 2. However, this model over-emphasizes the local context (through and ), and essentially, the last predicted word plays the most important role for generating the next word. How can we allow the model to exploit a broader context for translation?

To answer this question, we propose to change the decoder formula by including residual connections not only from the previous time step , but from all previous time steps from to . This reinforces the memory of the recurrent layer towards what has been translated so far, through weighted connections applied to all previously predicted words. At each time step, the new model decides which of the previously predicted words should be emphasized to predict the next one. In order to deal with the dynamic length of this new input, we use a target-side summary vector that can be interpreted as the representation of the decoded sentence until the time in a word embedding space. We therefore modify Equation 1 replacing with :

The replacement of with means that the number of parameters added to the model is dependent only on the calculation of . Figure 2 illustrates the modification made to the decoder. We define two methods for summarizing the context into , which are described in the following sections.

4.1Mean Residual Connections

One simple way to aggregate information from multiple word embeddings is by averaging them. The average of the embedding vectors represents a shared space of the previously decoded words, which can be understood as the sentence representation until time . We hypothesize that this representation is more informative than using only the embedding of the previous word. Formally, this representation is computed as follows:

4.2Attentive Residual Connections

Averaging is a simple and cheap way to aggregate information from multiple words, but may not be sufficient for all dependencies. Instead, we propose a dynamic way to aggregate information in each sentence, such that different words have different importance according to their relation with the prediction of the next word. We propose to use a shared self-attention mechanism to obtain a summary representation of the translation, i.e. a weighted average representation of the words translated from to . This mechanism aims to model, in part, important non-sequential dependencies among words, and serves as a complementary memory to the recurrent layer.

where

The weights of the attention model are computed by a scoring function that predicts how important each previous word ( or ) is for the current prediction . Figure 2 illustrates the attentive residual connections in the proposed decoder at the moment of prediction of , in contrast with the baseline model presented in Figure 1. We experiment with four different scoring functions:

where , , and are parameters of the network; and are dimensions of the word embedding, and contextual vector respectively.

Firstly, we explore a simple scoring function that is calculated based only on the previous hidden states of the decoder as used e.g. by [20] . Secondly, we study the same scoring function (noted sum) as for the attention mechanism to the encoder, as proposed by [2] . Thirdly, we consider an element-wise product, and lastly a dot product function as proposed by [17] . In contrast to the first attention function which is based only on the previous word, the other three functions make use of the context vector .

5Experimental Data and Setup

5.1Datasets

To evaluate the proposed MT models in different conditions, we select three language pairs with increasing amounts of training data: English-Chinese (0.5M sentences), Spanish-English (2.1M), and English-German (4.5M).

For English-to-Chinese, we use a subset of the UN parallel corpus [22]1, with 500,000 sentence pairs for training, 2,000 sentences for development, and 2,000 sentences for testing. For training Spanish-to-English MT, we use a subset of WMT 2013 [3]2, corresponding to Europarl v7 with ca. 1,900,000 sentence pairs, plus News Commentary v11 with ca. 200,000 pairs. Newstest2012 and Newstest2013 were used for development and testing respectively, with 3,000 sentence pairs each. Finally, we use the complete English-to-German set from WMT 2016 [4]3 which includes Europarl v7, Common Crawl, and News Commentary v11 with a total of ca. 4.5 million sentence pairs. The development and testing sets are in this case Newstest2013 and Newstest2014 with 3,000 sentences each.

We report translation quality using BLEU scores over tokenized and tru ecased texts using the scrips available in the Moses toolkit [14]4.

5.2Setup and Parameters

We use the implementation of the attention-based NMT baseline provided by its authors, dl4mt-tutorial5 in Python using Theano framework [26]. The system implements the attention-based NMT model described previously using one layer of GRUs [8]. The vocabulary size is 25,000 for English-to-Chinese NMT, and 50,000 for Spanish-to-English and English-German. We use the byte pair encoding (BPE) strategy for out-of-vocabulary words [23]. For all cases, the maximum sentence length of the training samples is 50, the dimension of the word embeddings is 500, and the dimension of the hidden layers is 1,024. We use dropout with a probability of 0.5 after each layer. The parameters of the models are initialized randomly from a standard normal distribution scaled to a factor of 0.01. The loss function is optimized using Adadelta [28] with and as the best performance settings in the original paper. The system was trained in 36 epochs on English-to-Chinese, 18 epochs on Spanish-to-English, and 12 on English-to-German. It took 7–12 days of training for each model on a Tesla K40 GPU at the speed of ca. 1,000 words/sec.

6Analysis of Results

The BLEU scores of the different NMT models for the three language pairs are shown in Table 1. Along with the NMT baseline, we included an SMT one based on Moses [14] with the same training/tuning/test data as the NMT. The results show that all the proposed models outperform the attention-based NMT baseline, and the best scores are obtained by the NMT model with attentive residual connections. Despite their simplicity, the mean residual connections already improve the translation, without increasing the number of parameters.

=1.5pt

Table 1: BLEU scores for the evaluated models for three translation directions. The highest score per dataset is marked in bold text. The attentive residual connections makes use of the simple attention function. We show scores reported by other systems for English-to-German on newstest2014.
BLEU

Models

EnZh EsEn EnDe
[1mm] SMT Winning WMT 14 20.7
NMT 20.9
SMT baseline 21.6 25.2 17.5
NMT baseline 22.6 25.4 22.6
NMT + Mean residual connections 23.6 25.7 22.9
NMT + Attentive residual connections 24.0 26.3 23.2

6.1Role of the attention function.

We now examine the four scoring functions that can be used for attention to residual connections presented above in Eq. 15, considering only English-to-Chinese (the smallest dataset) due to limited computing time.

Table ? shows the BLEU scores for the test set: the best method is the simple matching function. This depends only on the word embeddings, in contrast to the other three methods, which are dependent additionally on the context vector. When using both word embeddings and local context for computing the attention, the performance is better than the baseline system but not better than the simple attention; this is likely because learning such a function is harder than one based on the global context only.

The improved performance comes at the cost of only a slight increase of the number of parameters compared to the baseline system, specifically by , where is the dimensionality of the target word embeddings. These numbers correspond respectively to the parameters and in Equation 7 above.

6.2Performance according to human evaluation.

Manual evaluation on samples of 50 sentences for each language pair helped to ascertain the conclusions obtained from the BLEU scores, and to provide a qualitative understanding of the improvements brought by our model. For each language, we employed one evaluator who was a native speaker of the target language and had good knowledge of the source language. The evaluator ranked three translations of the same source sentence – one from each of our models: baseline, mean residual connections, and attentive residual connections – according to the translation quality. The three translations were presented in a random order, so that the system that had generated them could not be identified. To integrate the judgments, we proceed in pairs, and count the number of times each system was ranked higher, equal to, or lower than another competing system. The results shown in Table 2 indicate that the attentive model outperforms the one with mean residual connections, and both outperform the baseline, for all three language pairs. The rankings are thus identical to those obtained using BLEU in Table 1.

=3pt

Table 2: Human evaluation of sentence-level translation quality on three language pairs. We compare the models in pairs, indicating the percentages of sentences that were ranked higher (), equal to (), or lower () for the first system with respect to the second one. The final ranking shows that the attentive model scores higher than the mean one, and both outperform the NMT baseline.

System

Mean vs. Baseline 26 56 18 20 64 16 28 58 24
Attentive vs. Baseline 28 60 12 28 56 16 32 54 14
Attentive vs. Mean 24 62 14 28 58 14 32 56 12

6.3Performance on language modeling.

The overall aim of our proposal is to model word dependencies which are meaningful for translation, and to extend the memory of the decoder. We hypothesize that the component that is actually affected by our change in the NMT architecture is the language modeling component. In order to verify this hypothesis, we incorporate the proposed residual connections to a neural model tailored for a language modeling (LM) task. This is formulated as the probability of a given text, expressed as a sequence of words noted as :

The neural LM uses a RNN to approximate this probability, thanks to the following equations:

where and are non-linear functions, and is the hidden state the recurrent layer. Similar to the formulation used in the decoder, we add the summary vector to the LM by modifying Equation 6 as follows:

For our experiment, we use a Pytorch LM implementation based on the work of [21] 6. The baseline model is an LSTM with dimension 150 for embeddings and 300 for the hidden states. All the models are trained on the Penn Treebank [18], for 100 epochs. Table ? shows the results both proposed methods improving over the baseline. The best performing model uses attentive residual connections and reduces perplexity by 3 points wrt. the baseline.

Figure 3: Examples of Spanish-to-English translations from the attentive residual connections model (target sentences only). The output sentence is on the vertical axis, while the horizontal axis represents the previous words. A darker shade indicates a higher attention weight to each corresponding previous word. The hypothesized syntactic structures are indicated with brackets, and changes of attention focus are visible from the change in shades.
Figure 3: Examples of Spanish-to-English translations from the attentive residual connections model (target sentences only). The output sentence is on the vertical axis, while the horizontal axis represents the previous words. A darker shade indicates a higher attention weight to each corresponding previous word. The hypothesized syntactic structures are indicated with brackets, and changes of attention focus are visible from the change in shades.
Figure 4: Number of times the word at a given position was assigned with maximum weight. The count is normalized by the length of the sentence and relative word position.
Figure 4: Number of times the word at a given position was assigned with maximum weight. The count is normalized by the length of the sentence and relative word position.
Figure 5: Mean of weights to a given word position. The weights are normalized by the length of the sentence and relative word position.
Figure 5: Mean of weights to a given word position. The weights are normalized by the length of the sentence and relative word position.

7Qualitative Analysis

We perform a qualitative analysis of the attentive residual connections method. In particular, we analyze the distribution of the weights assigned by the attention mechanism, and the kind of connections that this approach learns. We also present sample translations to contrast the baseline system with the proposed approach.

7.1Distribution of Attention.

Figure ? shows the distribution of the attention to previous words obtained with attentive residual connections, on Spanish-to-English NMT (the other two pairs exhibit similar distributions). The weights on the test corpus are summarized using two normalizations. First, we normalize the weights by the number of preceding words, because at each time step the number of preceding words increases, so the weights are inherently more dispersed for words which are at the end of the sentence. Second, we normalize the weights by the sentence length given its variability throughout the test set.

Figure 4 shows the distribution of relative positions which received maximal attention. In other words, we selected, at each word prediction, the preceding word with maximal weight, and counted its relative position, represented on the -axis. Although the maximal attention was most frequently directed to the closest preceding words, in many cases distant words (more than 30 positions away) were still given maximum attention. Figure 5 shows the mean weight assigned at each word position, now considering the weights assigned to all preceding words (not only the maximal one as in the previous case). We observe a uniform distribution (with noise for sentences longer than our training samples of 50 words) showing that the attention is indeed distributed throughout a sentence.

7.2Structures Learned by the Attentive Model.

Figure 3 shows matrices with target-side attention weights for four samples translated from Spanish-to-English with our attentive model. The sentence is on the vertical axis, while the horizontal axis represents the previous words. The color shades at each position (“cell”) represent the attention weights. A first attempt to determine classes of words which receive more attention did not lead to conclusive results, except possibly prepositions or conjunctions. In a second attempt, we observed the formation of sub-phrases. By analyzing the changes of attention as a binary tree, it appears on the examined examples that the decoder learns certain syntactic patterns. In Figure 3, starting from the first word, we mark with brackets the formation of a new sub-phrase each time the focus of attention changes to a new word. In these examples, we can see that the sub-phrases coincide with noun phrases or verb phrases.

Translation Samples.

Table 3 shows some examples of translation of the baseline and the attentive residual connections model. The first part (upper three examples) includes cases when the proposed model reached a higher BLEU score than the baseline system. Here, the structure of the sentence or at least the word order is improved. The second part (lower three examples) contains cases where the baseline had better BLEU score than our model. In the first example, the structure of the sentence is different but the information is correct, in the second one the lexical choices differ from the reference but are still correct, while in the third sample, the translation is simply incorrect.

=2pt

Table 3: Translation samples from Spanish to English. S: source, R: reference, B: baseline, A: attentive model.
S: Estudiantes y profesores se están tomando a la ligera la fecha.
R: Students and teachers are taking the date lightly.
B: Students and teachers are being taken lightly to the date.
A: Students and teachers are taking the date lightly.
S: No porque compartiera su ideología, sino porque para él los Derechos Humanos son indivisibles.
R: Not because he shared their world view, but because for him, human rights are indivisible.
B: Not because I share his ideology, but because he is indivisible by human rights.
A: Not because he shared his ideology, but because for him human rights are indivisible.
S: ¿El PSG puede convertirse en un gran club europeo a corto plazo?
R: Can PSG become a top European club in the short term?
B: Can the PG can become a large short-term European club?
A: Can the PSG be a great European club in the short term?
S: El gobierno intenta que no se construyan tantas casas pequeñas.
R: The Government is trying not to build so many small houses.
B: The government is trying not to build so many small houses.
A: The government is trying to ensure that so many small houses are not built.
S: Otras personas pueden tener niños .
R: Other people can have children.
B: Other people can have children.
A: Others may have children.
S: Dar a luz en el espacio
R: Give birth in space
B: Give birth in space
A: In the light of the space

8Conclusion

We presented a novel approach to enrich the sequential model of state-of-the-art decoders for neural machine translation. We proposed to extend the context by introducing residual connections to the previous word predictions. To manage the variable lengths of previous predictions, we proposed two methods for context summarization: mean residual connections and attentive residual connections. The first one is an average of the word embedding vectors, which can be interpreted as a representation of the sentence until the time of the new word prediction. The second one uses an attention mechanism over the word embedding vectors to emphasizes relevant information from the previous context. We evaluated our proposal over three datasets, for Chinese-to-English, Spanish-to-English, and English-to-German NMT. In each case, we improved the BLEU score compared to the NMT baseline. A manual evaluation over a small set of sentences for each language pair confirmed the improvement. We hypothesized that most of the improvement comes from the language model component, and showed indeed that our model has a lower perplexity than the baseline on a LM task. Finally, a qualitative analysis showed that the attentive model distributes weights throughout an entire sentence, and learns structures resembling syntactic ones.

To encourage further research in global-context NMT our code will be made available upon publication.


AAppendix A: Detailed Architecture

This appendix describes in detail the implementation of the Global-Context NMT model which is based on the attention-based NMT implementation of dl4mt-tutorial7.

The input of the model is a source sentence denoted as 1-of-k coded vector, where each element of the sequence corresponds to a word:

and the output is a target sentence denoted as well as 1-of-k coded vector:

where is the size of the vocabulary of target and source side, and are the lengths of the source and target sentences respectively. We omit the bias vectors for simplicity.

a.1Encoder

Each word of the source sentence is embedded in a -dimensional vector space using the embedding matrix . The hidden states are -dimensional vectors modeled by a bi-directional GRU. The forward states are computed as:

where

Here, and are weight matrices. The backward states are computed in similar manner. The embedding matrix is shared for both passes, and the final hidden states are formed by the concatenation of them:

a.2Attention mechanism

The context vector at time is calculated by:

where

Here, , and are weight matrices.

a.3Decoder

The input of the decoder are the previous word and the context vector , the objective is to predict . The hidden states of the decoder are initialized with the mean of the context vectors:

where is a weight matrix, is the size of the source sentence. The following hidden states are calculated with a GRU conditioned over the context vector at tine as follows:

where

Here, is the embedding matrix for the target language. , , and are weight matrices. The intermediate vector is calculated from a simple GRU:

In the attention-based NMT model, the probability of a target word is given by:

whereas in our global-context NMT model, the probability of a target word is given by:

Here, , , , are weight matrices. The summary vector can be calculated in different manners based on previous words to . First, a simple average:

The second, by using an attention mechanism:

The matching function also can be calculated in different manners:

where , , and are weight matrices.

Footnotes

  1. http://www.uncorpora.org/
  2. http://www.statmt.org/wmt13/
  3. http://www.statmt.org/wmt16/
  4. Namely multi-blue, tokenizer and truecase.
  5. https://github.com/nyu-dl/dl4mt-tutorial
  6. https://github.com/pytorch/examples/tree/master/word_language_model
  7. https://github.com/nyu-dl/dl4mt-tutorial

References

  1. 2017.
    Aharoni, R., and Goldberg, Y. Towards string-to-tree neural machine translation.
  2. 2015.
    Bahdanau, D.; Cho, K.; and Bengio, Y. Neural machine translation by jointly learning to align and translate.
  3. 2013.
    Bojar, O., et al. Findings of the 2013 Workshop on Statistical Machine Translation.
  4. 2016.
    Bojar, O., et al. Findings of the 2016 Conference on Machine Translation.
  5. 2014.
    Buck, C.; Heafield, K.; and van Ooyen, B. N-gram counts and language models from the common crawl.
  6. 2016.
    Chen, Q.; Zhu, X.; Ling, Z.; Wei, S.; and Jiang, H. Enhancing and combining sequential and tree LSTM for natural language inference.
  7. 2016.
    Cheng, J.; Dong, L.; and Lapata, M. Long short-term memory-networks for machine reading.
  8. 2014.
    Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation.
  9. 2016.
    Eriguchi, A.; Hashimoto, K.; and Tsuruoka, Y. Tree-to-sequence attentional neural machine translation.
  10. 2016.
    He, K.; Zhang, X.; Ren, S.; and Sun, J. Deep residual learning for image recognition.
  11. 1997.
    Hochreiter, S., and Schmidhuber, J. Long short-term memory.
  12. 2013.
    Kalchbrenner, N., and Blunsom, P. Recurrent continuous translation models.
  13. 2017.
    Kim, J.; El-Khamy, M.; and Lee, J. Residual LSTM: Design of a deep recurrent architecture for distant speech recognition.
  14. 2007.
    Koehn, P., et al. Moses: Open source toolkit for statistical machine translation.
  15. 2010.
    Koehn, P. Statistical Machine Translation.
  16. 2015.
    Le, P., and Zuidema, W. Compositional distributional semantics with long short term memory.
  17. 2015.
    Luong, T.; Pham, H.; and Manning, D. C. Effective approaches to attention-based neural machine translation.
  18. 1993.
    Marcus, M. P.; Marcinkiewicz, M. A.; and Santorini, B. Building a large annotated corpus of English: The Penn Treebank.
  19. 2002.
    Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation.
  20. 2017.
    Pappas, N., and Popescu-Belis, A. Explicit document modeling through weighted multiple-instance learning.
  21. 2017.
    Press, O., and Wolf, L. Using the output embedding to improve language models.
  22. 2009.
    Rafalovitch, A., et al. United Nations General Assembly resolutions: A six-language parallel corpus.
  23. 2016.
    Sennrich, R.; Haddow, B.; and Birch, A. Neural machine translation of rare words with subword units.
  24. 2014.
    Sutskever, I.; Vinyals, O.; and Le, Q. V. Sequence to sequence learning with neural networks.
  25. 2015.
    Tai, K. S.; Socher, R.; and Manning, C. D. Improved semantic representations from tree-structured long short-term memory networks.
  26. 2016.
    Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions.
  27. 2017.
    Wu, S.; Zhang, D.; Yang, N.; Li, M.; and Zhou, M. Sequence-to-dependency neural machine translation.
  28. 2012.
    Zeiler, M. D. ADADELTA: an adaptive learning rate method.
  29. 2016.
    Zhang, Y.; Chen, G.; Yu, D.; Yaco, K.; Khudanpur, S.; and Glass, J. Highway long short-term memory RNNs for distant speech recognition.
  30. 2016.
    Zhu, X.; Sobhani, P.; and Guo, H. DAG-structured long short-term memory for semantic compositionality.
  31. 2015.
    Zhu, X.; Sobihani, P.; and Guo, H. Long short-term memory over recursive structures.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
10323
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description