Cache-based Document-level Neural Machine Translation

Cache-based Document-level Neural Machine Translation

Shaohui Kuang          Deyi Xiong          Weihua Luo          Guodong Zhou
School of Computer Science and Technology, Soochow University, Suzhou, China
Alibaba Group
shkuang@stu.suda.edu.cn, dyxiong@suda.edu.cn,
weihua.luowh@alibaba-inc.com, gdzhou@suda.edu.cn
Abstract

Sentences in a well-formed text are connected to each other via various links to form the cohesive structure of the text. Current neural machine translation (NMT) systems translate a text in a conventional sentence-by-sentence fashion, ignoring such cross-sentence links and dependencies. This may lead to generate an incohesive and incoherent target text for a cohesive and coherent source text. In order to handle this issue, we propose a cache-based approach to document-level neural machine translation by capturing contextual information either from recently translated sentences or the entire document. Particularly, we explore two types of caches: a dynamic cache, which stores words from the best translation hypotheses of preceding sentences, and a topic cache, which maintains a set of target-side topical words that are semantically related to the document to be translated. On this basis, we build a new layer to score target words in these two caches with a cache-based neural model. Here the estimated probabilities from the cache-based neural model are combined with NMT probabilities into the final word prediction probabilities via a gating mechanism. Finally, the proposed cache-based neural model is trained jointly with a state-of-the-art neural machine translation system in an end-to-end manner. On several NIST Chinese-English translation tasks, our experiments demonstrate that the proposed cache-based model achieves substantial improvements over several state-of-the-art SMT and NMT baselines.

{CJK}

UTF8gbsn

1 Introduction

Neural machine translation [\citenameSutskever et al.2014, \citenameBahdanau et al.2015] as an emerging machine translation approach, quickly achieves the state-of-the-art translation performance on many language pairs, e.g., English-French [\citenameLuong et al.2015b], English-German [\citenameShen et al.2015, \citenameLuong et al.2015a] and so on. In principle, NMT is established on an encoder-decoder framework, where the encoder reads a source sentence and encodes it into a fixed-length semantic vector, and the decoder generates a translation according to this vector.

In spite of its current success, NMT translates sentences of a text independently, ignoring document-level information during translation. This largely limits its success since document-level information imposes constraints on the translations of individual sentences of a text. And such document-level constraints, at least, can be categorized into three aspects: consistency, disambiguation and coherence. First, the same phrases or terms should be translated consistently across the entire text as much as possible, no matter in which sentence they occur. If sentences of a text are translated independent of each other, it will be difficult to maintain the translation consistency across different sentences. Second, document-level information provides a global context that can help disambiguate words with multiple senses if sentence-level local context is not sufficient for disambiguation. Third, the topic of a document is able to keep individual sentences translated in a coherent way.

In the literature, such informative constraints have been occasionally investigated in statistical machine translation and achieved certain success via a variety of document-level models, such as cache-based language and translation models [\citenameTiedemann2010, \citenameGong et al.2011, \citenameNepveu et al.2004] for the consistency constraint, topic-based coherence model [\citenameXiong and Zhang2013, \citenameTam et al.2007] for the coherence constraint. By contrast, in neural machine translation, to the best of our knowledge, such constraints have not been explored so far.

Partially inspired by the success of cache models in SMT, we propose a cache-based approach to document-level neural machine translation. Particular, we incorporate two types of caches in the proposed neural cache model: a static topic cache and a dynamic cache being updated on the fly. For the topic cache, we first use a projection-based bilingual topic learning approach to infer the topic distribution for each document to be translated and obtain the corresponding topical words on the target side. These topical words are then integrated into the topic cache. For the dynamic cache, words are retrieved from the best translation hypotheses of recently translated sentences. While the topic cache builds a global profile for a document, which helps impose the coherence constraint on document translation, the dynamic cache follows an intuition that words occurred previously should have higher probabilities of recurrence even if they are rare words in the training data.

In order to integrate these two caches into neural machine translation, we further propose a cache-based neural model, which adds a new neural network layer on the cache component to score each word in the cache. During decoding, we estimate the probability of a word from the cache according to its score and combine this cache probability with the original probability computed by the decoder via a gating mechanism to obtain the final word prediction probability.

On the NIST Chinese-English translation tasks, our experiment results show that the proposed cache-based neural approach can achieve significant improvements of up to 0.98 BLEU points on average (up to 1.34 BLEU points on NIST04) over the state-of-the-art attention-based NMT baseline.

2 Related Work

In the literature, several cache-based translation models have been proposed for conventional statistical machine translation, besides traditional n-gram language models and neural language models. In this section, we will first introduce related work in cache-based language models and then in translation models.

For traditional n-gram language models, \newciteKuhn1990A propose a cache-based language model, which mixes a large global language model with a small local model estimated from recent items in the history of the input stream for speech recongnition. \newcitedella1992adaptive introduce a MaxEnt-based cache model by integrating a cache into a smoothed trigram language model, reporting reduction in both perplexity and word error rates. \newcitechueh2010topic present a new topic cache model for speech recongnition based on latent Dirichlet language model by incorporating a large-span topic cache into the generation of topic mixtures.

For neural language models, \newcitehuang2014cache propose a cache-based RNN inference scheme, which avoids repeated computation of identical LM calls and caches previously computed scores and useful intermediate results and thus reduce the computational expense of RNNLM. \newciteGrave2016Improving extend the neural network language model with a neural cache model, which stores recent hidden activations to be used as contextual representations. Our caches significantly differ from these two caches in that we store linguistic items in the cache rather than scores or activations.

For cache-based translation models, \newcitenepveu2004adaptive propose a dynamic adaptive translation model using cache-based implementation for interactive machine translation, and develop a monolingual dynamic adaptive model and a bilingual dynamic adaptive model. \newcitetiedemann2010context propose a cache-based translation model, filling the cache with bilingual phrase pairs from the best translation hypotheses of previous sentences in a document. \newcitegong2011cache further propose a cache-based approach to document-level translation, which includes three caches, a dynamic cache, a static cache and a topic cache, to capture various document-level information. \newcitebertoldi2013cache describe a cache mechanism to implement online learning in phrase-based SMT and use a repetition rate measure to predict the utility of cached items expected to be useful for the current translation.

Our caches are similar to those used by \newcitegong2011cache who incorporate these caches into statistical machine translation. We adapt them to neural machine translation with a neural cache model. It is worthwhile to emphasize that such adaptation is nontrivial as shown below because the two translation philosophies and frameworks are significantly different.

3 Neural Machine Translation

In this section, we briefly describle the atttention-based NMT model proposed in [\citenameBahdanau et al.2015].

In their framework, the encoder encodes a source sentence into a sequence of vectors with bi-directional RNNs, where the forward RNN reads the source sentence from left to right and the backward RNN reads the source sentence in an inverse direction. Here, the hidden states in the forward RNN can be computed as follows:

(1)

where is a non-linear activaion function, here defined as a gated recurrent unit (GRU) [\citenameChung et al.2014]. similarly, hidden states of the backward RNN can be calculated. On this basis, the forward and backward hidden states are concatenated into the final annotation vectors

The decoder is also an RNN that predicts the next word given the context vector , the hidden state and the previously generated partial translation . Here, the probability of the next word is calculated as follows:

(2)

where is a softmax activation function, and is the state of the RNN decoder at time step computed as:

(3)

where is the activation function, the same as that used in the encoder. The context vector is calculated as a weighted sum of all hidden states of the encoder as follows:

(4)
(5)
(6)

where is the attention weight of each hidden state computed by the attention model, and is a feedforward neural network with a single hidden layer.

We also implement an NMT system which adopts feedback attention taken from dl4mt tutorial 111https://github.com/nyu-dl/di4mt-tutorial/session2/, referred to as RNNSearch* in this paper. In the feedback attention, is computed as follows:

(7)

where , and the hidden state of the decoder is updated as follows:

(8)

In this paper, the proposed cache-based neural approach is implemented on the top of RNNSearch* system, where the encoder-decoder NMT framework is trained to optimize the sum of the conditional log probabilities of correct translations of all source sentences on a parallel corpus as normal.

4 Document-level NMT with a Cache-based Neural Model

In this section, we elaborate on the proposed cache-based neural model and how we integrate it into neural machine translation, Figure 1 shows the entire architecture of our NMT with the cache-based neural model.

Figure 1: Architecture of NMT with the neural cache model. is the probability for a next target word estimated by the cache-based neural model.

4.1 Dynamic Cache and Topic Cache

The aim of cache is to incorporate document-level constraints and therefore to improve the consistency and coherence of document translations. In this section, we introduce our proposed dynamic cache and topic cache in detail.

4.1.1 Dynamic Cache

In order to build the dynamic cache, we apply the following rules to build the dynamic cache.

  1. dynamically extract words from recently translated sentences and the partial translation of current sentence being translated as words of dynamic cache.

  2. The max size of the dynamic cache is set to .

  3. According to the first-in-first-out rule, when the dynamic cache is full and a new word is inserted into the cache, the oldest word in the cache will be removed.

  4. Duplicate entries into the dynamic cache are not allowed when a word has been already in the cache.

It is worth noting that we also maintain a stop word list, and we added English punctuations and “<UNK>” into our stop word list. Words in the stop word list would not be inserted into the dynamic cache. So the common words like “a” and “the” cannot appear in the cache. All words in the dynamic cache can be found in the target-side vocabulary of RNNSearch*.

4.1.2 Topic Cache

In order to build the topic cache, we first use an off-the-shelf LDA topic tool222http://www.arbylon.net/projects/ to learn topic distributions of source- and target-side documents separately. Then we estimate a topic projection distribution over all target-side topics for each source topic by collecting events and accumulating counts of from aligned document pairs. Notice that is the topic with the highest topic probability on the source/target side. Then we can use the topic cache as follows:

  1. During the training process of NMT, the learned target-side topic model is used to infer the topic distribution for each target document. For a target document d in the training data, we select the topic with the highest probability as the topic for the document. The most probable topical words in topic are extracted to fill the topic cache for the document .

  2. In the NMT testing process, we first infer the topic distribution for a source document in question with the learned source-side topic model. From the topic distribution, we choose the topic with the highest probability as the topic for the source document. Then we use the learned topic projection function to map the source topic onto a target topic with the highest projection probability, as illustrated in Figure 2. After that, we use the most probable topical words in the projected target topic to fill the topic cache.

The words of topic cache and dynamic cache together form the final cache model. In practice, the cache stores word embeddings, as shown in Figure 3. As we do not want to introduce extra embedding parameters, we let the cache share the same target word embedding matrix with the NMT model. In this case, if a word is not in the target-side vocabulary of NMT, we discard the word from the cache.

4.2 The Cache-based Neural Model

The cache-based neural model is to evaluate the probabilities of words occurring in the cache and to provide the evaluation results for the decoder via a gating mechanism.

4.2.1 Evaluating Word Entries in the Cache

When the decoder generates the next target word , we hope that the cache can provide helpful information to judge whether is appropriate from the perspective of the document-level cache if occurs in the cache.To achieve this goal, we should appropriately evaluate the word entries in the cache.

In this paper, we build a new neural network layer as the scorer for the cache. At each decoding step , we use the scorer to score if is in the cache. The inputs to the scorer are the current hidden state of the decoder, previous word , context vector , and the word from the cache. The score of is calculated as follows:

(9)

where is a non-linear activation function, i.e., the tangent function used here.

This score is further used to estimate the cache probability of as follows:

(10)
Figure 2: Schematic diagram of the topic projection during the testing process.
Figure 3: Architecture of the cache model.

4.2.2 Integrating the Cache-based Neural Model into NMT

Since we have two prediction probabilities for the next target word , one from the cache-based neural model , the other originally estimated by the NMT decoder , how do we integrate these two probabilities? Here, we introduce a gating mechanism to combine them, and word prediction probabilities on the vocabulary of NMT are updated by combining the two probabilities through linear interpolation between the NMT probability and cache-based neural model probability. The final word prediction probability for is calculated as follows:

(11)

Notice that if is not in the cache, we set , where is the gate and computed as follows:

(12)

where is a sigmoid function.

We use the contextual elements of to score the current target word occurring in the cache (Eq. (9)) and to estimate the gate (Eq. (12)). If the target word is consistent with the context and in the cache at the same time, the probability of the target word will be high.

Finally, we train the proposed cache model jointly with the NMT model towards minimizing the negative log-likelihood on the training corpus. The cost function is computed as follows:

(13)

where are all parameters in the cache-based NMT model.

4.3 Decoding Process

Our cache-based NMT system works as follows:

  1. When the decoder shifts to a new test document, clear the topic and dynamic cache.

  2. Obtain target topical words for the new test document as described in Section 4.1 and fill them in the topic cache.

  3. Clear the dynamic cache when translating the first sentence of the test document.

  4. For each sentence in the new test document, translate it with the proposed cache-based NMT and continuously expands the dynamic cache with newly generated target words and target words obtained from the best translation hypothesis of previous sentences.

In this way, the topic cache can provide useful global information at the beginning of the translation process while the dynamic cache is growing with the progress of translation.

5 Experimentation

We evaluated the effectiveness of the proposed cache-based neural model for neural machine translation on NIST Chinese-English translation tasks.

5.1 Experimental Setting

We selected corpora LDC2003E14, LDC2004T07, LDC2005T06, LDC2005T10 and a portion of data from the corpus LDC2004T08 (Hong Kong Hansards/Laws/News) as our bilingual training data, where document boundaries are explicitly kept. In total, our training data contain 103,236 documents and 2.80M sentences. On average, each document consists of 28.4 sentences. We chose NIST05 dataset (1082 sentence pairs) as our development set, and NIST02, NIST04, NIST06, NIST08 (878, 1788, 1664, 1357 sentence pairs. respectively) as our test sets. We compared our proposed model against the following three systems:

For Moses, we used the full training data to train the model. We ran GIZA++ [\citenameOch and Ney2000] on the training data in both directions, and merged alignments in two directions with “grow-diag-final” refinement rule [\citenameKoehn et al.2005] to obtain final word alignments. We trained a 5-gram language model on the Xinhua portion of GIGA-WORD corpus using SRILM Toolkit with a modified Kneser-Ney smoothing.

For GroundHog, we used the parallel corpus to train the attention-based NMT model. The encoder of GroundHog consists of a forward and backward recurrent neural network. The word embedding dimension is 620 and the size of a hidden layer is 1000. The maximum length of sentences that we used to train GroundHog in our experiments was set to 50 on both Chinese and English side. We used the most frequent 30K words for both Chinese and English, covering approximately 99.0% and 99.2% of the data in the two languages respectively. We replaced rare words with a special token “<UNK>”. All the other settings were the same as those in [\citenameBahdanau et al.2015]. Once the NMT model was tranied, we adopted a beam search to find possible translations with high probabilities. We set the beam width to 10.

For RNNSearch*, we used the feedback attention mechanism rather than the vanilla attention network. All the other settings were set the same as GroundHog. Dropout was applied only on the output layer and the dropout rate was set to 0.5.

For the proposed cache-based NMT model, we implemented it on the top of RNNSearch*. We set the size of the dynamic and topic cache and to 100, 200, respectively. For the dynamic cache, we only kept those most recently-visited items. For the LDA tool, we set the number of topics considered in the model to 100 and set the number of topic words that are used to fill the topic cache to 200. The parameter and of LDA were set to 0.5 and 0.1, respectively. For the cache scorer (Equation (9)) and gating mechanism (Equation (11)), we used a feedforward neural network with two hidden layers. For the gating mechanism, the number of units in the two hidden layers are were set to 500 and 200 respectively. For the cache scorer, the number of units in the two hidden layers were set 1000 and 500 respectively. We used a pre-training strategy that has been widely used in the literature to train our proposed model: training the regular attention-based NMT model using our implementation of RNNSearch*, and then using its parameters to initialize the parameters of the proposed model, except for those related to the operations of the proposed cache model.

We used the stochastic gradient descent algorithm with mini-batch and Adadelta to train the NMT models. The mini-batch was set to 80 sentences and decay rates and of Adadelta were set to 0.95 and .

5.2 Experimental Results

Table 1 shows the results of different models measured in terms of BLEU score444As our model requires document boundaries so as to gurantee that cache words are from the same document, we use all LDC corpora that provide document boundaries. Most training sentences are from Hong Kong Hansards/Laws Parallel Text (accounting for 57.82% of our training data) which are in the law domain rather than the news domain of our test/dev sets. This is the reason that our baseline is lower than other published results obtained using more news-domain training data without document boundaries.. From the table, we can find that our implementation RNNSearch* using the feedback attention and dropout outperforms GroundHog and Moses by 1.33 BLEU points and 2.65 BLEU points, respectively. The proposed model achieves an average gain of 0.60 BLEU points over RNNSearch* on all test sets. Further, the model achieves an average gain of 0.97 BLEU points over RNNSearch*, and it outperforms GroundHog and Moses by 2.3 BLEU points and 3.62 BLEU points, respectively. These results strongly suggest that the dynamic and topic cache are very helpful and able to improve translation quality in document translation.

Model NIST02 NIST04 NIST05 NIST06 Avg
Moses 31.52 32.73 29.52 29.57 30.69
GroundHog 32.31 35.52 30.92 29.32 32.01
RNNSearch* 33.44 36.76 32.31 30.88 33.34
+ 34.01 37.55 32.02 32.19 33.94
+ 34.41 38.10 32.90 31.82 34.31
Table 1: Experiment results on the NIST Chinese-English translation tasks. [+] is the proposed model with the dynamic cache. [+] is the proposed model with both the dynamic and topic cache. The BLEU scores are case-insensitive. Avg means the average BLEU score on all test sets.

Effect of the Gating Mechanism

In order to validate the effectiveness of the gating mechanism used in the cache-based neural model, we set a fixed gate value for , in other words, we use a mixture of probabilities with fixed proportions to replace the gating mechanism that automatically learns weights for probability mixture.

Table 2 displays the result. When we set the gate to a fixed value 0.3, the performance has an obvious decline comparing with that of in terms of BLEU score. The performance is even worse than RNNSearch* by 7.93 BLEU points. Therefore without a good mechanism, the cache-based neural model cannot be appropriately integrated into NMT. This shows that the gating mechanism plays a important role in .

Model NIST02 NIST04 NIST05 NIST06 Avg
RNNSearch* 33.44 36.76 32.31 30.88 33.34
+ 34.41 38.10 32.90 31.82 34.31
+ 23.39 17.83 31.51 28.90 25.41
Table 2: Effect of the gating mechanism. [+=0.3] is the [] with a fixed gate value 0.3.

Effect of the Topic Cache

When the NMT decoder translates the first sentence of a document, the dynamic cache is empty. In this case, we hope that the topic cache will provide document-level information for translating the first sentence. We therefore further investigate how the topic cache influence the translation of the first sentence in a document. We count and calculate the average number of words that appear in both the translations of the beginning sentences of documents and the topic cache.

The statistical results are shown in Table 3. Without using the cache model, RNNSearch* generates translations that contain words from the topic cache as these topic words are tightly related to documents being translated. With the topic cache, our neural cache model enables the translations of the first sentences to be more relevant to the global topics of documents being translated as these translations contain more words from the topic cache that describes these documents. As the dynamic cache is empty when the decoder translates the beginning sentences, the topic cache is complementary to such a cold cache at the start. Comparing the numbers of translations generated by our model and human translations (Reference in Table 3), we can find that with the help of the topic cache, translations of the first sentences of documents are becoming closer to human translations.

Model NIST02 NIST04 NIST05 NIST06 Mean
RNNSearch* 1.90 2.43 1.31 1.50 1.78
+ 2.11 2.51 1.53 1.73 1.96
Reference 2.39 2.51 2.20 1.28 2.09
Table 3: The average number of words in translations of beginning sentences of documents that are also in the topic cache. Reference represents the average number of words in four human translations that are also in the topic cache.
Model NIST02 NIST04 NIST05 NIST06 Avg
25.70 27.48 26.89 41.00 30.27
59.46 64.24 76.70 124.25 81.16
2.93 3.07 2.49 1.95 2.61
6.77 7.18 7.09 5.89 6.73
Table 4: The average number of words in translations generated by that are also in the dynamic and topic cache. [] denote the average number of words that are in both document/sentence translations and the topic cache. [] denote the average number of words occurring in both document/sentence translations and the two caches.
SRC (1) 并 将 计划 中 的 一 系列 军事 行动 提前 付诸 实施 。
(2) 会议 决定 加大 对 巴方 的 军事 打击 行动 。
REF (1) and to implement ahead of schedule a series of military actions still being planned .
(2) the meeting decided to increase military actions against palestinian .
RNNSearch* (1) and to implement a series of military operations .
(2) the meeting decided to increase military actions against the palestinian side .
+ (1) and to implement a series of military actions plans.
(2) the meeting decided to increase military actions against the palestinian side .
Table 5: Translation examples on the test set. SRC for source sentences, REF for human translations. These two sentences (1) and (2) are in the same document.

Analysis on the Cache-based Neural Model

As shown above, the topic cache is able to influence both the translations of beginning sentences and those of subsequent sentences while the dynamic cache built from translations of preceding sentences has an impact on the translations of subsequent sentences. We further study what roles the dynamic and topic cache play in the translation process. For this aim, we calculate the average number of words in translations generated by that are also in the caches. During the counting process, stop words and “¡UNK¿” are removed from sentence and document translations.

Table 4 shows the results. If only the topic cache is used ([,] in Table 4), the cache still can provide useful information to help NMT translate sentences and documents. 28.3 words per document and 2.39 words per sentence are from the topic cache. When both the dynamic and topic cache are used ([,] in Table 4), the numbers of words that both occur in sentence/document translations and the two caches sharply increase from 2.61/30.27 to 6.73/81.16. The reason for this is that words appear in preceding sentences will have a large probability of appearing in subsequent sentences. This shows that the dynamic cache plays a import role in keeping document translations consistent by reusing words from preceding sentences.

We also provide two translation examples in Table 5. We can find that RNNSearch* generates different translations “operations” and “actions” for the same chinese word “行动(xingdong)”, while our proposed model produces the same translation “actions”.

5.3 Analysis on Translation Coherence

Model Coherence
RNNSearch* 0.4259
0.4274
Human Reference 0.4347
Table 6: The average cosine similarity of adjacent sentences (coherence) on all test sets.

We want to further study how the proposed cache-based neural model influence coherence in document translation. For this, we follow \newciteLapata2005Automatic to measure coherence as sentence similarity. First, each sentence is represented by the mean of the distributed vectors of its words. Second, the similarity between two sentences is determined by the cosine of their means.

(14)

where , and is the vector for word .

We use Word2Vec555http://word2vec.googlecode.com/svn/trunk/ to get the distributed vectors of words and English Gigaword fourth Edition666https://catalog.ldc.upenn.edu/LDC2009T13 as training data to train Word2Vec. We consider that embeddings from word2vec trained on large monolingual corpus can well encode semantic information of words. We set the dimensionality of word embeddings to 200.

Table 6 shows the average cosine similarity of adjacent sentences on all test sets. From the table, we can find that the model produces better coherence in document translation than RNNSearch* in term of cosine similarity.

6 Conclusion and Future Work

In this paper, we have presented a novel cache-based neural model for NMT to capture the global topic information and inter-sentence cohesion dependencies. We use a gating mechanism to integrate both the topic and dynamic cache into the proposed neural cache model. Experiment results show that the cache-based neural model achieves consistent and significant improvements in translation quality over several state-of-the-art NMT and SMT baselines. Further analysis reveals that the topic cache and dynamic cache are complementary to each other and that both are able to guide the NMT decoder to use topical words and to reuse words from recently translated sentences as next word predictions.

The proposed cache-based neural model only stores target words in the cache. In the future, we would like to integrate both source- and target-side information into document-level NMT.

References

  • [\citenameBahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR.
  • [\citenameBertoldi et al.2013] Nicola Bertoldi, Mauro Cettolo, and Marcello Federico. 2013. Cache-based online adaptation for machine translation enhanced computer assisted translation. Proceedings of the XIV Machine Translation Summit, pages 35–42.
  • [\citenameChueh and Chien2010] Chuang-Hua Chueh and Jen-Tzung Chien. 2010. Topic cache language model for speech recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5194–5197. IEEE.
  • [\citenameChung et al.2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. Presented in NIPS 2014 Deep Learning and Representation Learning Workshop.
  • [\citenameDella Pietra et al.1992] Stephen Della Pietra, Vincent Della Pietra, Robert L Mercer, and Salim Roukos. 1992. Adaptive language modeling using minimum discriminant estimation. In Proceedings of the workshop on Speech and Natural Language, pages 103–106. Association for Computational Linguistics.
  • [\citenameGong et al.2011] Zhengxian Gong, Min Zhang, and Guodong Zhou. 2011. Cache-based document-level statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 909–919. Association for Computational Linguistics.
  • [\citenameGrave et al.2016] Edouard Grave, Armand Joulin, and Nicolas Usunier. 2016. Improving neural language models with a continuous cache.
  • [\citenameHuang et al.2014] Zhiheng Huang, Geoffrey Zweig, and Benoit Dumoulin. 2014. Cache based recurrent neural network language model inference for first pass speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6354–6358. IEEE.
  • [\citenameKoehn et al.2005] Philipp Koehn, Amittai Axelrod, Alexandra Birch, Chris Callison-Burch, Miles Osborne, David Talbot, and Michael White. 2005. Edinburgh system description for the 2005 iwslt speech translation evaluation. In IWSLT, pages 68–75.
  • [\citenameKoehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics.
  • [\citenameKuhn and De Mori1990] R. Kuhn and Renato De Mori. 1990. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6):570–583.
  • [\citenameLapata and Barzilay2005] Mirella Lapata and Regina Barzilay. 2005. Automatic evaluation of text coherence: models and representations. In International Joint Conference on Artificial Intelligence, pages 1085–1090.
  • [\citenameLuong et al.2015a] Minh Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. Computer Science.
  • [\citenameLuong et al.2015b] Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. Association for Computational Linguistics.
  • [\citenameNepveu et al.2004] Laurent Nepveu, Guy Lapalme, Philippe Langlais, and George F. Foster. 2004. Adaptive language and translation models for interactive machine translation. In Conference on Empirical Methods in Natural Language Processing , EMNLP 2004, A Meeting of Sigdat, A Special Interest Group of the Acl, Held in Conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain, pages 190–197.
  • [\citenameOch and Ney2000] Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 440–447. Association for Computational Linguistics.
  • [\citenameShen et al.2015] Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Minimum risk training for neural machine translation. Computer Science.
  • [\citenameSutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • [\citenameTam et al.2007] Yik Cheung Tam, Ian Lane, and Tanja Schultz. 2007. Bilingual lsa-based adaptation for statistical machine translation. Machine Translation, 21(4):187–207.
  • [\citenameTiedemann2010] Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pages 8–15. Association for Computational Linguistics.
  • [\citenameXiong and Zhang2013] Deyi Xiong and Min Zhang. 2013. A topic-based coherence model for statistical machine translation. In AAAI.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
14306
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description