Neural Machine Translation: A Review and Survey

# Neural Machine Translation: A Review and Survey

## Abstract

The field of machine translation (MT), the automatic translation of written text from one natural language into another, has experienced a major paradigm shift in recent years. Statistical MT, which mainly relies on various count-based models and which used to dominate MT research for decades, has largely been superseded by neural machine translation (NMT), which tackles translation with a single neural network. In this work we will trace back the origins of modern NMT architectures to word and sentence embeddings and earlier examples of the encoder-decoder network family. We will conclude with a survey of recent trends in the field.

###### keywords:
Neural machine translation, Neural sequence models
2\biboptions

sort&compress \biboptionsauthoryear

Various fields in the area of natural language processing (NLP) have been boosted by the rediscovery of neural networks (Goldberg, 2016). However, for a long time, the integration of neural nets into machine translation (MT) systems was rather shallow. Early attempts used feedforward neural language models (Bengio et al., 2003, 2006) for the target language to rerank translation lattices (Schwenk et al., 2006). The first neural models which also took the source language into account extended this idea by using the same model with bilingual tuples instead of target language words (Zamora-Martinez et al., 2010), scoring phrase pairs directly with a feedforward net (Schwenk, 2012), or adding a source context window to the neural language model (Le et al., 2012; Devlin et al., 2014). Kalchbrenner and Blunsom (2013) and Cho et al. (2014b) introduced recurrent networks for translation modelling. All those approaches applied neural networks as component in a traditional statistical machine translation system. Therefore, they retained the log-linear model combination and only exchanged parts in the traditional architecture.

Neural machine translation (NMT) has overcome this separation by using a single large neural net that directly transforms the source sentence into the target sentence (Cho et al., 2014a; Sutskever et al., 2014; Bahdanau et al., 2015). The advent of NMT certainly marks one of the major milestones in the history of MT, and has led to a radical and sudden departure of mainstream research from many previous research lines. This is perhaps best reflected by the explosion of scientific publications related to NMT in the past years3 (Fig. 1), and the large number of publicly available NMT toolkits (Tab. 1). NMT has already been widely adopted in the industry (Wu et al., 2016; Crego et al., 2016; Schmidt and Marg, 2018; Levin et al., 2017) and is deployed in production systems by Google, Microsoft, Facebook, Amazon, SDL, Yandex, and many more. This article will introduce the basic concepts of NMT, and will give a comprehensive overview of current research in the field. For even more insight into the field of neural machine translation, we refer the reader to other overview papers such as (Neubig, 2017; Cromieres et al., 2017; Koehn, 2017; Popescu-Belis, 2019).

## 1 Nomenclature

We will denote the source sentence of length as . We use the subscript to index tokens in the source sentence. We refer to the source language vocabulary as .

 x=xI1=(x1,…,xI)∈ΣIsrc (1)

The translation of source sentence into the target language is denoted as . We use an analogous nomenclature on the target side.

 y=yJ1=(y1,…,yJ)∈ΣJtrg (2)

In case we deal with only one language we drop the subscript /. For convenience we represent tokens as indices in a list of subwords or word surface forms. Therefore, and are the first natural numbers (i.e.  where is the vocabulary size). Additionally, we use the projection function which maps a tuple or vector to its -th entry:

 πk(z1,…,zk,…,zn)=zk. (3)

For a matrix we denote the element in the -th row and the -th column as , the -th row vector as and the -th column vector as . For a series of -dimensional vectors () we denote the matrix which results from stacking the vectors horizontally as as illustrated with the following tautology:

 A=(Ap,:)p=1:m=((A:,q)q=1:n)T. (4)

## 2 Word Embeddings

Representing words or phrases as continuous vectors is arguably one of the keys in connectionist models for NLP. To the best of our knowledge, continuous space word representations were first successfully used for language modelling (Bellegarda, 1997; Bengio et al., 2003). The key idea is to represent a word as a -dimensional vector of real numbers. The size of the embedding layer is normally chosen to be much smaller than the vocabulary size () in order to obtain interesting representations. The mapping from the word to its distributed representation can be represented by an embedding matrix  (Collobert and Weston, 2008). The column of (denoted as ) holds the -dimensional representation for the word .

Learned continuous word representations have the potential of capturing morphological, syntactic and semantic similarity across words (Collobert and Weston, 2008). In neural machine translation, embedding matrices are usually trained jointly with the rest of the network using backpropagation (Rumelhart et al., 1988) and a gradient based optimizer such as stochastic gradient descent. In other areas of NLP, pre-trained word embeddings trained on unlabelled text have become ubiquitous (Collobert et al., 2011). Methods for training word embeddings on raw text often take the context into account in which the word occurs frequently (Pennington et al., 2014; Mikolov et al., 2013a), or use cross-lingual information to improve embeddings (Mikolov et al., 2013b; Upadhyay et al., 2016).

A newly emerging type of contextualized word embeddings (Peters et al., 2017; McCann et al., 2017) is gaining popularity in various fields of NLP. Contextualized representations do not only depend on the word itself but on the entire input sentence. Thus, they cannot be described by a single embedding matrix but are usually generated by neural sequence models which have been trained under a language model objective. Most approaches either use LSTM (Peters et al., 2017, 2018) or Transformer architectures (Radford et al., 2018; Devlin et al., 2019) but differ in the way these architectures are used to compute the word representations. Contextualized word embeddings have advanced the state-of-the-art in several NLP benchmarks (Peters et al., 2018; Bowman et al., 2018; Devlin et al., 2019). Goldberg (2019) showed that contextualized embeddings are remarkably sensitive to syntax. Choi et al. (2017) reported gains from contextualizing word embeddings in NMT using a bag of words.

## 3 Phrase Embeddings

For various NLP tasks such as sentiment analysis or MT it is desirable to embed whole phrases or sentences instead of single words. For example, a distributed representation of the source sentence could be used as conditional for the distribution over the target sentences . Early approaches to phrase embedding were based on recurrent autoencoders (Pollack, 1990; Socher et al., 2011). To represent a phrase as -dimensional vector, Socher et al. (2011) first trained a word embedding matrix . Then, they recursively applied an autoencoder network which finds -dimensional representations for -dimensional inputs, where the input is the concatenation of two parent representations. The parent representations are either word embeddings or representations calculated by the same autoencoder from two different parents. The order in which representations are merged is determined by a binary tree over which can be constructed greedily (Socher et al., 2011) or derived from an Inversion Transduction Grammar (Wu, 1997, ITG) (Li et al., 2013). Fig. 1(a) shows an example of a recurrent autoencoder embedding a phrase with five words into a four dimensional space. One of the disadvantages of recurrent autoencoders is that the word and sentence embeddings need to have the same dimensionality. This restriction is not very critical in sentiment analysis because the sentence representation is only used to extract the sentiment of the writer (Socher et al., 2011). In MT, however, the sentence representations need to convey enough information to condition the target sentence distribution on it, and thus should be higher dimensional than the word embeddings.

## 4 Sentence Embeddings

Kalchbrenner and Blunsom (2013) used convolution to find vector representations of phrases or sentences and thus avoided the dimensionality issue of recurrent autoencoders. As shown in Fig. 1(b), their model yields -gram representations at each convolution level, with increasing with depth. The top level can be used as representation for the whole sentence. Other notable examples of using convolution for sentence representations include (Kalchbrenner et al., 2014; Kim, 2014; Mou et al., 2016; dos Santos and Gatti, 2014; Er et al., 2016). However, the convolution operations in these models loose information about the exact word order. and are thus more suitable for sentiment analysis than for tasks like machine translation.4 A recent line of work uses self-attention rather than convolution to find sentence representations (Shen et al., 2018a; Wu et al., 2018b; Zhang et al., 2018b). Another interesting idea explored by Yu et al. (2018) is to resort to (recursive) relation networks (Santoro et al., 2017; Palm et al., 2018) which repeatedly aggregate pairwise relations between words in the sentence. Recurrent architectures are also commonly used for sentence representation. It has been noted that even random RNNs without any training can work surprisingly well for several NLP tasks (Conneau et al., 2017a, 2018; Wieting and Kiela, 2019).

## 5 Encoder-Decoder Networks with Fixed Length Sentence Encodings

Kalchbrenner and Blunsom (2013) were the first who conditioned the target sentence distribution on a distributed fixed-length representation of the source sentence. Their recurrent continuous translation models (RCTM) I and II gave rise to a new family of so-called encoder-decoder networks which is the current prevailing architecture for NMT. Encoder-decoder networks are subdivided into an encoder network which computes a representation of the source sentence, and a decoder network which generates the target sentence from that representation. As introduced in Sec. 1 we denote the source sentence as and the target sentence as . All existing NMT models define a probability distribution over the target sentences by factorizing it into conditionals:

 P(y|x)Chain rule=J∏j=1P(yj|yj−11,x). (5)

Different encoder-decoder architectures differ vastly in how they model the distribution . We will first discuss encoder-decoder networks in which the encoder represents the source sentence as a fixed-length vector like the methods in Sec. 4. The conditionals are modelled as:

 P(yj|yj−11,x)=g(yj|sj,yj−1,c(x)) (6)

where is the hidden state of a recurrent neural (decoder) network (RNN). We will formally introduce in Sec. 6.3. Gated activation functions such as the long short-term memory (Hochreiter and Schmidhuber, 1997, LSTM) or the gated recurrent unit (Cho et al., 2014b, GRU) are commonly used to alleviate the vanishing gradient problem (Hochreiter et al., 2001) which makes it difficult to train RNNs to capture long-range dependencies. Deep architectures with stacked LSTM cells were used by Sutskever et al. (2014). The encoder can be a convolutional network as in the RCTM I (Kalchbrenner and Blunsom, 2013), an LSTM network (Sutskever et al., 2014), or a GRU network (Cho et al., 2014b). is a feedforward network with a softmax layer at the end which takes as input the decoder state and an embedding of the previous target token . In addition, may also take the source sentence encoding as input to condition on the source sentence (Kalchbrenner and Blunsom, 2013; Cho et al., 2014b). Alternatively, is just used to initialize the decoder state  (Sutskever et al., 2014; Bahdanau et al., 2015). Fig. 3 contrasts both methods. Intuitively, once the source sentence has been encoded, the decoder starts generating the first target sentence symbol which is then fed back to the decoder network for producing the second symbol . The algorithm terminates when the network produces the end-of-sentence symbol </s>. Sec. 7 explains more formally what we mean by the network “generating” a symbol and sheds more light on the aspect of decoding in NMT. Fig. 4 shows the complete architecture of Sutskever et al. (2014) who presented one of the first working standalone NMT systems that did not rely on any SMT baseline. One of the reasons why this paper was groundbreaking is the simplicity of the architecture, which stands in stark contrast to traditional SMT systems that used a very large number of highly engineered features.

Different ways of providing the source sentence to the encoder network have been explored in the past. Cho et al. (2014b) fed the tokens to the encoder in the natural order they appear in the source sentence (cf. Fig. 4(a)). Sutskever et al. (2014) reported gains from simply feeding the sequence in reversed order (cf. Fig. 4(b)). They argue that these improvements might be “caused by the introduction of many short term dependencies to the dataset” (Sutskever et al., 2014). Bidirectional RNNs (Schuster and Paliwal, 1997, BiRNN) are able to capture both directions (cf. Fig. 4(c)) and are often used in attentional NMT (Bahdanau et al., 2015).

## 6 Attentional Encoder-Decoder Networks

### 6.1 Attention

One problem of early NMT models which is still not fully solved yet (see Sec. 10.1) is that they often produced poor translations for long sentences (Sountsov and Sarawagi, 2016). Cho et al. (2014a) suggested that this weakness is due to the fixed-length source sentence encoding. Sentences with varying length convey different amounts of information. Therefore, despite being appropriate for short sentences, a fixed-length vector “does not have enough capacity to encode a long sentence with complicated structure and meaning” (Cho et al., 2014a). Pouget-Abadie et al. (2014) tried to mitigate this problem by chopping the source sentence into short clauses. They composed the target sentence by concatenating the separately translated clauses. However, this approach does not cope well with long-distance reorderings as word reorderings are only possible within a clause. Bahdanau et al. (2015) introduced the concept of attention to avoid having a fixed-length source sentence representation. Their model does not use a constant context vector any more which encodes the whole source sentence. By contrast, the attentional decoder can place its attention only on parts of the source sentence which are useful for producing the next token. The constant context vector is thus replaced by a series of context vectors ; one for each time step .5

We will first introduce attention as a general concept before describing the architecture of Bahdanau et al. (2015) in detail in Sec. 6.3. We follow the terminology of Vaswani et al. (2017) and describe attention as mapping query vectors to output vectors via a mapping table (or a memory) of key-value pairs. This view is related to memory-augmented neural networks which we will discuss in greater detail in Sec. 13.3. We make the simplifying assumption that all vectors have the same dimension so that we can stack the vectors into matrices , , and . Intuitively, for each query vector we compute an output vector as a weighted sum of the value vectors. The weights are determined by a similarity score between the query vector and the keys (cf. (Vaswani et al., 2017, Eq. 1)):

 Attention(K,V,Q)n×d=Softmax(score(Q,K)n×m)Vm×d. (7)

The output of is an matrix of similarity scores. The softmax function normalizes over the columns of that matrix so that the weights for each query vector sum up to one. A straight-forward choice for proposed by Luong et al. (2015b) is the dot product (i.e. ). The most common scoring functions are summarized in Tab. 2.

A common way to use attention in NMT is at the interface between encoder and decoder. Bahdanau et al. (2015); Luong et al. (2015b) used the hidden decoder states as query vectors. Both the key and value vectors are derived from the hidden states of a recursive encoder.6 Formally, this means that are the query vectors , is the target sentence length, are the key and value vectors, and is the source sentence length.7 The outputs of the attention layer are used as time-dependent context vectors . In other words, rather than using a fixed-length sentence encoding as in Sec. 5, at each time step we query a memory in which entries store (context-sensitive) representations of the source words. In this setup it is possible to derive an attention matrix to visualize the learned relations between words in the source sentence and words in the target sentence:

 A:=Softmax(score((sj)j=1:J,(hi)i=1:I)). (8)

Fig. 6 shows an example of from an English-German NMT system with additive attention. The attention matrix captures cross-lingual word relationships such as or . The system has learned that the English source word “is” is relevant for generating the German target word “ist” and thus emits a high attention weight for this pair. Consequently, the context vector at time step mainly represents the source word “is” (). This is particularly significant as the system was not explicitly trained to align words but to optimize translation performance. However, as we will argue in Sec. 12.4, it would be wrong to think of as a soft version of a traditional SMT word alignment.

An important generalization of attention is multi-head attention proposed by Vaswani et al. (2017). The idea is to perform attention operations instead of a single one where is the number of attention heads (usually ). The query, key, and value vectors for the attention heads are linear transforms of , , and . The output of multi-head attention is the concatenation of the outputs of each attention head. The dimensionality of the attention heads is usually divided by to avoid increasing the number of parameters. Formally, it can be described as follows (Vaswani et al., 2017):

with weight matrix where

with weight matrices for . Fig. 7 shows a multi-head attention module with three heads. Note that with multi-head attention it is not obvious anymore how to derive a single attention weight matrix like shown in Fig. 6. Therefore, models using multi-head attention tend to be more difficult to interpret.

The concept of attention is no longer just a technique to improve sentence lengths in NMT. Since its introduction by Bahdanau et al. (2015) it has become a vital part of various NMT architectures, culminating in the Transformer architecture (Sec. 6.5) which is entirely attention-based. Attention has also been proven effective for, inter alia, object recognition (Larochelle and Hinton, 2010; Ba et al., 2014; Mnih et al., 2014), image caption generation (Xu et al., 2015), video description (Yao et al., 2015), speech recognition (Chorowski et al., 2014; Chan et al., 2016), cross-lingual word-to-phone alignment (Duong et al., 2016), bioinformatics (Sønderby et al., 2015), text summarization (Rush et al., 2015), text normalization (Sproat and Jaitly, 2016), grammatical error correction (Yuan and Briscoe, 2016), question answering (Hermann et al., 2015; Yang et al., 2016; Sukhbaatar et al., 2015), natural language understanding and inference (Dong and Lapata, 2016; Shen et al., 2018a; Im and Cho, 2017; Liu et al., 2016), uncertainty detection (Adel and Schütze, 2017), photo optical character recognition (Lee and Osindero, 2016), and natural language conversation (Shang et al., 2015).

NMT usually groups sentences into batches to make more efficient use of the available hardware and to reduce noise in gradient estimation (cf. Sec. 11.1). However, the central data structure for many machine learning frameworks (Bastien et al., 2012; Abadi et al., 2016) are tensors – multi-dimensional arrays with fixed dimensionality. Re-arranging source sentences as tensor often results in some unused space as the sentences may vary in length. In practice, shorter sentences are filled up with a special padding symbol <pad> to match the length of the longest sentence in the batch (Fig. 8). Most implementations work with masks to avoid taking padded positions into account when computing the training loss. Attention layers also have to be restricted to non-padding symbols which is also usually realized by multiplying the attention weights by a mask that sets the attention weights for padding symbols to zero.

### 6.3 Recurrent Neural Machine Translation

This section contains a complete formal description of the RNNsearch architecture of Bahdanau et al. (2015) which was the first NMT model using attention. Recall that NMT uses the chain rule to decompose the probability of a target sentence given a source sentence into left-to-right conditionals (Eq. 5). RNNsearch models the conditionals as follows (Bahdanau et al., 2015, Eq. 2,4):

 P(y|x)Eq.~{}???=J∏j=1P(yj|yj−11,x)=J∏j=1g(yj|yj−1,sj,cj(x)). (11)

Similarly to Eq. 6, the function encapsulates the decoder network which computes the distribution for the next target token given the last produced token , the RNN decoder state , and the context vector . The sizes of the encoder and decoder hidden layers are denoted with and . The context vector is a distributed representation of the relevant parts of the source sentence. In NMT without attention (Sutskever et al., 2014; Cho et al., 2014b) (Sec. 5), the context vector is constant and thus needs to encode the whole source sentence. Adding an attention mechanism results in different context vectors for each target sentence position . This effectively addresses issues in NMT due to the limited capacity of a fixed context vector as illustrated in Fig. 9.

As outlined in Sec. 6.1, the context vectors are weighted sums of source sentence annotations . The annotations are produced by the encoder network. In other words, the encoder converts the input sequence to a sequence of annotations of the same length. Each annotation encodes information about the entire source sentence “with a strong focus on the parts surrounding the -th word of the input sequence” (Bahdanau et al., 2015, Sec. 3.1). RNNsearch uses a bidirectional RNN (Schuster and Paliwal, 1997, BiRNN) to generate the annotations. A BiRNN consists of two independent RNNs. The forward RNN reads in the original order (from to ). The backward RNN consumes in reversed order (from to ):

 →hi=→f(xi,→hi−1) (12)
 ←hi=←f(xi,←hi+1). (13)

The RNNs and are usually LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014b) cells. The annotation is the concatenation of the hidden states and  (Bahdanau et al., 2015, Sec. 3.2):

 hi=[→h⊺i;←h⊺i]⊺. (14)

The context vectors are computed from the annotations as weighted sum with weights  (Bahdanau et al., 2015, Eq. 5):

 cj(x)=I∑i=1αj,ihi. (15)

The weights are determined by the alignment model :

 αj,i=1Zexp(a(sj−1,hi)) with Z=I∑k=1exp(a(sj−1,hk)) (16)

where is a feedforward neural network which estimates the importance of annotation for producing the -th target token given the current decoder state . In the terminology of Sec. 6.1, represent the keys and values, are the queries, and is the attention scoring function.

The function in Eq. 11 does not only take the previous target token and the context vector but also the decoder hidden state .

 sj=f(sj−1,yj−1,cj) (17)

where is modelled by a GRU or LSTM cell. The function is defined as follows.

 g(yj|yj−1,sj,cj)∝exp(Womax(tj,uj)) (18)

with

 tj=Tssj+TyEyj−1+Tccj (19)
 uj=Ussj+UyEyj−1+Uccj (20)

where is the element-wise maximum, and , , , , are weight matrices. The definition of can be seen as connecting the output of the recurrent layer, an -dimensional embedding of the previous target token, and the context vector with a single maxout layer (Goodfellow et al., 2013b) of size and using a softmax over the target language vocabulary (Bahdanau et al., 2015). Fig. 10 illustrates the complete RNNsearch model.

### 6.4 Convolutional Neural Machine Translation

Although convolutional neural networks (CNNs) have first been proposed by Waibel et al. (1989) for phoneme recognition, their traditional use case is computer vision (LeCun et al., 1989, 1990, 1998). CNNs are especially useful for processing images because of two reasons. First, they use a high degree of weight tying and thus reduce the number of parameters dramatically compared to fully connected networks. This is crucial for high dimensional input like visual imagery. Second, they automatically learn space invariant features. Spatial invariance is desirable in vision since we often aim to recognize objects or features regardless of their exact position in the image. In NLP, convolutions are usually one dimensional since we are dealing with sequences rather than two dimensional images as in computer vision. We will therefore limit our discussions to the one dimensional case. We will also exclude concepts like pooling or strides as they are uncommon for sequence models in NLP.

The input to an 1D convolutional layer is a sequence of -dimensional vectors . The literature about CNNs usually refers to the dimensions in each () as channels, and to the -axis as spatial dimension. The convolution transforms the input sequence to an output sequence of -dimensional of the same length by moving a kernel of width over the input sequence. The kernel is a linear transform which maps the -gram to the output for (we append padding symbols to the input). Standard convolution parameterizes this linear transform with a full weight matrix :

 StdConv:(vi)n=M∑m=1K−1∑k=0WstdkM+m,n(ui+k)m (21)

with and . Standard convolution represents two kinds of dependencies: Spatial dependency (inner sum in Eq. 21) and cross-channel dependency (outer sum in Eq. 21). Pointwise and depthwise convolution factor out these dependencies into two separate operations:

 PointwiseConv:(vi)n=M∑m=1Wpwm,n(ui)m=uiWpw (22)
 DepthwiseConv:(vi)n=K−1∑k=0Wdwk,n(ui+k)n (23)

where and are weight matrices. Fig. 11 illustrates the differences between these types of convolution. The idea behind depthwise separable convolution is to replace standard convolutional with depthwise convolution followed by pointwise convolution. As shown in Tab. 3, the decomposition into two simpler steps reduces the number of parameters and has been shown to make more efficient use of the parameters than regular convolution in vision (Chollet, 2017; Howard et al., 2017).

Using convolution rather than recurrence in NMT models has several potential advantages. First, they reduce sequential computation and are therefore easier parallelizable on GPU hardware. Second, their hierarchical structure connects distant words via a shorter path than sequential topologies (Gehring et al., 2017) which eases learning (Hochreiter et al., 2001). Both regular (Kalchbrenner et al., 2016; Gehring et al., 2017, 2017) and depthwise separable (Kaiser et al., 2017; Wu et al., 2019) convolution have been used for NMT in the past. Fig. 11(a) shows the general architecture for a fully convolutional NMT model such as ConvS2S (Gehring et al., 2017) or SliceNet (Kaiser et al., 2017) in which both encoder and decoder are convolutional. Stacking multiple convolutional layers increases the effective context size. In the decoder, we need to mask the receptive field of the convolution operations to make sure that the network has no access to future information (van den Oord et al., 2016). Encoder and decoder are connected via attention. Gehring et al. (2017) used attention into the encoder representations after each convolutional layer in the decoder.

### 6.5 Self-attention-based Neural Machine Translation

Recall that Eq. 5 states that NMT factorizes into conditionals . We have reviewed two ways to model the dependency on the source sentence in NMT: via a fixed-length sentence encoding (Sec. 5) or via time-dependent context vectors which are computed using attention (Sec. 6.1). We have also presented two ways to implement the dependency on the target sentence prefix : via a recurrent connection which passes through the decoder state to the next time step (Sec. 6.3) or via convolution (Sec. 6.4). A third option to model target side dependency is using self-attention. Using the terminology introduced in Sec. 6.1, decoder self-attention derives all three components (queries, keys, and values) from the decoder state. The decoder conditions on the translation prefix by attending to its own states from previous time steps. Besides machine translation, self-attention has been applied to various NLP tasks such as sentiment analysis (Cheng et al., 2016), natural language inference (Shen et al., 2018a; Parikh et al., 2016; Liu et al., 2016; Shen et al., 2018b), text summarization (Paulus et al., 2017), headline generation (Daniil et al., 2019), sentence embedding (Lin et al., 2017; Wu et al., 2018b; Zhang et al., 2018b), and reading comprehension (Hu et al., 2018). Similarly to convolution, self-attention introduces short paths between distant words and reduces the amount of sequential computation. Studies indicate that these short paths are especially useful for learning strong semantic feature extractors, but (perhaps somewhat counter-intuitively) less so for modelling long-range subject-verb agreement (Tang et al., 2018). Like in convolutional models we also need to mask future decoder states to prevent conditioning on future tokens (cf. Sec. 6.2). The general layout for self-attention-based NMT models is shown in Fig. 11(b). The first example of this new class of NMT models was the Transformer (Vaswani et al., 2017). The Transformer uses attention for three purposes: 1) within the encoder to enable context-sensitive word representations which depend on the whole source sentence, 2) between the encoder and the decoder as in previous models, and 3) within the decoder to condition on the current translation history. The Transformer uses multi-head attention (Sec. 6.1) rather than regular attention. Using multi-head attention has been shown to be essential for the Transformer architecture (Tang et al., 2018; Chen et al., 2018).

A challenge in self-attention-based models (and to some extent in convolutional models) is that vanilla attention as introduced in Sec. 6.1 by itself has no notion of order. The key-value pairs in the memory are accessed purely based on the correspondence between key and query (content-based addressing) and not based on a location of the key in the memory (location-based).8 This is less of a problem in recurrent NMT (Sec. 6.3) as queries, keys, and values are derived from RNN states and already carry a strong sequential signal due to the RNN topology. In the Transformer architecture, however, recurrent connections are removed in favor of attention. Vaswani et al. (2017) tackled this problem using positional encodings. Positional encodings are (potentially partial) functions where is the word embedding size, i.e. they are -dimensional representations of natural numbers. They are added to the (input and output) word embeddings to make them (and consequently the queries, keys, and values) position-sensitive. Vaswani et al. (2017) stacked sine and cosine functions of different frequencies to implement :

 PEsin(n)d=⎧⎨⎩sin(10000−dDn):d is evencos(10000−dDn):d is odd (24)

for and . Alternatively, positional encodings can be learned in an embedding matrix (Gehring et al., 2017):

 PElearned(n)=W:,n (25)

with weight matrix for some sufficiently large . The input to is usually the absolute position of the word in the sentence (Vaswani et al., 2017; Gehring et al., 2017), but relative positioning is also possible (Shaw et al., 2018). We will give an overview of extensions to the Transformer architecture in Sec. 13.1.

### 6.6 Comparison of the Fundamental Architectures

As outlined in the previous sections, NMT can come in one of three flavors: recurrent, convolutional, or self-attention-based. In this section, we will discuss three concrete architectures in greater detail – one of each flavor. For an empirical comparison see (Stahlberg et al., 2018b). Fig. 13 visualizes the data streams in Google’s Neural Machine Translation system (Wu et al., 2016, GNMT) as example of a recurrent network, the convolutional ConvS2S model (Gehring et al., 2017), and the self-attention-based Transformer model (Vaswani et al., 2017) in plate notation. We excluded components like dropout (Srivastava et al., 2014), batch normalization (Ioffe and Szegedy, 2015), and layer normalization (Ba et al., 2016) to simplify the diagrams. All models fall in the general category of encoder-decoder networks, with the encoder in the left column and the decoder in the right column. Output probabilities are generated by a linear projection layer followed by a softmax activation at the end. They all use attention at each decoder layer to connect the encoder with the decoder, although the specifics differ. GNMT (Fig. 12(a)) uses regular attention, ConvS2S (Fig. 12(b)) adds the source word encodings to the values, and the Transformer (Fig. 12(c)) uses multi-head attention (Sec. 6.1). Residual connections (He et al., 2016b) are used in all three architectures to encourage gradient flow in multi-layer networks. Positional encodings are used in ConvS2S and the Transformer, but not in GNMT. An interesting fusion is the RNMT+ model (Chen et al., 2018) shown in Fig. 12(d) which reintroduces ideas from the Transformer like multi-head attention into recurrent NMT. Other notable mixed architectures include Gehring et al. (2017) who used a convolutional encoder with a recurrent decoder, Miculicich et al. (2018); Wang et al. (2019a); Werlen et al. (2018) who added self-attention connections to a recurrent decoder, Hao et al. (2019) who used a Transformer encoder and a recurrent encoder in parallel, and Lin et al. (2018) who equipped a recurrent decoder with a convolutional decoder to provide global target-side context.

## 7 Neural Machine Translation Decoding

### 7.1 The Search Problem in NMT

So far we have described how NMT defines the translation probability . However, in order to apply these definitions directly, both the source sentence and the target sentence have to be given. They do not directly provide a method for generating a target sentence from a given source sentence which is the ultimate goal in machine translation. The task of finding the most likely translation for a given source sentence is known as the decoding or inference problem:

 ^y=argmaxy∈Σ∗trgP(y|x). (26)

NMT decoding is non-trivial for mainly two reasons. First, the search space is vast as it grows exponentially with the sequence length. For example, if we assume a common vocabulary size of , there are already more possible translations with 20 words or less than atoms in the observable universe (). Thus, complete enumeration of the search space is impossible. Second, as we will see in Sec. 10, certain types of model errors are very common in NMT. The mismatch between the most likely and the “best” translation has deep implications on search as more exhaustive search often leads to worse translations (Stahlberg and Byrne, 2019). We will discuss possible solutions to both problems in the remainder of Sec. 7.

### 7.2 Greedy and Beam Search

The most popular decoding algorithms for NMT are greedy search and beam search. Both search procedures are based on the left-to-right factorization of NMT in Eq. 5. Translations are built up from left to right while partial translation prefixes are scored using the conditionals . This means that both algorithms work in a time-synchronous manner: in each iteration , partial hypotheses of (up to) length are compared to each other, and a subset of them is selected for expansion in the next time step. The algorithms terminate if either all or the best of the selected hypotheses end with the end-of-sentence symbol </s> or if some maximum number of iterations is reached. Fig. 14 illustrates the difference between greedy search and beam search. Greedy search (highlighted in green) selects the single best expansion at each time step: ‘c’ at , ‘a’ at , and ‘b’ at . However, greedy search is vulnerable to the so-called garden-path problem (Koehn, 2017). The algorithm selects ‘c’ in the first time step which turns out to be a mistake later on as subsequent distributions are very smooth and scores are comparably low. However, greedy decoding cannot correct this mistake later as it is already committed to this path. Beam search (highlighted in orange in Fig. 14) tries to mitigate the risk of the garden-path problem by passing not one but possible translation prefixes to the next time step ( in Fig. 14). The hypotheses which survive a time step are called active hypotheses. At each time step, the accumulated path scores for all possible continuations of active hypotheses are compared, and the best ones are selected. Thus, beam search does not only expand ‘c’ but also ‘b’ in time step 1, and thereby finds the high scoring translation prefix ‘ba’. Note that although beam search seems to be the more accurate search procedure, it is not guaranteed to always find a translation with higher or equal score as greedy decoding.9 It is therefore still prone to the garden-path problem, although less so than greedy search. Stahlberg and Byrne (2019) demonstrated that even beam search suffers from a high number of search errors.

### 7.3 Formal Description of Decoding for the RNNsearch Model

In this section, we will formally define decoding for the RNNsearch model (Bahdanau et al., 2015). We will resort to the mathematical symbols used in Sec. 6.3 to describe the algorithms. First, the source annotations are computed and stored as this does not require any search. Then, we compute the distribution for the first target token using (Alg. 1). The initial decoder state is often a linear transform of the last encoder hidden state : for some weight matrix .

Greedy decoding selects the most likely target token according the returned distribution and iteratively calls until the end-of-sentence symbol </s> is emitted (Alg. 2). We use the projection function (Eq. 3) which maps the posterior vector to the -th component.

The beam search strategy (Alg. 3) does not only keep the single best partial hypothesis but a set of promising hypotheses where is the size of the beam. A partial hypothesis is represented by a 3-tuple with the translation prefix , the accumulated score , and the last decoder state .

### 7.4 Ensembling

Ensembling (Dietterich, 2000; Hansen and Salamon, 1990) is a simple yet very effective technique to improve the accuracy of NMT. The basic idea is illustrated in Fig. 15. The decoder makes use of NMT networks rather than only one which are either trained independently (Sutskever et al., 2014; Neubig, 2016; Wu et al., 2016) or share some amount of training iterations (Sennrich et al., 2016a; Cromieres et al., 2016; Durrani et al., 2016). The ensemble decoder computes predictions for each of the individual models which are then combined using the arithmetic (Sutskever et al., 2014) or geometric (Cromieres et al., 2016) average:

 Sarith(yj|yj−11,x)=1KK∑k=1Pk(yj|yj−11,x) (27)
 Sgeo(yj|yj−11,x)=K∑k=1logPk(yj|yj−11,x). (28)

Both and can be used as drop-in replacement for the conditionals in Eq. 5. The arithmetic average is more sound as still forms a valid probability distribution which sums up to one. However, the geometric average is numerically more stable as log-probabilities can be directly combined without converting them to probabilities. Note that the core idea of ensembling is similar to language model interpolation used in statistical machine translation or speech recognition.

Ensembling consistently outperforms single NMT by a large margin. All top systems in recent machine translation evaluation campaigns ensemble a number of NMT systems (Bojar et al., 2016, 2017, 2018; Bojar, 2019; Sennrich et al., 2016a, 2017; Neubig, 2016; Cromieres et al., 2016; Durrani et al., 2016; Stahlberg et al., 2018b; Wang et al., 2017c; Junczys-Dowmunt, 2018b; Wang et al., 2018a), perhaps most famously taken to the extreme by the WMT18 submission of Tencent that ensembled up to 72 translation models (Wang et al., 2018a). However, the decoding speed is significantly worse since the decoder needs to apply NMT models rather than only one. This means that the decoder has to perform more forward passes through the networks, and has to apply the expensive softmax function more times in each time step. Ensembling also often increases the number of CPU/GPU switches and the communication overhead between CPU and GPU when averaging is implemented on the CPU. Ensembling is also often more difficult to implement than single system NMT. Knowledge distillation which we will discuss in Sec. 16 is one method to deal with the shortcomings of ensembling. Stahlberg and Byrne (2017) proposed to unfold the ensemble into a single network and shrink the unfolded network afterwards for efficient ensembling.

In NMT, all models in an ensemble usually have the same size and topology and are trained on the same data. They differ only due to the random weight initialization and the randomized order of the training samples. Notable exceptions include Freitag and Al-Onaizan (2016) who use ensembling to prevent overfitting in domain adaptation, He et al. (2018) who combined models that selected their training data based on marginal likelihood, and the UCAM submission to WMT18 (Stahlberg et al., 2018b) that ensembled different NMT architectures with each other.10

When all models are equally powerful and are trained with the same data, it is surprising that ensembling is so effective. One common narrative is that different models make different mistakes, but the mistake of one model can be outvoted by the others in the ensemble (Rokach, 2010). This explanation is plausible for NMT since translation quality can vary widely between training runs (Sennrich et al., 2016c). The variance in translation performance may also indicate that the NMT error surface is highly non-convex such that the optimizer often ends up in local optima. Ensembling might mitigate this problem. Ensembling may also have a regularization effect on the final translation scores (Goodfellow et al., 2016).

Checkpoint averaging (Junczys-Dowmunt et al., 2016, 2016) is a technique which is often discussed in conjunction with ensembling (Liu et al., 2018b). Checkpoint averaging keeps track of the few most recent checkpoints during training, and averages their weight matrices to create the final model. This results in a single model and thus does not increase the decoding time. Therefore, it has become a very common technique in NMT (Vaswani et al., 2017; Popel and Bojar, 2018; Stahlberg et al., 2018b). Checkpoint averaging addresses a quite different problem than ensembling as it mainly smooths out minor fluctuations in the training curve which are due to the optimizer’s update rule or noise in the gradient estimation due to mini-batch training. In contrast, the weights of independently trained models are very different from each other, and there is no obvious direct correspondence between neuron activities across the models. Therefore, checkpoint averaging cannot be applied to independently trained models.

### 7.5 Decoding Direction

Standard NMT factorizes the probability from left to right (L2R) according Eq. 5. Mathematically, the left-to-right order is rather arbitrary, and other arrangements such as a right-to-left (R2L) factorization are equally correct:

 P(y|x)=J∏j=1P(yj|yj−11,x)=P(y1|x)⋅P(y2|y1,x)⋅P(y3|y1,y2,x)⋯=J∏j=1P(yj|yJj+1,x)=P(yJ|x)⋅P(yJ−1|yJ,x)⋅P(yJ−2|yJ−1,yJ,x)⋯. (29)

NMT models which produce the target sentence in reverse order have led to some gains in evaluation systems when combined with left-to-right models (Sennrich et al., 2016a; Wang et al., 2017c; Stahlberg et al., 2018b; Wang et al., 2018a). A common combination scheme is based on rescoring: A strong L2R ensemble first creates an -best list which is then rescored with an R2L model (Liu et al., 2016; Sennrich et al., 2016a). Stahlberg et al. (2018b) used R2L models via a minimum Bayes risk framework. The L2R and R2L systems are normally trained independently, although some recent work proposes joint training schemes in which each direction is used as a regularizer for the other direction (Zhang et al., 2018e; Yang et al., 2018). Other orderings besides L2R and R2L have also been proposed such as middle-out (Mehri and Sigal, 2018), top-down in a binary tree (Welleck et al., 2019), insertion-based (Gu et al., 2019a; Stern et al., 2019; Östling and Tiedemann, 2017; Gu et al., 2019b), or in source sentence order (Stahlberg et al., 2018).

Another way to give the decoder access to the full target-side context is the two-stage approach of Li et al. (2017) who first drafted a translation, and then employed a multi-source NMT system to generate the final translation from both the source sentence and the draft. Zhang et al. (2018c) proposed a similar scheme but generated the draft translations in reverse order. A similar two-pass approach was used by ElMaghraby and Rafea (2019) to make Arabic MT more robust against domain shifts. Geng et al. (2018) used reinforcement learning to choose the best number of decoding passes.

Besides explicit combination with an R2L model and multi-pass strategies, we are aware of following efforts to make the decoder more sensitive to the right-side target context: He et al. (2017) used reinforcement learning to estimate the long-term value of a candidate. Lin et al. (2018) provided global target sentence information to a recurrent decoder via a convolutional model. Hoang et al. (2017) proposed a very appealing theoretical framework to relax the discrete NMT optimization problem into a continuous optimization problem which allows to include both decoding directions.

### 7.6 Efficiency

NMT decoding is very fast on GPU hardware and can reach up to 5000 words per second.11 However, GPUs are very expensive, and speeding up CPU decoding to the level of SMT remains more challenging. Therefore, how to improve the efficiency of neural sequence decoding algorithms is still an active research question. One bottleneck is the sequential left-to-right order of beam search which makes parallelization difficult. Stern et al. (2018) suggested to compute multiple time steps in parallel and validate translation prefixes afterwards. Kaiser et al. (2018) reduced the amount of sequential computation by learning a sequence of latent discrete variables which is shorter than the actual target sentence, and generating the final sentence from this latent representation in parallel. Di Gangi and Federico (2018) sped up recurrent NMT by using a simplified architecture for recurrent units. Another line of research tries to reintroduce the idea of hypothesis recombination to neural models. This technique is used extensively in traditional SMT (Koehn, 2010). The idea is to keep only the better of two partial hypotheses if it is guaranteed that both will be scored equally in the future. For example, this is the case for -gram language models if both hypotheses end with the same -gram. The problem in neural sequence models is that they condition on the full translation history. Therefore, hypothesis recombination for neural sequence models does not insist on exact equivalence but cluster hypotheses based on the similarity between RNN states or the -gram history (Zhang et al., 2018b; Liu et al., 2014). A similar idea was used by Lecorvé and Motlicek (2012) to approximate RNNs with WFSTs which also requires mapping histories into equivalence classes.

It is also possible to speed up beam search by reducing the beam size. Wu et al. (2016); Freitag and Al-Onaizan (2017) suggested to use a variable beam size, using various heuristics to decide the beam size at each time step. Alternatively, the NMT model training can be tailored towards the decoding algorithm (Goyal et al., 2018; Wiseman and Rush, 2016; Collobert et al., 2019; Gu et al., 2017). Wiseman and Rush (2016) proposed a loss function for NMT training which penalizes when the reference falls off the beam during training. Kim and Rush (2016) reported that knowledge distillation (discussed in Sec. 16) reduces the gap between greedy decoding and beam decoding significantly. Greedy decoding can also be improved by using a small actor network which modifies the hidden states in an already trained model (Gu et al., 2017; Chen et al., 2018).

### 7.7 Generating Diverse Translations

An issue with using beam search is that the hypotheses found by the decoder are very similar to each other and often differ only by one or two words (Li and Jurafsky, 2016; Li et al., 2016b; Gimpel et al., 2013). The lack of diversity is problematic for several reasons. First, natural language in general and translation in particular often come with a high level of ambiguity that is not represented well by non-diverse -best lists. Second, it impedes user interaction as NMT is not able to provide the user with alternative translations if needed. Third, collecting statistics about the search space such as estimating the probabilities of -grams for minimum Bayes-risk decoding (Goel et al., 2000; Kumar and Byrne, 2004; Tromble et al., 2008; Iglesias et al., 2018; Stahlberg et al., 2018b, 2017) or risk-based training (Sec. 11.5) is much less effective.

Cho (2016) added noise to the activations in the hidden layer of the decoder network to produce alternative high scoring hypotheses. This is justified by the observation that small variations of a hidden configuration encode semantically similar context (Bengio et al., 2013). Li and Jurafsky (2016); Li et al. (2016b) proposed a diversity promoting modification of the beam search objective function. They added an explicit penalization term to the NMT score based on a maximum mutual information criterion which penalizes hypotheses from the same parent node. Note that both extensions can be used together (Cho, 2016). Vijayakumar et al. (2016) suggested to partition the active hypotheses in groups, and use a dissimilarity term to ensure diversity between groups. Park et al. (2016) found alternative translations by -nearest neighbor search from the greedy translation in a translation memory.

### 7.8 Simultaneous Translation

Most of the research in MT assumes an offline scenario: a complete source sentence is to be translated to a complete target sentence. However, this basic assumption does not hold up for many real-life applications. For example, useful machine translation for parliamentary speeches and lectures (Müller et al., 2016; Fügen et al., 2007) or voice call services such as Skype (Lewis, 2015) does not only have to produce good translations but also have to do so with very low latency (Mieno et al., 2015). To reduce the latency in such real-time speech-to-speech translation scenarios it is desirable to start translating before the full source sentence has been vocalized by the speaker. Most approaches frame simultaneous machine translation as source sentence segmentation problem. The source sentence is revealed one word at a time. After a certain number of words, the segmentation policy decides to translate the current partial source sentence prefix and commit to a translation prefix which may not be a complete translation of the partial source. This process is repeated until the full source sentence is available. The segmentation policy can be heuristic (Cho and Esipova, 2016) or learned with reinforcement learning (Grissom II et al., 2014; Gu et al., 2017). The translation itself is usually carried out by a standard MT system which was trained on full sentences. This is sub-optimal for two reasons. First, using a system which was trained on full sentences to translate partial sentences is brittle due to the significant mismatch between training and testing time. Ma et al. (2018a) tried to tackle this problem by training NMT to generate the target sentence with a fixed maximum latency to the source sentence. Second, human simultaneous interpreters use sophisticated strategies to reduce the latency by changing the grammatical structure (Paulik and Waibel, 2009, 2013; He et al., 2016a). These strategies are neglected by a vanilla translation system. Unfortunately, training data from human simultaneous translators is rare (Paulik and Waibel, 2013) which makes it difficult to adapt MT to it.

## 8 Open Vocabulary Neural Machine Translation

### 8.1 Using Large Output Vocabularies

As discussed in Sec. 2, NMT and other neural NLP models use embedding matrices to represent words as real-valued vectors. Embedding matrices need to have a fixed shape to make joint training with the translation model possible, and thus can only be used with a fixed and pre-defined vocabulary. This has several major implications for NMT.

First, the size of the embedding matrices grows with the vocabulary size. As shown in Tab. 4, the embedding matrices make up most of the model parameters of a standard RNNsearch model. Increasing the vocabulary size inflates the model drastically. Large models require a small batch size because they take more space in the (GPU) memory, but reducing the batch size often leads to noisier gradients, slower training, and eventually worse model performance (Popel and Bojar, 2018). Furthermore, a large softmax output layer is computationally very expensive. In contrast, traditional (symbolic) MT systems can easily use very large vocabularies (Heafield et al., 2013; Lin and Dyer, 2010; Chiang, 2007; Koehn, 2010). Besides these practical issues, training embedding matrices for large vocabularies is also complicated by the long-tail distribution of words in a language. Zipf’s law (Zipf, 1946) states that the frequency of any word and its rank in the frequency table are inversely proportional to each other. Fig. 16 shows that 843K of the 875K distinct words (96.5%) occur less than 100 times in an English text with 140M running words – that is less than 0.00007% of the entire text. It is difficult to train robust word embeddings for such rare words. Word-based NMT models address this issue by restricting the vocabulary to the most frequent words, and replacing all other words by a special token UNK. A problem with that approach is that the UNK token may appear in the generated translation. In fact, limiting the vocabulary to the 30K most frequent words results in an out-of-vocabulary rate (OOV) of 2.9% on the training set (Fig. 16). That means an UNK token can be expected to occur every 35 words. In practice, the number of UNKs is usually even higher. One simple reason is that the test set OOV rate is often higher than on the training set because the distribution of words and phrases naturally varies across genre, corpora, and time. Another observation is that word-based NMT often prefers emitting UNK even if a more appropriate word is in the NMT vocabulary. This is possibly due to the misbalance between the UNK token and other words: replacing all rare words with the same UNK token leads to an over-representation of UNK in the training set, and therefore a strong bias towards UNK during decoding.

#### Translation-specific Approaches

Jean et al. (2015) distinguished between translation-specific and model-specific approaches. Translation-specific approaches keep the shortlist vocabulary in the original form, but correct UNK tokens afterwards. For example, the UNK replace technique (Luong et al., 2015; Le et al., 2016) keeps track of the positions of source sentence words which correspond to the UNK tokens. In a post-processing step, they replaced the UNK tokens with the most likely translation of the aligned source word according a bilingual word-level dictionary which was extracted from a word-aligned training corpus. Gulcehre et al. (2016) followed a similar idea but used a special pointer network for referring to source sentence words. These approaches are rather ad-hoc because simple dictionary lookup without context is not a very strong model of translation. Li et al. (2016) replaced each OOV word with a similar in-vocabulary word based on the cosine similarity between their distributed representations in a pre-processing step. However, this technique cannot tackle all OOVs as it is based on vector representations of words which are normally only available for a closed vocabulary. Moreover, the replacements might differ from the original meaning significantly. Further UNK replacement strategies were presented by Li et al. (2017a, b); Miao et al. (2017), but all share the inevitable limitation of all translation-specific approaches, namely that the translation model itself is indiscriminative between a large number of OOVs.

#### Model-specific Approaches

Model-specific approaches change the NMT model to make training with large vocabularies feasible. For example, Nguyen and Chiang (2018) improved the translation of rare words in NMT by adding a lexical translation model which directly connects corresponding source and target words. Another very popular idea is to train networks to output probability distributions without using the full softmax (Andreas and Klein, 2015). Noise-contrastive estimation (Gutmann and Hyvärinen, 2010; Dyer, 2014, NCE) trains a logistic regression model which discriminates between real training examples and noise. For example, to train an embedding for a word , Mnih and Kavukcuoglu (2013) treat as positive example, and sample from the global unigram word distribution in the training data to generate negative examples. The logistic regression model is a binary classifier and thus does not need to sum over the full vocabulary. NCE has been used to train large vocabulary neural sequence models such as language models (Mnih and Teh, 2012). The technique falls into the category of self-normalizing training (Andreas and Klein, 2015) because the model is trained to emit normalized distributions without explicitly summing over the output vocabulary. Self-normalization can also be achieved by adding the value of the partition function to the training loss (Devlin et al., 2014), encouraging the network to learn parameters which generate normalized output.

Another approach (sometimes referred to as vocabulary selection) is to approximate the partition function of the full softmax by using only a subset of the vocabulary. This subset can be selected in different ways. For example, Jean et al. (2015) applied importance sampling to select a small set of words for approximating the partition function. Both softmax sampling and UNK replace have been used in one of the winning systems at the WMT’15 evaluation on English-German (Jean et al., 2015). Various methods have been proposed to select the vocabulary to normalize over during decoding, such as fetching all possible translations in a conventional phrase table (Mi et al., 2016), using the vocabulary of the translation lattices from a traditional MT system (Stahlberg et al., 2016b, local softmax), and attention-based (Sankaran et al., 2017) and embedding-based (L’Hostis et al., 2016) methods.

### 8.2 Character-based NMT

Arguably, both translation-specific and model-specific approaches to word-based NMT are fundamentally flawed. Translation-specific techniques like UNK replace are indiscriminative between translations that differ only by OOV words. A translation model which assigns exactly the same score to a large number of hypotheses is of limited use by its own. Model-specific approaches suffer from the difficulty of training embeddings for rare words (Sec. 8.1). Compound or morpheme splitting (Hans and Milton, 2016; Tamchyna et al., 2017) can mitigate this issue only to a certain extent. More importantly, a fully-trained NMT system even with a very large vocabulary cannot be extended with new words. However, customizing systems to new domains (and thus new vocabularies) is a crucial requirement for commercial MT. Moreover, many OOV words are proper names which can be passed through untranslated. Hiero (Chiang, 2007) and other symbolic systems can easily be extended with new words and phrases.

More recent attempts try to alleviate the vocabulary issue in NMT by departing from words as modelling units. These approaches decompose the word sequences into finer-grained units and model the translation between those instead of words. To the best of our knowledge, Ling et al. (2015) were the first who proposed an NMT architecture which translates between sequences of characters. The core of their NMT network is still on the word-level, but the input and output embedding layers are replaced with subnetworks that compute word representations from the characters of the word. Such a subnetwork can be recurrent (Ling et al., 2015; Johansen et al., 2016) or convolutional (Costa-jussà and Fonollosa, 2016; Kim et al., 2016). This idea was extended to a hybrid model by Luong and Manning (2016) who used the standard lookup table embeddings for in-vocabulary words and the LSTM-based embeddings only for OOVs.

Having a word-level model at the core of a character-based system does circumvent the closed vocabulary restriction of purely word-based models, but it is still segmentation-dependent: The input text has to be preprocessed with a tokenizer that separates words by blank symbols in languages without word boundary markers, optionally applies compound or morpheme splitting in morphologically rich languages, and isolates punctuation symbols. Since tokenization is by itself error-prone and can degrade the translation performance (Domingo et al., 2018), it is desirable to design character-level systems that do not require any prior segmentation. Chung et al. (2016) used a bi-scale recurrent neural network that is similar to dynamically segmenting the input using jointly learned gates between a slow and a fast recurrent layer. Lee et al. (2017); Yang et al. (2016) used convolution to achieve segmentation-free character-level NMT. Costa-jussà et al. (2017) took character-level NMT one step further and used bytes rather than characters to help multilingual systems. Gulcehre et al. (2017) added a planning mechanism to improve the attention weights between character-based encoders and decoders.

### 8.3 Subword-unit-based NMT

As compromise between characters and full words, compression methods like Huffman codes (Chitnis and DeNero, 2015), word piece models (Schuster and Nakajima, 2012; Wu et al., 2016), or byte pair encoding (Sennrich et al., 2016c; Gage, 1994, BPE) can be used to transform the words to sequences of subword units. Subwords have been used rarely for traditional SMT (Kunchukuttan and Bhattacharyya, 2017, 2016; Liu et al., 2018a), but are currently the most common translation units for NMT. Byte pair encoding (BPE) initializes the set of available subword units with the character set of the language. This set is extended iteratively in subsequent merge operations. Each merge combines the two units with the highest number of co-occurrences in the text.12 This process terminates when the desired vocabulary size is reached. This vocabulary size is often set empirically, but can also be tuned on data (Salesky et al., 2018).

Given a fixed BPE vocabulary, there are often multiple ways to segment an unseen text.13 The ambiguity stems from the fact that symbols are still part of the vocabulary even after they are merged. Most BPE implementations select a segmentation greedily by preferring longer subword units. Interestingly, the ambiguity can also be used as source of noise for regularization. Kudo (2018) reported surprisingly large gains by augmenting the training data with alternative subword segmentations and by decoding from multiple segmentations of the same source sentence.

Segmentation approaches differ in the level of constraints they impose on the subwords. A common constraint is that subwords cannot span over multiple words (Sennrich et al., 2016c). However, enforcing this constraint again requires a tokenizer which is a potential source of errors (see Sec. 8.2). The SentencePiece model (Kudo and Richardson, 2018) is a tokenization-free subword model that is estimated on raw text. On the other side of the spectrum, it has been observed that automatically learned subwords generally do not correspond to linguistic entities such as morphemes, suffixes, affixes etc. However, linguistically-motivated subword units (Huck et al., 2017; Macháček et al., 2018; Ataman et al., 2017; Pinnis et al., 2017) that also take morpheme boundaries into account do not always improve over completely data-driven ones.

### 8.4 Words, Subwords, or Characters?

There is no conclusive agreement in the literature whether characters or subwords are the better translation units for NMT. Tab. 5 summarizes some of the arguments. The tendency seems to be that character-based systems have the potential of outperforming subword-based NMT, but they are technically difficult to deploy. Therefore, most systems in the WMT18 evaluation are based on subwords (Bojar et al., 2018). On a more profound level, we do see the shift towards small modelling units not without some concern. Chung et al. (2016) noted that “we often have a priori belief that a word, or its segmented-out lexeme, is a basic unit of meaning, making it natural to approach translation as mapping from a sequence of source-language words to a sequence of target-language words.” Translation is the task of transferring meaning from one language to another, and it makes intuitive sense to model this process with meaningful units. The decades of research in traditional SMT were characterized by a constant movement towards larger translation units – starting from the word-based IBM models (Brown et al., 1993) to phrase-based MT (Koehn, 2010) and hierarchical SMT (Chiang, 2007) that models syntactic structures. Expressions consisting of multiple words are even more appropriate units than words for translation since there is rarely a 1:1 correspondence between source and target words. In contrast, the starting point for character- and subword-based models is the language’s writing system. Most writing systems are not logographic but alphabetic or syllabaric and thus use symbols without any relation to meaning. The introduction of symbolic word-level and phrase-level information to NMT is one of the main motivations for NMT-SMT hybrid systems (Sec. 18).

## 9 Using Monolingual Training Data

In practice, parallel training data for MT is hard to acquire and expensive, whereas untranslated monolingual data is usually abundant. This is one of the reasons why language models (LMs) are central to traditional SMT. For example, in Hiero (Chiang, 2007), the translation grammar spans a vast space of possible translations but is weak in assigning scores to them. The LM is mainly responsible for selecting a coherent and fluent translation from that space. However, the vanilla NMT formalism does not allow the integration of an LM or monolingual data in general.

There are several lines of research which investigate the use of monolingual training data in NMT. Gulcehre et al. (2015, 2017) suggested to integrate a separately trained RNN-LM into the NMT decoder. Similarly to traditional SMT (Koehn, 2010) they started out with combining RNN-LM and NMT scores via a log-linear model (‘shallow fusion’). They reported even better performance with ‘deep fusion’ which uses a controller network that dynamically adjusts the weights between RNN-LM and NMT. Both deep fusion and -best reranking with count-based language models have led to some gains in WMT evaluation systems (Jean et al., 2015; Wang et al., 2017c). The ‘simple fusion’ technique (Stahlberg et al., 2018a) trains the translation model to predict the residual probability of the training data added to the prediction of a pre-trained and fixed LM.

The second line of research makes use of monolingual text via data augmentation. The idea is to add monolingual data in the target language to the natural parallel training corpus. Different strategies for filling in the source side for these sentences have been proposed such as using a single dummy token (Sennrich et al., 2016b) or copying the target sentence over to the source side (Currey et al., 2017). The most successful strategy is called back-translation (Schwenk, 2008; Sennrich et al., 2016b) which employs a separate translation system in the reverse direction to generate the source sentences for the monolingual target language sentences. The back-translating system is usually smaller and computationally cheaper than the final system for practical reasons, although with enough computational resources improving the quality of the reverse system can affect the final translation performance significantly (Burlot and Yvon, 2018). Iterative approaches that back-translate with systems that were by themselves trained with back-translation can yield improvements (Hoang et al., 2018b; Niu et al., 2018; Zhang et al., 2018d) although they are not widely used due to their computational costs. Back-translation has become a very common technique and has been used in nearly all neural submissions to recent evaluation campaigns (Sennrich et al., 2016a; Bojar et al., 2017, 2018).

A major limitation of back-translation is that the amount of synthetic data has to be balanced with the amount of real parallel data (Sennrich et al., 2016b, a; Poncelas et al., 2018). Therefore, the back-translation technique can only make use of a small fraction of the available monolingual data. A misbalance between synthetic and real data can be partially corrected by over-sampling – duplicating real training samples a number of times to match the synthetic data size. However, very high over-sampling rates often do not work well in practice. Recently, Edunov et al. (2018) proposed to add noise to the back-translated sentences to provide a stronger training signal from the synthetic sentence pairs. They showed that adding noise does not only improve the translation quality but also makes the training more robust against a high ratio of synthetic against real sentences. The effectiveness of using noise for data augmentation in NMT has also been confirmed by Wang et al. (2018b). These methods increase the variety of the training data and thus make it harder for the model to fit which ultimately leads to stronger training signals. The variety of synthetic sentences in back-translation can also be increased by sampling multiple sentences from the reverse translation model (Imamura et al., 2018).

A third class of approaches changes the NMT training loss function to incorporate monolingual data. For example, Cheng et al. (2016); Tu et al. (2017); Escolano et al. (2018) proposed to add autoencoder terms to the training objective which capture how well a sentence can be reconstructed from its translated representation. Using the reconstruction error is also central to (unsupervised) dual learning approaches (He et al., 2016a; Hassan et al., 2018; Wang et al., 2018). However, training with respect to the new loss is often computationally intensive and requires approximations. Alternatively, multi-task learning has been used to incorporate source-side (Zhang and Zong, 2016) and target-side (Domhan and Hieber, 2017) monolingual data. Another way of utilizing monolingual data in both source and target language is to warm start Seq2Seq training from pre-trained encoder and decoder networks (Ramachandran et al., 2017; Skorokhodov et al., 2018). An extreme form of leveraging monolingual training data is unsupervised NMT which removes the need for parallel training data entirely. We will discuss unsupervised NMT in Sec. 14.4.

## 10 NMT Model Errors

NMT is highly effective in assigning scores (or probabilities) to translations because, in stark contrast to SMT, it does not make any conditional independence assumptions in Eq. 5 to model sentence-level translation.14 A potential drawback of such a powerful model is that it prohibits the use of sophisticated search procedures. Compared to hierarchical SMT systems like Hiero (Chiang, 2007) that explore very large search spaces, NMT beam search appears to be overly simplistic. This observation suggests that translation errors in NMT are more likely due to search errors (the decoder does not find the highest scoring translation) than model errors (the model assigns a higher probability to a worse translation). Interestingly, this is not necessarily the case. Search errors in NMT have been studied by Niehues et al. (2017); Stahlberg et al. (2018); Stahlberg and Byrne (2019). In particular, Stahlberg and Byrne (2019) demonstrated the high number of search errors in NMT decoding. However, as we will show in this section, NMT also suffers from various kinds of model errors in practice despite its theoretical advantage.

### 10.1 Sentence Length

Increasing the beam size exposes one of the most noticeable model errors in NMT. The red curve in Fig. 17 plots the BLEU score (Papineni et al., 2002) of a recent Transformer NMT model against the beam size. A beam size of 10 is optimal on this test set. Wider beams lead to a steady drop in translation performance because the generated translations are becoming too short (green curve). However, as expected, the log-probabilities of the found translations (blue curve) are decreasing as we increase the beam size. NMT seems to assign too much probability mass to short hypotheses which are only found with more exhaustive search. Sountsov and Sarawagi (2016) argue that this model error is due to the locally normalized maximum likelihood training objective in NMT that underestimates the margin between the correct translation and shorter ones if trained with regularization and finite data. A similar argument was made by Murray and Chiang (2018) who pointed out the difficulty for a locally normalized model to estimate the “budget” for all remaining (longer) translations in each time step. Kumar and Sarawagi (2019) demonstrated that NMT models are often poorly calibrated, and that calibration issues can cause the length deficiency in NMT. A similar case is illustrated in Fig. 18. The NMT model underestimates the combined probability mass of translations continuing after “Stadtrat” in time step 7 and overestimates the probability of the period symbol. Greedy decoding does not follow the green translation since “der” is more likely in time step 7. However, beam search with a large beam keeps the green path and thus finds the shorter (incomplete) translation with better score. In fact, Stahlberg and Byrne (2019) linked the bias of large beam sizes towards short translations with the reduction of search errors.

At first glance this seems to be good news: fast beam search with a small beam size is already able to find good translations. However, fixing the model error of short translations by introducing search errors with a narrow beam seems like fighting fire with fire. In practice, this means that the beam size is yet another hyper-parameter which needs to be tuned for each new NMT training technique (eg. label smoothing (Szegedy et al., 2016) usually requires a larger beam), NMT architecture (the Transformer model is usually decoded with a smaller beam than typical recurrent models), and language pair (Koehn and Knowles, 2017). More importantly, it is not clear whether there are gains to be had from reducing the number of search errors with wider beams which are simply obliterated by the NMT length deficiency.

#### Model-agnostic Length Models

The first class of approaches to alleviate the length problem is model-agnostic. Methods in this class treat the NMT model as black box but add a correction term to the NMT score to bias beam search towards longer translations. A simple method is called length normalization which divides the NMT probability by the sentence length (Jean et al., 2015; Boulanger-Lewandowski et al., 2013):

 SLN(y|x)=logP(y|x)|y| (30)

Wu et al. (2016) proposed an extension of this idea by introducing a tunable parameter :

 SLN-GNMT(y|x)=logP(y|x)(1+5)α(1+|y|)α (31)

Alternatively, like in SMT we can use a word penalty which rewards each word in the sentence:

 SWP(y|x)=J∑j=1γ(j,x)+logP(yj|yj−11,x) (32)

A constant reward which is independent of and can be found with the standard minimum-error-rate-training (Och, 2003, MERT) algorithm (He et al., 2016b) or with a gradient-based learning scheme (Murray and Chiang, 2018). Alternative policies which reward words with respect to some estimated sentence length were suggested by Huang et al. (2017); Yang et al. (2018).

#### Source-side Coverage Models

Tu et al. (2016) connected the sentence length issue in NMT with the lack of an explicit mechanism to check the source-side coverage of a translation. Traditional SMT keeps track of a coverage vector which contains 1 for source words which are already translated and 0 otherwise. is used to guard against under-translation (missing translations of some words) and over-translation (some words are unnecessarily translated multiple times). Since vanilla NMT does not use an explicit coverage vector it can be prone to both under- and over-translation (Tu et al., 2016; Yang et al., 2018) and tends to prefer fluency over adequacy (Kong et al., 2018). There are two popular ways to model coverage in NMT, both make use of the encoder-decoder attention weight matrix introduced in Sec. 6.1. The simpler methods combine the scores of an already trained NMT system with a coverage penalty without retraining. This penalty represents how much of the source sentence is already translated. Wu et al. (2016) proposed the following term:

 cp(x,y)=βI∑i=1log(min(J∑j=1Ai,j,1.0)). (33)

A very similar penalty was suggested by Li et al. (2018):

 cp(x,y)=αI∑i=1log(max(J∑j=1Ai,j,β)) (34)

where and are hyper-parameters that are tuned on the development set.

An even tighter integration can be achieved by changing the NMT architecture itself and jointly training it with a coverage model (Tu et al., 2016; Mi et al., 2016a). Tu et al. (2016) reintroduced an explicit coverage matrix to NMT. Intuitively, the -th column stores to what extend each source word has been translated in time step . can be filled with an RNN-based controller network (the “neural network based” coverage model of Tu et al. (2016)). Alternatively, we can directly use to compute the coverage (the “linguistic” coverage model of Tu et al. (2016)):

 Ci,j=1Φij∑k=1Ai,k (35)

where is the estimated number of target words the -th source word generates which is similar to fertility in SMT. is predicted by a feedforward network that conditions on the -th encoder state. In both the neural network based and the linguistic coverage model, the decoder is modified to additionally condition on . The idea of using fertilities to prevent over- and under-translation has also been explored by Malaviya et al. (2018). A coverage model for character-based NMT was suggested by Kazimi and Costa-Jussá (2017).

All approaches discussed in this section operate on the attention weight matrix and are thus only readily applicable to models with single encoder-decoder attention like GNMT, but not to models with multiple encoder-decoder attention modules such as ConvS2S or the Transformer (see Sec. 6.6 for detailed descriptions of GNMT, ConvS2S, and the Transformer).

#### Controlling Mechanisms for Output Length

In some sequence prediction tasks such as headline generation or text summarization, the approximate desired output length is known in advance. In such cases, it is possible to control the length of the output sequence by explicitly feeding in the desired length to the neural model. The length information can be provided as additional input to the decoder network (Fan et al., 2018; Liu et al., 2018), at each time step as the number of remaining tokens (Kikuchi et al., 2016), or by modifying Transformer positional embeddings (Takase and Okazaki, 2019). However, these approaches are not directly applicable to machine translation as the translation length is difficult to predict with sufficient accuracy.

## 11 NMT Training

NMT models are normally trained using backpropagation (Rumelhart et al., 1988) and a gradient-based optimizer like Adadelta (Zeiler, 2012) with cross-entropy loss (Sec. 11.1). Modern NMT architectures like the Transformer, ConvS2S, or recurrent networks with LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014b) cells help to address known training problems like vanishing gradients (Hochreiter et al., 2001). However, there is evidence that the optimizer still fails to exploit the full potential of NMT models and often gets stuck in suboptima:

1. NMT models vary greatly in performance, even if they use exactly the same architecture, training data, and are trained for the same number of iterations. Sennrich et al. (2016c) observed up to 1 BLEU difference between different models.

2. NMT ensembling (Sec. 15) combines the scores of multiple separately trained NMT models of the same kind. NMT ensembles consistently outperform single NMT by a large margin. The achieved gains through ensembling might indicate difficulties in training of the single models.15

Training is therefore still a very active and diverse research topic. We will outline the different efforts in the literature on NMT training in this section.

### 11.1 Cross-entropy Training

The most common objective function for NMT training is cross-entropy loss. The optimization problem over model parameters for a single sentence pair under this loss is defined as follows:

 argminΘLCE(x,y,Θ)=argminΘ−|y|∑j=1logPΘ(yj|yj−11,x). (36)

In practice, NMT training groups several instances from the training corpus into batches, and optimizes by following the gradient of the average in the batch. There are various ways to interpret this loss function.

Cross-entropy loss maximizes the log-likelihood of the training data A direct interpretation of Eq. 36 is that it yields a maximum likelihood estimate of as it directly maximizes the probability :

 −logPΘ(y|x)Eq.~{}???=−|y|∑j=1logPΘ(yj|yj−11,x)=LCE(x,y,Θ). (37)

Cross-entropy loss optimizes a Monte Carlo approximation of the cross-entropy to the real sequence-level distribution Another intuition behind the cross-entropy loss is that we want to find model parameters that make the model distribution similar to the real distribution over translations for a source sentence . The similarity is measured with the cross-entropy . In practice, the real distribution is not known, but we have access to a training corpus of pairs . For each such pair we consider the target sentence as a sample from the real distribution . We now approximate the cross-entropy using Monte Carlo estimation with only one sample ():

 Hx(P,PΘ) = Ey[−logPΘ(y|x)] = −∑y′P(y′|x)logPΘ(y′|x) MC≈ −1N∑y′logPΘ(y′|x) N=1= −log