Learning Coupled Policies for Simultaneous Machine Translation
In simultaneous machine translation, the system needs to incrementally generate the output translation before the input sentence ends. This is a coupled decision process consisting of a programmer and interpreter. The programmer’s policy decides about when to Write the next output or Read the next input, and the interpreter’s policy decides what word to write. We present an imitation learning (IL) approach to efficiently learn effective coupled programmer-interpreter policies. To enable IL, we present an algorithmic oracle to produce oracle Read/Write actions for training bilingual sentence-pairs using the notion of word alignments. We attribute the effectiveness of the learned coupled policies to (i) scheduled sampling addressing the coupled exposure bias, and (ii) quality of oracle actions capturing enough information from the partial input before writing the output.
Experiments show our method outperforms strong baselines in terms of translation quality and delay, when translating from German/Arabic/Czeck/Bulgarian/Romanian to English.
Simultaneous machine translation (SiMT) is a setting where the model needs to incrementally generate the translation while the source utterance is being received. This is crucial in live or streaming scenarios, e.g. speech-to-speech translation, where waiting in order to translate a complete utterance leads to intolerable delay. This is a challenging translation scenario as the SiMT model needs to trade off the delay and the quality of the generated translation.
Recent research on SiMT relies on a strategy, aka policy, to decide when to read a word from the input or write a word to the output (e.g. [14, 7]). This is based on a sequential decision making formulation of SiMT, where the decision making about the next Read/Write action is made by an agent, interacting with the neural machine translation (NMT) environment. The current approaches are sub-optimal as they either fix the agent’s policy to focus learning the NMT model (e.g. [10, 6]) or learn adaptive agent policies while the NMT model is fixed (e.g. [7, 1]). Furthermore, the majority of policy learning approaches rely on reinforcement learning (RL), which makes the training process unstable and long due to exploration. This is exacerbated when learning the programmer’s and interpreter’s policies jointly. As such, recent research has considered the use of imitation learning (IL) scenario [18, 19], which is generally superior to RL in terms of the stability and sample complexity. However, the bottleneck of IL in SiMT is the unavailability of the oracle sequence of actions. Designing algorithmic oracles to compute sequence of Read/Write actions with low translation latency and high translation quality is underexplored.
In this paper, we take a neural programmer-interpreter (NPI) approach to SiMT, where the programmer (agent) communicates Read/Write actions to the interpreter (NMT). This provides a natural framework to learn the coupled policies for the agent (programmer) and the underlying NMT model (interpreter). We present an IL approach to efficiently learn effective coupled programmer-interpreter policies in SiMT, based on the following contributions. Firstly, we present a simple, fast, and effective algorithmic oracle to produce oracle Read/Write actions from the training bilingual sentence-pairs based on statistical word alignments , Secondly, as the two policies collaborate in NPI-SiMT, their learning inter-dependency needs to be taken into account. Particularly, each policy should be robust not only to its own incorrect predictions, but also to incorrect predictions of the other policy. We provide an effective learning algorithm to guard against this coupled exposure bias using scheduled sampling .
We prepare translation experiments on 5 language pairs; translating from German, Czech, Arabic, Bulgarian, and Romanian as source languages into English as the target language. Our experiments show the policies trained using our approach compares favorably with strong policies from the previous work. We attribute the effectiveness of the learned coupled policies to (i) scheduled sampling handling the coupled exposure bias resulting in up to +10 BLEU score improvements, and (ii) the quality of oracle actions generated by our algorithmic oracle, which nicely balances the translation quality and delay.
2 Npi Approach to SiMT
In this section, we describe our neural programmer-interpreter (NPI) approach to simultaneous machine translation (SiMT). At each time step , the programmer needs to decide whether to read the next source word or to write the next target word in the translation. The interpreter then immediately executes the action generated by the programmer.
It needs to sequentially decide about the next action, given the previous actions and the prefix of the source utterance read so far as well as the prefix of the target translation generated so far . That is, our programmer is modeled as . This sequential decision making is best modeled as a Markov decision process (MDP), where the action set , the state space consists depends on cross-product of possible sequences of actions from , source utterances from , and target translations from .
It needs to execute the action generated by the interpreter. At the time step , if the generated action is read (r), then the next word of the source utterance is read and . Otherwise, if the generated action is write (w), then the next target word is generated and . The next target word is generated according to . This is best modeled as a sequential decision making process, where the action set consists of the target language’s vocabulary. We realise this component with an auto-regressive NMT model which generates the next target word given the previously generated words and the observed part of the source utterance. This NMT architecture implicitly conditions on the previous actions through the prefixes of the source utterance and target translation.
The Probabilistic Model.
The generative process of NPI-SiMT is outlined in Algorithm 1. The probability of simultaneously generating the translation and the sequence of actions for a source utterance is,
The indices and are the number of Read and Write actions in the program up to the time step , i.e.
The Neural Architectures.
Our NPI SiMT architecture is similar to the learning to translate in a real time framework . We slightly modify their framework to instead use the last hidden state of the encoder , the last hidden state of the decoder , and the embedding of last generated action as an input of the programmer. Our programmer is a recurrent neural network with a binary softmax activation to produce a probability distribution of Read/Write actions.
Training the Model.
In SiMT, we are interested in not only producing a high quality translation, but also reducing the delay between the times of receiving the source words and generating their translations. Training of the model based on this hybrid training objective can be done by algorithms in reinforcement learning (RL) or imitation learning (IL). More specifically, the training objective in RL is the expected reward,
where is the parameter of our SiMT model, is the bilingual parallel training set, and are the generated actions and the corresponding translation generated by the model, and is the ground truth translation. is the reward function which is a hybrid of the translation quality and the induced delay. The RL approach has been attempted by [14, 7, 1] for training the programmer; however, it is unstable and inefficient due to exploration. This is exacerbated in our programmer-interpreter approach as the cross product of the spaces of possible sequences of actions and translations is enormous, hindering effective and efficient learning of the programmer’s and interpreter’s policies using RL. We thus take the IL approach for a sample efficient, effective, and stable learning of policies in NPI-SiMT.
3 Deep Coupled Imitation Learning
Our goal is to learn a pair of reasonable policies for the programmer and interpreter using IL, which requires addressing the following two challenges. Firstly, this is different from the typical IL scenarios, where there is only one policy to learn. As the two policies collaborate in NPI-SiMT approach, their learning inter-dependency needs to be taken into account; this is addressed in §3.1. Secondly, we need to come up with the oracle program actions for each sentence pair in the training set, i.e. the program which has been responsible for generating the translation for a source utterance with as low delay as possible. In §3.2, we address this challenge by proposing an algorithmic oracle which computes the oracle actions using word alignments .
3.1 Learning Robust Coupled Policies
Assuming we have the oracle actions, we can learn the policies for both the programmer and interpreter using behavioural cloning in IL . That is, the model parameters are learned by maximising the likelihood of the oracle actions for both the programmer and interpreter:
This is akin to have the expectation, in the original training objective eqn (3), under a point-mass distribution over the oracle actions.
IL with behavioural cloning does not lead to robust policies for unseen examples in the test time due to the exposure bias . That is, the agent is only exposed to situations resulting from the correct actions in the training time, not the ones resulting from the likely incorrect actions in the test time. Scheduled sampling [3, 13] is addresses this issue by exposing the agent to incorrect decisions in training time through perturbation of the oracle decisions, which we extend to learning policy pairs. Crucially, the programmer-interpreter policies need to be robust to incorrect decisions not only in their own trajectories, but also to each each other’s trajectory history.
Learning the Programmer.
To train our programmer on a training example with scheduled sampling, we first create the perturbation of the ground truth program and interpreter decisions (will be mentioned shortly). We then maximise the following training objective:
Based on the generative process described in Algorithm 1, the programmer conditions the generation of Read/Write actions in each time step on the current states of the NMT’s encoder and decoder. Hence, while training the programmer, the Read/Write actions need to be communicated to the interpreter and be executed in order to provide NMT’s encoder/decoder states to the programmer to condition upon. Crucially, the communicated actions are the ground truth actions . The perturbed program and translation are only used as the input to the recurrent architectures of the programmer and interpreter’s decoder.
The perturbations of the ground truth translation and program are created as follows. For each element of the ground truth action, we first decide randomly whether to perturb that element, according to a probability. In case of perturbation, we replace the ground truth element by randomly selecting an action from the space of possible actions. The random action can be selected according to various probability distributions, e.g. uniform or the predictive distribution from the NPI-SiMT model.
Learning the Interpreter.
The interpreter needs to be robust wrt incorrect actions in the previously generated words in the translations as well as the Read/Write actions generated by the programmer. Thus, the training objective for the interpreter is
where is the perturbed version of the ground truth program and translation for the training data point . When generating the program perturbation, we need to make sure that it is valid wrt the sentence-pair, i.e. it respects the number of Read/Write actions for the words in the source/target sentence. While generating the perturbed program, if the number of Write actions match the length of the target sentence, we will then terminate the generative process of NPI-SiMT although the words of the source sentence are not fully read. On the other hand, if the program contains enough number of Read actions to fully read the source sentence, we only generate Write actions in order to fully generate the target sentence and terminate the generative process of NPI-SiMT.
3.2 Oracle Program Actions
The success of imitation learning is determined by the quality of its teacher oracle. To generate such oracle, we notice that the generation of depends on . More source words provide more context but induce more delay to the system. Our oracle measures the appropriate amount of source words needed for translating a particular target word. That is, for each , we determine the key source word that needs to be read before starting the translation.
One way to determine this key source word is to make use of the traditional word alignment , which captures a strong relationship between parallel sentence. Using this word alignment, we deduce an algorithm to create our NPI-SiMT oracle. For each target token:
Read the source until the key source word of is read.
The first step of this algorithm can be skipped if the key source word has been read by the previous iteration or if there is no corresponding source words aligned to the current word. This algorithm ensures that, when writing , the interpreter has enough information to compute a correct context vector.
Datasets and Oracle.
We prepare translation experiments on 5 language pairs, which all translating into English. We choose German (DE), Czech (CS), Arabic (AR) corpus from IWSLT 2016 translation dataset.
To generate the oracle, we use the fast-align
Both programmer and interpreter employ a single RNN as its core component. Our interpreter is a standard attention NMT architecture with MLP attention function and input feeding. No bridge is used. We use a single layer left-to-right unidirectional long short term memory (LSTM) network for the programmer, and both the interpreter’s encoder and decoder. We use a hidden unit size of 512 and a standard dropout probability of 0.2 at the LSTM output. A probability of 0.1 is employed at the interpreter’s word embedding. We built our SiMT framework using PyTorch toolkit.
We train the programmer and the interpreter jointly using Adam optimizer with learning rate. The optimizer minimizes their joint negative log likelihood (NLL) during training. However we used only the interpreter’s NLL, and half the optimizer’s learning rate each time interpreter’s NLL increased on the development set. Early stopping is reached after the fourth learning rate decay. During sequence generation, we use a beam search with 5 beam size. We also divide the sequence accumulated log-probability by its length for length normalization during search.
|DEEN||CSEN||AREN||BG EN||RO EN|
The valid action perturbation is constructed by (i) selecting a subset of indices of by tossing a coin independently for each index
We evaluate the SiMT systems based on its translation quality and delay. Translation quality can be measured by case sensitive BLEU score using Sacrebleu
where and .
We compare against the wait- baseline  where the programmer’s policy begins with numbers of Read, and is followed by switching Write and Read, until the source sentence is exhausted or EOS symbol is written. If the source sentence is exhausted, only Write action is emitted. This baseline is superior in terms of quality compared to the RL approach  and can be tuned for the desired delay. For fair comparison, we compare the accuracy of our proposed method with the wait- system within the same delay.
|wait-2||R(_Aber, _wenn) W(_) R(_wir) W(B) R(_die) W(ut) R(_Zusammensetzung) W(_if) R(_des) W(_we) R(_Erd) W(_look) R(boden) W(_at) R(s) W(_the) R(_nicht) W(_composition) R(_ändern) W(_of) R(,) W(_) R(_werden) W(E) R(_wir) W(ar) R(_das) W(th) R(_nie) W(’) R(_tun) W(s) R(.) W(_ground) R(EOS) W(,, _we, _never, _will, ., EOS)|
|ss-both||R(_Aber) W(_) R(_wenn) W(B) R(_wir) W(ut) R(_die, _Zusammensetzung, _des, _Erd, boden, s, _nicht, _ändern, ,) W(_if, _we, _don, ’, t, _change, _the, _composition, _of, _the) R(_werden, _wir) W(_soil ,,) R(_das, nie) W(_we) R(_tun, ., EOS) W(_will, _never, _do, _that, .) W(EOS)|
|Oracle||R(_Aber) W(_but) R(_wenn) W(_if) R(_wir) W(_we) R(_die, _Zusammensetzung,_des, _Erd, boden, s, _nicht) W(_don, ’, t) R(_ändern, ,) W(_change, _the, _composition, _of, _the) R(_werden, _wir) W(_soil, ,, _we, _will) R(_das, _nie) W(_never) R(_tun, ., EOS) W(_do, _this, ., EOS)|
Oracle Policy vs Wait- Policy.
Figure 1 compares the policies trained by our algorithmic oracle vs those trained using the wait- policy for when translating from German/Czech/Romanian into English from German. In each of these three plots, the policy trained using the oracle actions corresponds to the first leftmost point on the dashed line. As it can be seen, the policy trained using the oracle actions compares favorably with those trained using the wait- method in terms of the combination of the translation quality (the higher is better) and the translation delay (the lower is better).
We are further interested to investigate the effect of increasing the delay of the oracle policy in a controlled manner onto the translation quality of the trained systems. As such, we increase the delay of the oracle policy by moving the last Read action in the oracle program to the beginning of the program. For more delays, we repeat this process as many times as needed. We expect that the delayed oracle programs lead to trained policies with better translation quality in the expense of increasing the delay. The points on the dashed lines in Figure 1 correspond to policies trained using the delayed versions of the oracle program, where the added delays are in . Note that adding zero delay (the first point on the dashed line) corresponds to the system trained using the original oracle program. As it can be seen, the policies trained with the delayed versions of the oracle program consistently outperform the wait- policies.
We are now interested into the effect of scheduled sampling (SS) in learning coupled policies in NPI-SiMT.
For this purpose, we train four versions of our models where apply SS to both the programmer and the interpreter, only programmer, only interpreter, or none of them. Table 1 shoes the results when translating from German/Czech/Arabic/Bulgarian/Romanian into English.
We further put the results in the context of the policies trained using the wait- method where .
The NPI-SiMT system that is trained using our proposed oracle (no-SS) was able to learn from the low delay oracle as their natural delay (AL) and (AP) is generally as low as the delay of the oracle during training.
However, it is clearly difficult to perfectly predict the oracle during test-time
We address the difficulty of translating SOV inputs when translating from German to English . Table 2 highlights a comparison of the wait-, our trained, and gold oracle in translating a German sentence in two seconds delay. It can be seen that the wait-2 system which lacks the source input “ändern” where forced to guess the verb “look” after reading word “Zusammensetzung” which means “composition”. Moreover it also lacks an insight that the input is a negative sentence, which is indicated by “nicht” that comes just before the correct verb. It is difficult for any system to correctly predict negative without looking the actual input. Our proposed system is able to wait enough inputs, and successfully translated the German phrase “Zusammensetzung des Erdbodens nicht ändern” into “don’t change the composition of the soil”. We believe it is linguistically difficult to translate this phrase in part because most of the alignments between the parallel phrase are cross-alignments.
5 Related Work
[14, 7, 1] formulate simultaneous NMT as sequential decision making problem where an agent interacts with the environment (i.e. the underlying NMT model) through Read/Write actions. They pre-train the NMT system, while the agent’s policy is being trained using deep-RL. [10, 6] train the NMT model with respect to a fixed wait- and/or simple policies, and  jointly trains an adaptive policy and re-trains the underlying NMT system.  produces oracle Read/Write actions using a pre-trained NMT model, which is then used to train an adaptive agent based on supervised learning, i.e. behavioural cloning in imitation learning. Their work is different from ours in that: (i) they do not use word alignment to produce the oracle actions, and (ii) they do not make use of scheduled sampling.  adds a “delay” token, equivalent to the Read action, to the target vocabulary and trains one model which integrates both the translation model and the agent. The supervision is provided using actions of a fixed oracle.
Simultaneous Speech Translation.
[11, 12, 2] consider the problem of simultaneous speech to text translation, where the system needs to produce textural translation of the gradually incoming foreign speech. This scenario allows for revising the incremental translation, hence evaluation needs to account for the translation quality, delay, and stability.  considers the problem of simultaneous translation of the source language speech to the target language speech, unlike the majority of the work in the literature on full sentence translation which results in too much delay; e.g. [9, 8].
This paper proposes a simple and effective way to train a simultaneous translation system which uses minimal delay to translate target sentence. The delay is determined based on the numbers of inputs needed for translating particular target token. Word-alignment which is used to generate such oracle is successful as our proposed system is superior than the wait-k baseline in the same delay bracket. This oracle is a cheap and effective oracle to be used for future imitation-based learning simultaneous translation research.
Moreover, we also show the importance of scheduled sampling while learning the system. This is a crucial component as exposure bias induced in training multiple agents (the programmer and the interpreter) are more devastating. Our regularization successfully increased the BLEU score of our non-regularized system by up to 10 BLEU score while also minimally reduced the delay.
- The code will be released upon publication.
- Note that there can be more than one source aligned to the target. We choose the furthest source word as the “key” word in this case.
- Except the first Read and the last Write where we fixed these positions. This prevents the last EOS to be perturbed and having Write before and Read
- Our preliminary experiments comparing generated actions and the gold oracle resulting around 70% BLEU score.
- (2018) Prediction improves simultaneous neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §1, §2, §5.
- (2019) Monotonic infinite lookback attention for simultaneous machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Cited by: §5, §5.
- (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 1171–1179. Cited by: §1, §3.1.
- (1993) The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2), pp. 263–311. Cited by: §1, §3.2, §3.
- (2019) Thinking slow about latency evaluation for simultaneous machine translation. arXiv preprint arXiv:1906.00048. Cited by: §4.1.
- (2018) Incremental decoding and training methods for simultaneous translation in neural machine translation. In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 493–499. Cited by: §1, §5.
- (2017-04) Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 1053–1062. External Links: Cited by: §1, §2, §2, §4.2, §5.
- (2019) Direct speech-to-speech translation with a sequence-to-sequence model. ArXiv abs/1904.06037. Cited by: §5.
- (1997) Janus-iii: speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 99–102. Cited by: §5.
- (2019-07) STACL: simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3025–3036. External Links: Cited by: §1, §4.1, §5.
- (2016) Dynamic transcription for low-latency speech translation. In 17th Annual Conference of the International Speech Communication Association (Interspeech), pp. 2513–2517. Cited by: §5.
- (2018) Low-latency neural speech translation. In 19th Annual Conference of the International Speech Communication (Interspeech), pp. 1293–1297. Cited by: §5.
- (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635. Cited by: §3.1.
- (2016) Simultaneous machine translation using deep reinforcement learning. In Proceedings of the Abstraction in Reinforcement Learning Workshop, Cited by: §1, §2, §5.
- (2019) Recent advances in imitation learning from observation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 6325–6331. Cited by: §3.1.
- (2020) SimulS2S: end-to-end simultaneous speech to speech translation. External Links: Cited by: §5.
- (2019-07) Bridging the gap between training and inference for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4334–4343. External Links: Cited by: §4.1.
- (2019) Simpler and faster learning of adaptive policies for simultaneous translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 1349–1354. Cited by: §1, §4.1.
- (2019) Simultaneous translation with flexible policy via restricted imitation learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5816–5822. Cited by: §1, §5.