Orthros: Non-autoregressive End-to-end Speech Translationwith Dual-decoder

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder


Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems. End-to-end (E2E) models based on the encoder-decoder architecture are more suitable for this goal than traditional cascaded systems, but their effectiveness regarding decoding speed has not been explored so far. Inspired by recent progress in non-autoregressive (NAR) methods in text-based translation, which generates target tokens in parallel by eliminating conditional dependencies, we study the problem of NAR decoding for E2E-ST. We propose a novel NAR E2E-ST framework, Orthoros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder. The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead. We further investigate effective length prediction methods from speech inputs and the impact of vocabulary sizes. Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality compared to state-of-the-art AR E2E-ST systems.


Hirofumi Inaguma1, Yosuke Higuchi2, Kevin Duh3, Tatsuya Kawahara1, Shinji Watanabe3 \address1Kyoto University, Japan 2Waseda University, Japan 3Johns Hopkins University, USA \ninept {keywords} End-to-end speech translation, non-autoregressive decoding, conditional masked language model

1 Introduction

There is a growing interest in speech translation [33] due to the increase in demand for international communications. The goal is to transform speech from one language to text in another language. Traditionally, the dominant approach is to cascade automatic speech recognition (ASR) and machine translation (MT) systems. Thanks to recent progress in neural approaches, researchers are shifting to the development of end-to-end speech translation (E2E-ST) systems [3], aiming to optimize a direct mapping from source speech to target text translation by bypassing ASR components. The potential advantages of E2E modeling are (a) mitigation of error propagation from incorrect transcriptions, (b) low-latency inference, and (c) applications in endangered language documentation [5]. However, most efforts have been devoted to investigating methods to improve translation quality by the use of additional data [21, 38, 50, 51, 18, 9, 39, 44, 45], better parameter initialization [4, 2, 20], and improved training methods [53, 31].

For applications of ST systems in lectures and dialogues, inference speed is also an essential factor from the user’s perspective [34]. Although E2E models are more suitable for small latency inference than cascaded models since the ASR decoder and the MT encoder processing can be skipped, their effectiveness regarding inference speed has not been well studied. Moreover, incremental left-to-right token generation of autoregressive (AR) E2E models increases computational complexity and, therefore, still suffer from slow inference. To speed up inference while achieving comparable translation quality to AR models, non-autoregressive (NAR) decoding has been investigated in text-based translation [14, 29, 22, 30, 11, 15, 47, 32, 12]. The NAR inference enables to generate tokens in parallel by eliminating conditional token dependencies.

Motivated by this, we propose a novel NAR framework, Orthros, for the E2E-ST task. Orthros has dual decoders on the shared speech encoder: NAR and AR decoders. The NAR decoder is used for the fast token generation, and the AR decoder selects better translation among multiple length candidates during inference. As the AR decoder can rescore all tokens in parallel and reuse encoder outputs, its overhead is minimal. This architecture design is motivated by the difficulty of estimating a suitable target length given a speech in advance. We adopt the conditional masked language model (CMLM) [11] for the NAR decoder, in which a subset of tokens are repeatedly updated by partial masking through constant iterations. We also use semi-autoregressive training (SMART) to alleviate mismatches between training and testing conditions [13]. However, any NAR decoder can be used in Orthros, conceptually. Moreover, effective length prediction methods and the impact of vocabulary sizes are also studied.

Experiments on four benchmark corpora show that the proposed framework achieves comparable translation quality to state-of-the-art AR E2E- and cascaded-ST models with approximately 2.31 and 4.61 decoding speed-ups, respectively. Interestingly, the best NAR model can even outperform AR models in terms of the BLEU score in some cases. This work is the first study of NAR models for the E2E-ST task to the best of our knowledge.

Reference but with that and and when we bought it i thought who’s gonna put a you know a walkman or something and that’s what i’m doing now
he came with that and when we bought it i thought who going to put you know know ak or or something and now i’m doing it
he came with that and when we bought it i thought who he going to put you know akman or something and now i’m doing it
he came with that and when we bought it i thought who is going to put you know a walkman or something and now i’m doing it
Table 1: An example of Mask-Predict decoding on the Fisher-dev set. Highlighted tokens are masked for the next iteration.

2 Background

2.1 End-to-end Speech Translation (E2E-ST)

E2E-ST is formulated as a direct mapping problem from input speech in a source language to the target translation text . E2E models can be implemented with any encoder-decoder architectures, and we adopt the state-of-the-art Transformer model [49]. A conventional E2E-ST model is composed of a speech encoder and an autoregressive (AR) translation decoder, which decomposes a probability distribution of into a chain of conditional probabilities from left to right as:


Parameters are optimized with a single translation objective after pre-training with the ASR and MT tasks [4].

2.2 Conditional masked language model (CMLM)

Since AR models generate target tokens incrementally during inference, they suffer from high inference latency and do not fully leverage the computational power of modern hardware such as GPU. In order to tackle this problem, parallel sequence generation with non-autoregressive (NAR) models have been investigated in a wide range of tasks such as text-to-speech synthesis [35, 42], machine translation [14], and speech recognition [6, 17, 10, 1, 48]. NAR models factorize conditional probabilities in Eq. (1) by assuming conditional independence for every target position. However, iterative refinement methods generally achieve better quality than pure NAR methods at the cost of speed because of multiple forward passes [29, 11, 16]. Among them, the conditional masked language model (CMLM) [11] is a natural choice because of its simplicity and good performance. Moreover, we can flexibly trade the speed during inference by changing the number of iterations .

In CMLM, the partial decoder inputs are masked by replacing with a unique token [MASK] based on confidence. Intermediate discrete variables at the -th iteration () are iteratively refined given the rest observed tokens as:


The target length distribution for is typically modeled by a linear classifier stacked on the encoder.

However, NAR models suffer from a multimodality problem, where a distribution of multiple correct translations must be modeled given the same source sentence. Recent studies reveal that sequence-level knowledge distillation [23] makes training data deterministic and mitigates this problem [14, 54, 43].

3 Proposed method: Orthros

3.1 Model architecture

Typical text-based NAR models generate target sentences with multiple lengths in parallel to improve quality in a stochastic [14] or deterministic [11] way, followed by an optional rescoring step with a separate AR model [14]. However, a spoken utterance consists of hundreds of acoustic frames even after downsampling, and its length varies significantly based on the speaking rate and silence duration. Therefore, it is challenging to estimate the target length given a speech in advance accurately. Moreover, extra computation and memory consumption for feature encoding with the separate AR model in rescoring are not negligible. This motivated us to propose Orthros, having dual decoders on top of the shared encoder: an NAR decoder for fast token generation and an AR decoder for candidate selection from the NAR decoder. This way, re-encoding speech frames is unnecessary, and the AR decoder greatly improves the effectiveness of using a large length beam. The speech encoder is identical to that of AR models. A length predictor and a CTC ASR layer for the auxiliary ASR task are also built on the same encoder.

Our NAR decoder is based on the conditional masked language model (CMLM) [11]. One of the distinguished advantages over pure NAR models [14] is that CMLM removes the necessity of a copy of the source sentence to initialize decoder inputs. This could be achieved by using a predicted transcription from the auxiliary ASR sub-module, but this contradicts a motivation to avoid ASR errors.

3.2 Inference

The inference of the CMLM is based on the Mask-Predict algorithm [11]. Let be the number of iterations, be a predicted target sequence length, and and be masked and observed tokens in the prediction () at the -th iteration (), respectively. At the initial iteration , all tokens are initialized with [MASK]. An example is shown in Table 1.


The mask-predict algorithm performs two operations, mask and predict, at every iteration . In the mask operation, given predicted tokens at the previous iteration, , we mask tokens having the lowest confidence scores, where is a linear decay function . In the predict operation, we take the most probable token from a posterior probability distribution at every masked position and update as:


where is the vocabulary. When using SMART, described in Section 3.3.1, all tokens are updated in Eq. (3) if they differ from those at the previous iteration. Furthermore, we generate target sentences having different lengths in parallel and select the most probable candidate by calculating the average log probability over all tokens at the last iteration: .

For target length prediction, we sample top- length candidates from a linear classifier conditioned on time-averaged encoder outputs. We also study a simple scaling method using CTC outputs used for the auxiliary ASR task.

Candidate selection with AR decoder

Using multiple length candidates is effective for improving quality, but it is sub-optimal to directly use sequence-level scores from in Eq. (3) because they are stale [11]. Therefore, we propose to select the most probable translation among candidates after the last iteration by using log probability scores from the AR decoder averaged over all tokens. Note that we do not use scores from the NAR decoder here. Since the AR decoder can rescore all tokens in a candidate in parallel, it can still maintain the advantage of parallelism in self-attention.

3.3 Training

The training objective of the CMLM can be formulated as follows:


where and . We sample the number of masked tokens from a uniform distribution, , following [11].

ID Model BLEU Latency (ms) Speedup
Must-C Fisher-CallHome Libri- trans
De Fr Fsh-test CH-evltest
E2E AR A1 Transformer () 21.54 32.26 48.38 18.07 16.52 175 1.54
Transformer () 23.12 33.84 48.49 18.90 16.84 271
\cdashline4-11 A2 Transformer + Seq-KD () 23.88 33.92 50.34 19.09 15.91
Transformer + Seq-KD () 24.43 34.57 50.32 19.81 16.44
NAR N1 CTC () 19.40 27.38 45.97 15.91 12.10 13 20.84
\cdashline4-11 N2 Orthros (CMLM, ) 18.78 25.99 46.03 16.71 12.90
Orthros (CMLM, +AR) 19.62 27.77 47.80 18.28 13.69
Orthros (CMLM, ) 20.89 28.74 48.56 18.60 14.68
Orthros (CMLM, +AR) 21.79 30.31 49.98 19.71 15.43
\cdashline4-11 N3 Orthros (SMART, ) 20.03 27.22 45.89 17.39 14.17 46 5.89
Orthros (SMART, +AR) 21.08 29.30 48.73 19.25 14.99 61 4.44
Orthros (SMART, ) 21.25 29.31 47.09 18.25 15.11 99 2.73
Orthros (SMART, +AR) 22.27 31.07 50.07 20.10 16.08 111 2.44
\cdashline4-11 N4  + BPE8k16k 22.88 32.20 50.18 19.88 16.22 117 2.31
\cdashline4-11 N5   + large (SMART, +AR, ) 22.54 59 4.59
  + large (SMART, +AR, ) 23.92 113 2.39
Cascade AR A3 AR ASR () AR MT () 22.20 31.67 40.94 19.15 16.44 154166 0.84
AR ASR () AR MT () 23.30 33.40 42.05 19.77 16.52 333207 0.50
Table 2: BLEU scores of AR and NAR methods on the tst-COMMON sets of Must-C (EnDe and EnFr), Fisher-test (Fsh-test) and CallHome-evltest (CH-evltest) sets of Fisher-CallHome Spanish (EsEn), and the test set of Libri-trans (EnFr). Seq-KD represents sequence-level knowledge distillation. Latency is measured as average decoding time per sentence on Must-C EnDe, with batch size 1.

Semi-autoregressive training (SMART)

To bridge the gap between training and test conditions in the CMLM, we adopt semi-autoregressive training (SMART) [13]. SMART uses two forward passes to calculate the cross-entropy (CE) loss. In the first pass, the CMLM generates predictions at all positions, , given partially-observed ground-truth tokens as in the original training process. The gradient flow is truncated in the first pass [13]. Then, a subset of tokens in are masked again with a new mask. The resulting observed tokens are fed into the decoder as inputs in the second pass. The CE loss is calculated with predictions at all positions in the second pass, unlike the original training.

Total training objective

The speech encoder and all four branches (NAR/AR decoders, length predictor, and CTC ASR layer) are optimized jointly. The total objective function is formulated as:

where , , and are losses in AR E2E-ST, length prediction, and ASR tasks, and , , and are the corresponding tunable hyperparameters. We set (, , ) to (0.3, 0.1, 0.3) throughout the experiments.

4 Experimental evaluation

4.1 Datasets

We used En-De (229k pairs, 408 hours) and En-Fr (275k pairs, 492 hours) language directions on Must-C [8], Fisher-CallHome Spanish (Es-En, 138k pairs, 170 hours, hereafter Fisher-CH) [40], and Libri-trans (En-Fr, 45k pairs, 100 hours) [26] corpora. All corpora contain a triplet of source speech and the corresponding transcription and translation, and we used the same preprocessing as [19]. For Must-C, we report case-sensitive BLEU [36] on the tst-COMMON set. Non-verbal speech labels such as ”(Applause)” were removed during evaluation. For Fisher-CH, we report case-insensitive BLEU on the Fisher-test with four references, and CallHome-evltest sets with a single reference. For Libri-trans, we report case-insensitive BLEU on the test set. We removed case information and all punctuation marks except for apostrophe in both transcriptions of all corpora and translations of Fisher-CH.

We extracted 80-channel log-mel filterbank coefficients with 3-dimensional pitch features using Kaldi [41] as input speech features, which was augmented by a factor of 3 with speed perturbation [25] and SpecAugment [37] to avoid overfitting. All sentences were tokenized with the tokenizer.perl script in Moses [27]. We built vocabularies based on byte pair encoding (BPE) algorithm [46] implemented with Sentencepiece [28]. The joint source and target vocabularies were used in the ST/MT tasks, while the ASR vocabularies were constructed with the transcriptions only. For Must-C, we used 5k and 8k vocabularies for ASR and E2E-ST/MT models, respectively. For Fisher-CH and Libri-trans, we used 8k and 1k vocabularies for NAR E2E-ST models and the others, respectively.

4.2 Model configurations

We used the Transformer architecture implemented in ESPnet-ST [19] for all tasks. All ASR and E2E-ST models consisted of stacked 12 encoder layers and 6 decoder layers. Speech encoders had 2 CNN layers before the self-attention layers, which performed 4-fold downsampling. The text encoders in the MT models consisted of 6 layers. The dimension of self-attention layer and feed-forward network , and the number of heads were set to 256, 2048, and 4, respectively. For the large model on Must-C, we set and . The Adam optimizer [24] with , , and was used for training with Noam learning rate schedule [49]. Warmup steps and a learning rate constant were set to and , respectively. A mini-batch was constructed with 32 utterances, and gradients were accumulated for 8 steps in NAR E2E-ST models. The last 5 best checkpoints based on the validation performance were used for model averaging. Following the standard practice in NAR models [14, 11], we used sequence-level knowledge distillation (Seq-KD) [23] with the corresponding AR Transformer MT model, except for Libri-trans. During inference, we used a beam width for AR ASR/ST/MT models, and a length beam width for NAR models. The language model was used for the ASR model on Libri-trans only. Joint CTC/Attention decoding [52] was performed for ASR models. Decoding time was measured with a batch size 1 on a single NVIDIA TITAN RTX GPU by averaging on five runs. We initialized encoder parameters of E2E-ST models with those of the corresponding pre-trained ASR model and AR decoder parameters with those of the corresponding pre-trained AR MT model trained on the same triplets, respectively [4]. However, NAR decoder parameters were initialized based on the weight initialization scheme in BERT [7, 11].

Figure 1: Trade-off between decoding speed-up and BLEU scores on the Must-C En-De tst-COMMON set. Parentheses, square brackets, and curly brackets represent (, ), [], and { (ASR), (MT)}, respectively.
Model BLEU
w/o AR w/ AR w/o AR w/ AR
Orthros (N3) 45.76 49.01 46.88 50.28
 - Seq-KD 44.36 47.42 44.25 49.50
 - AR decoder 45.53 N/A 46.94 N/A
 + length prediction w/ CTC 45.41 48.18 46.79 50.05
Table 3: Ablation study on the Fisher-dev set

4.3 Main results

The main results are shown in Table 2. Iterative refinement based on CMLM (N2) significantly outperformed the pure NAR CTC model (N1) in translation quality.1 Increasing the number of iterations was effective for improving quality at the cost of latency. SMART also boosted the BLEU scores with no extra latency during inference, except for Fisher-CH (N3). This is probably because sequence lengths in Fisher-CH are relatively short. Candidate selection with the AR decoder greatly improved the quality with a negligible increase of latency, which corresponds to performing one more iteration. We also found that NAR models prefer the large vocabulary size for better quality. Using BPE16k, BLEU scores were improved while keeping latency (N4). We note that the BPE size was tuned separately for AR and NAR models. We will analyze this phenomenon in Section 4.5. Increasing the model capacity also improved the quality of NAR models while it did not hurt the speed so much when using GPU. AR models did not benefit from the larger capacity on this corpus though not shown in the table (see Figure 1).

For a comparison with AR models, Orthros achieved comparable quality to both strong AR E2E (A1, A2) and cascaded systems (A3) with smaller latency. Interestingly, N4 and N5 even outperformed A1 in quality by a large margin on Fisher-CH and Must-C En-De, respectively. Seq-KD was very effective for AR models as well except for Libri-trans. Although relative speed-ups are smaller than those in the MT literature [11, 16], this is probably because we used the smaller vocabulary and the baseline AR models have much smaller latency (e.g., 607ms in [16] vs. 271ms in ours).

Figure 1 shows the trade-off between relative speed-ups and BLEU scores on the Must-C En-De tst-COMMON set. Consistent with text-based CMLM models [11], a large length beam width was not effective. However, the proposed candidate selection significantly improved the performances with a larger . This way, a similar BLEU score can be achieved with a smaller iteration. Moreover, Orthros (N4) can obtain the same BLEU as a baseline AR (A1) with more than 3 times speed-up for greedy decoding and 1.5 times for beam search. The large Orthros (N5) achieved better BLEU scores than N4 with similar latency and outperformed the AR models with beam search both in quality and latency. Although the cascaded models showed reasonable BLEU scores, they were much slower than the E2E models. We also compared AR models for candidate selection: the AR decoder on the unified encoder (proposal) vs. the separate AR encoder-decoder. The unified encoder showed smaller latency with better quality. We suspect that this is because sharing the encoder has a positive effect on candidate selection, or the AR decoder in Orthros was trained with Seq-KD. We will analyze this in future work. Although the overhead for additional speech encoding was relatively small here, this would be enlarged when using a more complicated encoder architecture. One more advantage of Orthros is that the memory consumption for model parameters and encoder output caching is much smaller.

4.4 Ablation study

To see individual contributions of the proposed techniques, we conducted the ablation study on the Fisher-dev set in Table 3. Seq-KD was beneficial for boosting BLEU scores consistent with the NAR MT task [11]. Joint training with the AR decoder did not hurt BLEU scores when candidate selection was not used. For length prediction, we also investigated a simple approach by scaling the transcription length , which was obtained from the CTC ASR layer with greedy decoding, by a constant value , i.e., . Although this works as well, we needed to tune on the dev set on each corpus, and therefore we adopted the classification approach.

Figure 2: Impact of vocabulary size on the Fisher-dev set

4.5 Effect of vocabulary size

Finally, we investigated the impact of vocabulary size. Figure 2 shows BLEU scores of AR and NAR E2E models as a function of the vocabulary size on the Fisher-dev set. We observed AR models have a peak around 1k BPE because the data size is relatively small (170-hours). However, the performance of NAR models continued to improve according to the vocabulary size until 16k. The candidate selection with the AR decoder was beneficial for all the vocabulary sizes, especially for 32k. This is probably because misspelling was alleviated thanks to many complete words in the large vocabulary, which had a complementary effect on the conditional independence assumption made in the NAR models. We also observed similar trends in other corpora and CTC models.

5 Conclusion

In this work, we proposed a unified NAR decoding framework to speed-up inference in the E2E-ST task, Orthros, with NAR and AR decoders on the shared encoder. Selecting the better candidate with the AR decoder greatly improved the effectiveness of a large length beam in the NAR decoder. We also presented that using a large vocabulary and parameters is effective for NAR E2E-ST models. The best NAR E2E model reached a level of state-of-the-art AR Transformer model in the BLEU score while reducing inference latency more than twice.


  1. CTC in the E2E-ST task has the speech encoder and the output classifier. It was optimized with a single CTC objective with a pair of (, ). Since input speech lengths are generally longer than target sequence lengths in the E2E-ST task, we did not use the upsampling technique in [30].


  1. Y. Bai (2020) Listen attentively, and spell once: Whole sentence generation via a non-autoregressive architecture for low-latency speech recognition. In Proc. of Interspeech, Cited by: §2.2.
  2. S. Bansal (2019) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proc. of NAACL-HLT, pp. 58–68. Cited by: §1.
  3. A. Bérard (2016) Listen and translate: A proof of concept for end-to-end speech-to-text translation. In Proc. of NeurIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop, Cited by: §1.
  4. A. Bérard (2018) End-to-end automatic speech translation of audiobooks. In Proc. of ICASSP, pp. 6224–6228. Cited by: §1, §2.1, §4.2.
  5. M. Z. Boito (2017) Unwritten languages demand attention too! word discovery with encoder-decoder models. In Proc. of ASRU, pp. 458–465. Cited by: §1.
  6. N. Chen (2019) Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition. arXiv preprint arXiv:1911.04908. Cited by: §2.2.
  7. J. Devlin (2019) BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proc. of NAACL-HLT, pp. 4171–4186. Cited by: §4.2.
  8. M. A. Di Gangi (2019) MuST-C: a Multilingual Speech Translation Corpus. In Proc. of NAACL-HLT, pp. 2012–2017. Cited by: §4.1.
  9. M. A. Di Gangi (2019) One-to-many multilingual end-to-end speech translation. In Proc. of ASRU, pp. 585–592. Cited by: §1.
  10. Y. Fujita (2020) Insertion-based modeling for end-to-end automatic speech recognition. In Proc. of Interspeech, Cited by: §2.2.
  11. M. Ghazvininejad (2019) Mask-predict: Parallel decoding of conditional masked language models. In Proc. of EMNLP, pp. 6114–6123. Cited by: §1, §1, §2.2, §3.1, §3.1, §3.2.2, §3.2, §3.3, §4.2, §4.3, §4.3, §4.4.
  12. M. Ghazvininejad (2020) Aligned cross entropy for non-autoregressive machine translation. In Proc. of ICML, Cited by: §1.
  13. M. Ghazvininejad (2020) Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785. Cited by: §1, §3.3.1.
  14. J. Gu (2018) Non-autoregressive neural machine translation. In Proc. of ICLR, Cited by: §1, §2.2, §2.2, §3.1, §3.1, §4.2.
  15. J. Gu (2019) Levenshtein Transformer. In Proc. of NeurIPS, pp. 11181–11191. Cited by: §1.
  16. J. Guo (2020) Jointly masked sequence-to-sequence model for non-autoregressive neural machine translation. In Proc. of ACL, pp. 376–385. Cited by: §2.2, §4.3.
  17. Y. Higuchi (2020) Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict. In Proc. of Interspeech, Cited by: §2.2.
  18. H. Inaguma (2019) Multilingual end-to-end speech translation. In Proc. of ASRU, pp. 570–577. Cited by: §1.
  19. H. Inaguma (2020) ESPnet-ST: all-in-one speech translation toolkit. In Proc. of ACL: System Demonstrations, pp. 302–311. Cited by: §4.1, §4.2.
  20. S. Indurthi (2020) End-end speech-to-text translation with modality agnostic meta-learning. In Proc. of ICASSP, pp. 7904–7908. Cited by: §1.
  21. Y. Jia (2019) Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In Proc. of ICASSP, pp. 7180–7184. Cited by: §1.
  22. L. Kaiser (2018) Fast decoding in sequence models using discrete latent variables. In Proc. of ICML, pp. 2390–2399. Cited by: §1.
  23. Y. Kim (2016) Sequence-level knowledge distillation. In Proc. of EMNLP, pp. 1317–1327. Cited by: §2.2, §4.2.
  24. D. Kingma (2015) Adam: A method for stochastic optimization. Proc. of ICLR. Cited by: §4.2.
  25. T. Ko (2015) Audio augmentation for speech recognition. In Proc. of Interspeech, pp. 3586–3589. Cited by: §4.1.
  26. A. C. Kocabiyikoglu (2018) Augmenting Librispeech with French translations: A multimodal corpus for direct speech translation evaluation. In Proc. of LREC, Cited by: §4.1.
  27. P. Koehn (2007) Moses: Open source toolkit for statistical machine translation. In Proc. of ACL: Demo and Poster Sessions, pp. 177–180. Cited by: §4.1.
  28. T. Kudo (2018) Subword regularization: Improving neural network translation models with multiple subword candidates. In Proc. of ACL, pp. 66–75. Cited by: §4.1.
  29. J. Lee (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proc. of EMNLP, pp. 1173–1182. Cited by: §1, §2.2.
  30. J. Libovickỳ (2018) End-to-end non-autoregressive neural machine translation with connectionist temporal classification. In Proc. of EMNLP, pp. 3016–3021. Cited by: §1, footnote 1.
  31. Y. Liu (2019) End-to-end speech translation with knowledge distillation. In Proc. of Interspeech, pp. 1128–1132. Cited by: §1.
  32. X. Ma (2019) FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In Proc. of EMNLP, pp. 4273–4283. Cited by: §1.
  33. H. Ney (1999) Speech translation: Coupling of recognition and translation. In Proc. of ICASSP, pp. 517–520. Cited by: §1.
  34. J. Niehues (2016) Dynamic transcription for low-latency speech translation.. In Proc. of Interspeech, pp. 2513–2517. Cited by: §1.
  35. A. Oord (2018) Parallel wavenet: fast high-fidelity speech synthesis. In Proc. of ICML, pp. 3918–3926. Cited by: §2.2.
  36. K. Papineni (2002) Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL, pp. 311–318. Cited by: §4.1.
  37. D. S. Park (2019) SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. of Interspeech, pp. 2613–2617. Cited by: §4.1.
  38. J. Pino (2019) Harnessing indirect training data for end-to-end automatic speech translation: tricks of the trade. arXiv, pp. arXiv–1909. Cited by: §1.
  39. J. Pino (2020) Self-training for end-to-end speech translation. In Proc. of Interspeech, Cited by: §1.
  40. M. Post (2013) Improved speech-to-text translation with the Fisher and Callhome Spanish–English speech translation corpus. In Proc. of IWSLT, Cited by: §4.1.
  41. D. Povey (2011) The kaldi speech recognition toolkit. In Proc. of ASRU, Cited by: §4.1.
  42. Y. Ren (2019) FastSpeech: Fast, robust and controllable text to speech. In Proc. of NeurIPS, pp. 3171–3180. Cited by: §2.2.
  43. Y. Ren (2020) A study of non-autoregressive model for sequence generation. In Proc. of ACL, pp. 149–159. Cited by: §2.2.
  44. E. Salesky (2019) Exploring phoneme-level speech representations for end-to-end speech translation. In Proc. of ACL, pp. 1835–1841. Cited by: §1.
  45. E. Salesky (2020) Phone features improve speech translation. In Proc. of ACL, pp. 2388–2397. Cited by: §1.
  46. R. Sennrich (2016) Neural machine translation of rare words with subword units. In Proc. of ACL, pp. 1715–1725. Cited by: §4.1.
  47. M. Stern (2019) Insertion Transformer: Flexible sequence generation via insertion operations. In Proc. of ICML, pp. 5976–5985. Cited by: §1.
  48. Z. Tian (2020) Spike-triggered non-autoregressive Transformer for end-to-end speech recognition. In Proc. of Interspeech, Cited by: §2.2.
  49. A. Vaswani (2017) Attention is all you need. In Proc. of NeurIPS, pp. 5998–6008. Cited by: §2.1, §4.2.
  50. C. Wang (2020) Bridging the gap between pre-training and fine-tuning for end-to-end speech translation. In Proc. of AAAI, pp. 9161–9168. Cited by: §1.
  51. C. Wang (2020) Curriculum pre-training for end-to-end speech translation. In Proc. of ACL, pp. 3728–3738. Cited by: §1.
  52. S. Watanabe (2017) Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §4.2.
  53. R. J. Weiss (2017) Sequence-to-sequence models can directly translate foreign speech. In Proc. of Interspeech, pp. 2625–2629. Cited by: §1.
  54. C. Zhou (2019) Understanding knowledge distillation in non-autoregressive machine translation. In Proc. of ICLR, Cited by: §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description