Enhancing Monotonic Multihead Attention for Streaming ASR

Enhancing Monotonic Multihead Attention for Streaming ASR

Abstract

We investigate a monotonic multihead attention (MMA) by extending hard monotonic attention to Transformer-based automatic speech recognition (ASR) for online streaming applications. For streaming inference, all monotonic attention (MA) heads should learn proper alignments because the next token is not generated until all heads detect the corresponding token boundaries. However, we found not all MA heads learn alignments with a naïve implementation. To encourage every head to learn alignments properly, we propose HeadDrop regularization by masking out a part of heads stochastically during training. Furthermore, we propose to prune redundant heads to improve consensus among heads for boundary detection and prevent delayed token generation caused by such heads. Chunkwise attention on each MA head is extended to the multihead counterpart. Finally, we propose head-synchronous beam search decoding to guarantee stable streaming inference.

\patchcmd\AtBeginEnvironment

algorithmic \nameHirofumi Inaguma, Masato Mimura, Tatsuya Kawahara \address Graduate School of Informatics, Kyoto University, Kyoto, Japan \email{inaguma,mimura,kawahara}@sap.ist.i.kyoto-u.ac.jp

Index Terms: Transformer, streaming end-to-end ASR, monotonic multihead attention, beam search decoding

1 Introduction

Recent progress of end-to-end (E2E) automatic speech recognition (ASR) models bridges the gap from the state-of-the-art hybrid systems [7]. To make E2E models applicable to simultaneous interpretations in lecture and meeting domains, online streaming processing is necessary. For E2E models, connectionist temporal classification (CTC) [12] and recurrent neural network transducer (RNN-T) [13] have been dominant approaches and reached a level of real applications [14, 37]. Meanwhile, attention-based encoder-decoder (AED) models [8, 5] have demonstrated the powerful modeling capability in offline tasks [34, 3, 48] and a number of streaming models have been investigated for RNN-based models [15, 42, 21, 22, 28, 35, 6].

Recently, the Transformer architecture [44], based on self-attention and multihead attention, has shown to outperform the RNN counterparts in various domains [18, 47], and several streaming models have been proposed such as triggered attention [30], continuous-integrate-and-fire (CIF) [11], hard monotonic attention (HMA) [43, 25], and other variants [41]. Triggered attention truncates encoder outputs by using CTC spikes and performs an attention mechanism over all past frames. CIF learns acoustic boundaries explicitly and extracts context vectors from the segmented region. Therefore, these models have adaptive segmentation policies relying on acoustic cues only.

On the other hand, HMA detects token boundaries on the decoder side by using lexical information as well. Thus, it is more flexible for modeling non-monotonic alignments and has been investigated in simultaneous machine translation (MT) [1]. Recently, HMA was extended to the Transformer architecture, named monotonic multihead attention (MMA), by replacing each encoder-decoder attention head in the decoder with a monotonic attention (MA) head [24]. Unlike a single MA head used in RNN-based models, each MA head can extract source contexts with different pace and learn complex alignments between input and output sequences. Concurrently, similar methods have been investigated for Transformer-based streaming ASR [43, 25]. Miao et al. [25] simplified the MMA framework by equipping a single MA head with each decoder layer to truncate encoder outputs as in triggered attention and perform attention over all past frames. Tsunoo et al. [43] also investigated the MMA framework but resorted to using all past frames to obtain a decent performance. However, looking back to the beginning of input frames lessens the advantage of linear-time decoding with HMA as the input length gets longer.

In this work, we investigate the MMA framework using restricted input context for the streaming ASR task. To perform streaming recognition with the MMA framework, it is necessary for every MA head to learn alignments properly. This is because the next token is not generated until all heads detect the corresponding token boundaries. If some heads fail to detect the boundaries until seeing the encoder output of the final frame, the next token generation is delayed accordingly. However, with a naïve implementation, we found that proper monotonic alignments are learnt in dominant MA heads only. To prevent this, we propose HeadDrop, in which a part of heads is entirely masked out at random as a regularization during training to encourage the rest non-masked heads to learn alignments properly. Moreover, we propose to prune redundant MA heads in lower decoder layers to further improve consensus among heads on token boundary detection. Chunkwise attention [6] on top of each MA head is further extended to the multihead counterpart to extract useful representations and compensate the limited context size. Finally, we propose head-synchronous beam search decoding to guarantee streamable inference.

Experimental evaluations on Librispeech corpus show that our proposed methods effectively encourage MA heads to learn alignments properly, which leads to improvement of ASR performance. Our optimal model enables stable streaming inference on other corpora as well without architecture modification.

2 Transformer ASR architecture

In this section, we detail the Transformer base architecture used in this paper. Our Transformer architecture consists of stacked encoder layers followed by front-end CNN blocks and decoder layers [18]. A CNN block has two CNN layers with a filter with a channel size followed by a ReLU activation. The frame rate is reduced by a factor of by a max-pooling layer with a stride of after every block. Each encoder layer is composed of a self-attention (SAN) sub-layer followed by a position-wise feed-forward network (FFN) sub-layer, wrapped by residual connections and layer normalization [2]. A key component of SAN sub-layers is a multihead attention (MHA) mechanism, in which key, value, and query matrices are split into potions with a dimension after linear transformations and each head performs a scaled-dot attention mechanism: , where , , and represent key, query, and value matrices on each head, respectively. The outputs from all heads are concatenated along the feature dimension followed by a linear transformation. A FFN sub-layer is composed of two linear layers with the inner dimension , interleaved with a ReLU activation between them.

In each decoder layer, unlike the encoder layer, additional encoder-decoder attention sub-layer is inserted between SAN and FFN sub-layers, and causal masking is performed to prevent the decoder from seeing the future tokens. We adopt three 1D-convolutional layers for positional embeddings [27]. The entire network is optimized by minimizing the negative log-likelihood and CTC loss with an interpolation weight [18].

3 Monotonic multihead attention (MMA)

In this section, we review hard monotonic attention (HMA) [35], monotonic chunkwise attention (MoChA) [6], and monotonic multihead attention (MMA) [24] as an extension of them.

3.1 Hard monotonic attention (HMA)

HMA was originally proposed for online linear-time decoding with RNN-based AED models. At output step , the decoder scans encoder outputs from left to right and stops at an index (token boundary) to attend the corresponding single encoder output . The decoder has options to stop at the current index or move forward to the next index. The next boundary is determined by resuming scanning from the previous boundary. As hard attention is not differentiable, the expected alignments are calculated by marginalizing over all possible paths during training as follows:

(1)
(2)

where is a selection probability and a monotonic energy function takes the -th decoder state and -th encoder output as inputs. Whenever is satisfied at test time, is activated (i.e., set to 1.0).

3.2 Monotonic chunkwise attention (MoChA)

To relax strict input-output alignment by using the surrounding contexts, MoChA introduces additional soft attention mechanism on top of HMA. Given the boundary , chunkwise attention is performed over a fixed window of width from there:

(3)

where is a chunk energy parameterized similar to the monotonic energy in Eq. (2) using separate parameters. in Eq. (3) is a continuous value during training, but is a binary value according to at test time.

3.3 Monotonic multihead attention (MMA)

To keep the expressive power of Transformer with the multihead attention mechanism while enabling online linear-time decoding, MMA was proposed as an extension of HMA [24]. Each encoder-decoder attention head in the decoder is replaced with a monotonic attention (MA) head in Eq. (1) by defining the monotonic energy function in Eq. (2) as follows:

(4)

where and are parameter matrices, and is a learnable offset parameter (initialized with in this work). Unlike a case of a single MA head in Section 3.1, each MA head can attend to input speech with different pace because its decision process regarding timing to activate does not influence each other at each output step. The side effect is that all heads must be activated to generate a token. Otherwise, some MA heads continue to scan encoder outputs until the last time index, which leads to significant increase of latency during inference.

Furthermore, unlike previous works [24, 43] having a single chunkwise attention (CA) head on each MA head, we extend it to the multi-head version having heads per MA head to extract useful representations with multiple views from each boundary (chunkwise multihead attention). Assuming each decoder layer has MA heads, the total number of CA heads is at the layer. However, we found that sharing parameters of CA heads among MA heads in the same layer is effective in our pilot experiments and adopt this strategy in all experiments. The chunk energy for each CA head is designed as in Eq. (4) without .

Figure 1: System overview. Residual connections and layer normalization are omitted.

4 Enhancing monotonic alignments

In the Transformer models, there exist many attention heads and residual connections, so it is unclear that all heads contribute to the final predictions. Michel et al. [26] reported that most heads can be pruned at test time without significant performance degradation in standard MT and BERT [9] architectures. They also revealed that important heads are determined in early training stage. Concurrently, Voita et al. [45] also reported the similar observations by automatically pruning a part of heads with penalty [23]. In our preliminary experiments, we also observed that not all MA heads learn alignments properly in the MMA-based ASR models and monotonic alignments are learnt only by dominant heads in upper decoder layers. Since in Eq. (1) are not normalized over inputs so as to sum up to one during training, context vectors from heads which do not learn alignments are more likely to become zero vectors at test time. This is a severe problem because (1) it leads to mismatch between training and testing conditions, and (2) the subsequent tokens cannot be generated until all heads are activated. To alleviate this problem, we propose the following methods.

4.1 HeadDrop

We first propose a regularization method to encourage each MA head to equally contribute to the target task. During training, we stochastically zero out all elements in each MA head (i.e., ) with a fixed probability to force the other heads to learn alignments. The decision of dropping each head is independent of other heads regardless of the depth of the decoder. The output of a MMA function is normalized by dividing it by , where is the number of non-masked MA heads. We name this HeadDrop, inspired by dropout [39] and DropConnect [46].

4.2 Pruning monotonic attention heads in lower layers

Although HeadDrop is effective for improving the contribution of each MA head, we found that some MA heads in the lower decoder layers still do not learn alignments properly. Therefore, we propose to prune such redundant heads because they are harmful for streaming decoding. We remove the MMA function in the first decoder layers from the bottom () during both training and test time (see Figure 1). These layers have SAN and FFN sub-layers only and serve as a pseudo language model (LM). The total number of effective MA heads is . This method is also based on findings in [45] that the lower layers of the Transformer decoder are mostly responsible for language modeling. Another advantage of pruning redundant heads is that inference speed is improved, which is effective for streaming ASR.

5 Head-synchronous beam search decoding

During beam search in the MMA framework, failure of boundary detection in some MA heads in some beam candidates at an output step easily prevents the decoder from continuing streaming inference. This is because other candidates must wait for the hypothesis pruning until all heads in all candidates are activated at each output step. To continue streaming inference, we propose head-synchronous beam search decoding (Algorithm 1). The idea is to force non-activated heads to activate after a small fixed delay. If a head at the -th layer cannot detect a boundary for frames after the leftmost boundary detected by other heads in the same layer, the boundary of such a head is set to the rightmost boundary among already detected boundaries at the current output step (line 16). Therefore, latency between the fastest (rightmost) and slowest (leftmost) boundary positions in the same layer is less than frames. We note that the decisions of boundary detection at the -th layer are dependent on outputs from the -th layer, and at least one head must be activated at each layer to generate a token. For the efficient implementation, we search boundaries of all heads in a layer in parallel, thus the loop in line 9 can be ignored. Moreover, we perform batch processing over multiple hypotheses in the beam and cache previous decoder states for efficiency. Note that head synchronization is not performed during training to maintain the divergence of boundary positions. Thus, synchronization can have the ensemble effect for boundary detection.

6 Experimental evaluations

6.1 Experimental setup

We used the 960-hour Librispeech dataset [31] for experimental evaluations. We extracted 80-channel log-mel filterbank coefficients computed with a 25-ms window size shifted every 10 ms using Kaldi [33]. We used a 10k vocabulary based on the Byte Pair Encoding (BPE) algorithm [38]. For the Transformer model configurations, we used (, , , , , , , , ) (, , , , , , , , ) for the baseline MMA models. Adam [19] optimizer was used for training with Noam learning rate schedule [44]. Warmup steps and a learning rate constant were set to and , respectively. We averaged model parameters at the last epochs for evaluation. Both dropout and label smoothing [40] were used with a probability . We set to . We used a 4-layer LSTM LM with memory cells for decoding with a beam width of . We used decoding hyperparameters (, ) (, ).1

1:: encoder outputs, : wait time threshold, : beam width
2:: top-k hypotheses
3:Initialize: , , ,
4:for  to  do
5:       
6:       for  to  do Batchfy
7:             for  +1 to  do
8:                    
9:                    for  1 to in parallel do
10:                          
11:                          for  to  do
12:                                 if  then
13:                                       ;  ;  break;
14:                                 else
15:                                       if  then
16:                                              Forced activation
17:                                             break
18:                                       end if
19:                                 end if
20:                          end for
21:                    end for
22:             end for
23:             Append to
24:       end for
25:       
26:       if  or  then breakend if
27:end for
28:return
Algorithm 1 Head-synchronous beam search decoding

6.2 Evaluation measure of boundary detection

To assess consensus among heads for token boundary detection, we propose metrics to evaluate (1) how well each MA head learns alignments (boundary coverage) and (2) how often the model satisfies the streamable condition (streamability). This is because even if better word error rate (WER) is obtained, the model cannot continue streaming inference if some heads do not learn alignments well, whose evaluation is missing in [25, 43].

Boundary coverage

During beam search, we count the total number of boundaries ( such that ) up to the -th output step averaged over all MA heads, , for every candidate in the -th utterance:

The boundary coverage is defined as the ratio of to the corresponding hypothesis length of the best candidate and averaged over utterances in the evaluation set as follows:

Streamability

The streamability is defined as the ratio of utterances satisfying a condition where over all candidates up to the -th output step (i.e, until generation of the best hypothesis is completed) as follows:

where is the delta function and is a hypothesis set at -th output step of the -th utterance. indicates that the model failed streaming recognition somewhere in the -th utterance, i.e., continued scanning the encoder outputs until the last frame. However, we note that it does not mean the model leverages additional context.

6.3 Offline ASR results

HeadDrop and pruning MA heads in lower layers improve WER and streamability   Table 1 shows the results for offline MMA models on the Librispeech dev-clean/other sets. ”Offline” means the encoder is the offline architecture. Boundary coverage and stremability were averaged over two sets. A naïve implementation A1 showed a very poor performance. By pruning MA heads in lower layers with increasing , WER was significantly reduced, but the boundary coverage was not so high (A5). The proposed HeadDrop also significantly improved WER, and with the increase of , the boundary coverage was drastically improved to almost 100% (B3-B5). We can conclude that MA heads in lower layers are not necessarily important. This is probably because (1) the modalities between input and output sequences are different in the ASR task and (2) hidden states in lower decoder layers tend to represent the previous tokens, thus are not suitable for alignment.

ID HD dev-clean / dev-other
%WER
A1 0 4 24 - 8.6 / 16.5 67.40 0.0
A2 1 20 7.3 / 16.3 79.02 0.0
A3 2 16 4.7 / 12.6 86.07 0.0
A4 3 12 4.5 / 12.8 83.87 0.0
A5 4 8 3.6 / 10.8 93.80 0.9
B1 0 4 24 3.7 / 11.4 60.59 0.0
B2 1 20 4.0 / 11.9 73.73 0.0
B3 2 16 3.9 / 10.8 98.85 3.7
B4 3 12 4.1 / 11.0 99.36 6.4
B5 4 8 4.1 / 11.3 99.50 15.8
C1 0 1 6 - 4.9 / 11.7 99.38 15.7
C2 1 6 3.7 / 10.4 99.86 35.9
C3 2 12 3.5 / 10.7 72.08 0.0
Table 1: Results of offline MMA models on the dev sets with standard beam search decoding. HD: HeadDrop.
ID dev-clean / dev-other
%WER
B3 2 4 4 1 3.9 / 10.7 99.74 21.6
B4 3 3.9 / 10.6 99.76 25.1
B5 4 3.8 / 11.1 99.84 40.5
D1 2 4 16 1 3.3 / 9.9 99.78 37.4
D2 3 3.7 / 10.8 99.83 36.5
D3 4 3.5 / 10.4 99.93 60.4
E1 2 4 16 2 3.3 / 10.2 99.78 40.6
E2 3 3.6 / 10.3 99.87 51.2
E3 4 3.5 / 10.7 99.92 50.0
E4 2 4 3.3 / 9.8 99.91 77.9
E5 3 3.4 / 9.9 99.90 84.5
E6 4 3.6 / 10.4 99.92 63.2
F1 0 1 16 4 3.5 / 10.5 96.23 40.6
Table 2: Results of offline MMA models on the dev sets with head-synchronous beam search decoding and HeadDrop.
Model %WER %CER
Librispeech TED LIUM2 AISH ELL-1
clean other
Offline Transformer (ours) 3.3 9.1 10.1 6.4
+ data augmentation 2.8 7.0 - -
++ large model 2.5 6.1 - -
MMA (E5) 3.4 9.9 10.5 6.5
Streaming Triggered attention [29] 2.8 7.2 - -
CIF [11] 3.3 9.7 - -
MoChA [16] 4.0 9.5 11.3 -
MMA () [43] - - - 9.7
MMA (narrow chunk) 3.5 11.1 11.0 7.5
MMA (wide chunk) 3.3 10.5 10.2 6.6
+ data augmentation 3.0 8.5 - -
++ large model 2.7 7.1 - -
Table 3: Results of streaming MMA models on the Librispeech, TEDLIUM2, and AISHELL-1 test sets

Head-synchronous beam search decoding improves WER and streamability   Next, the results with head-synchronous beam search decoding are shown in Table 2. is set to 8 in all models. Head-synchronous decoding improved both boundary coverage and streamability. We found that if a head cannot detect the boundary around the corresponding actual acoustic boundary, it tends to stop around the next acoustic boundary twice to compensate the failure when using a standard beam search. Head-synchronous decoding alleviated this mismatch of boundary positions and led to small WER improvement.

Chunkwise multihead attention is effective   Furthermore, we increased the window size and number of heads in chunkwise attention, both of which further improved WER. With , E3 and E6 did not obtain benefits from larger . Increasing to longer than 16 was not effective.

Multiple MA heads in each layer are necessary   We also examined the effect of the number of MA heads in each layer (C1-C3 in Table 1). C1 with only one MA head per layer showed a high boundary coverage and was further improved with HeadDrop (C2). C3 with two heads per layer degraded streamability very much. Although C2 showed better performances than B*, it did not obtain much gains from larger and (F1 in Table 2). This confirms that having multiple MA heads in upper layers is more effective than simply reducing the number of MA heads per layer. In other words, the place of MA heads is more important than the total number of them.

Here, what does the rest % for streamability in E5 account for? We found that the last few tokens corresponding to the tail part of input frames were predicted after head pointers on upper layers reached the last encoder output. For these % utterances, E5 was able to continue streaming decoding until % of the input frames on average. Since the tail part is mostly silence, this does not affect streaming recognition. In our manual analysis, we observed that MA heads in the same layer move forward with the similar pace, and the pace gets faster in upper layers.2 This is because decoder states are dependent on the output from the lower layer. Considering the balance between WER and streamability performance, we will use the E5 setting for streaming experiments in the next section.

6.4 Streaming ASR results

Finally, we present the results of streaming MMA models for the Librispeech test sets in Table 3. We also included results on TEDLIUM2 [36] and AISHELL-1 [4] to confirm whether the optimal configuration tuned on Librispeech can work in other corpora as well. We adopted the chunk hopping mechanism [10] for the online encoder. Following [11, 25], we set the left/current (hop)/right chunk sizes to (narrow chunk) and (wide chunk) [ms]. We used speed perturbation [20] and SpecAugment [32] for data augmentation, but speed perturbation was applied by default for TEDLIUM2 and AISHELL-1. For large models, we used (, , ) (, , ) and other hyperparameters were kept. Head-synchronous decoding was used for all MMA models. We used CTC scores during inference only for standard Transformer models. Our streaming MMA models achieved comparable results to the state-of-the-art Transformer-based streaming ASR model [29] without looking back to the first input frame. Moreover, our model outperformed the MMA model with [43] by a large margin. Increasing the model size was also effective. The streamabilities of the streaming MMA models on TEDLIUM2 and AISHELL-1 with the wide chunk were 80.0%, and 81.5%, respectively. This confirms that the E5 setting generalizes to other corpora without architecture modification.

7 Conclusion

We tackled the alignment issue in monotonic multihead attention (MMA) for online streaming ASR with HeadDrop regularization and head pruning in lower decoder layers. We also stabilized streamable inference by head-synchronous decoding. Our future work includes investigation of adaptive policies for head pruning and regularization methods to make the most of the MA heads instead of discarding them. Minimum latency training as done in MoChA [17] is another interesting direction.

Footnotes

  1. Code: https://github.com/hirofumi0810/neural_sp.
  2. Examples available at https://github.com/hirofumi0810.github.io/demo/enhancing_mma_asr.

References

  1. N. Arivazhagan, C. Cherry, W. Macherey, C. Chiu, S. Yavuz, R. Pang, W. Li and C. Raffel (2019) Monotonic infinite lookback attention for simultaneous machine translation. In Proc. of ACL, pp. 1313–1323. Cited by: §1.
  2. J. L. Ba, J. R. Kiros and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.
  3. E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu, S. Satheesh, A. Sriram and Z. Zhu (2017) Exploring neural transducers for end-to-end speech recognition. In Proc. of ASRU, pp. 206–213. Cited by: §1.
  4. H. Bu, J. Du, X. Na, B. Wu and H. Zheng (2017) AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proc. of O-COCOSDA, pp. 1–5. Cited by: §6.4.
  5. W. Chan, N. Jaitly, Q. Le and O. Vinyals (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. of ICASSP, pp. 4960–4964. Cited by: §1.
  6. C. Chiu and C. Raffel (2018) Monotonic chunkwise attention. In Proc. of ICLR, Cited by: §1, §1, §3.
  7. C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao and K. Gonina (2018) State-of-the-art speech recognition with sequence-to-sequence models. In Proc. of ICASSP, pp. 4774–4778. Cited by: §1.
  8. J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho and Y. Bengio (2015) Attention-based models for speech recognition. In Proc. of NeurIPS, pp. 577–585. Cited by: §1.
  9. J. Devlin, M. Chang, K. Lee and K. Toutanova (2019) BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proc. of NAACL-HLT, pp. 4171–4186. Cited by: §4.
  10. L. Dong, F. Wang and B. Xu (2019) Self-attention aligner: A latency-control end-to-end model for ASR using self-attention network and chunk-hopping. In Proc. of ICASSP, pp. 5656–5660. Cited by: §6.4.
  11. L. Dong and B. Xu (2020) CIF: Continuous integrate-and-fire for end-to-end speech recognition. In Proc. of ICASSP, pp. 6079–6083. Cited by: §1, §6.4, Table 3.
  12. A. Graves, S. Fernández, F. Gomez and J. Schmidhuber (2006) Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proc. of ICML, pp. 369–376. Cited by: §1.
  13. A. Graves (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1.
  14. Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu and R. Pang (2019) Streaming end-to-end speech recognition for mobile devices. In Proc. of ICASSP, pp. 6381–6385. Cited by: §1.
  15. J. Hou, S. Zhang and L. Dai (2017) Gaussian prediction based attention for online end-to-end speech recognition. In Proc. of Interspeech, pp. 3692–3696. Cited by: §1.
  16. H. Inaguma, M. Mimura and T. Kawahara (2020) CTC-synchronous training for monotonic attention model. arXiv preprint arXiv:2005.04712. Cited by: Table 3.
  17. Inaguma,Hirofumi, Y. Gaur, L. Lu, J. Li and Y. Gong (2020) Minimum latency training strategies for streaming sequence-to-sequence ASR. In Proc. of ICASSP, pp. 6064–6068. Cited by: §7.
  18. S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto and X. Wang (2019) A comparative study on Transformer vs RNN in speech applications. In Proc. of ASRU, pp. 499–456. Cited by: §1, §2, §2.
  19. D. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
  20. T. Ko, V. Peddinti, D. Povey and S. Khudanpur (2015) Audio augmentation for speech recognition. In Proc. of Interspeech, pp. 3586–3589. Cited by: §6.4.
  21. D. Lawson, C. Chiu, G. Tucker, C. Raffel, K. Swersky and N. Jaitly (2018) Learning hard alignments with variational inference. In Proc. of ICASSP, pp. 5799–5803. Cited by: §1.
  22. M. Li, M. Liu and H. Masanori (2019) End-to-end speech recognition with adaptive computation steps. In Proc. of ICASSP, pp. 6246–6250. Cited by: §1.
  23. C. Louizos, M. Welling and D. P. Kingma (2018) Learning sparse neural networks through regularization. In Proc. of ICLR, Cited by: §4.
  24. X. Ma, J. Pino, J. Cross, L. Puzon and J. Gu (2020) Monotonic multihead attention. In Proc. of ICLR, Cited by: §1, §3.3, §3.3, §3.
  25. H. Miao, G. Cheng, C. Gao, P. Zhang and Y. Yan (2020) Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proc. of ICASSP, pp. 6084–6088. Cited by: §1, §1, §6.2, §6.4.
  26. P. Michel, O. Levy and G. Neubig (2019) Are sixteen heads really better than one?. In Proc. of NeurIPS, pp. 14014–14024. Cited by: §4.
  27. A. Mohamed, D. Okhonko and L. Zettlemoyer (2019) Transformers with convolutional context for ASR. arXiv preprint arXiv:1904.11660. Cited by: §2.
  28. N. Moritz, T. Hori and J. Le Roux (2019) Triggered attention for end-to-end speech recognition. In Proc. of ICASSP, pp. 5666–5670. Cited by: §1.
  29. N. Moritz, T. Hori and J. L. Roux (2020) Streaming automatic speech recognition with the Transformer model. arXiv preprint arXiv:2001.02674. Cited by: §6.4, Table 3.
  30. N. Moritz, T. Hori and J. L. Roux (2020) Streaming automatic speech recognition with the Transformer model. In Proc. of ICASSP, pp. 6074–6078. Cited by: §1.
  31. V. Panayotov, G. Chen, D. Povey and S. Khudanpur (2015) Librispeech: An ASR corpus based on public domain audio books. In Proc. of ICASSP, pp. 5206–5210. Cited by: §6.1.
  32. D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le (2019) SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. of Interspeech, Cited by: §6.4.
  33. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian and P. Schwarz (2011) The Kaldi speech recognition toolkit. In Proc. of ASRU, Cited by: §6.1.
  34. R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson and N. Jaitly (2017) A comparison of sequence-to-sequence models for speech recognition. In Proc. of Interspeech, pp. 939–943. Cited by: §1.
  35. C. Raffel, M. Luong, P. J. Liu, R. J. Weiss and D. Eck (2017) Online and linear-time attention by enforcing monotonic alignments. In Proc. of ICML, pp. 2837–2846. Cited by: §1, §3.
  36. A. Rousseau, P. Deléglise and Y. Estève (2012) TED-LIUM: An automatic speech recognition dedicated corpus. In Proc. of LREC, pp. 125–129. Cited by: §6.4.
  37. T. N. Sainath, Y. He, B. Li, A. Narayanan, R. Pang, A. Bruguier, S. Chang, W. Li, R. Alvarez and Z. Chen (2020) A streaming on-device end-to-end model surpassing server-side conventional model quality and latency. In Proc. of ICASSP, pp. 6059–6063. Cited by: §1.
  38. R. Sennrich, B. Haddow and A. Birch (2016) Neural machine translation of rare words with subword units. In Proc. of ACL, pp. 1715–1725. Cited by: §6.1.
  39. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov (2014) Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), pp. 1929–1958. Cited by: §4.1.
  40. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proc. of CVPR, pp. 2818–2826. Cited by: §6.1.
  41. Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang and Z. Wen (2020) Synchronous Transformers for end-to-end speech recognition. In Proc. of ICASSP, pp. 7884–7888. Cited by: §1.
  42. A. Tjandra, S. Sakti and S. Nakamura (2017) Local monotonic attention mechanism for end-to-end speech and language processing. In Proc. of IJCNLP, pp. 431–440. Cited by: §1.
  43. E. Tsunoo, Y. Kashiwagi, T. Kumakura and S. Watanabe (2019) Towards online end-to-end Transformer automatic speech recognition. arXiv preprint arXiv:1910.11871. Cited by: §1, §1, §3.3, §6.2, §6.4, Table 3.
  44. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017) Attention is all you need. In Proc. of NeurIPS, pp. 5998–6008. Cited by: §1, §6.1.
  45. E. Voita, D. Talbot, F. Moiseev, R. Sennrich and I. Titov (2019) Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proc. of ACL, pp. 5797–5808. Cited by: §4.2, §4.
  46. L. Wan, M. Zeiler, S. Zhang, Y. Le Cun and R. Fergus (2013) Regularization of neural networks using dropconnect. In Proc. of ICML, pp. 1058–1066. Cited by: §4.1.
  47. A. Zeyer, P. Bahar, K. Irie, R. Schlüter and H. Ney (2019) A comparison of Transformer and LSTM encoder decoder models for ASR. In Proc. of ASRU, pp. 8–15. Cited by: §1.
  48. A. Zeyer, K. Irie, R. Schlüter and H. Ney (2018) Improved training of end-to-end attention models for speech recognition. In Proc. of Interspeech, pp. 7–11. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414420
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description