End-to-end Speech Recognition: A review for the French language

End-to-end Speech Recognition: A review for the French language

Abstract

Recently, end-to-end ASR based either on sequence-to-sequence networks or on the CTC objective function gained a lot of interest from the community, achieving competitive results over traditional systems using robust but complex pipelines. One of the main features of end-to-end systems, in addition to the ability to free themselves from extra linguistic resources such as dictionaries or language models, is the capacity to model acoustic units such as characters, subwords or directly words; opening up the capacity to directly translate speech with different representations or levels of knowledge depending on the target language. In this paper we propose a review of the existing end-to-end ASR approaches for the French language. We compare results to conventional state-of-the-art ASR systems and discuss which units are more suited to model the French language.

\name

Florian Boyer, Jean-Luc Rouas \address Airudit, ENSEIRB-MATMECA, Talence, France
Univ. de Bordeaux, LaBRI, INP, CNRS, UMR5800, Talence, France
florian.boyer@{labri.fr, ea4t.com}, jean-luc.rouas@labri.fr

Index Terms: acoustic modeling, end-to-end speech recognition, French language

1 Introduction

Automatic Speech Recognition (ASR) has traditionally used Hidden Markov Models (HMM), describing temporal variability, combined with Gaussian Mixture Models (GMM), computing emission probabilities from HMM states, to model and map acoustic features to phones. In recent years, the introduction of deep neural networks replacing GMM for acoustic modeling showed huge improvements compared to previous state-of-the-art systems [1, 2]. However, building and training such systems can be complex and a lot of preprocessing steps are involved. Traditional ASR systems are also factorized in several modules, the acoustic model representing only one of them along with lexicon and language models.

Recently, more direct approaches – called end-to-end methods – in which neural architectures are trained to directly model sequences of features as characters have been proposed [3, 4, 5]. Predicting context independent targets such as characters using a single neural network architecture, drained a lot of interest from the research community as well as non-experts developers. This is caused by the simplicity of the pipeline and the possibility to create a complete ASR system without the need for expert knowledge. Moreover having an orthographic-based output allows to freely construct words, making it interesting against the Out-Of-Vocabulary problem encountered in traditional ASR systems.

End-to-end systems are nowadays extensively used and studied for multiple tasks and languages such as English, Mandarin or Japanese. However, for a language such as French, ASR performance and results with the existing methods have been scarcely studied, although the large number of silents letters, homophones or argot make comparing the assumptions made by each method very attractive.

In this context, we decided to study the three main types of architectures which have demonstrated promising results over traditional systems: 1) Connectionist Temporal Classification (CTC) [6, 7] which uses Markov assumptions (i.e. conditional independence between predictions at each time step) to efficiently solve sequential problems by dynamic programming, 2) Attention-based methods [8, 9] which rely on an attention mechanism to perform non-monotonic alignment between acoustic frames and recognized acoustic units and 3) RNN-tranducer [1, 10, 11] which extends CTC by additionally modeling the dependencies between outputs at different steps using a prediction network analogous to a language model. We extend our experiments by adding two hybrid end-to-end methods: a multi-task method called joint CTC-attention [12, 13] and a RNN-transducer extended with attention mechanisms [14]. To complete our review, we build a state-of-art phone-based system based on lattice-free MMI criterion [15] and its end-to-end counterpart with both phonetic and orthographic units [16].

2 End-to-end systems for Speech Recognition

2.1 Connectionist Temporal Classification

The CTC [6] can be seen as a direct translation of conventional HMM-DNN ASR systems into lexicon-free systems. Thus, the CTC follows the general ASR formulation, training the model to maximize the probability distribution over all possible label sequences:

Here, denotes the observations, is a sequence of acoustic units of length such that , where is an alphabet containing all distinct units. As in traditional HMM-DNN systems, the CTC model makes conditional independence assumptions between output predictions at different time steps given aligned inputs and it uses the probabilistic chain rule to factorize the posterior distribution into three distributions (i.e. framewise posterior distribution, transition probability and prior distribution of units). However, unlike HMM-based models, the framewise posterior distribution is defined here as a framewise acoustic unit sequence with an additional blank label such as .

Here, introduces two contraction rules for the output labels, allowing to repeat or collapse successive acoustic units.

2.2 Attention-based model

As opposed to CTC, the attention-based approach [8, 9] does not assume conditional independence between predictions at different time steps and does not marginalize over all alignments. Thus the posterior distribution is directly computed by picking a soft alignment between each output step and every input step as follows:

Here , – our attention-based objective function –, is obtained according to a probability distribution, typically a softmax, applied to the linear projection of the output of a recurrent neural network (or long-short term memory network), called decoder, such as:

The decoder output is conditioned by the previous output , a hidden vector and a context vector . Here denotes the high level representation (i.e. hidden states) of the decoder at step , encoding the target input, and designate the context – or symbol-wise vector in our case – for decoding step , which is computed as the sum of the complete high representation of another recurrent neural network, encoding the source input , weighted by the attention weight:

where , also referred to as energy, measures how well the inputs around position and the output at position match, given the decoder states at decoding step and the encoder states for input . In the following, we report the standard content-based mechanism and its location-aware variant which takes into account the alignment produced at the previous step using convolutional features:

where and are vectors, the matrix for the decoder, the matrix for the high representation and the matrix for the convolutional filters, that takes the previous alignment for location-based attention mechanism into account.

2.3 RNN transducer

The RNN transducer architecture was first introduced by Graves and al. [10] to address the main limitation of the proposed CTC network: it cannot model interdependencies as it assumes conditional independence between predictions at different time steps.
To tackle this issue, the authors introduced a CTC-like network augmented with a separate RNN network predicting each label given the previous ones, analogous to a language model. With the addition of another network taking into account both encoder and decoder outputs, the system can jointly model interdependencies between both inputs and outputs and within the output label sequence.

Although the CTC and RNN-transducer are similar, it should be noted that unlike CTC which represent a loss function, RNN-transducer defines a model structure composed of the following subnetworks :

  • The encoder or transcription network: from an input value at timestep this network yields an output vector of dimension , where denotes the label which acts similarly as in CTC model.

  • The prediction network: given as input the previous label prediction , this network compute an output vector dependent of the entire label sequence .

  • The joint network: using both encoder outputs and prediction outputs , it computes for each input in the encoder sequence and label in prediction network such as:

The output from the joint network is then passed to a softmax layer which defines a probability distribution over the set of possible target labels, including the blank symbol.

It should be noted that we made a small modification compared to the last proposed version [1]: instead of feeding the hidden activations of both networks into a separate linear layer, whose outputs are then normalised, we include another linear layer and feed each hidden activations to its corresponding linear layer which yields a vector of dimension , the defined joint-space.

Similarly to the CTC, the marginalized alignments are local and monotonic and the label likelihood can be computed using dynamic programming. However, unlike CTC, RNN transducer allows prediction of multiple characters at one time step, alongside their vertical probability transitions.

2.4 Other notable approaches

Joint CTC-attention  The key idea behind the joint CTC-Attention [12] learning approach is simple. By training simultaneously the encoder using the attention mechanism with a standard CTC objective function as an auxiliary task, monotonic alignments between speech and label sequences can be enforced to reduce the irregular alignments caused by large jumps or loops on the same frame in the attention-based model. The objective function below formulates the multi-task learning of the network, where is a tunable parameter weighting the contribution of each loss function:

The approach proposed in [13] introduced a joint-decoding method to take into account the CTC predictions in the beam-search based decoding process of the attention-based model. Considering the difficulty to combine their respective scores, the attention-based decoder performs the beam search character-synchronously whereas the CTC performs it frame-synchronously, two methods were proposed.

The first one is a two-pass decoding process where the complete hypotheses from the attention model are computed and then rescored according to the following equation, where is computed using the standard CTC forward-backward algorithm:

The second method is a one-pass decoding method where the probability of each partial hypothesis in the beam search process is computed directly using both CTC and attention model such as, given the partial hypothesis and the score defined as the log probability of the hypothesized sequence:

End-to-end lattice-free MMI  The end-to-end Lattice-Free MMI [16] is the end-to-end version of the method introduced by Povey et al. [15]. In this version, a flat-start manner is adopted in order to remove the need of training an initial HMM-GMM for alignments and the tree-building pipeline. Although the approach seems more like a flat-start adaptation of the state-of-art method than end-to-end in terms of pipeline and it does not benefit from the open-vocabulary property to construct unseen words compared to previously presented methods, we use it in our experiments as it showed small degradation over the original lattice-free MMI with different acoustic units. We can therefore contrast the orthographic differences in productions between open systems and more constrained ones where the relationship between acoustic units and a word-level representation is restricted.

RNN-transducer with attention  The RNN transducer architecture augmented with attention mechanisms was first mentioned, to the best of our knowledge, in [14]. Here, the prediction network described in 2.3 is replaced by an attention-based decoder similar to the one described in 2.2 and used in the joint CTC-attention. This modification allows the decoder to access acoustic information alongside the sequence of previous predictions. As the decoder output computation is not affected by this change (the decoder and joint outputs computation are not dependent on a particular choice of segmentation), the architecture can be trained with the same forward-backward algorithm used for standard RNN-transducer. Finally, unlike the previous hybrid procedure, the inference procedure can be performed frame-synchronously with an unmodified greedy or beam search algorithm.

3 Database

We carried out our experiments using the data provided during the ESTER evaluation campaign (Evaluation of Broadcast News enriched transcription systems) [17] which is one of the most commonly used corpus for the evaluation of French ASR. Evaluations are done on test set. The details of the dataset, corresponding to 6h34 of speech, are described in [17]. We use the same normalization and scoring rules as in the evaluation plan of the ESTER 2 campaign except that we do not use equivalence dictionary and partially pronounced words are scored as full words.

To train the acoustic models we use the 90h of the training set from ESTER2 augmented by 75h from ESTER1 training set and 90h from the additional subset provided in ESTER1 with their transcriptions provided in the corpus EPAC [18]. We removed segments containing less than 1,5 seconds of transcribed speech and we excluded the utterances corresponding to segments with more than 3000 input frames or sentences of more than 400 characters for the end-to-end models. Because some irregulars segment-utterance pairs remained, we re-segmented the training data using the GMM-HMM model (with LDA-MLLT-SAT features) we build our phone-based chain model upon. During re-segmentation, only the audio parts matching the transcripts are selected. This brings the training data to approximately 231h. For neural networks training, we have applied 3-fold speed perturbation [19] and volume perturbation with random volume scale factor between 0.25 and 2, leading to a total of training data of 700h.

For language modeling, we use the manual transcripts from the training set. We extend this set with manually selected transcriptions from other speech sources (BREF corpus [20], oral interventions in EuroParl from ’96-’06 [21] and a small portion of transcriptions from internal projects). The final corpus is composed of 2 041 916 sentences, for a total of 46 840 583 words.

4 Implementations

All our systems share equivalent optimization – no rescoring technique or post-processing is done – as well as equivalent resource usage. Each system is kept to its initial form (i.e. no further training on top of the reported system).

4.1 Acoustic units

For our experiments, three kind of acoustic units were chosen: phones, characters and subwords. The baseline phone-based systems use the standard 36 phones used in French. The CTC, attention and hybrid systems each have two versions: one for characters with 41 classes (26 letters from the Latin alphabet, 14 letters with a diacritic and apostrophe) and another version for subwords where the number of classes is set to 500, the final set of subword units used in our training being selected by using a subword segmentation algorithm based on a unigram language model [22] and implemented in Google’s toolkit SentencePiece [23]. For the end-to-end variant of the chain model, characters units are used with the 41 classes set.

4.2 Baseline systems

We used the Kaldi toolkit [24] to train the chain model and its end-to-end variant.

The chain model is a TDNN-HMM model trained with the LF-MMI objective function. The neural network is based on a sub-sampled time-delay neural network (TDNN) with 7 TDNN layers and 1024 units in each, time stride value being set to 1 in the first three layers, 0 in the fourth layer and 3 in the final ones. The end-to-end version of the chain model is trained in the same way as the original model but with a different architecture. The network is composed of a 1 LSTM Projected layer [25] with 512 units followed by 2 TDNN layers of 512 units - these first three layers being repeated twice - and another 1 LSTM Projected layer with 512 units when using character as unit. The time delay value in the recurrent connections of the projected LSTM layers is set to 3.

As the input for our models, we use a 40-dimensional high resolution MFCC vector (i.e. linear transform of the filterbanks) and CMVN for both the chain model trained with lattice free-MMI and its end-to-end variant. We also trained separately a phone-based chain model with the previous 40-dimensional MFCC vector concatenated with a 100-dimensional i-vector [26] as input to assess the impact of speaker-dependant features.

For the linguistic part, we also trained a word 3-gram language model using SRILM’s n-gram counting method [27] with KN discounting. As lexicon we use the phonetic dictionary provided by the LIUM, thus the vocabulary of our language model is limited to the most frequent 50k words found in our training texts and also present in their dictionary.

For the end-to-end version modeling characters, we replace the phonetic lexicon by an orthographic lexicon with the same entries, where the orthographic representation is the word sequence with space inserted between each character.

4.3 End-to-end systems

We use the ESPNET toolkit [28] to train the five end-to-end systems. For each method two acoustic units are used: character and subword. Ten epochs are used to train each model.

The acoustic models for all methods share the same architecture composed of VGG bottleneck [13] followed by a 3-layer bidirectional LSTM with 1024 units in each layer and each direction. For the models using attention mechanism we use a 1-layer LSTM with 1024 units and location-based mechanism with 10 centered convolution filters of width 100 for the convolutional feature extraction as decoder. When training jointly CTC and attention, was set to based on preliminary experiments. For RNN-transducer the joint space between encoder and decoder was set to 1024 dimensions.
The input features for these models are a 80-dimensional raw filterbanks vector with their first and second derivatives with cepstral mean normalization (CMN).

For the experiments involving language models, we trained three different models using the RNNLM module available in ESPNET: one with characters, another with subwords and the last one with full words for multi-level combination when dealing with characters as units. Each model is incorporated at inference time using shallow fusion [29], except for the word-LM relying on multi-level decoding [30]. The main architecture of our RNNLMs is a 1-layer RNN, the number of units in each layer depending of the target unit: 650 units for subwords and characters, and 1024 units for words. Unlike the systems described above, the vocabulary for the word-based RNNLM was limited according to the training texts only.

In order to directly compare the baseline systems to the end-to-end systems relying on different word-based LM (i.e. N-gram and RNN-based), another RNNLM was trained using available tools in Kaldi. The language model shares the same architecture as the word-RNNLM described in this subsection and was trained with equivalent training parameters. Following lattice rescoring approach proposed in [31], decoding was then performed with the RNNLM for all baseline systems. We observe a maximal WER improvement of 0.12% on the dev set and 0.16% on the test set compared to the systems relying on the original 3-gram. Adding to that a difference of less than 1.3% between words in language model vocabularies for baseline and end-to-end systems, we thus consider minimal the impact for our comparison.

4.4 Decoding

To measure the best performance, we set the beam size to 30 in decoding under all conditions and for all models. When decoding with the attention-only model, we do not use sequence length control parameters such as coverage term or length normalization parameters [32]. When joint-decoding, is set to 0.2 based on our preliminary experiments. For CTC and attention experiments involving a RNNLM, the language model weight during decoding is set to respectively for character and subword LM, and for the word LM. For RNN-transducer, we downscale the use of external language model when performing multi-level LM decoding, setting the value to .

5 Results

Model Units Lexicon LM Corr. Sub. Ins. Del. CER WER

 

 

chain LF-MMI phone 50K word 3-gram 14.2
chain LF-MMI
(i-vectors)
13.7

 

 

e2e chain LF-MMI phone 50K
phone 4-gram
+ word 3-gram
14.4
char
char 4-gram
+ word 3-gram
94.3 2.6 2.0 3.0 7.6 14.8

 

 

CTC char None None 87.4 4.9 7.7 3.0 15.5 42.3
char RNNLM 89.5 4.4 6.1 2.8 13.3 31.0
50K word RNNLM 89.8 4.3 5.8 2.7 12.8 27.3
subword None None 81.2 9.2 9.6 1.4 20.1 28.4
subword RNNLM 85.7 9.1 6.1 2.3 17.5 21.2

 

Attention (location-based) char None None 89.8 3.2 6.7 3.3 13.2 24.4
char RNNLM 90.1 3.2 6.4 3.2 12.8 23.6
50K word RNNLM 90.2 3.3 6.2 3.2 12.7 23.0
subword None None 84.1 12.3 3.6 3.6 19.5 22.7
subword RNNLM 85.0 11.1 3.4 3.3 18.4 21.8

 

RNN-transducer char None None 93.9 2.8 3.3 2.4 8.5 19.7
char RNNLM 94.0 2.6 3.4 2.2 8.2 18.8
50K word RNNLM 94.1 2.5 3.4 2.1 8.0 18.1
subword None None 87.1 8.9 4.1 2.5 15.5 18.5
subword RNNLM 87.4 8.2 4.3 2.2 14.7 17.4

 

 

Joint CTC-Attention + joint decoding (MTL=0.3, ctc weight=0.2) char None None 91.7 2.9 5.4 2.1 10.4 22.1
char RNNLM 92.2 2.9 5.0 2.2 10.1 20.6
50K word RNNLM 92.8 3.1 4.1 2.4 9.6 18.6
subword None None 87.3 9.3 3.5 2.5 15.3 18.7
subword RNNLM 87.4 8.8 3.3 2.4 14.5 17.8

 

RNN-transducer w/ attention (location-based) char None None 94.1 2.7 3.2 2.3 8.2 19.1
char RNNLM 94.1 2.5 3.4 2.1 8.0 18.3
50K word RNNLM 94.3 2.4 3.3 2.1 7.8 17.6
subword None None 87.1 9.0 4.1 2.5 15.6 18.4
subword RNNLM 87.3 8.3 4.4 2.2 14.9 17.5

 

Table 1: Character Error Rate (CER) with detailed report and Word Error Rate (WER) for all evaluated methods on ESTER testing set. Italic values denotes errors on subwords. Bold values indicates best results for each section.

The results of our experiments in terms of Character Error Rate (CER) and Word Error Rate (WER) on the test set are gathered in the Table 1. For CER we also report errors in the metric: correct, substituted, inserted and deleted characters.
It should be noted that the default CER computation in all frameworks does not use a special character for space during scoring. As important information relative to this character, denoting word-boundary errors, can be observed through the WER variation during comparison, we kept the initial computation for CER. Thus, for low CER variations, bigger WER differences are expected notably between traditional and end-to-end systems.

5.1 Baseline systems

The phone-based chain model trained with lattice-free MMI criterion has a WER of 14.2 on the test set. Compared to the best reported system during the ESTER campaign (WER 12.1% [17]), the performance show a relative degradation of 14.8%. Although the compared system rely on a HMM-GMM architecture, it should be noted that a triple-pass rescoring (+ post-processing) is applied, a consequent number of parameters is used, and a substantial amount of data is used for training the language model (more than 11 times our volume). Adding i-vectors features the performance of our model is further improved, leading to a WER of 13.7.

For the end-to-end phone-based system we denote a small WER degradation of 0.2% compared to the original system without i-vectors, which is a good trade-off considering the removal of the initial HMM-GMM training. Switching to characters as acoustic units we obtain a WER of 14.8, corresponding to a CER of 7.6. The detailed report show that all types of errors are quite balanced, with however a higher number of deletions. The system remains competitive even with orthographic units, despite the low correspondence between phonemes and letters in French. On the same note, a plain conversion of phonetic lexicon to a grapheme-based one does not negatively impact the performances. This was not excepted considering the use of alternative phonetic representation in French to denotes possible liaisons (the pronunciation of the final consonant of a word immediately before a following vowel sound in preceding word).

5.2 End-to-end systems

Character-based models  While, without language model, the attention-based model outperforms CTC model as expected, RNN-transducer performances exceed our initial estimations, surpassing previous models in terms of CER and WER. RNN-transducer even outperforms these models coupled with language model, regardless of the level of knowledge included (character and word-level). The CER obtained with this model is 8.5 while the WER is 19.7. This represent a relative decrease of almost 40% for the CER and 17% for the WER against the attention-based model with word LM, the second best system for classic end-to-end. Compared to the end-to-end chain model system modeling characters, we observe a small CER difference of 0.9 which corresponds to a WER difference of 4.9. While the CER is competitive, errors at word-level seem to indicate difficulties to model word boundaries compared to baseline systems.

Extending the comparison to hybrid models, only the RNN-transducer with attention mechanism could achieve similar or better results than its vanilla version. Although the joint CTC-attention procedure is beneficial to correct some limitations from individual approaches, the system can only reach a CER of 10.4 equivalent to a WER of 22.1. However, by adding word LM and using multi-level decoding, the system can achieve closer WER performance (18.6) despite the significant difference in terms of CER (9.6).
For the hybrid transducer relying on additional attention module, performances in all experiments are further improved compared to standard, reaching 8.2% CER and 19.1% WER without language model.

Concerning the best systems, it should be noted that the RNN-transducer performance is further improved with the use of language model, obtaining a CER of 8.0, close to our baseline score (7.6), with a word LM. In terms of WER it represents a relative improvement of 8.5% against previous results, which is however still far from the performance denoted with the baseline system for this metric (14.8%). For the RNN-transducer with an attention decoder, we achieve even better performance with a CER of 7.8 equivalent to a WER of 17.6. This is our best model with characters as acoustic units.

Focusing on the CER report, several observations can be made :

Deletion errors are lower for CTC models than attention-based systems, with the addition of language models included. Attention-based are expected to have higher number of deletions or insertions depending of the length difference between input and output sequences, it is however unanticipated to observe such a high number of insertion errors.

Following the last observation, we investigated the insertion errors done by the attention-only model. From what we found, the main reason is the existence of irregular segment-utterance pairs in the dataset (i.e: really low correspondence). Using coverage, penalty or length ratio terms helped on problematic pairs but degraded the global performances, regular short or long pairs being impacted.

Adding a language model decreases all errors in CTC systems while only insertion errors decrease in the attention-only system. Coupled with a word language model, substitution errors are even higher for attention model.

Similar observations can be made for transducer. While we observe a small decrease of deletion errors with the addition of a language model, we also see an increase in insertion errors. However the system is more impacted by the deletion changes as the number of substitutions decrease and the number of correct words increase.

Despite similar CER performances between CTC model with word LM and attention-only model with character or word LM for example, the first system cannot reach the word error rate of the second systems. It is beneficial to model linguistic information alongside acoustic information rather than in an external language model being at character or word level, although both can be combined to reach better performances. However, we should also consider that the training data for the acoustic model is the same as the data used to train the LM, augmented with a volume equivalent to less than a quarter of the initial training sentences.

Comparing the end-to-end chain model modeling characters to the RNN-transducer with language models, we can extract several useful information. Insertion errors made by the transducer are more influential at word level than the deletions errors made by the baseline system. From the hypotheses we observed that the deletions errors mostly happen on ambiguous verbal forms, gender forms or singular/plural forms in the baseline system. For the transducer, the same behaviour is observed however insertion errors also add space character in most cases, notably on proper and common names which are numerous in the corpus.

Although we observe a smaller number of substitutions at character level for the RNN-transducer with or without attention compared to the baseline system, substitution errors impact more words than the baseline system. These errors are mostly due to the same problems described previously, while substitutions in baseline systems are more localized due particularly to the presence of OOV and ambiguous words.

Considering all the previous observations, further investigation should be done to compare and categorize errors at character and word level in each system and also assess the value of these errors. The character errors reported for RNN-transducer with attention should be sufficient motivation as we report, against the baseline system, a lower number of substitution and deletions errors coupled to an equivalent number of correct words despite a significant gap in WER performance.

Subword-based models  Replacing characters with subword units improves the overall performance of all end-to-end methods. The gain is particularly important for CTC lowering the WER from to without language model. The gain observed when adding the language model to CTC is as impressive with a relative improvement of almost 28% on WER (from to ). For the system relying only on attention, the WER is further improved without and with language model but, unlike when we used characters, the model is outperformed on both CER and WER by the model relying on CTC. Although we observe a similar CER for both methods we also note a significant difference in terms of correct characters and WER (almost ). The attention making mostly consecutive mistakes on the same words or groups of words (particularly at the beginning and end of utterances) while the CTC tends to recognize part of words as independent, thus incorrectly recognizing word boundaries. Adding RNN-transducer to the comparison, both previous methods are surpassed, on CER ( for CTC, for attention and for transducer) and on WER ( for CTC, for attention and for transducer). Decoding with an external language model, the CER and WER are further improved by about and . It should be noted that the transducer model without language model exceed CTC and attention coupled to the subword LM.

Adding the hybrid systems to the comparison, we denote some differences compared to character-based systems. The RNN-transducer is not improved with attention mechanism and even slightly degraded for both CER and WER. The same observations can be done with and without LM addition. It seems the attention mechanism has more difficulty to model intra-subwords relations than intra-characters relations. However further work should be allocated to extend the comparison with different attention mechanisms, such as multi-head attention, and estimate the influence of architecture depending on output dimensions and representations.
Concerning the last hybrid system, joint CTC-attention seems better suited to subword than characters, reaching comparable performances to transducer even without language model: 18.7% against for RNN-transducer and . Although transducer are reported as our best system, it should be noted that joint CTC-attention reach equal or better performance on subword errors. Talking only about conventional ASR metric, we consider the two hybrid systems and vanilla transducer equivalent for subword units.

As in the previous section, we also made a focus on the detailed error report and denoted some differences compared to previous observations:

Akin to previous observations with characters, deletions errors are lower for CTC models () than attention-based models () with subwords. However, here, number of deletions for CTC are even lower than for all other methods, transducer and hybrid systems showing a deletion error of .

Previously, we noted that a higher number of deletions or insertions should be expected with attention-only model. With subword units, we can observe a balanced number of deletions and insertions although we also denote a significant number of substitutions. Following this new observation, we also investigated the orthographic output from both models. We denoted that the limitation of attention model was mostly removed and word sequence was unrolled or stopped. However it translated to a really large number of substitutions, some subwords within the word structure being repeated or cut.

Although, we report a higher number of correct words and a lower number of errors for joint CTC-attention, the hybrid method obtains a higher WER than RNN-transducer and its hybrid version. Analyzing the hypothesis formulated and error distribution by both systems we could not extract any relevant information to explain the number of words impacted by the errors at character level.

On the same note, the following difference should still be noted: transducer-based models have a lower number of substitutions and deletions whereas joint CTC-attention has a lower number of insertions and a higher number of correct characters. Outside correct labels, only the CTC has a similar error distribution.

In case of joint CTC-Attention we can see that CTC as auxiliary function act as predicted: the number of substitutions and deletions being further improved compared to attention-only model whereas the number of deletions is kept to the same range despite of insertions for the CTC-only model. In case of additional attention module for RNN-transducer, although the attention-only has a lower number of deletion errors ( versus for RNN-transducer), the inclusion of attention mechanism did not help to reduce this number. The error distribution is the same with and without attention. It also should be noted that RNN-transducer with attention has equivalent performance with characters and subword units.

Adding language models, all errors are lowered. The only exceptions being the number of deletions for CTC ( raised to ), number of insertions for RNN-transducer (from to ) and its hybrid counterpart (from to ). In these cases, and similarly to when we use character units, we can observe that the error rate (e.g.: deletion) decreases when the other (e.g.: insertion) increases.

6 Conclusion

In this paper, we experimentally showed that end-to-end approaches and different orthographic units were rather suitable to model the French language. RNN-transducer was found specially competitive with character units compared to other end-to-end approaches. Among the two orthographic units, subword was found beneficial for most methods to address the problems described in section 5.2 and retain information on ambiguous patterns in French. Extending with language models, we could obtain promising results compared to traditional phone-based systems. The best performing systems being for character unit the RNN-transducer with additional attention module, achieving 7.8% in terms of CER and 17.6% on WER. For subword units, classic RNN-transducer, RNN-transducer with attention and joint CTC-attention show comparable performance on subword error rate and WER, with the first one being slightly better on WER () and the last one having a lower error rate on subword ().
However, we also showed difference in produced errors for each method and different impact at word-level depending of the approach or units. Thus, future work will focus on analysing the orthographic output of these systems in two ways: 1) investigate errors produced by the end-to-end methods and explore several approaches to correct common errors done in French and 2) compare the end-to-end methods in a SLU context and evaluate the semantic value of the partially correct produced words.

References

  • [1] A. Graves, A. R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing, 2013, pp. 6645–6649.
  • [2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, and B. Kingsbury, “A tutorial on hidden Markov models and selected applications in speech recognition,” IEEE Signal Processing Magazine, pp. 82–97, 2012.
  • [3] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Inter. Conf. on Machine Learning, 2014, pp. 1764–1772.
  • [4] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, and A. Ng, “Deep speech: Scaling up end-to-end speech recognition,” arxiv:1412.5567 (arXiv preprint), 2014.
  • [5] F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Inter. Conf. INTERSPEECH, 2011, pp. 444–448.
  • [6] A. Graves, S. Fernandez, F. Gomez, and J. Schidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Inter. Conf. on Machine Learning, 2006, pp. 369–376.
  • [7] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2006, pp. 167–174.
  • [8] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Annual Conf. on Neural Information Processing Systems, 2015, pp. 577–585.
  • [9] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing, 2016, pp. 4945–4949.
  • [10] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711 (arXiv preprint), 2012.
  • [11] E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu, and Z. Zhu, “Exploring neural transducers for end-to-end speech recognition,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2017, pp. 206–213.
  • [12] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing, 2017, pp. 4835–4839.
  • [13] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” in Inter. Conf. INTERSPEECH, 2017, pp. 949–954.
  • [14] R. Prabhavalkar, K. Rao, T. Sainath, B. Li, L. Johnson, and N. Jaitly, “A comparison of sequence-to-sequence models for speech recognition,” in Inter. Conf. INTERSPEECH, 2017, pp. 939–944.
  • [15] D. Povey, V. Peddinti, D. Galvez, P. Gharemani, V. Manohar, X. Na, and S. Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-free mmi,” in Inter. Conf. INTERSPEECH, 2016, pp. 2751–2755.
  • [16] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end speech recognition using lattice-free mmi,” in Inter. Conf. INTERSPEECH, 2018, pp. 12–16.
  • [17] S. Galliano, G. Gravier, and L. Chaubard, “The ester 2 evaluation campaign for the rich transcription of french radio broadcasts,” in Inter. Conf. INTERSPEECH, 2009, pp. 2583–2586.
  • [18] Y. Esteve, T. Bazillon, J. Y. Antoine, F. Béchet, and J. Farinas, “The epac corpus: Manual and automatic annotations of conversational speech in french broadcast news,” in Inter. Conf. of the Language Resources and Evaluation Conf., 2010.
  • [19] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Inter. Conf. INTERSPEECH, 2012, pp. 2345–2349.
  • [20] L. F. Larnel, J. L. Gauvain, and M. Eskenazi, “Bref, a large vocabulary spoken corpus for french,” in Europ. Conf. Eurospeech, 1991, pp. 505–508.
  • [21] P. Koehn, “Europarl: A parallel corpus for statistical machine translation,” in Machine Translation Summit, 2005, pp. 79–86.
  • [22] T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Annual Meeting of the Asso. for Computational Linguistics, 2018, pp. 66–75.
  • [23] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Inter. Conf. on Empirical Methods in Natural Language Processing, 2018, pp. 66–71.
  • [24] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, and J. Silvosky, “The kaldi speech recognition toolkit,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2011, pp. 1–4.
  • [25] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Inter. Conf. INTERSPEECH, 2014, pp. 338–342.
  • [26] V. Gupta and T. S. P. Kenny, P. Ouellet, “I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription,” in IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing, 2014, pp. 6334–6338.
  • [27] A. Stolcke, “Srilm - an extensible language modeling toolkit,” in Inter. Conf. INTERSPEECH, 2002, pp. 901–904.
  • [28] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unni, and A. Renduchintala, “Espnet: End-to-end speech processing toolkit,” in Inter. Conf. INTERSPEECH, 2018, pp. 2207–2211.
  • [29] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing, 2018, pp. 5824–5828.
  • [30] T. Hori, S. Watanabe, and J. R. Hershey, “Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2017, pp. 287–293.
  • [31] H. Xu, T. Gao, Y. Wang, K. Li, N. Goel, and S. Khudanpur, “A pruned rnnlm lattice-rescoring algorithm for automatic speech recognition,” in IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing, 2018, pp. 5929–5933.
  • [32] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi et al., “Google’s neural machine translation system: bridging the gap between human and machine translation,” arXiv:1609.08144 (arXiv preprint), 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
394766
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description