Speech-to-Speech Translation Between Untranscribed Unknown Languages


In this paper, we explore a method for training speech-to-speech translation tasks without any transcription or linguistic supervision. Our proposed method consists of two steps: First, we train and generate discrete representation with unsupervised term discovery with a discrete quantized autoencoder. Second, we train a sequence-to-sequence model that directly maps the source language speech to the target language’s discrete representation. Our proposed method can directly generate target speech without any auxiliary or pre-training steps with a source or target transcription. To the best of our knowledge, this is the first work that performed pure speech-to-speech translation between untranscribed unknown languages.


Andros Tjandra, Sakriani Sakti, Satoshi Nakamurathanks: \address Nara Institute of Science and Technology, Japan
RIKEN, Center for Advanced Intelligence Project AIP, Japan
{andros.tjandra.ai6,ssakti,s-nakamura}@is.naist.jp \ninept {keywords} speech translation, sequence-to-sequence, zero-resource modeling, unit discovery, autoencoder

1 Introduction

Information exchanges among different countries continue to increase. International travelers for tourism, emigration, or foreign study are becoming increasingly diverse, heightening the need for devising a means to offer effective interaction among people who speak different languages. Since automatic spoken-to-speech translation (S2ST) provides an opportunity for people to communicate in their own languages, it significantly overcomes language barriers and closes cross-cultural gaps.

Many researchers have been developing a S2ST system over the past several decades. A traditional approach in S2ST systems requires effort to construct several components, including automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis, all of which are trained and tuned independently. Given speech input, ASR processes and transforms speech into text in the source language, MT transforms the source language text to corresponding text in the target language, and finally TTS generates speech from the text in the target language. Significant progress has been made and various commercial speech translation systems are already available for several language pairs. However, more than 6000 languages, spoken by 350 million people, have not been covered yet. Critically, over half of the world’s languages actually have no written form; they are only spoken.

Recently, end-to-end deep learning frameworks have shown impressive performances on many sequence-related tasks, such as ASR, MT, and TTS [1, 2, 3]. Their architecture commonly uses an attentional-based encoder-decoder mechanism, which allows the model to learn the alignments between the source and the target sequence, that can perform end-to-end mapping tasks of different modalities. Many complicated hand-engineered models can also be simplified by letting neural networks find their way to map from input to output spaces. Thus, the approach provides the possibility of learning a direct mapping between the variable-length of the source and the target sequences that are often not known a priori. Several works extended the sequence-to-sequence model’s coverage by directly performing end-to-end speech translation using only a single neural network architecture instead of separately focusing on its components (ASR, MT, and TTS).

Although the first feasibility was shown by Duong et al. [4], they focused on the alignment between the speech in the source language and the text in the target language because their speech-to-word model did not yield any useful output. The first full-fledged end-to-end attentional-based speech-to-text translation system was successfully performed by Brard et al. on a small French-English synthetic corpus [5]. But their performance was only compared with statistical MT systems. Weiss et al. [6] demonstrated that end-to-end speech-to-text models on Spanish-English language pairs outperformed neural cascade models. Kano et al. then proved that this approach is possible for distant language pairs such as Japanese-to-English translation [7]. Similar to the model by Weiss et al. [6], although it does not explicitly transcribe the speech into text in the source language, it also doesn’t require supervision from the groundtruth of the source language transcription during training. However, most of these works remain limited to speech-to-text translation and require text transcription in the target language.

Recently, Jia et al. [8] proposed the deep learning model that is trained end-to-end, which learns to map speech spectrograms into target spectrograms in another language that corresponds to the translated content (in a same or different canonical voice). Unfortunately, since training without auxiliary losses leads to extremely poor performance, they provided a solution by integrating auxiliary decoder networks to predict phoneme sequences that correspond to the source and/or target speech. Despite much progress in direct speech translation research, no completely direct speech-to-speech translation has been achieved without any text transcription in source and target languages, during training, has not been achieved yet. Therefore, it remains difficult to scale-up the existing approach to unknown languages without written forms or transcription data available.

On the other hand, there has been a project that held by speech community to push toward developing unsupervised, data-driven systems that are less reliant on linguistic expertise. Zero resource modeling is an approach where completely unsupervised techniques can learn the elements of a language’s speech hierarchy solely from untranscribed audio data. This means that only spoken audio data are available in a specific language, but transcriptions, annotations, and prior knowledge for it are all unavailable. The “Zero Resource Speech Challenge” s eries [9, 10, 11] was constructed to progress incrementally toward a system that learns an end-to-end spoken dialog (SD) system in an unknown language from scratch just using information available to language learning infants. The ZeroSpeech 2019 [11] challenge confronts the problem of constructing a speech synthesizer without any text or phonetic labels: TTS without T . It is a continuation of the subword unit discovery track of ZeroSpeech 2015 and 2017 [9, 10]. 19 systems were submitted, but few studies proposed end-to-end frameworks [12, 13, 14]. Among these proposed systems, the vector quantized variational autoencoder (VQ-VAE) approach provides a better performance of naturalness based on mean opinion score (MOS) on the generated speech and character error rate after human transcription of the speech synthesis. Further details of the results are available: www.zerospeech.com/2019/results.html.

In this paper, we take a step beyond the task of the current ZeroSpeech 2019 and propose a method for training speech to speech translation tasks without any transcription or linguistic supervision. Instead of only discovering subword units and synthesizing them within a certain language, our approach discovers subword units that are directly translated to another language. Our proposed method consists of two steps: (1) we train and generate discrete representation with unsupervised term discovery, which is also based on a discrete quantized autoencoder; (2) we train a sequence-to-sequence model to directly map the source language speech to the target language discrete representation. Our proposed method can directly generate target speech without any auxiliary or pre-training steps with source or target transcription. To the best of our knowledge, this is the first work that performed pure speech-to-speech translation between untranscribed unknown languages.

2 Unsupervised Unit Discovery with VQ-VAE

A speech signal can be disentangled into independent factors of variation such as contexts and speaking styles. In a speech domain, we assume the context has a similar property with phonemes or subwords, which are represented with a limited set of discrete symbols. Therefore, to capture the context without any supervision, we use a generative model named a vector quantized variational autoencoder (VQ-VAE) [15] to extract the discrete symbols. There are several distinctions between a VQ-VAE with a normal autoencoder [16] and a normal variational autoencoder (VAE) [17]. The VQ-VAE encoder maps the input features to a limited number of discrete latent variables, and a standard VAE encoder maps the input features into continuous latent variables. Therefore, a VQ-VAE encoder has many-to-one mappings due to restricting the representation to the nearest codebook vector, and the standard VAE encoder has one-to-one mapping between the input and latent variables.

Figure 1: VQ-VAE for unsupervised unit discovery consists of several parts: encoder , decoder , codebooks , and (optional) speaker embedding .
Figure 2: Building block inside VQ-VAE encoder and decoder: a) Encoder residual block and 1D convolution with stride 2 to downsample input sequence length; b) Decoder residual block and 1D transposed convolution with stride 2 to upsample codebook back to original input length.

We illustrate the VQ-VAE model in Fig. 1 and define as a collection of codebook vectors and . During the encoding step, input is such speech features as MFCC or mel-spectrogram and input ’s speaker identity is denoted by . In Fig. 2, we show the details for the residual block inside the encoder and decoder modules. Encoder generates discrete latent variable ( can also be represented as a one-hot vector). To transform a continuous representation into a discrete random variable, the encoder first produces intermediate continuous representation . Later, we find which codebook has a minimum distance between and a vector in . Mathematically, we formulate the operation:


where is a function to calculate the distance between two vectors. In this paper, we define as the L2-norm distance.

After we find closest codebook index , we substitute intermediate variable with corresponding codebook vector . To reconstruct the input data, decoder reads codebook vector and speaker embedding and generates reconstruction .

The following is the learning objective for VQ-VAE:


where function stops the gradient, defined as:


The first term is a negative log-likelihood to measure the reconstruction loss between original input and reconstruction to optimize encoder parameters and decoder parameters . The second term minimizes the distance between intermediate representation and nearest codebook , but the gradient is only back-propagated into encoder parameters as commitment loss. Such commitment loss can be scaled with additional hyperparameters . To update the codebook vectors, we use an exponential moving average (EMA) [18]. With an EMA update rule for training codebook , the model has a more stable result during the training process and avoids the posterior collapse issue [19].

3 Sequence-to-Sequence from Speech to Codebook

Our speech-to-speech translation model is built based on an attention sequence-to-sequence (seq2seq) framework [20, 21]. Assume paired source sequence and target sequence . A sequence-to-sequence model directly learns mapping , parameterized by parameters. In this paper, we specify to represent such speech features as MFCC or mel-spectrogram and to represent codebook indices. Fig. 3 illustrates a seq2seq model with an attention mechanism.

Figure 3: Sequence-to-sequence model with attention mechanism. Here encoder input is speech features , and decoder predicts codebook index for each time-step.

Inside a seq2seq model, there are three different components:

  1. The encoder module reads all the sequence speech features and represents them with where .

  2. The attention module assists the decoder to find which part of the encoder contains related information for current decoding state [20]. Given decoder state , the attention modules generate attention and context :


    where function predicts the relevancy value between the encoder and decoder states. Many functions exist, including dot-product [22], MLP [20] or modified MLP with history [23].

  3. The decoder module predicts class probability
    over different classes (depending on codebook size) given context , previous information , and current decoder state .

To train a seq2seq model, the most common objective is to minimize the negative log-likelihood over the correct class:


where is the predicted probability on the -th class and time-step .

4 Codebook Inverter

A codebook inverter is an module that synthesizes the corresponding speech utterance from a sequence of the codebook index. Its input, which is a sequence of codebook embedding, is , and the output target is a sequence of speech representation (e.g., linear magnitude spectrogram) .

We illustrate our codebook inverter architecture in Fig. 4.

Figure 4: Codebook inverter: given codebook sequence , we predict corresponding linear magnitude spectrogram . If the lengths between and are different, we consecutively duplicate each codebook by -times.

Our codebook inverter is composed of several residual 1D blocks, followed by stacked bidirectional LSTMs [24], and finally another several residual 1D blocks. Fig. 5 shows the details inside the block.

Figure 5: Residual 1D block combines multiscale 1D convolution with different kernel size and “SAME” padding, LeakyReLU [25] activation function, and batch normalization [26].

Under certain circumstances, codebook sequence length might be shorter than because VQ-VAE encoder has convolution with a stride larger than 1. Therefore, to align the codebook sequence with the speech representation target sequence, we duplicate each codebook into copies side-by-side where . To train a codebook inverter, we set the objective function:


to minimize the L2-norm between predicted spectrogram and groundtruth spectrogram . We defined as the inverter parameterized by . In the inference stage, we used Griffin-Lim [27] to reconstruct the phase from the spectrogram and applied an inverse short-term Fourier transform (STFT) to invert it into a speech waveform.

5 Training and Inference

Figure 6: a) Train VQ-VAE to represent continuous MFCC vectors with codebook sequence and train codebook inverter to generate a linear magnitude spectrogram based on generated codebook sequence; b) Train a seq2seq model from source language MFCC to target language codebook. c) In inference stage, seq2seq model takes source language MFCC and predicts codebook sequences, and then codebook inverter generates target language speech representation.

In this section, we explain our proposed method in detail and step-by-step. To train our proposed model, we setup three different modules: VQ-VAE (Section 2), a speech-to-codebook seq2seq (Section 3), and a codebook inverter (Section 4). Fig. 6 shows which modules are trained in each step. Initially, we defined as paired parallel speech, is the MFCC features from the source language, and is the MFCC features from the target language. is the codebook sequences generated by VQ-VAE encoder given as the input. is the predicted linear spectrogram of the target language. are calculated by the formula in Eqs. 4, 10, and 9.

  1. First, we trained the VQ-VAE model on target language MFCC . We also trained the codebook inverter to predict corresponding linear spectrogram .

  2. Second, we trained the seq2seq model from the source language speech to the target language codebook. Given a paired parallel MFCC from source and target languages , we extracted codebook sequence from the VQ-VAE encoder. Later, we trained the seq2seq translation model to predict and minimize loss between and .

  3. In the inference step, given source language speech , we decoded a target language codebook index sequence and synthesized it into target language speech .

6 Experimental Setup

6.1 Dataset

In this paper, we ran our experiment based on the Basic Travel Expression (BTEC) corpus [28, 29] that has several language pairs . We chose two tasks: French-to-English and Japanese-to-English. For both language pairs, we used the BTEC1 set that consisted of 162,318 training sentences and 510 test sentences. Since the speech utterances for the sentences are unavailable, we generated sentences with Google text-to-speech API for all languages pairs. Even though the lack of natural speech dataset in this paper, VQ-VAE and codebook inverter can be applied and has shown a great performance on multispeaker natural speech [14, 13]. Some papers [30, 31, 32] also show the performance improvement from the synthetic dataset can be carried over to the real dataset.

6.2 Speech Feature Extraction

For the source language and target language speech utterances, we represented the speech utterances with mel-frequency cepstral coefficients (MFCCs) with 13 dimensions (total 39 dimensions). For the target language speech utterances, we also generated a linear magnitude spectrogram with 1025 dimensions for the codebook inverter (Section 4) training target. For each frame, we extracted the MFCCs and the linear magnitude spectrogram with a 25-millisecond-sized window and 10-millisecond time-steps. We extracted both the MFCC and the linear magnitude spectrogram with Librosa [33] library.

6.3 Evaluation

For an objective evaluation of the target speech utterances, currently there is no standard method can be used to measure translation quality directly on the speech utterances. Therefore, we utilized a pre-trained ASR on the English BTEC dataset and the generated transcription for our evaluation. For the ASR architecture, the encoder module has three stacked Bi-LSTMs with 512 hidden units, and the decoder has one LSTM with 512 hidden units. For the attention module, we utilized MLP attention with multiscale location history [23]. For the output unit, we used a word-level token from the English transcription. Because there is a performance gap between the ASR and the ground truth cause by imperfect transcription, we assume the metric (calculated based on the ASR transcription) is the lowerbound for the related translation model. We utilized two metrics to evaluate the translation performance from the transcribed text: BLEU scores [34] and METEOR [35] with a Multeval toolkit [36]. Our pre-trained ASR model resulted in a 2.84% WER, a 94.9 BLEU, and a 69.1 METEOR on English speech utterances from the BTEC test set, and we set those scores as the groundtruth topline scores.

7 Results and Discussion

In this section, we present our experimental result and followed by the discussion.

7.1 Baseline

For the baseline translation task, we modified the Tacotron [3] model by changing the source input from a one-hot character embedding into a continuous vector. Basically, we changed the embedding layer in the encoder layer with a linear projection layer. Therefore, this model directly translated the source language MFCC to a target language mel-spectrogram. However, this approach did not converge at all and produced no audible s peech. [8] also observed a similar result with a similar scenario.

7.2 Topline with Cascade ASR-TTS

In this paper, we set the topline performance by using the cascade of ASR and TTS system. First, we train the ASR system by using the source language MFCC as the input and target language character transcription. Second, we train a TTS based on Tacotron [3] to generate a speech from the target language characters to the target language speech representation.

7.3 French-to-English by Speech to Codebook

Table 1 shows our experimental result on various hyperparameters across different codebook sizes and time-reductions. We tried several hyperparameters, including codebook size and time-reduction factor. Our best performance was produced by codebook of 64 and a time-reduction factor of 12 with a score of 25.0 BLEU and 23.2 METEOR.

Tacotron with MFCC input
- -
Proposed Speech2Code
32 4 19.4 19.1
32 8 23.8 22.2
32 12 23.2 22.1
64 4 16.1 16.9
64 8 24.4 22.9
64 12 25.0 23.2
128 4 16.9 17.4
128 8 23.3 22.1
128 12 24.2 21.9
(Cascade ASR ->TTS)
47.4 41.2
Table 1: Our experiment results based on BTEC French-English speech-to-speech translation:

7.4 Japanese-to-English by Speech to Codebook

Table 2 shows our experimental result on various hyperparameters across different codebook sizes and time-reductions. We tried several hyperparameters, including codebook size and time-reduction factor. Our best performance was produced by a codebook of 128 and a time-reduction factor 8 with a score of 15.3 BLEU and 15.3 METEOR.

Tacotron with MFCC source
- -
Proposed Speech2Code
32 4 14.8 15
32 8 14.2 15.6
32 12 16 16
64 4 10.8 12.1
64 8 14.2 14.7
64 12 14.7 14.8
128 4 11.9 13.5
128 8 15.3 15.3
128 12 14.9 14.5
(Cascade ASR ->TTS)
37.4 32.8
Table 2: Our experiment results based on BTEC Japanese-English speech-to-speech translation.

7.5 Transcription Example and Discussion

In Table 3, we provide some transcriptions example from the ground-truth, our proposed speech-to-code and topline cascade ASR-TTS models. In the first result, all models translation contains similar meaning with the ground truth. In the second result, all models still maintain a similar semantic with the ground-truth. However, compared to the topline, the speech-to-code does not produce the additional translation for “as soon as he comes in”. In the third result, our proposed method can only translate the beginning of the sentence correctly and produce incorrect result in the latter part. From the transcription result, the missing part and arbitrary transcription in the latter half might be interesting to be investigated in the future.

For further information and translation samples, our reader could refer to:

Model Transcription Result
Groundtruth how long are you going to stay
how long are you going to stay
how long will it take
Topline FR-EN how long are you staying
Topline JA-EN how long are you staying
Groundtruth please tell him to call me as soon as he comes in
please tell him to call me back
please tell him that i called
Topline FR-EN please tell her to call me and check it
Topline JA-EN please ask him to call me as soon as possible
Groundtruth i would like a balcony seat please
i would like to have this film please
i would like a seat near the seat
Topline FR-EN i would like a balcony seat please
Topline JA-EN i would like a balcony seat
Table 3: Transcription example between the ground truth, our proposed Speech2Code, and topline (Cascade ASR-TTS) model.

8 Conclusion

In this paper, we proposed a novel approach for training a speech-to-speech translation between two languages without any transcription. First, we trained a discrete quantized autoencoder to generate a discrete representation from the target speech features. Second, we trained a sequence-to-sequence model to predict the codebook sequence given the source speech representation. This method is applicable to any type of language, with or without a written form because the target speech representations are trained and generated unsupervisedly. Based on our experiment result, our model can perform a direct speech-to-speech translation on French-English and Japanese-English.

9 Acknowledgments

Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H06101 and JP17K00237.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description