Mockingjay: Unsupervised Speech Representation Learning With Deep Bidirectional Transformer Encoders


We present Mockingjay as a new speech representation learning approach, where bidirectional Transformer encoders are pre-trained on a large amount of unlabeled speech. Previous speech representation methods learn through conditioning on past frames and predicting information about future frames. Whereas Mockingjay is designed to predict the current frame through jointly conditioning on both past and future contexts. The Mockingjay representation improves performance for a wide range of downstream tasks, including phoneme classification, speaker recognition, and sentiment classification on spoken content, while outperforming other approaches. Mockingjay is empirically powerful and can be fine-tuned with downstream models, with only 2 epochs we further improve performance dramatically. In a low resource setting with only 0.1% of labeled data, we outperform the result of Mel-features that uses all 100% labeled data.


Andy T. Liu   Shu-wen Yang   Po-Han Chi   Po-chun Hsu   Hung-yi Lee \addressNational Taiwan University
College of Electrical Engineering and Computer Science
{r07942089, r08944041, r08942074, r07942095, hungyilee} {keywords} speech representation learning, unsupervised training, transformer encoders, low resource

1 Introduction

The goal of speech representation learning is to find a transform from speech that makes high-level information more accessible to SLP (Speech and Language Processing) downstream tasks, as speech signal possess a rich set of acoustic and linguistic content, including phonemes, words, semantic meanings, tone, speaker characteristics, and even sentiment information. In this paper, we propose Mockingjay to learn speech representations through unsupervised training without the use of any labels. We use multi-layer transformer encoders and multi-head self-attention [25] to achieve bidirectional encoding; this framework allows our model to consider past and future contexts at the same time. To achieve unsupervised pre-training for speech representations, Mockingjay learns under the proposed Masked Acoustic Model (MAM) task. During training, masked frames are given, and the model learns to reconstruct and predict the original frames. Hence we gave the name Mockingjay, a bird that mimics sound. The proposed framework is illustrated in Figure 1.

Figure 1: The proposed Masked Acoustic Model pre-training task, 15% of input the frames are masked to zero at random.

1.1 Related work

Unsupervised speech representation learning [5, 6, 8, 24, 7, 20, 2, 1, 13] is effective in extracting high-level properties from speech. SLP downstream tasks can be improved through speech representations because surface features such as log Mel-spectrograms or waveform can poorly reveal the abundant information within speech.

Contrastive Predictive Coding (CPC) [24] and wav2vec [20] use a multi-layer CNN to encode past context, representations are learned by predicting the future in latent space under a contrastive binary classification task. Autoregressive Predictive Coding (APC) [7] uses autoregressive models to encode temporal information of past acoustic sequence; the model predicts future frames like an RNN-based language model [16], optimized with reconstruction loss. Unidirectional models are commonly used in the previous approaches [5, 6, 8, 24, 7, 20]. However, this constraint on model architectures limits the potential of speech representation learning.

Figure 2: The proposed Mockingjay training framework.

The recently proposed vq-wav2vec [2] approach attempts to apply the well-performing Natural Language Processing (NLP) algorithm BERT [9] on continuous speech. Input speech is discretized to a K-way quantized embedding space, so continuous speech could act like discrete units similar to word tokens in NLP tasks. In vq-wav2vec [2], an exhaustive two-stage training pipeline with massive computing resources are required to adapt speech to NLP algorithm, as the quantization process is against the continuous nature of speech. Unlike [2] that adapts speech to BERT [9] through quantization, the proposed approach can be seen as a modified version of BERT [9] for direct application on continuous speech.

1.2 Proposed Method

Unlike previous left-to-right unidirectional approaches that only consider past sequences to predict information about future frames, the proposed method allows us to train a bidirectional speech representation model, alleviating the unidirectionality constraint of previous methods. As a result, the Mockingjay model obtains substantial improvements in several SLP tasks. Moreover, as previous approaches restrict the power of the pre-trained models to representation extraction only [24, 7, 20, 2], the proposed method is robust as it can be fine-tuned easily on downstream tasks. We show that fine-tuning for 2 epochs easily acquires significant improvement.

The proposed approach outperforms other representations and features. When compared to the commonly used log Mel-features, we outperformed it by 35.2% (absolute improvement) for phoneme classification accuracy, 28.0% (absolute improvement) for speaker recognition accuracy, and 6.4% (absolute improvement) for sentiment discrimination accuracy on a spoken content dataset unseen during pre-train. We also experiment in low resource settings to show that Mockingjay is capable of improving supervised training in real-life low-resource scenarios. With only 0.36 hours (0.1%) of transcribed speech, the proposed approach outperforms Mel-features with 360 hours (100%) of labels.

2 Mockingjay

In this section, we first introduce model architecture and its designs, secondly we explain the proposed unsupervised context prediction task, and finally we explain how the proposed model is used with downstream task models.

2.1 Model Architecture

We use a multi-layer Transformer encoder with multi-head self-attention for left-and-right bidirectional encoding, this architecture is illustrated in Figure 2. Each encoder layer has two sub-layers, the first is a multi-head self-attention network, and the second is a feed-forward layer, each sub-layer has a residual connection followed by layer normalization [3], based on the design described in [25]. All encoder layers in the model, as well as the sub-layers, produce outputs of identical dimensions denoted as . In Figure 2, we denote the feed-forward size as , the number of self-attention heads as , and the total of Transformer layers as . The Mockingjay representations can be extracted from the Transformer encoders’ hidden state and labeled as , we explain how they are used as representations in Section 2.3.

Since Transformer encoders contain no recurrence and convolution, we use positional encoding to make our model aware of the input sequence order [25]. As direct addition of acoustic features to positional encoding may lead to potential training failure [21], the input frames are first projected linearly to the hidden dimension of before adding with positional encoding [19]. We use sinusoidal positional encoding instead of learnable positional embeddings [10] because acoustic features can be arbitrarily long with high variance [19]. We apply downsampling on input features to adapt our model to long sequences. To reduce the length of frames by a factor of , we use the reshape technique from [21, 19] by stacking consecutive frames into one step.

2.2 Masked Acoustic Modeling

We propose the Masked Acoustic Modeling task, where we randomly select 15% of the input frames, and the model predicts the selected frames based on its left and right context, as illustrated in Figure 1. During training, we add a prediction head consists of two layers of feed-forward network with layer-normalization, using the last encoder layer as it’s input. We use L1 Loss to minimize reconstruction error between prediction and ground-truth frames on the selected 15%. The prediction head is not used once the model is trained.

During training, for the selected 15% of frames, 1) we mask it all to zero 80% of the time, 2) replace all with a random frame 10% of the time, and 3) leave the frames untouched 10% of the time.1 We introduce this sub-random process instead of always masking the frames to alleviate the mismatch between training and inference, as masked frames do not appear during inference time. Note that in contrast to BERT [9], where the sub-random process is performed token-wise on the i-th chosen token, our sub-random process is performed utterance-wise. In other words, our model may receive inputs as ground-truth frames for 3) 10% of the time, rather with some of the inputs always augmented as in [9].

To avoid the model exploiting local smoothness of acoustic frames, we propose additional consecutive masking where we mask consecutive frames to zero. The model is required to infer on global structure rather than local information. We also use dynamic masking [14] where masking patterns are sampled from an uniform distribution every time we feed a sequence to the model, unlike the static mask employed in [9] where masking is performed during data preprocessing. We only use a single context prediction task to train our representation model, as suggested by [14]. Unlike BERT [9] and ALBERT [12] that needs two tasks to train their language model. In our preliminary experiments, we found that the sentence prediction task used in [9, 12] is not helpful, as additional tasks can potentially harm training behavior. We do not include details due to space limitations.

2.3 Incorporating with Downstream Tasks

Mockingjay representations are essentially the Transformer encoder hidden states. There are many ways to incorporate learned representations to downstream tasks. In this work, we mainly extract representations from the last layer. However, we also expose the deep internals of Mockingjay to downstream models, where we use a mixture of representations from all layers, similar to the ELMO [18] approach. In other words, we use a learnable weighted sum to integrate hidden states from all layers. Last but not least, a pre-trained Mockingjay model can be fine-tuned with downstream models to create improving results, we update the pre-trained Mockingjay together with random initialized downstream task models.

3 Implementation

In this work, we use two types of features as our model’s output reconstruction target: the Mel-scale spectrogram and the linear-scale spectrogram. As Mel-scale spectrogram is a more concise acoustic feature when compared to linear-scale spectrogram, we propose two model settings: BASE and LARGE. Both of these models take Mel-features as input, and transform input Mel-features into high-level representations. They use the same hidden dimension size of =768, feed-forward size of =3072, and attention heads of =12, with the exception of layer number , downsample factor , and consecutive masking number , the differences in model settings are listed in Table 1. We further analyze their differences in our experiment section.

Target Mel Linear
3 12
1 3
7 3
parameters 21.4M 85.4M
Table 1: The proposed BASE and LARGE model settings

The proposed Mockingjay models are pre-trained on the LibriSpeech [17] corpus train-clean-360 subset. We use Adam [11] where learning rate is warmed up over the first 7% of 500k total training steps to a peak value of 4e-4 and then linearly decayed. A dropout [22] of 0.1 is applied on all layers and attention weights. For downstream task fine-tuning, most of the hyperparameters are the same as in pre-training, with the exception of a learning rate of 4e-3, and the number of training epochs is set to 2 (which is approximately 50k steps). We train with a batch size of 6 using a single 1080Ti GPU. We provide pre-trained models with our implementation, they are publicly available for reproducibility2.

4 Experiment

Following previous works [5, 6, 8, 24, 7, 20, 2], we evaluate different features and representations on downstream tasks, including: phoneme classification, speaker recognition, and sentiment classification on spoken content. For a fair comparison, each downstream task uses an identical model architecture and hyperparameters despite different input features.

We report results from 5 of our settings: 1) BASE and 2) LARGE where Mockingjay representations are extracted from the last encoder layer, 3) the BASE-FT2 where we fine-tune BASE with random initialized downstream models for 2 epochs, and 4) the BASE-FT500 where we fine-tune for 500k steps, and finally 5) the LARGE-WS where we incorporate hidden states from all encoder layers of the LARGE model through a learnable weighted sum. We did not fine-tune the LARGE model, as it is meant for extracting representations. Empirically we found that even with supervised training, a random initialized Mockingjay model followed by any downstream model is hard to be trained from scratch. This shows that the proposed pre-training is essentially indispensable.

4.1 Comparing with other representations

The proposed approaches are mainly compared with APC [7] representations, as they also experiment on phone classification and speaker verification. As reported in [7], the APC approach outperformed CPC representations [24, 20, 1] in both two tasks, which makes APC suitable as a strong baseline. APC uses an unidirectional autoregressive model. We compare the proposed approach with APC to show that our bidirectional approach has advantages in speech representation learning. For fair comparison, we pre-train APC using their official implementations with the reported ideal parameters and settings, but expand the model’s hidden size to =768 to match ours. We also report results on 160-dimensional log Mel-features, which helps evaluate the accessibility of speech information from regular acoustic features.

Figure 3: Comparing representations with phone classification accuracy across different amount of transcribed data.

4.2 Phoneme Classification

To measure the accessibility of phonetic information, we train linear phone classifiers using Mel-features, APC and Mockingjay representations from the LibriSpeech train-clean-360 subset. We obtain force-aligned phoneme sequences with the Montreal Forced Aligner [15], where there are 72 possible phone classes. Testing results on the LibriSpeech test-clean subset are presented in Figure 3. In the case where all 360 hours of labels are used to train the classifier, BASE and LARGE representations increase 11.8% and 15.2% accuracy from Mel-features. The BASE-FT2 model outperforms all representations after 2 epochs of fine-tuning, with 10.2% and 35.2% absolute improvement over APC and Mel-features, respectively. We observe that fine-tuning for 2 epochs is enough to reveal our method’s potential as there is only a small gap (3.9%) between BASE-FT2 and BASE-FT500. Furthermore, LARGE-WS improves over LARGE, just as we expected.

To demonstrate how pre-training on speech can improve supervised training in resource constrained scenarios where human labels are short, we train the classifier with reduced amount of training data. The performance variation of different methods are plotted in Figure 3, where we measure over various intervals of constrained training data to observe performance drop. Both BASE and LARGE increase accuracy over Mel-features across various amount of transcribed data. Whereas the APC approach performs well on the full resource but fails to generalize for limited amount of labeled data. In the case where there are only 0.36 hours of data available, we improve accuracy by 22.7% (an absolute improvement from Mel-features). Note that with only 0.36 hours (0.1%) of labels available, BASE-FT2 (57.9% acc) even outperformed Mel-features (49.1% acc) that uses all 360 hours (100%) of labeled data. We conclude that pre-training Mockingjay on speech substantially improves the performance on supervised task that requires human annotations.

4.3 Speaker Recognition

To demonstrate that the proposed approach performs constantly for all SLP downstream tasks, we report speaker recognition results on the LibriSpeech 100 hour selected subset, where train/test split is performed randomly with a 9:1 ratio, and there are 63 possible speakers. We trained a simple one-layer RNN classifier for speaker recognition using different representations, results are listed in Table 2 for comparison. The proposed BASE and LARGE representations outperformed both APC and Mel-Features. BASE-FT2 further improves upon BASE while achieving the highest accuracy, whereas LARGE-WS also outperforms LARGE.

4.4 Sentiment Classification on Spoken Content

To demonstrate domain invariant transferability of the proposed representation across different datasets, the Mockingjay model is pre-trained on LibriSpeech and applied on the MOSEI [4] dataset. We also use a simple one-layer RNN classifier, where the model is trained to extract linguistic meanings from speech and discriminates between sentiments. The results listed in Table 2 lead to an identical conclusion as in the speaker recognition task discussed above. Except that in the case of sentiment classification, LARGE-WS achieved the highest score without the need of fine-tuning, demonstrating that a deeper model has great potential for extracting general speech representations. To conclude this section, we claim that the proposed representations are general and can be used on datasets with various unseen domains.

Methods Speaker (acc) Sentiment (acc)
Mel-Features 70.06 64.63
APC 85.88 65.97
Base 94.54 67.38
BaseFT2 98.05 68.45
Large 96.26 70.07
LargeWS 96.40 71.05
Table 2: Comparing different methods with different tasks.

5 Conclusion

The proposed representation contains a variety of knowledge, including but not limited to phonetic, speaker, and sentiment information. We improve performance for a wide range of downstream tasks, and show promising results in low resource settings, as the learned speech representations are robust and can be transferred to different tasks across different datasets. In future work, we will investigate and deploy Mockingjay representations on more downstream SLP tasks, including ASR, voice conversion, and speech translation.


  1. This process is similar to the Cloze task [23] and the Masked Language Model task from BERT [9], but we mask frames of speech to zero instead of using the MASK token.


  1. A. authors (2020) Unsupervised learning of efficient and robust speech representations. In ICLR 2020 Conference Blind Submission, Cited by: §1.1, §4.1.
  2. A. authors (2020) Vq-wav2vec: self-supervised learning of discrete speech representations. In ICLR 2020 Conference Blind Submission, Cited by: §1.1, §1.1, §1.2, §4.
  3. J. L. Ba, J. R. Kiros and G. E. Hinton (2016) Layer normalization. External Links: 1607.06450 Cited by: §2.1.
  4. A. Bagher Zadeh, P. P. Liang, S. Poria, E. Cambria and L. Morency (2018-07) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2236–2246. External Links: Document Cited by: §4.4.
  5. J. Chorowski, R. J. Weiss, S. Bengio and A. van den Oord (2019-12) Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 2041–2053. External Links: ISSN 2329-9304, Link, Document Cited by: §1.1, §1.1, §4.
  6. Y. Chung and J. Glass (2018-09) Speech2Vec: a sequence-to-sequence framework for learning word embeddings from speech. Interspeech 2018. External Links: Link, Document Cited by: §1.1, §1.1, §4.
  7. Y. Chung, W. Hsu, H. Tang and J. Glass (2019-09) An unsupervised autoregressive model for speech representation learning. Interspeech. External Links: Link, Document Cited by: §1.1, §1.1, §1.2, §4.1, §4.
  8. Y. Chung, C. Wu, C. Shen, H. Lee and L. Lee (2016) Audio word2vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. External Links: 1603.00982 Cited by: §1.1, §1.1, §4.
  9. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. External Links: 1810.04805 Cited by: §1.1, §2.2, §2.2, footnote 1.
  10. J. Gehring, M. Auli, D. Grangier, D. Yarats and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. External Links: 1705.03122 Cited by: §2.1.
  11. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §3.
  12. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma and R. Soricut (2019) ALBERT: a lite bert for self-supervised learning of language representations. External Links: 1909.11942 Cited by: §2.2.
  13. A. T. Liu, P. Hsu and H. Lee (2019) Unsupervised end-to-end learning of discrete linguistic units for voice conversion. External Links: 1905.11563 Cited by: §1.1.
  14. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. External Links: 1907.11692 Cited by: §2.2.
  15. M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner and M. Sonderegger (2017) Montreal forced aligner: trainable text-speech alignment using kaldi.. Cited by: §4.2.
  16. T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §1.1.
  17. V. Panayotov, G. Chen, D. Povey and S. Khudanpur (2015-04) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5206–5210. External Links: Document, ISSN Cited by: §3.
  18. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer (2018) Deep contextualized word representations. External Links: 1802.05365 Cited by: §2.3.
  19. N. Pham, T. Nguyen, J. Niehues, M. Muller and A. Waibel (2019) Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377. Cited by: §2.1.
  20. S. Schneider, A. Baevski, R. Collobert and M. Auli (2019-09) Wav2vec: unsupervised pre-training for speech recognition. Interspeech. External Links: Link, Document Cited by: §1.1, §1.1, §1.2, §4.1, §4.
  21. M. Sperber, J. Niehues, G. Neubig, S. Stüker and A. Waibel (2018-09) Self-attentional acoustic models. Interspeech 2018. External Links: Link, Document Cited by: §2.1.
  22. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. External Links: Link Cited by: §3.
  23. W. L. Taylor (1953) “Cloze procedure”: a new tool for measuring readability. Journalism Bulletin 30 (4), pp. 415–433. Cited by: footnote 1.
  24. A. van den Oord, Y. Li and O. Vinyals (2018) Representation learning with contrastive predictive coding. External Links: 1807.03748 Cited by: §1.1, §1.1, §1.2, §4.1, §4.
  25. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin (2017) Attention is all you need. External Links: 1706.03762 Cited by: §1, §2.1, §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description