From Speech Chain to Multimodal Chain: Leveraging Cross-modal Data Augmentation for Semi-supervised Learning

From Speech Chain to Multimodal Chain:
Leveraging Cross-modal Data Augmentation for Semi-supervised Learning


The most common way for humans to communicate is by speech. But perhaps a language system cannot know what it is communicating without a connection to the real world by image perception. In fact, humans perceive these multiple sources of information together to build a general concept. However, constructing a machine that can alleviate these modalities together in a supervised learning fashion is difficult, because a parallel dataset is required among speech, image, and text modalities altogether that is often unavailable. A machine speech chain based on sequence-to-sequence deep learning was previously proposed to achieve semi-supervised learning that enabled automatic speech recognition (ASR) and text-to-speech synthesis (TTS) to teach each other when they receive unpaired data. In this research, we take a further step by expanding the speech chain into a multimodal chain and design a closely knit chain architecture that connects ASR, TTS, image captioning (IC), and image retrieval (IR) models into a single framework. ASR, TTS, IC, and IR components can be trained in a semi-supervised fashion by assisting each other given incomplete datasets and leveraging cross-modal data augmentation within the chain.

From Speech Chain to Multimodal Chain:

Leveraging Cross-modal Data Augmentation for Semi-supervised Learning

Johanes Effendi, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

Nara Institute of Science and Technology, Japan

RIKEN, Center for Advanced Intelligence Project AIP, Japan

{johanes.effendi.ix4, andros.tjandra.ai6, ssakti, s-nakamura}

Index Terms: speech recognition, text-to-speech, image captioning, image retrieval, multimodal chain

1 Introduction

Human communication is multisensory and involves several communication channels, including auditory and visual channels. Such multiple information sources are perceived together to build a general concept and understanding. Moreover, the sensory inputs from several modalities share complementary behavior to ensure a robust perception of the overall information.

Over the past decades, several studies have integrated audio and visual cues to improve speech recognition performance. Within recent deep learning frameworks, Petridis et al. [1] proposed one of the first end-to-end audiovisual speech recognition schemes. Another approach is the “Watch, Listen, Attend, and Spell (WLAS)” framework [2], which is as an extension of the LAS framework [3] for speech recognition tasks that utilize a dual attention mechanism that can operate in three ways: over visual input only, audio input only, or both. Afouras et al. [4] also proposed a deep audio-visual speech recognition to recognize phrases and sentences spoken by a talking face with or without audio.

Image processing has recently addressed two main directions for utilizing different modalities (image and text) for image retrieval. The first is accommodated within a multimodal embedding space to enable cross-modal image retrieval by text, such as Word2VisualVec [5], full network embedding [6], and multimodal CNN [7]. Another approach uses a generative adversarial network (GAN), which incrementally generates images based on model attention over captions [8].

However, constructing a machine that can integrate these modalities in a supervised learning fashion is complicated, because it is required to have a paired (input-output) dataset among speech, image, and text modalities altogether that are often unavailable. Furthermore, most existing approaches handled the multimodal mechanism within a single system. In contrast, humans process different modalities by different organs: (i.e., ear for listening, mouth for speaking, eye for seeing, etc).

Previously, a machine speech chain was proposed [9, 10] to mimic human speech perception and production behaviors with a closed-loop speech chain mechanism (i.e., auditory feedback from a speaker’s mouth to her ears). The sequence-to-sequence model in the closed-loop architecture achieves semi-supervised learning by enabling ASR as a listening component and TTS as a speaking component to mutually improve their performances when they receive unpaired data.

In this research, we take a further step and expand the speech chain into a multimodal chain and design a closely knit chain architecture that combines ASR, TTS, image captioning (IC), and image retrieval (IR) models into one framework. In this way, the ASR, TTS, IC, and IR components can be trained in a semi-supervised fashion, because they can assist each other given incomplete datasets and leverage cross-modal data augmentation within the chain.

Figure 1: Architecture of (a) original speech chain framework [9], and (b) our proposed multimodal chain mechanism.

2 Multimodal Speech Chain Framework

Figure 1 illustrates (a) the original speech chain framework [9] and (b) our proposed multimodal chain mechanism. In this extension, we included image captioning and image retrieval models to incorporate visual modality in the chain. The framework consists of dual loop mechanisms between the speech and visual chains that involve quadruple learning components (ASR, TTS, IC, IR). In the speech chain, sequence-to-sequence ASR and sequence-to-sequence TTS models are jointly trained in a loop connection, and in the visual chain, neural image captioning and neural embedding-based image retrieval models are also jointly trained in a loop connection. Furthermore, both chains (speech and visual components) are allowed to collaborate by text modality.

The sequence-to-sequence model in closed-loop architecture allows us to train our entire model in a semi-supervised fashion by concatenating both the labeled and unlabeled data. To further clarify the learning process, we describe the mechanism based on the availability condition of the training data:

  1. Paired speech-text-image data exist: separately train ASR, TTS, IC and IR (supervised learning)
    Given complete multimodal dataset , we can set-up speech utterances and corresponding text transcriptions as dataset to separately train both the ASR and TTS models in a supervised manner. ASR losses and are calculated directly with teacher-forcing. We can also set-up images and captions as dataset to separately train the IC and IR models with supervised learning. The IC model is trained with teacher-forcing on reference caption , and the IR model is trained with pairwise rank loss on reference image and its contrastive sample.

  2. Unpaired speech, text, images data exist: jointly Train ASR&TTS in the speech chain and IC&IR in the visual chain (unsupervised learning)
    In this case, although speech, text, and image data are available, they are unpaired.

    1. Using speech data only: unrolled process ASRTTS in a speech chain
      Here we only use speech utterances of dataset , and ASR generates transcription for TTS to reconstruct. Reconstructed transcriptions calculate loss between and and update the model parameter.

    2. Using image data only: unrolled process ICIR in a visual chain
      Using only image in dataset , image captions are generated with the IC model. These captions are then used by the IR model to update its multimodal space using pairwise rank loss, which resulted in loss .

    3. Using text data only: unrolled process TTSASR in the speech chain and IRIC in the visual chain
      Given only the text in dataset , TTS generates speech utterance for the ASR, which then reconstructs the speech utterances into text in which reconstruction loss between and can be calculated. On the other hand, image captions retrieve images , which are reconstructed into text using the IC model in which losses are calculated between and .

  3. Single data (either speech, text, or images) exist: train ASR & TTS jointly in the speech chain and IC & IR in the visual chain (unsupervised learning)
    In this case, only a single data (either speech, text, or image) is available, and the others are empty.

    1. Only text data exist: train the speech and visual chains, as in 2(c)
      If only text data are available in dataset , we can separately perform unrolled process TTSASR in the speech chain and IRIC in the visual chain, separately

    2. Only image data exist: visual chain speech chain
      If only image data are available in dataset , first we perform unrolled process ICIR in the visual chain (See 2(b)). The generated image caption is then used to perform unrolled process TTSASR in the speech chain (See 2(c)).

    3. Only speech data exist: speech chain visual chain
      If only speech data are available in dataset , first we perform unrolled process ASRTTS in the speech chain (See 2(a)). The generated text transcription is then used to performs IRIC in the visual chain (See 2(c)).

We combine all of the losses and update both the ASR and TTS model, in addition with IC and IR model:


which results in losses and for speech and visual chains. Parameters , , , and are hyper-parameters to scale the loss between the supervised (paired) and unsupervised (unpaired) losses in each chain.

3 Multimodal Chain Components

In this section, we briefly describe all of the components inside the multimodal chain framework.

3.1 Sequence-to-sequence ASR

We use the sequence-to-sequence ASR model with attention, similar architecture with the one used in [9] which is also based on LAS framework [3]. It directly model the conditional probability of transcription given the speech feature . For the speech feature, MFCC or mel-spectogram are usually encoded by bidirectional LSTM encoder. The hidden representation are then attended by a LSTM or GRU decoder that decodes a sequence of characters or phonemes.

3.2 Sequence-to-sequence TTS

A sequence-to-sequence TTS is a parametric TTS that generates sequence of speech feature from transcription . We also used similar architecture as the one used in [9] which is based on Tacotron [11] with a slight modification. Tacotron produces a mel-spectogram given the text utterances, and is further transformed into a linear spectogram so that the speech signal can be reconstructed using the Griffin-Lim algorithm [12].

3.3 Image Captioning

Figure 2: Caption model

An image captioning model receives image and learn to produces caption . We utilized similar architecture as [13], where image are encoded through a series of convolutional neural network resulting a high level feature representation within a certain number of region that represent parts of the image. During decoding, linear attention grounds each decoded word into correlated image region by calculating alignment probability over decoder states . Unlike Xu et al’s model, we use ResNet [14] instead of VGG [15].

3.4 Image Retrieval

Figure 3: Retrieval model

Neural IR models [5, 6, 7, 16] are implemented by realizing a multimodal embedding between image and its caption . Image embedding is usually extracted from a series of pretrained convolutional neural networks followed by linear transformation. Recurrent neural network encoder are used to generate caption embedding . These two embedding representations are then trained using pairwise rank loss function to combine them into a unique multimodal embedding space.

As shown in Eq. 7, this procedure reduces mean squared distance between each image with related caption , and increases its distance with unrelated caption . Margin M is used to distance the already similar pairs, providing space to optimize the hard-positive examples:


4 Experimental Set-Up

4.1 Corpus Dataset

Dataset #Amount Type
2000 1 (pair)
7000 2 (unpair)
10000 3 (unpair)
10000 3 (unpair)
Table 1: Training data partition of for Flickr30k

In this study, we used Flickr30k dataset [17] to run our experiment. It contained 31,014 photos of everyday activities, events, and scenes. Similar with other image captioning dataset, each image has five captions. However, since we use this dataset not only for captioning but for retrieval as well, we need to maintain a balance between the source and the target. For image captioning, we use one caption per image to make the learning target consistent by avoiding one-to-many confusion. Conversely, in image retrieval we used all the five captions because the learning target is already consistent. To train the speech counterpart of our proposed architecture, we generated speech from the Flickr30k captions using Google TTS.

To show the capability of our model for semi-supervised learning, we formulated our dataset into four parts. The first part, , was used to train each model supervisedly (Type 1). Next, unpaired dataset were used to separately trained the speech chain and visual chain separately each correlated modality couple (Type 2). and are assumed to be a single modality corpus, which only has speech and images without any transcription or captioning. By decoding the dataset into the image caption, and into the utterance transcription, we can use the generated data to semi-supervisedly further improve each model (Type 3). Without our proposed architecture, these monomodal data cannot be used because their modality are completely unrelated with the chain in the other modality pair. For more details, see the specifications in Section 2.

4.2 Model Details

We respectively set the values of and, as and . We used an Adam optimizer with a learning rate of 1e-3 for the ASR model, 2.5e-4 for the TTS model, and 1e-4 for the IC model. For the IR model, we use a stochastic gradient descent with 0.1 learning rate.

In the image chain, we implemented the IC and IR model as described in 3.3 and 3.4. For the convolutional part that extracts the image features, we used ResNet [14] that was previously trained on ImageNet task [18]. In the IC model, we removed the last two layers of the resnet, which yields a 14x14x512 hidden representation in which the decoder could attend to. Then, for the IR model, we removed the last layers, giving us a 2048-dimensional hidden representation which can be regarded as the image representation. These representation are then linearly transformed into 300-dimensional image embeddings. On the other hand, the text embeddings were generated using a single-layer bidirectional LSTM with 256 hidden sizes in each direction.

The transcription in the speech chain are decoded using beam-search decoding with a size of three. Similarly, during the visual chain operation, the IC model produced its hypothesis using beam-search decoding with the same size. To simulate sampling in the IR hypothesis, we randomly sampled one hypothesis from five candidates.

5 Experiment Results

Data WER(%) L2-norm
Baseline: ASR & TTS
(Supervised learning - Type 1)
2k* 81.31 0.874
Proposed: speech chain ASRTTS and TTSASR
(Semi-supervised learning - Type 2(a)&2(c))
+7k 10.60 0.714
Proposed: visual chain speech chain
Semi-supervised learning - Type 3(b)&3(a))
+10k 7.97 0.645
Topline: ASR & TTS separately
(Supervised learning - Full Data)
29k 2.37 0.398
Table 2: ASR and TTS performance
footnotetext: *We trained our baseline model with only 2k to simulate real-condition where paired dataset size are usually small, and to show that our chain can semi-supervisedly improve a very bad initial model with as few data as possible.
Data BLEU-1 R@10 med r
Baseline: IC & IR
(Supervised learning - Type 1)
2k* 33.91 26.88 34
Proposed: visual chain ICIR and IRIC
(Semi-supervised learning - Type 2(b)&2(c))
+7k 42.11 28.14 31
Proposed: speech chain visual chain
Semi-supervised learning - Type 3(c)&3(a))
+10k 43.08 28.44 30
Topline: IC & IR separately
(Supervised learning - Full data)
29k 59.63 62.42 5
Table 3: IC and IR performance

Table 2 shows the ASR and TTS results from the scenarios in Section 2. First, we supervisedly trained them on 2k paired data as shown in the first block. This model serves as the foundation for the next speech chain. We continued the training into the speech chain using 7k data and achieved 10.60% WER and 0.714 L2-norm. Finally, using the IC model that was trained semi-supervisedly through Type 2(a)&2(c), we decoded the image-only dataset which enables it to be used in speech chain. By this way, we achieved about 2.6% WER improvement over the original speech chain [9] that were only trained using the speech and text datasets. This result proved that the cross-modal data augmentation from the image modality into this speech chain correlates positively with the model quality. Our proposed strategies improved ASR and TTS, even without any speech or text data, with the help of a visual chain.

On the other hand, Table 3 shows the IC and IR results from similar scenarios with the improvement from the speech chain. Here, first we did training using paired 2k data and achieve the baseline score as shown in the first block. Next, we trained the IC and IR model semi-supervisedly in the visual chain mechanism, and produced over 8.2 BLEU-1 improvement, 1.26 recall at 10 (R@10) improvement and 3 point improvement for the median r metrics. Finally, in the third block we show that the visual chain can also be improved using speech data, by the help of speech chain. There’s about 1 point improvement in terms of BLEU for IC (high is good) and median r for IR (low is good). This result also implies that using our proposed learning strategy, the IC and IR model can be improved even without image and text datasets. Therefore, we showed that it also works not only from image-to-speech modality, but also reversely.

In the last block of Table 2 and 3, we listed the topline of each model. This proves that in fully supervised scenario, our model works as good as the published paper in each field (speech or image processing).

6 Related Works

Approaches that utilize learning from source-to-target and vice-versa as well as feedback links remain challenging. He et al. [19] and Cheng et al. [20], recently published a work that addressed a mechanism called dual learning in neural machine translation (NMT). Their system has a dual task: source-to-target language translation (primal) versus target-to-source language translation (dual), so it can leverage monolingual data to improve the neural machine translation. The central idea is to reconstruct monolingual corpora using an autoencoder in which the source-to-target and target-to-source translation models respectively serve as encoder and decoder.

In image processing, several methods have also been proposed to achieve unsupervised joint distribution matching without any paired data, including DiscoGAN [21], CycleGAN [22] and DualGAN [23]. The framework provides learning to translate an image from a source domain to a target domain without paired examples based on using a cycle-consistent adversarial network. Implementation on voice conversion applications has also been investigated [24]. However, most of these only work using the same domain between the source and the target.

The speech chain framework [9, 10] maybe the first framework that was constructed on different domain (speech versus text). This novel mechanism that also integrates human speech perception and production behaviors, that utilize the primal model (ASR) that transcribes the text given the speech versus the dual model (TTS) that synthesizes the speech given the text. Recently, to take advantage over the duality between image and text, Huang et al. (2018) proposed a turbo learning approach by implementing turbo butterfly architecture for joint training between image captioning and image generation [25].

In our project, we constructed the first framework that accommodates triangle modality (speech, text, and image) and addressed the problems of speech-to-text, text-to-speech, text-to-image, and image-to-text. Furthermore, our work also mimics the mechanism of the entire human communication system with auditory and visual sensors.

7 Conclusion

We described a novel approach for cross-modal data augmentation that upgrades a speech chain into the a multimodal chain. We proposed a visual chain by jointly training IC and IR model in a loop connection that can learn semi-supervisedly over an unpaired image-text dataset. Then we improved the speech chain using an image-only dataset, bridged by our visual chain, and vice-versa. We showed that each model in the two chains can assist each other given an incomplete dataset by leveraging data augmentation among modalities. In the future, we will jointly train both the speech and visual chain together so both can be updated together.

8 Acknowledgements

Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H06101 and JP17K00237.


  • [1] S. Petridis, Y. Wang, Z. Li, and M. Pantic, “End-to-end audiovisual fusion with lstms,” in Auditory-Visual Speech Processing, AVSP 2017, Stockholm, Sweden, 25-26 August 2017., 2017, pp. 36–40. [Online]. Available:
  • [2] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2017, pp. 3444–3453.
  • [3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 4960–4964.
  • [4] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [5] J. Dong, X. Li, and C. G. M. Snoek, “Word2visualvec: Cross-media retrieval by visual feature prediction,” CoRR, vol. abs/1604.06838, 2016. [Online]. Available:
  • [6] A. Vilalta, D. Garcia-Gasulla, F. Parés, E. Ayguadé, J. Labarta, E. U. Moya-Sánchez, and U. Cortés, “Studying the impact of the full-network embedding on multimodal pipelines,” Semantic Web, no. Preprint, pp. 1–15.
  • [7] 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015.   IEEE Computer Society, 2015.
  • [8] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” 2018.
  • [9] S. N. Andros Tjandra, Sakriani Sakti, “Listening while speaking: Speech chain by deep learning,” CoRR, vol. abs/1707.04879, 2017. [Online]. Available:
  • [10] A. Tjandra, S. Sakti, and S. Nakamura, “Machine speech chain with one-shot speaker adaptation,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., 2018, pp. 887–891. [Online]. Available:
  • [11] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” 2017. [Online]. Available:
  • [12] D. G. and, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, April 1984.
  • [13] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [16] I. Calixto and Q. Liu, “Sentence-level multilingual multi-modal embedding for natural language processing,” in Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017.   Varna, Bulgaria: INCOMA Ltd., Sep. 2017, pp. 139–148.
  • [17] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ser. ICCV ’15.   Washington, DC, USA: IEEE Computer Society, 2015, pp. 2641–2649. [Online]. Available:
  • [18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” CoRR, vol. abs/1409.0575, 2014. [Online]. Available:
  • [19] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma, “Dual learning for machine translation,” in Advances in Neural Information Processing Systems, 2016, pp. 820–828.
  • [20] Y. Cheng, Z. Tu, F. Meng, J. Zhai, and Y. Liu, “Towards robust neural machine translation,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 1756–1766. [Online]. Available:
  • [21] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 1857–1865. [Online]. Available:
  • [22] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
  • [23] Z. Yi, H. R. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-image translation,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 2868–2876. [Online]. Available:
  • [24] K. Tanaka, T. Kaneko, N. Hojo, and H. Kameoka, “Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks,” in 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018, 2018, pp. 632–639. [Online]. Available:
  • [25] Q. Huang, P. Zhang, D. O. Wu, and L. Zhang, “Turbo learning for captionbot and drawingbot,” CoRR, vol. abs/1805.08170, 2018. [Online]. Available:
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description