Improved Speech Reconstruction from Silent Video
Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our model on speakers from the GRID and TCD-TIMIT datasets, and evaluate the quality and intelligibility of reconstructed speech using common objective measurements. We show that speech predictions from the proposed model attain scores which indicate significantly improved quality over existing models. In addition, we show promising results towards reconstructing speech from an unconstrained dictionary.
Human speech is inherently an articulatory-to-auditory mapping in which mouth, vocal tract and facial movements produce an audible acoustic signal containing phonetic units of speech (phonemes) which together comprise words and sentences. Speechreading (commonly called lipreading) is the task of inferring phonetic information from these facial movements by visually observing them. Considering the fact that speech is the primary method of human communication, people who are deaf or have a hearing loss find that speechreading can help overcome many of the barriers when communicating with others . However, since several phonemes often correspond to a single viseme (visual unit of speech), it is a notoriously difficult task for humans to perform.
We believe that machine speechreading may be best approached using the same form of articulatory-to-acoustic mapping that creates the natural audio signal, even though not all relevant information is available visually (e.g. vocal chord and most tongue movement). In addition to the perceptual sense this approach makes, modeling the task as an acoustic regression problem has many advantages over the visual-to-textual or classification modeling: Acoustic speech signal contains information which is often difficult or impossible to express in text, such as emotion, prosody and word emphasis; This form of cross-modal regression, in which one modality is used to generate another modality, can be trained using “natural supervision”  which leverages the natural synchronization in a video of a talking person. Recorded video frames and recorded sound do not require any segmentation or labeling; By regressing very short units of speech, we can learn to reconstruct words comprised of these units which were not “seen” during training. While this is also possible by classifying short visual units into their corresponding phonemes or characters, in practice generating labeled training data for this task is difficult.
Several applications come to mind for automatic video-to-speech systems: Enabling videoconferencing from within a noisy environment; facilitating conversation at a party with loud music between people having wearable cameras and earpieces; maybe even using surveillance video as a long-range listening device. In another paper we have successfully used the generated sound, together with the original noisy sound, for speech enhancement and separation .
Our technical approach builds on recent work in neural networks for speechreading and speech synthesis, which we extend to the problem of generating natural sounding speech from silent video frames of a speaking person. To the best of our knowledge, there has been relatively little work for reconstructing high quality speech using an unconstrained dictionary. Our work is also closely related to efforts to extract textual information from a video of a person speaking, i.e. the visual-to-textual problem, as the output of our model can potentially also be used as input to a speech-to-text model.
In this paper, we: (1) present and compare multiple CNN-based encoder-decoder models that predict the speech audio signal of a silent video of a person speaking, and significantly improve both intelligibility and quality of speech reconstructions of existing models; (2) show significant progress towards reconstructing words from an unconstrained dictionary.
2 Related work
Much work has been done in the area of automating speechreading by computers [32, 29, 44]. There are two main approaches to this task. The first, and the one most widely attempted in the past, consists of modeling speechreading as a visual-to-textual mapping. In this approach, the input video is manually segmented into short clips which contain either whole words from a predefined dictionary, or parts of words comprising phonemes, visemes  or characters. Then, visual features are extracted from the frames and fed to a classifier. Assael et al. , Chung et al.  and others [42, 39, 33] have all recently showed state-of-the-art word and sentence-level classification results using neural network-based models.
The second approach, and the one used in this work, is to model speechreading as a visual-to-acoustic mapping problem in which the “label” of each short video segment is a corresponding feature vector representing the audio signal. Kello and Plaut  and Hueber and Bailly  attempted this approach using various sensors to record mouth movements, whereas Cornu and Milner  used active appearance model (AAM) visual features as input to a recurrent neural network.
Our approach is closely related to recent speaker-dependent video-to-speech work by Ephrat and Peleg , in which a convolutional neural network (CNN) is trained to map raw pixels of a speaker’s face directly to audio features, which are subsequently converted into intelligible waveform. The differences between our approach and the one taken by  can by broken down into two parts: improvement of the encoder and redesign of the decoder.
The goal of our encoder modification is to improve analysis of facial movements, and consists of a preprocessing step which registers the face to a canonical pose, the addition of an optical flow branch, and swapping the VGG-based architecture with a ResNet based one. Our decoder is designed to remedy a major flaw in , namely the unnatural sound of the reconstructed speech it produces. To this end, we use the sound representation and post-processing network of , which introduces longer-range dependency into the final speech reconstruction, resulting in smoother, higher quality speech. Section 4 expounds on the above differences, and Section 6.1 contains a comparison of the results of  to ours.
Our work also builds upon recent work in neural sound synthesis using predicted spectrogram magnitude, including Tacotron (Wang et al.)  for speech synthesis, and the baseline model of NSynth (Engel et al.) for music synthesis. While Tacotron focuses on building a single-speaker text-to-speech (TTS) system, our paper focuses on building a single-speaker video-to-speech system.
This work complements and improves upon previous efforts in a number of ways: Firstly, we explore how to better analyze the visual input, i.e. silent video frames, in order to produce an encoding which can be subsequently decoded into speech features. Secondly, while prior work has predicted only the output corresponding to a single video frame, we jointly generate audio features for a sequence of frames, as depicted in Figure 1, which improves the smoothness of the resulting audio. Thirdly,  focused on maximizing intelligibility at the expense of natural sounding speech on a relatively limited-vocabulary dataset. We aim to overcome the challenges of the more complex TCD-TIMIT  dataset, while optimizing for both intelligibility and natural-sounding speech.
3 Data representation
3.1 Visual representation
Our goal is to reconstruct a single audio representation vector which corresponds to the duration of a single video frame . However, instantaneous lip movements such as those in isolated video frames can be significantly disambiguated by using a temporal neighborhood as context. Therefore, the encoder module of our model takes two inputs: a clip of consecutive grayscale video frames, and a “clip” of consecutive dense optical flow fields corresponding to the motion in directions for pixels of consecutive grayscale frames.
Each clip is registered to a canonical frame of reference. We start by detecting facial points (two eyes, nose, and two tips of the mouth). We use the points on the eyes to compute a similarity transform alignment between each frame and the central frame of the clip. Following , we then crop the speaker’s full face to a size of pixels, and we use the entire face region rather than using only the region of the mouth. This results in an input volume of size scalars. The second input, dense optical flow, adds an additional volume of scalar inputs. It has been proven that optical flow can improve the performance of neural networks when combined with raw pixel values for a variety of applications [36, 12], and has even been successfully used as a stand-alone network input . Optical flow is positively influential in this case as well, as we show later.
3.2 Speech representation
The challenge of finding a suitable representation for an acoustic speech signal which can be estimated by a neural network on one hand, and synthesized back into intelligible audio on the other, is not trivial. Use of raw waveform as network output was ruled out for lack of a suitable loss function with which to train the network.
Line Spectrum Pairs (LSP)  are a representation of Linear Predictive Coding (LPC) coefficients which are more stable and robust to quantization and small deviations. LSPs are therefore useful for speech coding and transmission over a channel, and were used by  as output from their video-to-speech model. However, without the original excitation, the reconstruction using unvoiced excitation (random noise) results in somewhat intelligible, albeit robotic and unnatural sounding speech.
Given the above, we sought to use a representation which retains speech information vital for an accurate reconstruction into waveform. We experiment with both spectrogram magnitude and a reduced dimensionality mel-scale spectrogram as our regression target, which can subsequently be transformed back into waveform by using a phase reconstruction algorithm such as Giffin-Lim .
4 Model architecture
At a high-level, as shown in Figure 2, our model is a comprised of an encoder-decoder architecture which takes silent video frames of a speaking person as input, and outputs a sound representation corresponding to the speech spoken during the duration of the input.
It is important to note that our proposed approach is speaker-dependent, i.e. a new model needs to be trained for each new speaker. Achieving speaker-independent speech reconstruction is a non-trivial task, and is out of the scope of this work.
The encoder module of our model consists of a dual-tower Residual neural network (ResNet) which takes the aforementioned video clip and its optical flow as inputs and encodes them into a latent vector representing the visual features. Each of the inputs is processed with a column of residual block stacks. Each tower comprises ten consecutive blocks consisting of kernels. Following the last layer, a global spatial average () is performed, after which the two towers are concatenated into one -neuron layer which is essentially a latent representation of our visual features.
4.2 Decoder and post-processing
The latent vector is fed into a series of two fully connected layers with neurons each. The last layer of our CNN is of size , where is the number of audio windows predicted given a single video clip, and corresponds to the size of the sound representation vectors we wish to predict. For example, an output of coefficients of Mel frequency with and window size of , results in an dimensional output vector. The output of our CNN is fed into the post-processing network used by , consisting of one CBHG module, which is described as a powerful module for extracting representations from sequences. The CBHG module comprises several convolutional, Highway  and Bidirectional GRU  layers whose goal is to take several consecutive sound representations as input and output a higher temporal resolution version. The input clips are then packed in mini-batches of consecutive samples. As in the implementation of , the post processing network takes these consecutive mel-scale spectrogram vectors as input, and outputs a consecutive linear-scale spectrogram. The entire model is trained end-to-end with a two-term loss, one on the decoder output and one on the output of the post-processing network. Although the model is trained end-to-end, we keep all convolutional layers of the CNN frozen during training of the second network.
4.3 Generating a waveform
We consider several methods for generating a waveform from our model’s predicted mel and linear scale sound features. The first is the spectrogram synthesis approach used by [43, 10] in which the Griffin-Lim algorithm is used to reconstruct the phase of the predicted linear-scale spectrogram. Inverse STFT is then used to convert the complex spectrogram back into waveform. We found that the result is intelligible and smooth, albeit somewhat unnatural and robotic sounding.
Therefore, we also consider an example-based synthesis method similar to the one used by , in which we replace predicted sound features with their closest exemplar in the training set. We search for the nearest neighbor to both mel-scale and linear-scale predicted features, as measured by distance, and replace it with the neighbor’s corresponding linear-scale spectrogram feature. The full spectrogram is then converted into waveform using the procedure described above. We find that in most cases, mel-scale gives better results, which are more natural-sounding, but less smooth than using the predicted linear spectrogram itself.
5 Model details
We use the method of , with the code provided from their website, to detect facial features. The speaker’s face is cropped to pixels. Using frames as input worked best. This results in an input volume of size scalars, from which we subtract the mean. We use the method of  with python wrapper provided by  to compute an optical flow vector for every image pixel. Optical flow is not normalized as its mean is approximately zero, and its std is in the range of pixels. [23, 31].
We use the code provided by  to compute log magnitude of both linear and mel-scale spectrograms which are peak normalized to be between and . For videos with a frame rate of FPS we downsample the original audio to kHz and use ms windows with ms frame shift. For videos with a frame rate of FPS we downsample the original audio to Hz and use ms windows with ms frame shift.
Our network implementation is based on the Keras library  built on top of TensorFlow . Network weights are initialized using the initialization procedure suggested by He et al. . Before each activation layer Batch Normalization  is performed. We use Leaky ReLU  as the non-linear activation function in all layers but the last two, in which we use the hyperbolic tangent (tanh) function. Adam optimizer  is used with an initial learning rate of , which is reduced several times during training. Dropout  is used to prevent overfitting, with a rate of after convolutional layers and after fully connected ones. We use mini-batches of training samples each and stop training when the validation loss stops decreasing (around epochs). The network is trained with backpropagation using mean squared error () loss for both decoder and post-processing net outputs, which have equal weights. To improve the temporal smoothness of the output, after generating spectrogram coefficients for consecutive frames (), we move one step forward and do the same for (), which creates an overlap of frames, thus creating exactly predictions for each frame. We then calculate a weighted average over the predictions for a given frame using a Gaussian.
Previous works performed experiments with the GRID audiovisual sentence corpus , a large dataset of audio and video (facial) recordings of sentences spoken by talkers ( male, female). Each sentence consists of a six word sequence of the form shown in Table 1, e.g. “Place green at H now”. Although this dataset contains a fair volume of high quality speech videos, it has several limitations, the most important being its extremely small vocabulary. In order to compare our method with previous ones, we too perform experiments on this dataset.
In order to better demonstrate the capability of our model, we also perform experiments on the TCD-TIMIT dataset . This dataset consists of volunteer speakers with around videos each, as well as three lipspeakers, people specially trained to speak in a way that helps the deaf understand their visual speech. The speakers are recorded saying various sentences from the TIMIT dataset , and are recorded using both front-facing and degree cameras.
We evaluated the quality and intelligibility of the reconstructed speech using four well-known objective scores: STOI  and ESTOI  for estimating the intelligibility of the reconstructed speech and automatic mean opinion score (MOS) tests PESQ  and VisQOL , which indicate the quality of the speech. While all objective scores have their faults, we found that these metrics correlate relatively well with our perceived audio quality, as well as with our model’s loss function. However, we strongly encourage readers to watch and listen to the supplementary video on our project web page 111Examples of reconstructed speech can be found at
http://www.vision.huji.ac.il/vid2speech which conclusively demonstrates the superiority of our approach.
6.1 Sound prediction tasks
For this task we trained our model on a random train/test split of the videos of speakers (male) and (female), and made sure that all GRID words were represented in each set. The resulting representation vectors were converted back into waveform using the aforementioned mel-scale spectrogram example-based synthesis (Mel-synth) and predicted linear spectrogram synthesis (Lin-synth).
|Cornu et al. |
|Cornu et al. |
For this task we trained our model on a random train/test split of the videos of Lipspeakers (female). This results in less than minutes of video used for training, which only around of the amount of data used in the previous task. In this task, many of the words in the test set do not appear in the training set. We would like for our model to learn to reconstruct these words based on the recognition of the combinations of short visual segments which comprise the words. Here too, the resulting audio vectors were converted back into waveform using mel-scale spectrogram example-based synthesis (Mel-synth) and predicted linear spectrogram synthesis (Lin-synth).
Table 4 holds the results of this difficult task. The reconstructed speech is natural-sounding, albeit not entirely intelligible. Given the limited amount of training data, we believe our results are promising enough to indicate that fully intelligible reconstructed speech from unconstrained dictionaries is a feasible task.
6.2 Ablation analysis
We conducted a few ablation studies on GRID speaker to understand the key components in our model. We compare our full model with () a model using only an optical flow stream; () a model using only a pixel stream; () a model with no post-processing network which outputs mel-scale spectrogram. Table 5 shows the results of this analysis. Our analysis shows that pixel intensities provide most of the information needed for reconstructing speech, while adding optical flow and a post-processing network give slightly better results.
|Optical flow only|
|Pixels + optical flow|
|Pixels + OF + postnet|
7 Concluding remarks
A two-tower CNN-based model is proposed for reconstructing intelligible and natural-sounding speech from silent video frames of a speaking person. The model is trained end-to-end with a post-processing network which fuses together multiple CNN outputs in order to obtain a longer range speech representation. We have shown that the proposed model obtains significantly higher quality reconstructions than previous works, and even shows promise towards reconstructing speech from an unconstrained dictionary.
The work described in this paper can be improved upon by increasing intelligibility of speech reconstruction from an unconstrained dictionary, and extending an existing model to unknown speakers. It can also be used as a basis for various speech related tasks such as speaker separation and enhancement.
We plan to use the visually reconstructed speech in order to enhance speech recorded in a noisy environment, and to separate mixed speech in scenarios like the “cocktail party” where the face of the speaking person is visible .
Acknowledgment. This research was supported by Israel Science Foundation, by DFG, and by Intel ICRI-CI.
-  Keras. Software available from https://github.com/fchollet/keras.
-  Tensorflow. Software available from http://tensorflow.org/.
-  Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016.
-  H. L. Bear and R. Harvey. Decoding visemes: Improving machine lip-reading. In ICASSP’16, pages 2009–2013, 2016.
-  D. Burnham, R. Campbell, G. Away, and B. Dodd. Hearing Eye II: The Psychology Of Speechreading And Auditory-Visual Speech. Psychology Press, 2013.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
-  J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. arXiv preprint arXiv:1611.05358, 2016.
-  M. Cooke, J. Barker, S. Cunningham, and X. Shao. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424, 2006.
-  T. L. Cornu and B. Milner. Generating intelligible audio speech from visual speech. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017.
-  J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and M. Norouzi. Neural audio synthesis of musical notes with wavenet autoencoders. arXiv preprint arXiv:1704.01279, 2017.
-  A. Ephrat and S. Peleg. Vid2speech: speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
-  A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg. Seeing through noise: Speaker separation and enhancement using visually-derived speech. arXiv:1708.06767, 2017.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93, 1993.
-  D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
-  N. Harte and E. Gillen. Tcd-timit: An audio-visual corpus of continuous speech. IEEE Trans. Multimedia, 17:603–615, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV’15, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  A. Hines, J. Skoglund, A. Kokaram, and N. Harte. Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015 (13):1–18, 2015.
-  T. Hueber and G. Bailly. Statistical conversion of silent articulation into audible speech using full-covariance hmm. Comput. Speech Lang., 36(C):274–293, Mar. 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
-  F. Itakura. Line spectrum representation of linear predictor coefficients of speech signals. The Journal of the Acoustical Society of America, 57(S1):S35–S35, 1975.
-  V. Jampani, R. Gadde, and P. V. Gehler. Video propagation networks. arXiv preprint arXiv:1612.05478, 2016.
-  J. Jensen and C. H. Taal. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11):2009–2022, 2016.
-  C. T. Kello and D. C. Plaut. A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. The Journal of the Acoustical Society of America, 116(4):2354–2364, 2004.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
-  C. Liu et al. Beyond pixels: exploring new representations and applications for motion analysis. PhD thesis, Massachusetts Institute of Technology, 2009.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML’13, 2013.
-  I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198–213, 2002.
-  A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicated sounds. In CVPR’16, 2016.
-  D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan. Learning features by watching objects move. arXiv preprint arXiv:1612.06370, 2016.
-  E. D. Petajan. Automatic lipreading to enhance speech recognition (speech reading). PhD thesis, University of Illinois at Urbana-Champaign, 1984.
-  S. Petridis, Z. Li, and M. Pantic. End-to-end visual speech recognition with lstms. 2017.
-  Y. Poleg, A. Ephrat, S. Peleg, and C. Arora. Compact cnn for indexing egocentric videos. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pages 1–9. IEEE, 2016.
-  A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the Acoustics, Speech, and Signal Processing, 200. On IEEE International Conference - Volume 02, ICASSP ’01, pages 749–752, Washington, DC, USA, 2001. IEEE Computer Society.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
-  T. Stafylakis and G. Tzimiropoulos. Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105, 2017.
-  Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3476–3483, 2013.
-  C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen. A short-time objective intelligibility measure for time-frequency weighted noisy speech. 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4214–4217, 2010.
-  M. Wand, J. Koutn, et al. Lipreading with long short-term memory. In ICASSP’16, pages 6115–6119, 2016.
-  Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
-  Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen. A review of recent advances in visual speech decoding. Image and vision computing, 32(9):590–605, 2014.