Audio Super-Resolution using Neural Nets
We introduce a new audio processing technique that increases the sampling rate of signals such as speech or music using deep convolutional neural networks. Our model is trained on pairs of low and high-quality audio examples; at test-time, it predicts missing samples within a low-resolution signal in an interpolation process similar to image super-resolution. Our method is simple and does not involve specialized audio processing techniques; in our experiments, it outperforms baselines on standard speech and music benchmarks at upscaling ratios of , , and . The method has practical applications in telephony, compression, and text-to-speech generation; it demonstrates the effectiveness of convolutional architectures on an audio generation task.
|Volodymyr Kuleshov, S. Zayd Enam, and Stefano Ermon|
|Department of Computer Science,|
The generative modeling of audio signals is a fundamental problem at the intersection of signal processing and machine learning; recent learning-based algorithms have enabled advances in speech recognition (Hinton et al., 2012), audio synthesis (van den Oord et al., 2016; Mehri et al., 2016), music recommendation systems (Coviello et al., 2012; Wang & Wang, 2014; Liang et al., 2015), and in many other areas (Acevedo et al., 2009). Audio processing also raises basic research questions pertaining to time series and generative modeling (Haykin & Chen, 2005; Bilmes, 2004).
One of the most significant recent advances in machine learning-based audio processing has been the ability to directly model raw signals in the time domain using neural networks (van den Oord et al., 2016; Mehri et al., 2016). Although this affords us the maximum modeling flexibility, it is also computationally expensive, requiring us to handle audio samples at every second.
In this paper, we explore new lightweight modeling algorithms for audio. In particular, we focus on a specific audio generation problem called bandwidth extension, in which the task is to reconstruct high-quality audio from a low-quality, down-sampled input containing only a small fraction (15-50%) of the original samples. We introduce a new neural network-based technique for this problem that is inspired image super-resolution algorithms (Dong et al., 2016), which use machine learning techniques to interpolate a low-resolution image into a higher-resolution one. Learning-based methods often perform better in this context than general-purpose interpolation schemes such as splines because they leverage sophisticated domain-specific models of the appearance of natural signals.
As in image super-resolution, our model is trained on pairs of low and high-quality samples; at test-time, it predicts the missing samples of a low-resolution input signal. Unlike recent neural networks for generating raw audio, our model is fully feedforward and can be run in real-time. In addition to having multiple practical applications, our method also suggests new ways to improve existing generative models of audio.
From a practical perspective, our technique has applications in telephony, compression, text-to-speech generation, forensic analysis, and in other domains. It outperforms baselines at , , and upscaling ratios, while also being significantly simpler than previous methods. Whereas most existing audio enhancement methods make substantial use of signal processing theory, our approach is conceptually very simple and requires no specialized knowledge to implement. Our neural networks are simply trained to map one audio time series into another. Our approach is also among the first to use convolutional architectures for bandwidth extension; as a result, it scales better with dataset size and computational resources relative to current alternatives.
From a generative modeling perspective, our work demonstrates that purely feedforward architectures operating in a non-discretized output space can achieve good performance on an important audio generation task. This hints at the possibility of designing improved generative models for audio that combine both feedforward and recurrent components.
2 Setup and background
We represent an audio signal as a function , where is the duration of the signal (in seconds) and is the amplitude at . Taking a digital measurement of requires us to discretize the continuous function into a vector . We refer to as the sampling rate of (in Hz). Sampling rates may range from 4 KHz (low-quality telephone speech) to 44 Khz (high-fidelity music).
In this work, we interpret as the resolution of ; our goal is to increase the resolution of audio samples by predicting from a fraction of its samples taken at . Note that by basic signal processing theory, this is equivalent to predicting the higher frequencies of .
Audio upsampling has been studied in the audio processing community under the name bandwidth extension (Ekstrand, 2002; Larsen & Aarts, 2005). Several learning-based approaches have been proposed, including Gaussian mixture models (Cheng et al., 1994; Park & Kim, 2000) and neural networks (Li et al., 2015). These methods typically involve hand-crafted features and use relatively simple models (e.g., neural networks with at most 2-3 densely connected layers) that are often part of a larger, more complex systems. In comparison, our method is conceptually simple (operating directly on the raw audio signal), scalable (our neural networks are fully convolutional and fully feed-forward), more accurate, and is also among the few to have been tested on non-speech audio.
Given a low resolution signal sampled at a rate , our goal is to reconstruct a high-resolution version of that has a sampling rate . For example, may be a voice signal transmitted via a standard telephone connection at 4 KHz; may be a high-resolution 16 KHz reconstruction of the orignal. We use to denote the upsampling ratio of the two signals, which in our work equals . We thus expect that for .
To recover the under-defined signal, we learn a model of the higher-resolution , conditioned on its low-resolution instantiation . We assume that the relationship between the time series follows the equation where is Gaussian noise and is a model parametrized by . Our framework also extends to more complex noise models which the user may provide as a prior or that may be themselves parametrized by the model (similarly to how one parametrizes the normal distribution in a variational autoencoder).
The above formulation naturally leads to a mean squared error (MSE) objective
for determining the parameters based on a dataset of source/target time series pairs. Since our model is fully convolutional, we may take the to be small patches sampled from the full time series.
3.2 Model Architecture
We parametrize the function with a deep convolutional neural network with residual connections; our neural network architecture is based on ideas from Shi et al. (2016), Dong et al. (2016), and Isola et al. (2016), and is shown in Figure 1. We highlight its main features below.
Our model contains successive downsampling and upsampling blocks: each performs a convolution, batch normalization, and applies a ReLU non-linearity. Downsampling block contains convolutional filters of length and a stride of . Upsampling block has filters of length .
Thus, at a downsampling step, we halve the spatial dimension and double the filter size; during upsampling, this is reversed. This bottleneck architecture is inspired by auto-encoders, and is known to encourage the model to learn a hierarchy of features. For example, on an audio task, bottom layers may extract wavelet-style features, while higher ones may correspond to phonemes Aytar et al. (2016). Note that the model is fully convolutional, and may run on input sequences of arbitrary length.
When the source series is similar to the target , downsampling features will be also be useful for upsampling (Isola et al., 2016). We thus add additional skip connections which stack the tensor of -th downsampling features with the -th tensor of upsampling features. We also add an additive residual connection from the input to the final output: the model thus only needs to learn , which in practice speeds up training.
Subpixel shuffling layer.
In order to increase the time dimension during upscaling, we have implemented a one-dimensional version of the Subpixel layer of Shi et al. (2016), which has been shown to be less prone to produce artifacts (Odena et al., 2016).
An upscaling block’s convolution maps an input tensor of dimension into one of size . The subpixel layer reshuffles this tensor into another one of size (while preserving the tensor entries intact); these are concatenated with features from the downsampling stage, for a final output of size . Thus, we have halved the number of filters and doubled the spatial dimension.
We use the VCTK dataset (Yamagishi, ) — which contains 44 hours of data from 108 different speakers — and the Piano dataset of Mehri et al. (2016) (10 hours of Beethoven sonatas). We generate low-resolution audio signal from the 16 KHz originals by applying an order 8 Chebyshev type I low-pass filter before subsampling the signal by the desired scaling ratio.
We evaluate our method in three regimes. The SingleSpeaker task trains the model on the first 223 recordings of VCTK Speaker 1 (about 30 mins) and tests on the last 8 recordings. The MultiSpeaker task assesses our ability to generalize to new speakers. We train on the first 99 VCTK speakers and test on the 8 remaining ones; our recordings feature different voices and accents (Scottish, Indian, etc.) Lastly, the Piano task extends audio-super resolution to non-vocal data; we use the standard 88%-6%-6% data split.
We compare our method relative to two baselines: a cubic B-spline — which corresponds to the bicubic upsampling baseline used in image super-resolution — and the recent neural network-based technique of Li et al. (2015),
The latter approach takes as input the short-time Fourier transform (STFT) of the input and predicts directly the phase and the magnitudes of the high frequency components using a dense neural network with three hidden layers of size 2048 and ReLU nonlinearities. Li et al. (2015) have shown that this method is preferred over Gaussian Mixture Models in 84% of cases in a user study. This model requires that the scaling ratio be a power of , hence it is not applicable when .
We instantiate our model with blocks and train it for 400 epochs on patches of length 6000 (in the high-resolution space) using the ADAM optimizer with a learning rate of . To ensure source/target series are of the same length, the source input is pre-processed with cubic upscaling. We do not compare against previously-proposed matrix factorization techniques (Bansal et al., 2005; Liang et al., 2013), as they are typically trained on 10 input examples (Sun & Mazumder, 2013) (due to the cost of jointly factorizing a large number of matrices), and do not scale to the size of our datasets.
Given a reference signal and an approximation , the Signal to Noise Ratio (SNR) is defined as
The SNR is a standard metric used in the signal processing literature. The Log-spectral distance (LSD) (Gray & Markel, 1976) measures the reconstruction quality of individual frequencies as follows:
where and are the log-spectral power magnitudes of and , respectively. These are defined as , where is the short-time Fourier transform (STFT) of the signal. We use and index frames and frequencies, respectively; in our experiments, we used frames of length 2048.
The results of our experiments are summarized in Table 2. Our objective metrics show an improvement of 1-5 dB over the baselines, with the strongest improvements at higher upscaling factors. Although, the spline baseline achieves a high SNR, its signal often lacks higher frequencies; the LSD metric is better at identifying this problem. Our technique also improves over the DNN baseline; our convolutional architecture appears to use our modeling capacity more efficiently than a dense neural network, and we expect such architectures will soon be more widely used in audio generation tasks.
Next, we confirmed our objective experiments with a study in which human raters were asked to assess the quality of super-resolution using a MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) test. For each trial an audio sample was upscaled using different techniques111We have posted a our set of samples to: https://kuleshov.github.io/audio-super-res/.. We collected four VCTK speaker recordings audio samples from the MultiSpeaker testing set. For each recording, we collected the original utterance, a downsampled version at , as well as signals super-resolved using Splines, DNNs, and our model (six versions in total). We recruited 10 subjects and used an online survey to ask each of them to rate each sample on a scale of 0 (extremely bad) to 100 (excellent) reconstruction. The results from the experiment are summarized in Table 1. Our method ranked as being the best out of the three upscaling techniques.
|LPF (Test)||No LPF (Test)|
|No LPF (Train)||0.43||4.4||33.2||3.3|
We tested the sensitivity of our method to out-of-distribution input via an audio super-resolution experiment in which the training set did not use a low-pass filter, while the test set did, and vice-versa. We focused on the Piano task and . The output from the model was noisier than expected, indicating that generalization is an important practical concern. We suspect this behavior may be common in super-resolution algorithms, but has not been widely documented. A potential solution would be to train on data that has been generated using multiple techniques.
In addition, we examined the ability of our model to generalize from speech to music and vice versa. We found that switching domains produced noisy output, again highlighting the specialization of the model.
We examined the importance of our various architectural design choices via an ablation analysis on the MultiSpeaker audio super-resolution task using an upscaling ratio of . The adjacent figure displays the result: the green-ish line display the validation set loss of the original model over time; the yellow curve removes the additive residual connection; the green curve further removes the additive skip connection (while preserving the same total number of filters). This shows that symmetric skip connections are crucial for attaining good performance; additive connections add an additional small, but perceptible, improvement.
Our model is computationally efficient and can be run in real time. On the Piano task (where all input signals are 12s in length), our method processed a single second of audio in 0.11s on average on a Titan X GPU. Training our models, however, required about 2 days for the MultiSpeaker task. Unlike sequence-to-sequence architectures our model does not require the complete input sequence in order to begin generating an output sequence.
Finally, to explore the limits of our approach, we evaluated our method on the MagnaTagATune dataset, which consists of about 200 hours of music from 188 different genres. This dataset is larger and much more diverse that the ones we considered so far. We found that our model underfit the dataset, with very little reduction in the training error, and no improvement over the spline baseline. Other learning-based baselines fared similarly. However, we expect improved results with a larger model and more computational resources.
5 Previous Work and Discussion
Time series modeling.
In the machine learning literature, time series signals have most often been modeled with auto-regressive models, of which variants of recurrent networks are a special case (Gers et al., 2001; Maas et al., 2012; Mehri et al., 2016). Our approach instead generalizes conditional modeling ideas used in computer vision for tasks such as image super-resolution (Dong et al., 2016; Ledig et al., 2016) or colorization (Zhang et al., 2016).
We identify a broad class of conditional time series modeling problems that arise in signal processing, biomedicine, and other fields and that are characterized by a natural alignment among source/target series pairs and differences that are well-represented by local transformations. We propose a general architecture for such problems and show that it works well in different domains.
Existing learning-based approaches include Gaussian mixture models (Cheng et al., 1994; Park & Kim, 2000; Pulakka et al., 2011), linear predictive coding (Bradbury, 2000), and neural networks (Li et al., 2015). Our work proposes the first convolutional architecture, which we find to scale better with dataset size and outperform recent, specialized methods. Moreover, while existing techniques involve many hand-crafted features (see e.g., Pulakka et al. (2011)); our approach is fully domain-agnostic.
In telephony, commercial efforts are underway to transmit voice at higher rates (typically 16 Khz) in specific handsets; audio-super resolution is a step towards recreating this experience in software. Similar applications could be found in compression, text-to-speech generation, and forensic analysis. More generally, our work demonstrates the effectiveness of feedforward convolutional architectures on an audio generation task.
Machine learning techniques based on deep neural networks have been successful at solving under-defined problems in signal processing such as image super-resolution, colorization, in-painting, and many others. Learning-based methods often perform better in this context than general-purpose algorithms because they leverage sophisticated domain-specific models of the appearance of natural signals.
In this work, we proposed new techniques that use this insight to upsample audio signals. Our technique extends previous work on image super-resolution to the audio domain; it outperforms previous bandwidth extension approaches on both speech and non-vocal music. Our approach is fast and simple to implement, and has applications in telephony, compression, and text-to-speech generation. It also demonstrates the effectiveness of feedforward architectures on an important audio generation task, suggesting new directions for generative audio modeling.
- Acevedo et al. (2009) Miguel A Acevedo, Carlos J Corrada-Bravo, Héctor Corrada-Bravo, Luis J Villanueva-Rivera, and T Mitchell Aide. Automated classification of bird and amphibian calls using machine learning: A comparison of methods. Ecological Informatics, 4(4):206–214, 2009.
- Aytar et al. (2016) Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 892–900, 2016. URL http://papers.nips.cc/paper/6146-soundnet-learning-sound-representations-from-unlabeled-video.
- Bansal et al. (2005) Dhananjay Bansal, Bhiksha Raj, and Paris Smaragdis. Bandwidth expansion of narrowband speech using non-negative matrix factorization. In in Proc. Interspeech, 2005.
- Bilmes (2004) Jeffrey A Bilmes. Graphical models and automatic speech recognition. In Mathematical foundations of speech and language processing, pp. 191–245. Springer, 2004.
- Bradbury (2000) Jeremy Bradbury. Linear predictive coding. Mc G. Hill, 2000.
- Cheng et al. (1994) Yan Ming Cheng, Douglas O’Shaughnessy, and Paul Mermelstein. Statistical recovery of wideband speech from narrowband speech. IEEE Transactions on Speech and Audio Processing, 2(4):544–548, 1994.
- Coviello et al. (2012) Emanuele Coviello, Yonatan Vaizman, Antoni B Chan, and Gert RG Lanckriet. Multivariate autoregressive mixture models for music auto-tagging. In ISMIR, pp. 547–552, 2012.
- Dong et al. (2016) Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(2):295–307, February 2016. ISSN 0162-8828. doi: 10.1109/TPAMI.2015.2439281. URL http://dx.doi.org/10.1109/TPAMI.2015.2439281.
- Ekstrand (2002) Per Ekstrand. Bandwidth extension of audio signals by spectral band replication. In in Proceedings of the 1st IEEE Benelux Workshop on Model Based Processing and Coding of Audio (MPCAâ02. Citeseer, 2002.
- Gers et al. (2001) Felix A Gers, Douglas Eck, and Jürgen Schmidhuber. Applying lstm to time series predictable through time-window approaches. In International Conference on Artificial Neural Networks, pp. 669–676. Springer, 2001.
- Gray & Markel (1976) Augustine Gray and John Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5):380–391, 1976.
- Haykin & Chen (2005) Simon Haykin and Zhe Chen. The cocktail party problem. Neural computation, 17(9):1875–1902, 2005.
- Hinton et al. (2012) Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
- Isola et al. (2016) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arxiv, 2016.
- Larsen & Aarts (2005) Erik Larsen and Ronald M Aarts. Audio bandwidth extension: application of psychoacoustics, signal processing and loudspeaker design. John Wiley & Sons, 2005.
- Ledig et al. (2016) Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/1609.04802, 2016. URL http://arxiv.org/abs/1609.04802.
- Li et al. (2015) Kehuang Li, Zhen Huang, Yong Xu, and Chin-Hui Lee. Dnn-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
- Liang et al. (2013) Dawen Liang, Matthew D. Hoffman, and Daniel P. W. Ellis. Beta process sparse nonnegative matrix factorization for music. In Alceu de Souza Britto Jr., Fabien Gouyon, and Simon Dixon (eds.), Proceedings of the 14th International Society for Music Information Retrieval Conference, ISMIR 2013, Curitiba, Brazil, November 4-8, 2013, pp. 375–380, 2013. ISBN 978-0-615-90065-0. URL http://www.ppgia.pucpr.br/ismir2013/wp-content/uploads/2013/09/229_Paper.pdf.
- Liang et al. (2015) Dawen Liang, Minshu Zhan, and Daniel PW Ellis. Content-aware collaborative music recommendation using pre-trained neural networks. In ISMIR, pp. 295–301, 2015.
- Maas et al. (2012) Andrew Maas, Quoc V. Le, Tyler M. ONeil, Oriol Vinyals, Patrick Nguyen, and Andrew Y. Ng. Recurrent neural networks for noise reduction in robust asr. In INTERSPEECH, 2012.
- Mehri et al. (2016) Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. Samplernn: An unconditional end-to-end neural audio generation model, 2016. URL http://arxiv.org/abs/1612.07837. cite arxiv:1612.07837.
- Odena et al. (2016) Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 2016. doi: 10.23915/distill.00003. URL http://distill.pub/2016/deconv-checkerboard.
- Park & Kim (2000) Kun-Youl Park and Hyung Soon Kim. Narrowband to wideband conversion of speech using gmm based transformation. In Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on, volume 3, pp. 1843–1846. IEEE, 2000.
- Pulakka et al. (2011) Hannu Pulakka, Ulpu Remes, Kalle Palomäki, Mikko Kurimo, and Paavo Alku. Speech bandwidth extension using gaussian mixture model-based estimation of the highband mel spectrum. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 5100–5103. IEEE, 2011.
- Shi et al. (2016) Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. pp. 1874–1883, 2016. doi: 10.1109/CVPR.2016.207. URL http://dx.doi.org/10.1109/CVPR.2016.207.
- Sun & Mazumder (2013) Dennis L. Sun and Rahul Mazumder. Non-negative matrix completion for bandwidth extension: A convex optimization approach. In IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2013, Southampton, United Kingdom, September 22-25, 2013, pp. 1–6. IEEE, 2013. doi: 10.1109/MLSP.2013.6661924. URL http://dx.doi.org/10.1109/MLSP.2013.6661924.
- van den Oord et al. (2016) Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016. URL http://arxiv.org/abs/1609.03499.
- Wang & Wang (2014) Xinxi Wang and Ye Wang. Improving content-based and hybrid music recommendation using deep learning. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 627–636. ACM, 2014.
- (29) Junichi Yamagishi. English multi-speaker corpus for cstr voice cloning toolkit, 2012. URL http://homepages. inf. ed. ac. uk/jyamagis/page3/page58/page58. html.
- Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. ECCV, 2016.