Joint TimeFrequency Scattering for Audio Classification
Abstract
We introduce the joint timefrequency scattering transform, a time shift invariant descriptor of timefrequency structure for audio classification. It is obtained by applying a twodimensional wavelet transform in time and logfrequency to a timefrequency wavelet scalogram. We show that this descriptor successfully characterizes complex timefrequency phenomena such as timevarying filters and frequency modulated excitations. Stateoftheart results are achieved for signal reconstruction and phone segment classification on the TIMIT dataset.
Joint TimeFrequency Scattering for Audio Classification

Index Terms— audio classification, invariant descriptors, timefrequency structure, wavelets, convolutional networks
1 Introduction
Signal representations for classification need to capture discriminative information from signals while remaining invariant to irrelevant variability. This allows accurate classifiers to be trained using a limited set of labeled examples. In audio classification, classes are often invariant to time shifts, making time shift invariant descriptors particularly useful.
Melfrequency spectral coefficients are timefrequency descriptors invariant to time shifts up to and form the basis for the popular melfrequency cepstral coefficients (MFCCs) [1]. These can be seen as the timeaveraging of a wavelet scalogram, which is obtained by constantQ wavelet filtering followed by a complex modulus [2]. The time scattering transform refines this while maintaining invariance by further decomposing each frequency band in the wavelet scalogram using another scalogram [2, 3]. The result can be seen as the output of a multilayer convolutional network [3]. Classification experiments have demonstrated the importance of this second layer, which captures amplitude modulation [2]. Yet because it decomposes each frequency band separately, it fails to capture more complex timefrequency structure such as timevarying filters and frequency modulation, which are important in many classification tasks.
Section 2 introduces the joint timefrequency scattering transform which extends the time scattering by replacing the secondlayer wavelet transform in time with a twodimensional wavelet transform in time and logfrequency. This is inspired by the neurophysiological models of S. Shamma, where the scalogramlike output of the cochlea is decomposed using twodimensional Gabor filters [4]. Section 3 shows that joint timefrequency scattering better captures the timefrequency structure of the scalogram by adequately characterizing timevarying filters and frequency modulation. This is illustrated in Section 4, which presents signal reconstruction results from joint timefrequency scattering coefficients that are comparable to stateoftheart algorithms and superior to time scattering reconstruction. In Section 5, the joint timefrequency scattering transform is shown to achieve stateoftheart performance for phone segment classification on the TIMIT dataset, demonstrating the importance of properly describing timefrequency structure. All figures and numerical results are reproducible using a MATLAB software package available at http://www.di.ens.fr/data/scattering/.
2 Joint timefrequency scattering
The wavelet scalogram of a signal represents timefrequency structure through a wavelet decomposition, which filters a signal using a constantQ wavelet filter bank. A time scattering transform captures the temporal evolution of each frequency band by another set of wavelet convolutions in time. It does not fully capture the timefrequency structure of the scalogram since it neglects correlation across frequencies. The joint timefrequency scattering remedies this by replacing the onedimensional wavelet transform in time with a twodimensional wavelet transform in time and logfrequency.
We denote the Fourier transform of a signal by . An analytic mother wavelet is a complex filter whose Fourier transform is concentrated over the frequency interval . Dilations of this mother wavelet defines a family of filters centered at frequencies for , given by
(1) 
Letting denote the basetwo logarithm of , we observe that samples each octave uniformly with wavelets. The temporal support of is approximately , so to ensure that the support does not exceed some fixed window size , we define using (1) only when . The lowfrequency interval is covered by linearly spaced filters of constant bandwidth . However, to simplify explanations, we shall treat all filters as dilations of .
The wavelet transform convolves a signal with a wavelet filter bank. Its complex modulus is the wavelet scalogram
(2) 
an image uniformly sampled in and . Here represents timefrequency intensity in the interval of duration centered at and the frequency band of bandwidth centered at . Figure 1(a) shows a sample scalogram.
While a rich descriptor of timefrequency structure, the scalogram is not time shift invariant. The scattering transform ensures invariance to time shifts smaller than by timeaveraging with a lowpass filter of support , giving
(3) 
known as firstorder scattering coefficients. These approximate melfrequency spectral coefficients [2].
To recover the high frequencies lost when averaging by in (3), is convolved with a second set of wavelets . Computing the modulus gives
(4) 
As before, averaging in time creates invariance and yields
(5) 
These are called secondorder time scattering coefficients. They supplement the first order (and by extension melfrequency spectral coefficients) by capturing the temporal variability of the scalogram [3]. Higherorder coefficients can also be computed by repeating the same procedure.
A representation similar to secondorder time scattering is the constantQ modulation spectrogram, which computes the spectrogram of each frequency band and averages using a constantQ scale [5]. The cascade structure of alternating convolutions and modulus nonlinearities is also shared by convolutional neural networks, which enjoy significant success in many classification tasks [6, 7].
In addition to time shift invariance, the scattering transform is also stable to time warping due to the constantQ structure of the wavelets [3]. This is useful in audio classification where small deformations do not alter class membership.
In many audio classification tasks, such as speech recognition, classes are invariant to frequency transposition. In this case classifiers benefit from transpositioninvariant descriptors. The time scattering transform is made invariant to transposition by computing a frequency scattering transform along , improving classification accuracy for such tasks [2].
While the time scattering transform successfully describes the average spectral envelope and amplitude modulation of a signal [2], it decomposes and averages each frequency band separately and so cannot capture the relationship between local temporal structure across frequency. Hence it does not adequately characterize more complex timefrequency phenomena, such as timevarying filters and frequency modulation.
To capture the variability of the scalogram across both time and logfrequency, we replace the onedimensional wavelet transform in time with a twodimensional wavelet transform in time and logfrequency. This follows the cortical model introduced by S. Shamma, where a sound is decomposed by the cochlea into a wavelet scalogram which is then convolved by twodimensional Gabor filters in the auditory cortex [4]. Representations based on this cortical model have performed well in audio classification [8, 9], but often lack a mathematical justification.
Let us define the twodimensional wavelet
(6) 
where for and . The time wavelet is calculated with a dilation by for as in (1), giving a Fourier transform centered at . For these wavelets, , although the notation remains the same. Similarly, we abuse notation and define the logfrequency wavelet by dilating a mother wavelet to get
(7) 
The identity of the wavelet will be clear from context.
The Fourier transform of is centered at the frequency . We shall refer to this “frequency” parameter associated with the logfrequency variable as a “quefrency,” with units of cycles per octave. Note that this is different from the standard quefrency, which is measured in seconds.
Since the twodimensional Fourier transform of is centered at , it oscillates along the slope . Its support in time and logfrequency is by . Sample wavelets are shown in Figure 1(b). To ensure invertibility of the wavelet transform, the Fourier transforms of must cover a halfplane, hence the requirement that take negative values. The sign of determines the direction of oscillation.
The wavelet transform of is calculated through a twodimensional convolution with . Taking the modulus gives
(8) 
where . Similarly to (5), secondorder timefrequency scattering coefficients are computed by timeaveraging, which yields
(9) 
Higherorder coefficients are obtained as before by repeating the above process. In contrast to the time scattering transform, the joint descriptor successfully captures the twodimensional structure of the scalogram at time scales below .
To obtain frequency transposition invariance, it would suffice to average both and along using a frequency window. However, the amount of invariance needed may differ between classes. Since the invariant is created through a linear mapping – averaging along – a discriminative linear classifier can learn the proper amount of invariance for each class [2].
Just as time scattering is invariant to deformation in time, the twodimensional wavelet decomposition ensures that the frequencyaveraged joint scattering transform is invariant to deformation of the scalogram in time and logfrequency. This is useful for many audio classification tasks, where classes are often invariant under small deformations of the scalogram.
3 Scattering timefrequency structure
We apply the joint timefrequency scattering transform to two signal models: a fixed excitation convolved with a timevarying filter and an unfiltered frequencymodulated excitation. Both represent nonseparable timefrequency structure and are insufficiently captured by the time scattering transform but well characterized by joint timefrequency scattering. These models do not model more advanced structures such as polyphony and inharmonicity, but allow us to explore the basic properties of the joint scattering transform.
3.1 Timevarying filter
Let us consider a harmonic excitation
(10) 
of pitch . The signal is then given by applying a timevarying filter to , defined as
(11) 
Parseval’s theorem now gives
(12) 
where is the Fourier transform of along . Thus is the inverse Fourier transform of multiplied by a timevarying transfer function . These transforms are also known as pseudodifferential operators.
Timevarying filters appear in many audio signals and carry important information. For example, during speech production the vocal tract is deformed to produce a sequence of phones. This produces amplitude modulation, but also shifts formants in the spectral envelope, which can be modeled by a timevarying filter. Similarly, much of the instrumentspecific information in a musical note is contained in the attack, which is often characterized by a changing spectral envelope. For these reasons, it is important for an audio descriptor to adequately capture timevarying filters.
For a suitable choice of we can show that
(13) 
where is the index of the partial closest to , while for small enough quefrencies
(14) 
where does not depend on . Here is a weighted and logscaled version of given by . Firstorder coefficients thus provide the timeaveraged amplitude of sampled at the partials since is nonnegligible only for . Furthermore, the second order approximates the twodimensional scattering coefficients of the modified filter transfer function , capturing its timefrequency structure.
In contrast, the time scattering transform only characterizes separable timevarying filters that can be written as the product of an amplitude modulation in time and a fixed filter. In this case the model reduces to the amplitudemodulated, filtered excitation considered in [2]. Time scattering and joint timefrequency scattering thus differ in that the latter captures the nonseparable structure of while the former only describes its separable structure.
To justify (13) and (14), we proceed as in [2], convolving (12) with and taking the modulus to obtain
(15) 
for smooth enough and . In this case at most one partial is found in the support of the wavelet so the sum only contains one nonnegligible term when . Averaging in time yields (13). Furthermore, we note that as a function of , the sequence of partials can be approximated at large scale by . For small , is very regular in . If is also smooth enough along , we can therefore replace the sum of partials by when convolving with . Rewriting the convolution using then yields
(16) 
Taking the modulus and averaging then gives (14).
3.2 Frequency modulation
We now consider an excitation of varying pitch
(17) 
At time , has instantaneous pitch and relative pitch variation . This carries important information in many sounds, such as tonal speech, bioacoustic signals, and music (e.g. for vibratos and glissandi). A good audio descriptor should therefore adequately describe such pitch changes.
For appropriate and , we can show that
(18) 
where as before. Furthermore, for large,
(19) 
where is independent of .
While firstorder joint scattering coefficients provide an average of the instantaneous pitch over the interval of duration , the second order describes the rate of pitch variation . Indeed, for fixed and , is maximized along the line , and so captures this frequency modulation structure. The time scattering transform, in contrast, only captures the bumps in each frequency band induced by the varying pitch, ignoring its frequency structure.
To see why (18) and (19) hold, we linearize over the support of when decomposing (17), which gives
(20) 
provided that . As before, only the partial is contained in the frequency support of . Averaging in time gives (18). Each partial traces a curve along , so locally the scalogram can be approximated by sliding Dirac functions for some . Convolving along with for large enough to capture only one line gives . For a fixed , this is a complex exponential of instantaneous frequency multiplied by an envelope. Convolving this in time with a wavelet on whose support the envelope is approximately constant then gives
(21) 
Taking the modulus, we can replace with the lowpass filter . Assuming that is almost constant over an interval of duration , averaging gives (19).
We note that the timevarying filter and frequency modulation models in (12) and (17) are complementary. For small quefrencies , the joint scattering coefficients capture timefrequency structure over large frequency intervals, which is given by timevarying filters. Larger describe more localized behavior in logfrequency, like frequency modulation. This scale separation allows the joint scattering transform to simultaneously characterize both types of structures.
4 Timeshift invariant reconstruction
After having analyzed a given signal with a scattering transform, synthesizing a new signal from the invariant coefficients and highlights what information is captured in the representation — and, conversely, what is lost. In this section, we use a backpropagation algorithm on stationary audio textures to qualitatively compare the joint scattering transform with other architectures.
The reconstruction is first initialized with random noise, and then iteratively updated to converge to a local minimum of the functional
(22) 
with respect to . Since the forward computation of scattering coefficients consists of an alternated sequence of linear operators (wavelet convolutions) and modulus nonlinearities, the chain rule for gradient backpropagation yields a sequence of closedform derivatives in the reverse order. The modulus nonlinearities are backpropagated by applying . In turn, the backpropagation of the wavelet transforms consists of convolving each frequency band by the complex conjugate of the corresponding wavelet and summing across bands [10].
To illustrate, we have synthesized a bird song recording using different scattering transforms. Here and is of the order of three bird calls (see Figure 2(a)). Firstorder coefficients yield the reconstruction in Figure 2(c). This fits the averaged melfrequency spectrum of the target sound. Although this is sufficient when is the realization of a Gaussian process, it does not convey the typical intermittency in natural sounds. This is partly mitigated by adding secondorder coefficients, giving the reconstruction in Figure 2(d), since these encode the amplitude modulation spectra in each acoustic subband. However, these spectra are not synchronized across subbands, so time scattering tends to synthesize auditory textures made of decorrelated impulses. In contrast, we observe that the reconstruction from joint scattering coefficients in Figure 2(e) is able to capture coherent structures in the timefrequency plane, such as joint modulations in amplitude and frequency. Notably, because of their chirping structure, bird calls are better synthesized with joint scattering. Indeed, recalling (19), chirps are represented with few nonzero coefficients in the basis of joint timefrequency wavelets. We believe that audio resynthesis is greatly helped by this gain in sparsity. More experiments are available at http://www.di.ens.fr/data/scattering/audio/.
McDermott and Simoncelli [11] have built an audio texture synthesis algorithm based on a scatteringlike transform along time, of which they compute crosscorrelation statistics across and across , as well as marginal moments (variance and skewness). Their representation is also able to synchronize frequency bands and recover amplitude modulation. Nevertheless, asymmetry in frequency modulation is lost. Indeed, while all bird calls from the original recording have an ascending instantaneous frequency, some of the chirps reconstructed with their method descend instead. Moreover, the higherorder statistics on which they rely are unstable to deformations and hence not suitable for classification purposes. In this section, we have shown that joint scattering may achieve comparable or better quality in audio resynthesis, yet with only using stable features.
On the negative side, it must be noted that joint scattering is insufficient to capture temporal changes in harmonic structure. Indeed, partial tones which are several octaves apart are not likely to be correctly in tune — a limitation that we shall specifically address as a future work.
5 Classification
We evaluate the performance of the joint timefrequency scattering representation on phone segment classification using the TIMIT dataset [12]. The corpus consists of phrases, each of which has its constituent phone segments labeled with its position, duration, and identity. Given a position and duration, we want to identify the phone contained in the segment. This task is easier than the problem of continuous speech recognition, but provides a straightforward framework when evaluating signal representations for speech.
We follow the same setup as in [2]. Each phone is represented by a given descriptor applied to a millisecond window centered on the phone along with the phone’s logduration. A Gaussian support vector machine (SVM) is used as a classifier through the LIBSVM library [13].
The SVM is a discriminatively trained, locally linear classifier. This means that, given enough training data, an SVM can learn the amount of averaging needed along to gain the desired invariance [2]. We therefore present results for scattering transforms without averaging along .
Table 1 shows the results of the classification task. MFCCs calculated over the segment with a window size of and concatenated to yield a single feature vector provide a baseline error rate of . The nonscattering state of the art achieves and is obtained using a committeebased hierarchical discriminative classifier on MFCC descriptors [14]. A convolutional network classifier applied to the logscalogram with learned filters obtains [7].
Representation  Error rate (%) 

MFCCs  
State of the art (excl. scattering) [14]  
Time Scattering  
Time Scattering + Freq. Scattering  
Joint TimeFreq. Scattering 
The time scattering transform is computed with and up to the second order. As in previous experiments, we compute the logarithm of the scattering [2]. Since it better captures amplitude modulation, results improve with respect to MFCCs, achieving an error of .
Applying an unaveraged frequential scattering transform along up to a scale of octaves and computing the logarithm yields an error rate of . As discussed earlier, transposition invariance counters speaker variability, and so improves performance. However, the frequency scattering is computed along of a time scattering transform which has been averaged in time, so its discriminability also suffers from not capturing local correlations across frequencies.
Computing the joint timefrequency scattering transform for octaves yields an error of , an improvement compared to the time scattering transform with scattering along logfrequency. This illustrates the importance of the complex timefrequency structure that is captured by the joint scattering transform, and can be partly explained by the fact that the onset of many phones is characterized by rapid changes in formants, which can be modeled by timevarying filters. As we saw earlier, these are better described by timefrequency scattering compared to time scattering. However, the small window size limits the loss of timefrequency structure in the time scattering transform. We therefore expect a greater improvement for tasks involving larger time scales.
The previous state of the art was obtained at using a scattering transform with multiple Q factors [2]. This more ad hoc descriptor has many similarities with the joint scattering transform, but is difficult to study analytically.
6 Conclusion
We introduced the joint timefrequency scattering transform, which is a timeshift invariant representation stable to timefrequency warping. This representation characterizes timevarying filters and frequency modulation. Reconstruction experiments show how it successfully captures complex timefrequency structures of locally stationary signals. Finally, phone segment classification results demonstrate the value of adequately representing these structures for classification.
References
 [1] S.B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, 1980.
 [2] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Trans. Sig. Proc., vol. 62, pp. 4114–4128, 2014.
 [3] S. Mallat, “Group invariant scattering,” Comm. Pure Appl. Math., vol. 65, no. 10, pp. 1331–1398, 2012.
 [4] T. Chi, P. Ru, and S. Shamma, “Multiresolution spectrotemporal analysis of complex sounds,” J. Acoust. Soc. Am., vol. 118, no. 2, pp. 887–906, 2005.
 [5] J. Thompson and L. Atlas, “A nonuniform modulation transform for audio coding with increased time resolution,” in IEEE Int. Conf. on Acoust. Speech, and Sig. Proc., 2003, vol. 5, pp. V–397.
 [6] Y. LeCun, K. Kavukvuoglu, and C. Farabet, “Convolutional networks and applications in vision,” in IEEE Int. Symp. on Circuits and Syst., 2010.
 [7] H. Lee, P. Pham, Y. Largman, , and A. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Proc. NIPS, 2009.
 [8] N. Mesgarani, M. Slaney, and S. Shamma, “Discrimination of speech from nonspeech based on multiscale spectrotemporal modulations,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 3, pp. 920–930, 2006.
 [9] M. Kleinschmidt and D. Gelbart, “Improving word accuracy with gabor feature extraction.,” in Interspeech, 2002.
 [10] J. Bruna and S. Mallat, “Audio texture synthesis with scattering moments,” arXiv:1311.0407, 2013.
 [11] J. McDermott and E. Simoncelli, “Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis,” Neuron, vol. 71, no. 5, pp. 926–940, 2011.
 [12] W.M. Fisher, G.R. Doddington, and K.M. GoudieMarshall, “The DARPA speech recognition research database: specifications and status,” in Proc. DARPA Workshop on Speech Recognition, 1986, pp. 93–99.
 [13] C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. on Intell. Syst. and Technol., vol. 2, pp. 27:1–27:27, 2011, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
 [14] H. Chang and J. Glass, “Hierarchical largemargin gaussian mixture models for phonetic classification,” in Proc. ASRU. IEEE, 2007, pp. 272–277.