# Optimizing Short-Time Fourier Transform Parameters via Gradient Descent

## Abstract

The Short-Time Fourier Transform (STFT) has been a staple of signal processing, often being the first step for many audio tasks. A very familiar process when using the STFT is the search for the best STFT parameters, as they often have significant side effects if chosen poorly. These parameters are often defined in terms of an integer number of samples, which makes their optimization non-trivial. In this paper we show an approach that allows us to obtain a gradient for STFT parameters with respect to arbitrary cost functions, and thus enable the ability to employ gradient descent optimization of quantities like the STFT window length, or the STFT hop size. We do so for parameter values that stay constant throughout an input, but also for cases where these parameters have to dynamically change over time to accommodate varying signal characteristics.

An Zhao Krishna Subramani Paris Smaragdis \addressUniversity of Illinois at Urbana-Champaign, Adobe Research \ninept {keywords} STFT, gradient descent, adaptive transforms

^{1}

## 1 Introduction

With the advent of deep learning we have seen a dramatic shift in signal processing towards incorporating neural net-like learning (e.g. with Differentiable DSP research [4]). Although many parts of signal processing fit well into that framework, some parameters that we often use are not as easy to optimize. This usually includes parameters defined in terms of samples, such as frame sizes, hop sizes, etc. The Short-Time Fourier Transform [1] is a prime example of this. Previous work has focused on automatically finding optimal STFT parameters, e.g. in using dynamic programming to obtain the best window positions [8], or to detect signal non-stationarity to adjust the analysis parameters [2]. However, most of that work is based on heuristics and local search, and is not compatible with gradient descent-style optimization that can be used to jointly optimize entire end-to-end systems.

Here, we propose a couple of approaches for optimizing STFT parameters via gradient descent. We show that these can be used for any appropriately defined loss function, and can be easily incorporated in larger systems.

## 2 An optimizable STFT

As is well known, due to the time/frequency trade-off, picking the wrong STFT window size can result in increased smearing across the time or frequency axis, which in turn creates a poor representation of the data. Having the wrong parameters can not only result in a non-legible transform, but also provide a poor feature representation for further processing. Here we will present a formulation of the STFT that will allow us to directly optimize parameters such as the window length, and optimize it with respect to an arbitrary differentiable loss function.

For an input signal , we will define the STFT analysis as:

(1) |

where is the sample position, is the analysis window function centered at the -th sample, and is the resulting transform at time and frequency . For constant-sized windows and hop sizes, all the are the same function for all (e.g. a Hann window) and the STFT will sample at fixed intervals. We use however this notation to facilitate the use of windows whose shape is dependent on , as we will use in later sections.

Using this definition, will consider two distinct cases, one in which we are trying to estimate STFT parameters that are constant throughout the analysis (i.e. a fixed hop size, window size, or window shape), and later on the case where the STFT parameters are dynamically changing in order to adapt to the input signal. We start with the former since it is an easier formulation that can help lead to the next one.

## 3 Optimizing for constant STFT parameters

Here we describe how we can optimize the STFT assuming the window parameters are constant throughout the transform (e.g. using a constant size transform throughout the duration of the signal). We will outline the steps for obtaining a gradient for integer parameters like the window size, and will demonstrate this using a sparsity cost function.

Traditionally STFT uses integer window sizes and integer hop sizes. The window function, as defined in equation 1, is a fixed function of the window size. If we wish to optimize using gradient descent, the STFT is a problem since the involved variables are not continuous. By using an underlying continuous variable to derive both the window function and the window size, we can make the STFT parameters differentiable, even though the computed sizes remain discrete. In the case of the STFT this is relatively straightforward as shown below.

In order to obtain a meaningful differentiable setup, we can define in equation 1 to use a Gaussian window function:

(2) |

Note that by doing this we do not make a direct use of the window length, we instead use the continuous parameter as a proxy. Since the value of this window is effectively zero for large values of , when computing the transform that uses this window we can safely truncate the window to zero for, e.g., and assume that it has a length of samples (which is the non-zero region between and ). By computing the results using the truncated window, but optimizing with respect to the infinite one we can optimize the effective DFT length of the transform. In the examples here we fix the hop size to of window length, but that can be changed to any desired value.

### 3.1 Optimizing for sparsity

As an illustrative example we will consider the case where we wish to find STFT parameters that result in the most sparse STFT magnitudes. This is often a desirable property in time frequency analysis [5, 3, 11] and will allow us to show how we can take a loss function and directly optimize an STFT parameter using gradient descent.

We will evaluate the sparsity of the transform using the following measure:

(3) |

which is a measure of concentration as defined in [6] and [7] that is derived from the kurtosis (itself a measure of sparsity). In the equation above, and are and norms of each time slice of the STFT:

(4) |

In practice, we will sample under the constraint that the windows have reasonable overlap, and will then sum their concentration as the final measure. Therefore the sparsity loss function over the entire input with the proposed window function is defined as:

(5) |

And since the entire analysis is now differentiable, we can easily propagate gradients and optimize to maximize . By doing so we can find the sparsest representation by effectively adjusting the transform’s window length.

Applying this on a various signals reliably results in a window length that creates the most appropriate STFT representation, for example the middle plot in Figure 1. Of course, in this particular case, a direct search of the optimal window size would be faster and easier to perform. In order to demonstrate a more realistic use of this idea we will present it within a more complex context in the next section.

### 3.2 Classification Experiments

We will now examine a more involved case in which we want to tune the STFT parameters in order to optimize a subsequent estimation that will be jointly optimized. We will do so in a simple sound classification setting. Our goal this time is to find the optimal STFT parameters in order to optimize a classifier’s ability to discern between two sound classes. Instead of the sparsity measure, we will now use the classifier’s loss as the cost function and we will be optimizing the size of the STFT windows as well as the classifier parameters simultaneously.

More formally, we will use the same STFT formulation with the Gaussian window as in the previous section, and a classifier function that operates on each frame of the STFT and provides a set of class predictions for each class and time . For our experiments ahead, we will consider a simple linear classifier mapping from the input dimension (appropriately zero padded to ensure a constant input dimension to the classifier), followed by a softmax to give output . We can write a loss function that describes the accuracy of the classifier, and we will also add to it a regularizing term to avoid very small DFT sizes to ensure efficient processing. The overall loss then becomes:

(6) |

which is the typical cross-entropy loss ( being the ground truth and being the network output) with an extra term that penalizes small window sizes. The constant defines the strength of the regularizer (which in this experiment is set to 0.1). Using this loss as a guide, we want to find the optimal STFT window size , along with the optimal parameters for the classifier simultaneously.

Let us consider a simple input which consists of 2 alternating sinusoids of frequencies and , both having the same length, . We want to build a classifier which when given input STFT frames of the signal would classify them as being or respectively. We know that to discriminate these signals from their spectra, we need to consider a window length of , because any window length greater than would smear the spectra over time, which would make it tough for the classifier to distinguish the two classes. Likewise, for very short windows we will observe increasing smearing along the frequency axis which can also impede classification (examples of each case can be seen in Figure 2). However, if , then the obtained spectral frames contain sufficient information to successfully discriminate the input signals. The more general case of this problem is a common issue when performing sound classification (or other tasks in the time/frequency domain), since the choice of the STFT parameters can significantly bias the results. We use this simple example as a simple illustration of this problem with no loss of generality.

Instead of picking a priori (which in practice we would not know), we obtain a gradient for by differentiating , and then let this model figure out what the optimal should be as it simultaneously adjusts the classification parameters. We verify that this indeed behaves as expected by a simple experiment, by starting the optimization from multiple initial values of and observe that they quickly converge to the optimal value of as the classifier is being trained (Figure 3).

At this point we need to make a very important observation. Had we wanted to perform an exhaustive search for the optimal value of we would have to retrain the classifier for each choice of . Instead, by jointly optimizing both and the classifier we vastly reduce the amount of forward passes that we need to perform. For this experiment the classifier will converge after about 1500 iterations, regardless of whether we use a fixed , or one that is concurrently optimized. In effect, we sped up the search for an optimal by a factor as big as the original search points. Given that today we often work with systems that can take days to train, such a speed up can be significant.

## 4 Optimizing for dynamically changing parameters

Often, an input signal changes over time and that necessitates changing the STFT parameters in response. For example, a signal might exhibit low-frequency elements that move slowly, which suggests a long analysis window, but at a later segment it might contain short-term events which necessitate shorter analysis windows. In this section we will show how one can optimize for a continuously changing window (or hop) size using gradient descent. We will use again the sparsity cost as above, but this time the input signal will necessitate different settings at varying times. Instead of obtaining one parameter value that is globally optimal, we will instead produce a set of locally optimal values resulting in an STFT with a dynamically changing analysis window. In order to achieve such dynamically distributed windows, our framework needs to have three degrees of freedom: number of windows, length of each window, and overlaps between windows.

To accommodate that, we introduce the idea of a mapping function, which maps the index of each window to its corresponding sample position in the input sound. For example, for equally spaced windows this function would be a simple linear relation between the order index of a window and on which input time index that window is centered. So the first window (order index ) would be centered at sample index , whereas the ’th window will be centered in , where is the STFT hop size. If its slope of this relationship gets steeper (larger ), the windows become longer and more sparsely distributed, whereas a shallower slope (smaller ) will result in closely packed windows. If this mapping function is not a straight line, then the windows will not be uniformly distributed which is what we will use in this section.

In this setting, from the parameterized STFT in equation 1 will contain in it such a mapping function that will map each window to an arbitrary location of our input signal. We will use the trapezoid window as an example for our adaptive STFT since this will allow us to incorporate a variable hop size while ensuring a constant overlap between our windows [9]. Depending one one’s constraints, other window formulations are also possible. Our trapezoid window function is defined as:

(7) |

where is the sample position of the window with order index , representing start of flat region of the trapezoid window. is the sample position at the end of flat region in the trapezoid for window index , which is also the start of the slope for the next window with index . For each window , is the beginning of the window and is the end of the window. By using this formulation we can adjust the slopes of the windows so that they overlap-add to a 1. The relationship between the mapping function and trapezoid windows is illustrated in Figure 4. Note that the mapping function only needs to encode the , we get all the via a different mechanism ^{2}

Since the estimation of a flexible mapping function should not be constrained by a simple parametric form, we use Unconstrained Monotonic Neural Networks (UMNN) [10] to represent it. UMNNs produce monotonic functions by integrating neural networks with strictly positive outputs, aligned with our expectation of windows which are ordered from left to right. Using this formulation we can represent arbitrary mapping functions while maintaining the ability to differentiate the entire process. Since there is no enforced constraint that the first and last windows will be perfectly positioned at the start and end of the input signal, in practice we estimate the mapping function and zero pad the ends accordingly to facilitate windows that extend past the range of our input.

We would also like this system to be able to freely push windows out and squeeze windows in from both ends of the signal. To achieve that, we map the zero window index to the center of the input sequence (as opposed to the start). That allows us to use a mapping function that is free to push windows in and out the ends of the signal by freely manipulating the map on both sides. Had we used a formulation that clamped the zero window index at the start of the signal we wouldn’t be able to introduce new windows in the beginning and would constraint our optimization. This is more of an implementation detail, and does not change the UMNN model since it simply involves reinterpreting the window index. While the UMNN provides us with estimates the for each window, we use a simple feedforward neural network that processes these and provides as an output their corresponding . This makes the entire process fully differentiable, and once we define a cost function we can now directly optimize dynamically changing window and hop sizes.

### 4.1 Sparsity Experiments

To verify that the proposed method works, we once again optimize for a sparse STFT output as an illustration, but this time using a signal that requires different analysis settings at each part. The loss we will use is now defined as:

(8) |

i.e. we are simply adding the frame-wise concentrations. One last issue to address is that this particular loss function produces many local optima since different window distributions could result in similar levels of sparsity. This usually happens when a single frame dominates the summation in equation 8. In order to address this issue we clip large values of to the Frobenius norm of concentration of all frames, which for this particular loss function eliminates issues with local optima.

We show results from three example inputs. All these examples include a signal that locally necessitates a different window size. Using a fixed STFT window size throughout will not result in best results in certain sections. First we use a simple signal, consisting of an alternating chirp and a constant frequency sinusoid as shown in Figure 5. The chirp portions are best described by a short analysis window that captures the temporal changes, whereas the constant frequency parts are best described by a longer window that minimizes frequency smearing. By optimizing the sizes we get the results shown in the same figure. We see that training via gradient descent added more windows in the chirp sections and used much longer windows in the sinusoid sections.

A second example is also shown in Figure 5. Here we have an exponential chirp, which means that as the frequency of the input rises the speed of the frequency change also grows. This means that we will need increasingly smaller windows to properly represent the rapid change of frequency without smearing the spectral estimates over time. As can be seen from the plots, by using gradient descent our approach results in an optimal decomposition, where the window size shrinks over time to accommodate the input’s characteristics.

Lastly, we show a third example using real sounds. In this case we have some drum sounds in the first half and piano chords in the second half. The section with the drum sounds requires a range of window sizes, from large to accommodate pitched drums (at the start), to short for impulsive sounds (around sample index 30,000). The piano section requires longer windows to best describe the low sustained chords. As shown in Figure 6 our proposed approach again finds appropriate windows for each section.

## 5 Discussion

In this paper we showed various approaches that allow us to use gradient descent optimization on the parameters of an STFT analysis. The significance of this work is that it provides a way to include STFT analysis parameters in broader optimization contexts, e.g. as trained parameters of a neural net that accepts the resulting STFT inputs. As shown in our experiments, jointly optimizing the STFT with the subsequent task at hand we can obtain the optimal STFT values with fewer evaluations than than performing a search. We hope that using this approach one can incorporate the search for optimal STFT values in a global optimization setting and thus eliminate what is almost always a slower manual exhaustive search.

### Footnotes

- Code: https://github.com/SubramaniKrishna/STFTgrad
- We do not encode the in the mapping function so that we can facilitate other types of window parameters which might not be location-based, e.g. could have been a window length factor.

### References

- (1977) A unified approach to short-time fourier analysis and synthesis. Proceedings of the IEEE 65 (11), pp. 1558–1564. External Links: Document Cited by: §1.
- (2014) An adaptive time-frequency analysis scheme for improved real-time speech enhancement. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6265–6269. Cited by: §1.
- (2019-11) Sparse time-frequency representations for polyphonic audio based on combined efficient fan-chirp transforms. Journal of the Audio Engineering Society 67 (11), pp. 894–905. External Links: Document Cited by: §3.1.
- (2020) DDSP: differentiable digital signal processing. External Links: 2001.04643 Cited by: §1.
- (2006) Sparse time-frequency representations. Proceedings of the National Academy of Sciences 103 (16), pp. 6094–6099. External Links: Document, Link, https://www.pnas.org/content/103/16/6094.full.pdf Cited by: §3.1.
- (1994) A simple scheme for adapting time-frequency representations. IEEE Transactions on Signal Processing 42 (12), pp. 3530–3535. Cited by: §3.1.
- (2013-07) Kurtosis based time-frequency analysis scheme for stationary or non-stationary signals with transients. Information Technology Journal 12, pp. 1394–1399. External Links: Document Cited by: §3.1.
- (2008) Adaptive short-time analysis-synthesis for speech enhancement. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 4905–4908. Cited by: §1.
- (2011) Spectral audio signal processing. W3K Publishing, [S.l.]. Note: \urlhttp://ccrma.stanford.edu/ jos/sasp External Links: Link Cited by: §4.
- (2019) Unconstrained monotonic neural networks. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox and R. Garnett (Eds.), pp. 1545–1555. External Links: Link Cited by: §4.
- (2007) Sparse time-frequency representations in audio processing, as studied through a symmetrized lognormal model. In 2007 15th European Signal Processing Conference, pp. 355–359. Cited by: §3.1.