Crossmodal variational inference for bijective signalsymbol translation
Abstract
Extraction of symbolic information from signals is an active field of research enabling numerous applications especially in the Musical Information Retrieval domain. This complex task, that is also related to other topics such as pitch extraction or instrument recognition, is a demanding subject that gave birth to numerous approaches, mostly based on advanced signal processingbased algorithms. However, these techniques are often nongeneric, allowing the extraction of definite physical properties of the signal (pitch, octave), but not allowing arbitrary vocabularies or more general annotations. On top of that, these techniques are onesided, meaning that they can extract symbolic data from an audio signal, but cannot perform the reverse process and make symboltosignal generation. In this paper, we propose an bijective approach for signal/symbol translation by turning this problem into a density estimation task over signal and symbolic domains, considered both as related random variables. We estimate this joint distribution with two different variational autoencoders, one for each domain, whose inner representations are forced to match with an additive constraint, allowing both models to learn and generate separately while allowing signaltosymbol and symboltosignal inference. In this article, we test our models on pitch, octave and dynamics symbols, which comprise a fundamental step towards music transcription and labelconstrained audio generation. In addition to its versatility, this system is rather light during training and generation while allowing several interesting creative uses that we outline at the end of the article.
Axel Chemla–RomeuSantos^{1,2}, Stavros Ntalampiras^{1}, Philippe Esling^{2}, Goffredo Haus^{1}, Gérard Assayag^{1}
^{1} Laboratorio d’Informatica Musicale (LIM)
UNIMI, Milano, Italy
axel.chemla@unimi.it stavros.ntalampiras@unimi.it goffredo.haus@unimi.it
^{2} IRCAM  CNRS UMR 9912
Sorbonne Université, Paris, France
esling@ircam.fr assayag@ircam.fr
1 Introduction
Music Information Retrieval (MIR) is a growing domain of audio processing that aims to extract information (labels, symbolic or temporal features) from audio signals [downie2003music, 8665366]. This field embeds both musical and scientific challenges paving the way to a large variety of tasks. Such abundant industrial and creative applications [casey2008content] have attracted the interest of a large number of researchers with plentiful results. Among the diverse subtasks included in MIR, music transcription comprises an active research field [klapuri2007signal, benetos2013automatic] which is not only interesting by itself but finds generic applicability as a subtask for other MIR objectives (cover recognition, key detection, symbolic analysis). Music transcription can be described as associating symbols to audio signals composed of one or more musical instruments. Thus, this field embeds pitch and multipitch estimation tasks but also other musical dimensions, such as dynamics. Currently, most pitch estimation techniques are based on fundamental frequency detection [de2002yin]. However, such approaches may prove insufficient in multipitch contexts, where the need for more sophisticated approaches appears crucial.
In parallel, the recent rise of generative systems provided interesting alternatives to supervised machine learning approaches focusing on classification [bengio2013generalized]. These unsupervised learning models aim to discover the inner structure of a dataset based on a reconstruction task. Such methods are usually defined as probabilistic density estimation approaches, Bayesian inference and autoencoding structures. Among those, the Variational AutoEncoders (VAE) provides a powerful framework, which explicitly targets the construction of a latent space [kingma2013auto]. Such spaces are highlevel representations with the ability to reveal interesting properties about the inner structure of different types of data [kingma2013auto][rezende2014stochastic], and also more recently in audio [esling2018generative]. Such learning procedures can be mixed with supervised learning to perform label extraction and conditional generation, showing the flexibility and the efficiency of this approach. Last but not least, latent spaces can also be explicitly shared by several systems acting on different data domains, providing an elegant way of performing domaintodomain translation or multimodal learning [liu2017unsupervised].
In this article, we propose a generative modeling approach to musical transcription by formulating it as a density estimation problem. Our approach allows to directly model pairs (), where represents the spectral features and represents the corresponding musical annotations. Following a multimodal approach inspired by Higgins & al. [higgins2017scan], we train two different VAEs on these separate domains whose latent representations are progressively shared through explicit distribution matching. In addition to providing a Bayesian formulation of musical transcription compatible with arbitrary vocabularies, our method also naturally handles the reverse audio generation process, and thus allows both signaltosymbol and symboltosignal inference. Furthermore, direct data/symbol generation is also available by latent space exploration, providing an interesting method for creative audio synthesis. Finally, we bind our transcription approach with a novel sourceseparation approach, based on explicit source decomposition with disjoint decoders. The idea behind our method is to use the knowledge previously acquired on individual instruments in order to ease their recognition in the mixture signal. A novel form of inference network is trained on the product space of the decoders latent space, with additional latent dimensions that performs Bayesian inference directly over mixture coefficients.
2 State of the art
Here, we provide a brief stateoftheart of the most common approaches for musical transcription. Then, we introduce variational autoencoders and detail their use for crossmodal inference and generation.
2.1 Automatic music transcription
Automatic music transcription (AMT) aims at closing the gap between acoustic music signals and their corresponding musical notation. The main problem in AMT is detecting multiple and possibly overlapping in time pitches. Classical approaches for pitch and multipitch extraction are mostly based on spectral or spectral analysis using fundamental harmonics localization [drugman2011joint], such as the Yin algorihtm [de2002yin]. As these methods were originally conceived for monophonic signals, their extension to multipitch estimation contexts often implies recursive processes (multifundamental recognition, harmonic subtraction) that reduce their efficiency. In parallel, other methods relying on spectrogram factorization have been proposed. These are based on the decomposition of the spectrogram into a linear combination of nonnegative factors. These include Nonnegative Matrix Factorisation (NMF) [Lee1999] or probabilistic latent component analysis (PLCA) [Shashanka2008]. However, spectrogram factorization methods usually fail to identify a global optima, a limitation which led many researchers to hypothesize the need for supplementary external knowledge to attain more accurate decompositions [5957256, 4959583].
Recently, deep learning approaches have been proposed to address the multipitch detection problem. For instance, piano transcription task has been tackled via a variety of neural networks in [journals/corr/KelzDKBAW16, Sigtia:2016:ENN:2992480.2992488, hawthorne2017onsets, hawthorne2018enabling]. Interestingly, the MusicNet dataset [thickstun2017learning] includes multiinstrument music conveniently structured to address polyphonic music transcription. Finally, a method based on convolutional neural networks is presented in [Bittner2017DeepSR], which aims at learning meaningful representations allowing accurate pitch approximation in polyphonic audio recordings.
2.2 Generative models and variational autoencoders
Variational inference
Generative models define a class of unsupervised machine learning approaches aiming to recover the probability density underlying a given dataset. This density is usually conditioned on another set of random variables , called latent variables. This set acts as a higherlevel representation that controls the generation in the data domain. Formally, generative models can be described as modeling the joint probability , where acts as a Bayesian prior over the latent variables. The generative process takes a latent position to produce the corresponding probability density in the data domain. Conversely, we also want to estimate the posterior distribution , that gives the latent distribution corresponding to a data sample . Retrieving this posterior distribution from a given generative process is called Bayesian inference, and is known to be a very robust inference framework. Unfortunately, this inference is generally intractable for complex distributions or requires limiting assumptions on both generative and inference processes. Variational inference (VI) is a framework that overcomes this intractability by turning Bayesian inference to an optimization problem [jaakkola2000bayesian]. To do so, variational inference posits a parametric distribution that can be freely designed, and optimizes this distribution to approximate the real posterior . This optimization is performed thanks to the following bound
(1) 
where denotes the KullbackLeibler divergence. We can see that maximizing the right term of this inequality inherently optimizes the evidence of our model. This bound, called the Evidence LowerBOund (ELBO), can be interpreted as the sum of a likelihood term and of a divergence term that enforces the approximated posterior to match the prior . This variational formulation is less restrictive than direct Bayesian inference, as it only requires the tractability of these two terms. Thus, we are able to model complex dependencies between and for both and while retaining the benefits of a Bayesian formulation [bishop].
Variational autoencoder and crossmodal learning
To define the approximate distribution, we can model both generative and inference models as normal distributions
such that parameters and are respectively obtained by deterministic functions and . When these functions are parametrized as neural networks, we obtain the original Variatonal AutoEncoder (VAE) formulation proposed by Kingma & al. [kingma2013auto]. The prior is usually defined as an isotropic normal distribution
, which acts as a regularizer to enforce the independence of latent dimensions. Similar to autoencoding architectures,
and are respectively called the encoder and the decoder of the system. These functions are jointly trained until convergence on parameters with a backpropagation algorithm. Despite the apparent simplicity of its formulation, this system allows very expressive encoding and generative processes while providing a highly structured latent space, whose smoothness is provided by the reconstruction term.
3 Crossmodal VAE for music transcription
3.1 Signal/symbol transfer through shared latent spaces
In this paper, we propose to reformulate the audio transcription problem as the estimation of a joint probability density , where represents the spectral information of the analyzed audio signal and represents the corresponding set of symbolic information. Previous works showed the efficiency of VAEs for audio processing when used on spectral frames, in terms of both representational and generative abilities [esling2018generative, bitton2018modulated]. However, we intend here to estimate not only the probability density , but also the joint probability density . Considering as label information, some approaches proposed to include an additional discriminator on the latent space, that is jointly trained during the learning process [kingma2014semi]. Here, we take inspiration from the SCAN approach proposed by Higgins & al., that trains a mirrored VAE on symbolic data whose latent representation is constrained to match the latent space obtained from the signal VAE [higgins2017scan]. Hence, modelling our symbolic information as binary vectors , we can train this VAE over the label space
where denotes a Bernouilli distribution of mean for binary symbols, and denotes a Categorical distribution of classwise probabilities in the case of multilabel symbols. We enforce its latent representation to fit the one obtained with the signal VAE by adding a term to the ELBO
(2) 
such that the latent distributions provided by the two inference processes match for a given pair . The ordering of terms in the KullbackLeibler divergence is chosen such that the distribution is forced to cover the whole mass of . Hence, the correct label for a given is encouragqed even for lowprobability areas of . Both VAEs are jointly trained, so that the latent representation obtained is a compromise between both autoencoder performances. It should be noted that, as both signal and symbolic VAEs are independent, we are still able to perform semisupervised learning for incomplete pairs by training only one of the two autoencoders.
3.2 Bidirectional signaltosymbol mappings
Our approach extends the multipitch detection problem on several aspects. First, our model is inherently bidirectional as we can recover symbolic inference with the process
This can be understood as a Bayesian formulation of audio semantic labeling. Hence, multipitch transcription is simply a special case of our formulation, where is defined as being solely the pitch information. Furthermore, we can also naturally handle signal generation from symbolic constraints, by taking the reverse process
such that we can recover the appropriate spectral distribution from the symbolic data, as depicted in Fig. 1. Another interesting property of our method is its applicability to arbitrary symbols. In this paper, we model symbolic information as a triplet [pitch class, octave, dynamics], where we add dynamics estimation to the pitch estimation task. Thus, we have , where each is defined as a categorical distribution. We use this property to extend this method to multipitch applications, where is a mixture signal with different sources. Hence, we formulate the symbolic information as a product , where each follows the previous specification. In addition to performing multipitch estimation, it also specifies the corresponding instrument if a given symbolic ordering is held during training. Finally, our formulation can be extended to polyphonic instruments in a straightforward manner. In this case, we simply replace the above conditioning by , where we define to be a Bernoulli distribution over a onehot pitch vector.
4 Experiments
4.1 Datasets
To evaluate our approach we use the Studio One Line (SOL) [ballet1999studio], a database that contains solo instrument recordings for every note across their tessitura. Each note is recording over a range of different dynamics (ff,mf,pp). Here, we selected five instruments: violin, altosax, flute, Ctrumpet and piano, for a total amount of 800 files. First, audio files are all resampled to a sample rate of 22050Hz. Then, we transform the raw audio data to the spectral domain by using a NonStationary Gabor Transform (NSGT) [velasco2011constructing]. Interestingly, this multiresolution spectral transform allows to define custom frequency scales, while remaining invertible. Here, we use a constantQ scale with 48 bins per octave. For each model training, we split our dataset with 80% as training and 20% as test sets. As our dataset is composed of monophonic signals, we randomly create instrument signal mixtures during training such that every combination is seen during the training.
4.2 Models
To show the efficiency of our proposal, we rely on VAEs with very simple architectures. Nevertheless, depending on the complexity of the input data, we adjust the dimensionality of both the latent space and hidden layers. For singleinstrument models we use 32 dimensions for the latent space, and define both encoding and decoding functions for the signal VAE as 2layers multilayer perceptrons (MLP) with 2000 hidden units. For the symbolic autoencoder, encoding and decoding MLPs have 2 layers and 800 hidden units. For mixtures of two different instruments, the number of hidden units for the signal encoders/decoders are set to 5000. For the mixture of three instruments, hidden layers have 5000 and 1500 units for the signal and the symbolic encoders / decoders respectively. All models are trained using the ADAM optimizer, and we use the warmup procedure that slowly brings the regularization from 0 to 1 during the first 100 epochs. As recommended by Higgins & al., the additional term presented in (2) is scaled up to a factor 10. The learning rate is first set to 1e3, and is increasingly reduced as the derivative of the error decreases.
4.3 Evaluation
In addition to performing a standard evaluation on the test set, we also evaluate our model on a separate dataset containing recordings of flute arpeggios, scales and melodies [elena_agullo_cantos_2018_1408985] with source audio files and aligned MIDI files. Unfortunately, this dataset does not provide information about symbolic dynamics, so we do not evaluate the dynamics inference on this set. We compare the efficiency of our model with results obtained from a baseline approach. To this end, we rely on an architecture similar to our model, but designed in a supervised way to emphasize the gain provided by our model. This baseline classifier first performs a Principal Component Analysis (PCA) from the signal data to perform dimensionality reduction, mocking the compression between the input data and the latent space. Then, we use a 2layer MLP with the same amount of hidden units than the corresponding symbolical decoder, to output the desired labels. The whole system is trained on a standard crossentropy loss. The classifier is trained until convergence with the same optimization strategy.
5 Results
In this section, we present the results of our methods. The source code, audio examples and additional figures and results are available on our support page https://domkirke.github.io/latenttranscription/.
5.1 Signal reconstruction and transfer performances
ISD  ISD  

AltoSax (Sax)  694.1  0.093  416.6  0.177 
Violin (Vn)  671.4  0.104  551.1  0.151 
TrumpetC (TpC)  706.9  0.073  276.71  0.35 
Flute (Fl)  706.2  0.076  379.2  0.147 
Piano (Pn)  813.5  0.044  361.13  0.112 
Sax + Vn  358.71  0.364  27.37  0.852 
Sax + Vn + Fl  268.7  0.624  692.4  3.813 
First, we analyze the results obtained on the SOL examples. Signal reconstruction and transfer scores are provided in Table 1, relying on two evaluation metrics. The first metric is the loglikelihood of the original spectrum with respect to the distribution decoded by the model. The second is the ItakuraSaito Divergence (ISD), a metric that reflects the perceptual dissimilarity between the original and reconstructed spectrum [6797100]. Both scores are presented for signaltosignal reconstruction (left) and symboltosignal inference (right). In addition to these scores, reconstruction examples are depicted in Fig. 2. We can see that performances in both signal reconstruction and transfer decrease with the number of instruments, as the complexity of the incoming signal increases. Both reconstruction and signaltotransfer scores are almost perfect in the case of solo instruments, providing convincing and highquality sound samples generation. In the case of mixtures of two or more instruments, reconstruction scores maintain an acceptable performance, but symboltosignal transfer scores clearly decrease. This observation correlates with the decrease of performance observed in the symbolic domain, as discussed in the following subsection.
5.2 Symbolic inference performances
Success Ratio (%)  loose (%)  Baseline (loose)  

Sax 






Vn 






TpC 






Fl 






Pn 






Sax + Vn 






Sax + Vn + TpC 





Here, we evaluate the performances of our model in the symbolic domain. We provide in Table 2 four different classification scores, separately for each family of labels: octave, pitch class and dynamics. In the case of multiinstrument mixtures, these losses are averaged over every instrument of the mixture. Every column (except for the baseline) show two scores : the first are the scores obtained symbolto symbol (reconstruction), and the second within parenthesis are the ones obtained signaltosymbol (transfer).
The first loss, written , denotes the likelihood of the true labels with respect to the distributions decoded by the symbolic part of the VAE. The percent scores located at the right of the likelihood correspond to classification scores, obtained by taking the highest probability of the categorical distribution and obtaining the corresponding ratio of wellclassified symbols. The first column, called success ratio, denotes the classification score obtained by the symbolical VAE. The second column, called loose ratio, is specific in the case of mixed instruments, considering a label to be correct regardless of the instrument (we will come back to the motivation behind this score). Finally, the last column display the scores obtained by our baseline classifier, that does not have symboltosymbol scores.
We note that symbolic reconstruction and signaltosymbol scores are almost perfect in the case of singlesource signals, outperforming the equivalent baseline system. We argue that is due to two main aspects of the proposed approach. First, thanks to the reconstruction task, the construction of the latent space is organized to reflect the inner structure of both signal and symbol domains. The latent space can be thus understood as a feature space, carrying higherlevel information that allow signal/symbolic coupling to be more efficient. Second, the Bayesian approach matching the latent spaces allows a smoother and more efficient mapping than a deterministic approach, that would just provide pairwise mappings between incoming examples.
In the case of instrument mixtures, scores are decreasing as the complexity of both the spectrum and symbolic distributions increase. While the system still performs convincingly with two instruments mixtures, it struggles with mixtures of more than three instruments. We argue that this is partly due to a combinatorial problem, as can be seen when analyzing the loose classification ratio : indeed, it becomes harder for the model to accurately affect a label to the corresponding instrument as the number of different possible sources increase. This can be seen with the loose ratio in table 2 : where the classification ratio significantly increase if the label is considered correct regardless the affected instrument. This effect can also be seen figure 2, where some peaks are correct but unfortunately distributed to the wrong instrument. A more subtle strategy to tackle this effect has to be considered ; we leave this to a future work.
5.3 Monophonic flute transcription
Likelihood  ISD  Class. Ratio  Baseline  

2648 (1057)  1.065 (0.632) 


Here, we analyze the results obtained with an external dataset of flute recordings, as depicted in Table 3. Performances in symbolic inference is still convincing, showing that our model does not suffer from strong overfitting. Compared to the results obtained with the reference dataset 1, the reconstruction results obtained here have decreased. This is due to several points : first, we have trained the model solely on the stationary part of each instrument signals, such that the attack and release of the signal are not understood by our model. This anomaly is clearly perceptible when listening to the reconstructions. Second, a more subtle comparison between the reference dataset and this dataset showed important differences in terms of harmonic content. Specifically, a 1octave lower harmonic is globally present in this dataset, and not in the reference one. This may explain the important decrease in octave classification, and may indicate that an increased amount of various instruments of the same type may be required to enforce the generalization of the model.
6 Discussion and future works
6.1 Performance aspects
We think that the efficiency of the proposed approach mainly relies on the hybridization of its learning process, that combines both unsupervised and supervised learning. Indeed, while each encoder learns to extract domaindependant features in an unsupervised manner, latent spaces are matched by enforcing a supervised coupling of signal/symbol pairs. This process thus intends to learn transferable features, that are then used by each decoder to project them back into their respective data domains. Furthermore, this process allows the model to train on incomplete data, such that each domain’s encoding / decoding functions can still be trained individually even if some signal / symbol couplings are missing. This means that the training method is scalable to bigger datasets where some symbolic information may be absent, such that incomplete data can yet be used to reinforce the reconstruction abilities of the system.
However, in spite of the strengths of the proposed approach, the actual state of the model suffers some issues, that we aim to tackle in the future. The first main issue is that, while the system performs well in the singleinstrument case, its performance weakens with two instruments and clearly fails when applied on more. We think that this falls to several reasons. First, we think this is due to the capacity of the model, as we still use very simple systems even for complex signals like the 3instruments case. Secondly, the complexity of the problem is such that, as we showed when comparing the loose and nonloose version of the classification ratio, the system struggles to correctly allocate the good label to the good instrument. We think that the incoming signal representation may be not precise enough to alleviate some ambiguities, as for examples in the case of octaves or fifths where instrument identification may be hard to disentangle. Furthermore, the model does not prevent instrumentwise symbolic outputs to focus on the same spectral components, and thus to perform redundant symbol predictions, and thus may also lead to a permutation problem.
Finally, another issue with the proposed model is that the temporal evolution is not considered by the system. Including temporal features could bring decisive enhancements : in addition to allow fullsound generation and increased pitch and dynamics inference, it may even be mandatory for applying our model to custom symbolic dictionaries (playing modes, temporal symbols such as trills…) and provide a substantial advantage over more casual pitchdetection methods.
6.2 Creative aspects
Finally, an important aspect of the proposed model is the diversity of creative applications it provides (see figure 3). As generation of both symbolic and signal content is both based on the latent space, one may use it as a continuous control space and meaningfully explore it in either an unsupervised or semisupervised fashion. Indeed, this space can be explored in a fully unsupervised manner by direct interaction: both signal and symbol information are then generated, such that the user can have a direct symbolic feedback on the data he is generating. Alternately, it can also be used in a semisupervised fashion, constraining the navigation to the distribution inferred by a given symbol or a given sound. For example, in our case, we can directly generate a note with given pitch, octave and dynamics by inferring a distribution with the symbolic encoder, and then navigate inside it to access the diversity of signals retained under the corresponding label information. This allows us, translating first MIDI information in pitch/octave/dynamics pairs, and then transferring this symbolic information in the signal domain, to generate audio content from a MIDI file. We list below various use cases that can be carried out by our model:

sequence generation: we can use a sequence of labels to recover the corresponding distribution in the latent space that we can freely sample and/or navigate,

spectral morphing: we can take two latent target distributions, and draw a trajectory that we can sample regularly to obtain a smooth transformation between the two target sounds,

free trajectory: take a totally free trajectory in the latent space,

symbol extraction: we can infer symbolic information from an incoming signal, and still train the corresponding signal encoder/decoder with the incoming data. This could be a particularly interesting feature especially in realtime contexts.
Corresponding examples for each of the above navigation strategies are given at support webpage. Finally, also note that, in our example, the vocabulary is easy to learn, such that retrieving the underlying distribution of the symbolic data itself is not really useful. Indeed, the different labels are all independent, and are approximately equally distributed in their own domain. However, our system can also learn on much more complex vocabularies where learning the underlying distribution in the symbolic domain itself has an interest, and thus open additional perspectives for its use in creative and/or MIR applications.
7 Conclusion
In this paper, we proposed a novel formulation for bijective signal/symbol translation, based on the latent space matching of domainwise variational architectures. We studied the benefits and drawbacks of the proposed system, and concluded that while improvable this model performed well and proposed a very interesting alternative to signalsymbol algorithms, and furthermore provided additional applications that were not possible in previous models. Indeed, our method is bidirectional, and performs well in both audiotosymbol and symboltoaudio prediction. Furthermore, our method is compatible with any kind or arbitrary symbolic information, and is then opened to userdefined vocabularies. Besides, as our model is based on a latent space that can be considered as a continuous control space, it is also opened to diverse creative uses as sequence generation, sound interpolation, or free navigation, whether in an supervised manner or semisupervised manner, guided with symbolic information. For future work, we plan to solve the symbolic ambiguities that raise in the case of numerous instruments, to incorporate temporal features to allow dynamical features extraction, and to design user interfaces to make our model compatible with artistic practises.