[

[

[
Abstract
Abstract

This paper describes a new unsupervised machine learning method for simultaneous phoneme and word discovery from multiple speakers. Human infants can acquire knowledge of phonemes and words from interactions with his/her mother as well as with others surrounding him/her. From a computational perspective, phoneme and word discovery from multiple speakers is a more challenging problem than that from one speaker because the speech signals from different speakers exhibit different acoustic features. This paper proposes an unsupervised phoneme and word discovery method that simultaneously uses nonparametric Bayesian double articulation analyzer (NPB-DAA) and deep sparse autoencoder with parametric bias in hidden layer (DSAE-PBHL). We assume that an infant can recognize and distinguish speakers based on certain other features, e.g., visual face recognition. DSAE-PBHL is aimed to be able to subtract speaker-dependent acoustic features and extract speaker-independent features. An experiment demonstrated that DSAE-PBHL can subtract distributed representations of acoustic signals, enabling extraction based on the types of phonemes rather than on the speakers. Another experiment demonstrated that a combination of NPB-DAA and DSAE-PB outperformed the available methods in phoneme and word discovery tasks involving speech signals with Japanese vowel sequences from multiple speakers. \helveticabold

1 Keywords:

word discovery, phoneme discovery, parametric bias, Bayesian model, neural network

\firstpage

1

Double Articulation Analyzer and Neural Network with Parametric Bias]Unsupervised Phoneme and Word Discovery from Multiple Speakers using Double Articulation Analyzer and neural network with Parametric Bias

Nakashima et al.]Ryo Nakashima , Ryo Ozaki  and Tadahiro Taniguchi  \correspondance

\extraAuth

2 Introduction

Infants can discover phonemes and words from speech signals uttered by individuals surrounding them without transcribed data, i.e., labeled data, in a manner that differs from most automatic speech recognition systems (ASRs) developed recently (Saffran et al., 1996a, b). This study is aimed at creating a machine learning method that can discover phonemes and words from unlabeled data for developing a constructive model of language acquisition by human infants and for leveraging the large amount of unlabeled data spoken by multiple speakers in the context of developmental robotics (Taniguchi et al., 2016a).

Most available ASRs are trained with transcribed data that need to be prepared separately from the learning process (Sugiura et al., 2015; Kawaharay et al., 2000; Dahl et al., 2012). By using certain supervised learning methods and certain model architectures, an ASR can be developed with a very large amount of transcribed speech data corpus, i.e., a set of pairs of text data and acoustic data. However, human infants can discover phoneme and words through their developmental process. They do not need transcribed data. Moreover, they discover phonemes and words at a time when they have not developed the capability to read text data. This evidence implies that infants discover phonemes and words in an unsupervised manner, i.e., from his/her sensor–motor information.

It is widely established that eight-month-old children can also infer chunks of phones, i.e., word-like unit, from the distribution of acoustic signals (Saffran et al., 1996b). Caregivers generally utter a sequence of words rather than an isolated word in their infant-directed speech  (Aslin et al., 1995). Therefore, word segmentation and discovery is essential for language acquisition. Saffran et al. described that human infants use three types of cues for word segmentation: prosodic, distributional, and co-occurrence (Saffran et al., 1996b). In this study, we focus on distributional cues. Saffran et al. reported that eight-month-old infants can also perform word segmentation from continuous speech by using solely distributional cues (Saffran et al., 1996a). Thiessen et al. reported that distributional cues appear to be used by human infants by the age of seven months (Thiessen and Saffran, 2003). This is earlier than for other cues.

However, the computational models that discover phonemes and words from human speech signals have not been completely explored in the fields of developmental robotics and natural language or speech processing (Lee and Glass, 2012; Lee et al., 2013, 2015; Kamper et al., 2015; Taniguchi et al., 2016b, c). The unsupervised word segmentation problem has been studied for a long time (Brent, 1999; Venkataraman, 2001; Goldwater et al., 2006, 2009; Mochihashi et al., 2009; Johnson and Goldwater, 2009; Chen et al., 2014; Magistry, 2012; Sakti et al., 2011; Takeda and Komatani, 2017). However, these models are established to be incapable of providing satisfactory results if they are applied to phoneme sequences recognized by a phoneme recognizer, which usually involve a lot of phoneme recognition errors. This is because they do not consider phoneme recognition errors or posterior distribution of phonemes, i.e., probabilistic modeling of phoneme recognition. Neubig et al. extended the sampling procedure proposed by Mochihashi to handle word lattices that can be obtained from an ASR system (Neubig et al., 2012). However, the improvement was limited, and they did not consider phoneme acquisition. It was indicated that feedback information from segmented words is essential in phonetic category acquisition (Feldman et al., 2013). Subsequent to these studies, several others have been conducted to develop unsupervised phoneme and word discovery (Lee et al., 2015; Kamper et al., 2015; Taniguchi et al., 2016b, c). This type of research is mostly equivalent to the development of unsupervised learning of speech recognition system, which transforms speech signals to sequences of words. The development of an unsupervised machine learning method that can discover words and phonemes is also important to provide fresh insight into developmental studies from a computational perspective. In this study, we employ Nonparametric Bayesian double articulation analyzer (NPB-DAA) (Taniguchi et al., 2016b).

The double articulation structure existing in spoken language is a characteristic structural feature of human language (Chandler, 2002). When we develop an unsupervised machine learning method based on probabilistic generative models, i.e., Bayesian approach, it is critical to clarify our assumption about the latent structure embedded in observation data. The double articulation structure is a two-layer hierarchical structure; i.e., a sentence is generated by stochastic transitions between words, a word corresponds to a deterministic sequence of phonemes, and a phoneme exhibits similar acoustic features. This double articulation structure is universal for languages. NPB-DAA was developed to enable a robot to obtain knowledge of phonemes and words in an unsupervised manner even if the robot does not know the number of phonemes and words, lists of phonemes and words and transcription of the speech signals. Taniguchi et al. introduced deep sparse autoencoder (DSAE) to improve the performance of NPB-DAA; they demonstrated that it also outperformed a conventional off-the-shelf ASR system trained using transcribed data (Taniguchi et al., 2016c). Although it did not outperform the state-of-the-art deep learning-based ASR system, the performance was remarkable; this is considering that the main research purpose of developing NPB-DAA with DSAE was to develop an unsupervised phoneme and word discovery system that can be regarded as a computational explanation of the process of human language acquisition, rather than to develop a high-performance ASR system.

However, the experiments conducted in (Taniguchi et al., 2016b, c) used speech data obtained from only one speaker. The NPB-DAA with DSAE did not assume learning environments where a robot modelling a human infant learns phonemes and words from multiple speakers. Human infants do not acquire knowledge of phonemes and words from his/her mother alone; they do so also from multiple speakers surrounding him/her. Therefore, the direct application of NPB-DAA with DSAE to the multi-speaker scenario is highly likely to be ineffective. How to extend NPB-DAA with DSAE to the multi-speaker scenario is the research question of this paper.

In the studies of unsupervised phoneme and word discovery, learning from speech signals obtained from multiple speakers have been recognized as challenging (Dunbar et al., 2017; Kamper et al., 2017). To explain the essential challenge of the problem, let us consider an example of the discrimination of “a” from “i.” Figure 1 provides a schematic view of the explanation that follows. Fundamentally, the phoneme discovery problem can be regarded as a type of clustering problem. A machine learning method for unsupervised phoneme and word discovery should be capable of identifying clusters of “a” and “i,” and distinguishing them. If the acoustic feature distributions of “a” and “i” are sufficiently different, a proper unsupervised machine learning method can form two clusters, i.e., acoustic categories. For example, DSAE can form reasonable feature representations, and NPB-DAA can simultaneously categorize phonemes and words. If explicit feature representations are formed, a standard clustering method, e.g., Gaussian mixture model, can also perform phoneme discovery to a certain extent. However, in a multi-speaker setting, acoustic feature distribution of each phoneme can differ depending on the speakers. That is, “a” from the first speaker and “a” from the second speaker exhibit different feature distributions in the feature space. The direct application of a clustering method on the data tends to form different clusters, i.e., phoneme categories, for “a” from the first and second speakers. To enable a robot to acquire phonemes and words from speech signals obtained from multiple speakers, it needs to omit, cancel, or subtract speaker-dependent information from the observed speech signals. In Figure 1, the speaker-dependent features are subtracted, and the speaker-independent features are extracted. If speaker-independent feature representations can be formed similarly, the proposed clustering method, e.g., NPB-DAA, is likely to identify phonemes from the extracted features.

Figure 1: Schematic view of speaker-dependent and speaker-independent acoustic features and clustering result of the,

How to omit, cancel, or subtract speaker-dependent information is a crucial challenge in unsupervised phoneme and word discovery from multiple speakers. Conventional studies on ASR, which can use transcribed data, adopt an approach that omits the difference between multiple speakers by using transcribed data. Although “a” from speakers A and B exhibit different distributions, by using label data, the pattern recognition system can learn that both the distributions should be mapped to a label “a.” In the scenario of supervised learning, deep learning-based speech recognition systems adopt these types of approaches by exploiting a considerable amount of labeled data and the flexibility of neural networks (Chan et al., 2016; Chiu et al., 2018; Amodei et al., 2016; Hannun et al., 2014). This approach was not suitable for this study because the research question is different; through this study, we intended investigate unsupervised phoneme and word discovery.

The system should not use transcription. Instead of transcription, This study focused on information of speaker index, i.e., “who is speaking now,” to subtract speaker-dependent acoustic features. It is widely established that infants can distinguish individuals around them in their early developmental stage. Therefore, the assumption that they can sense “who is speaking now,” i.e., speaker index, is reasonable from the developmental perspective.

To apply speaker index and subtract speaker-dependent information from acoustic features, we employed the concept of parametric bias in the study of neural networks. Neural networks have been demonstrated to exhibit rich representation learning capability and widely used for a decade (Le et al., 2011; Krizhevsky et al., 2012; Liu et al., 2014; Bengio, 2009a; Hinton and Salakhutdinov, 2006). In the context of developmental robotics, Tani and Ogata et al. proposed and explored recurrent neural network with parametric bias (Tani et al., 2004; Yokoya et al., 2007; Ogata et al., 2007). Parametric bias is an additional input that can function as a gray switch that can modify the function of the neural network. In our study, the speaker index is provided as an input of parametric bias. Moreover, the characteristic of neural networks wherein they encode independent feature information in each neuron if it is trained under suitable conditions is called disentanglement. The property of disentanglement has been attracting attention in recent studies (Chen et al., 2016; Higgins et al., 2017; Bengio, 2009b). The arithmetic manipulability rooting on this characteristic of the neural network has been gaining attention. It was demonstrated that Word2Vec, i.e., skip-gram for word embedding, can predict the representation vector of “Paris” by subtracting the vector of “Japan” from that of “Tokyo” and adding that of “France” (Mikolov et al., 2013b, a). Considering these concepts, we propose DSAE with parametric bias in hidden layer (DSAE-PBHL) to subtract speaker-dependent information.

The overview of our approach, unsupervised phoneme and word discovery using NPB-DAA with DSAE-PB, is schematically depicted in Figure 2. First, a robot observes spoken utterances with speaker indexes using a speaker recognition method, e.g., face recognition. DSAE-PB, which accepts speaker-dependent features and speaker index as input, extracts speaker-independent feature representations and passes them to NPB-DAA. NPB-DAA segments the feature sequences and identifies words and phonemes, i.e., language and acoustic models, in an unsupervised manner.

Figure 2: Overview of proposed method, NPB-DAA with DSAE-PBHL. First, a robot observes spoken utterances with speaker indexes using a speaker recognition method, e.g., face recognition. DSAE-PB, which accepts speaker-dependent features and speaker index as input, extracts speaker-independent feature representations and passes them to NPB-DAA. NPB-DAA segments the feature sequences and identifies words and phonemes, i.e., language and acoustic models, in an unsupervised manner.

Our contribution is that we propose an unsupervised learning method that can identify words and phonemes directly from speech signals uttered by multiple speakers. The method based on NPB-DAA and DSAE-PBHL is an unsupervised learning method except for the use of an index of a speaker, which is assumed to be estimated by the robot, i.e., a model of a human infant.

The remainder of this paper is organized as follows: Section 3 briefly describes the proposed method, a combination of NPB-DAA and DSAE-PBHL. Section 4 describes two experiments that evaluate the effectiveness of the proposed method using actual sequential Japanese vowel speech signals. Section 5 concludes this paper.

3 Methods

The proposed method consists of NPB-DAA and DSAE-PBHL (see Figure 2). First, we briefly introduce NPB-DAA (Taniguchi et al., 2016b). Secondly, we describe DSAE-PBHL after introducing DSAE (Ng, 2011a; Liu et al., 2015; Taniguchi et al., 2016c).

3.1 Npb-Daa

Hierarchical Dirichlet process hidden language model (HDP-HLM) is a probabilistic generative model that models double articulation structure (i.e., two-layer hierarchy, a characteristic of human spoken language) (Taniguchi et al., 2016b). Mathematically, HDP-HLM is a natural extension of hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM), which is a type of generalization of hidden Markov model (Johnson and Willsky, 2013). NPB-DAA is the name of an unsupervised learning method for phoneme and word discovery based on HDP-HLM.

Figure 3: Graphical model of HDP-HLM (Taniguchi et al., 2016b)

Whereas HDP-HMM assumes that the latent variable transits between them following Markov process, HDP-HLM assumes that the latent variable, index of phoneme, transits according to the word bigram language model. In HDP-HSMM, a superstate persists for a certain duration determined by the duration distribution and outputs observation using a corresponding emission distribution; meanwhile, in HDP-HLM, a latent word persists for a certain duration, and the model output observations with a sequential transition of latent letters, i.e., phonemes. Note that in the HDP-HLM terminology, the variable corresponding to a phoneme is called a latent letter; the variable corresponding to a word is called a latent word.

As HMM-based ASR has language and acoustic models, HDP-HLM has both these as latent variables in its generative model. Because of the nature of Bayesian nonparametrics, i.e., Dirichlet process prior, HDP-HLM can determine the number of phonemes and words through the inference process. It is not necessary to fix the number of phonemes and words, i.e., the number of latent letters and words, beforehand.

In the graphical model, the -th latent word corresponds to superstate . Superstate has a sequence of latent letters ; here, is the index of the -th latent letter of the -th latent word. represents the string length of . The generative process of HDP-HLM is as follows;

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)

Here, represents a stick breaking process (SBP), and represents Dirichlet process (DP). Here, represents the based measure of Dirichlet process for word model, and are hyperparameters of DP and SBP. A word model is a prior distribution of a sequence of latent letters composing a latent word. generates a transition probability, which is a categorical distribution over the subsequent latent letter of the -th latent letter. Similarly, , , and represent the based measure of Dirichlet process for language model and hyperparameters of DP and SBP, respectively. generates a transition probability ; it is a categorical distribution over the subsequent latent letter of the -th latent letter. The notations and represent language and word models, respectively. The emission distribution and duration distribution have parameters and drawn from the base measures and , respectively. The variable is the -th word in the latent word sequence. Moreover, is the duration of , is the -th latent letter of the -th latent word, and is its duration. Variables and represent the observation and latent state corresponding to a latent letter at time . The times and represent the start time and end time, respectively, of .

If we assume the duration distribution of a latent letter to follow a Poisson distribution, the model exhibits an effective mathematical feature because of the reproductive property of Poisson distributions. The duration is drawn from . Therefore, the duration of is . If we assume to follow Poisson distribution, i.e., is a Poisson distribution, also follows Poisson distribution. In this case, the parameter of the Poisson duration distribution of becomes . The observation corresponding to is generated from ; here, and are mappings that indicate the corresponding word and letter at time .

Following the process described above, HDP-HLM can generate time series data exhibiting a latent double articulation structure. In this study, we assumed that the observation corresponds to the acoustic features. In summary, represents acoustic models, and represents language models. The inference of the latent variables of this generative model corresponds to the simultaneous discovery of phonemes and words.

An inference procedure for HDP-HLM was proposed in (Taniguchi et al., 2016b). This procedure is based on the blocked Gibbs sampler for HDP-HSMM proposed by Johnson (Johnson and Willsky, 2013). The pseudo code of the procedure is described in Algorithm 1. In this paper, we omit the details of the procedure. For further details, please refer to the original paper (Taniguchi et al., 2016b).

  Initialize all parameters.
  Observe time series data .
  repeat
     for  to  do
        // Backward filtering procedure
        For each , initialize messages .
        for  to  do
           For each , compute backward messages and (see Taniguchi et al. (2016b))
        end for
        // Forward sampling procedure
        Initialize and
        while  do
           // Sampling a superstate representing a latent word
           
           // Sampling duration of the superstate
           
           
           
        end while
        
        // Sampling a tentative latent letter sequence
        for  to  do
           
        end for
     end for
     // Update model parameters
     Sample acoustic model parameters on the basis of tentatively sampled latent letter sequences .
     Sample language model parameter on the basis of sampled super states , i.e., latent words.
     Sample a word inventory using sampling importance re-sampling procedure (see Taniguchi et al. (2016b)).
     Sample a word model on the basis of the sampled word inventory .
  until a predetermined exit condition is satisfied.
Algorithm 1 Blocked Gibbs sampler for HDP-HLM

3.2 Deep sparse auto-encoder with parametric bias

3.2.1 Deep sparse auto-encoder

In the previous paper (Taniguchi et al., 2016c), features extracted using DSAE were used as the input of NPB-DAA. DSAE is a representation learning method. It consists of several sparse autoencoders (SAEs) (Ng, 2011b). By stacking several autoencoders and assigning penalty terms to the loss function for improving robustness and sparsity, DSAE is obtained. In DSAE, each SAE attempts to minimize the reconstruction errors and learn efficient and essential representations of the input data (speech signals in this study).

Figure 4: Overview of DSAE

Figure 4 shows an overview of DSAE. In this study, we assumed that the original input of speech signals are converted into Mel frequency cepstral coefficients (MFCC) following the process described in the previous work (Taniguchi et al., 2016c). The time series data is obtained as a matrix ; here, is the number of data. The acoustic feature at time is represented by as follows:

(12)

where represents the dimension of vector .

In this study, the hyperbolic tangent function was used as the activation function of SAE. To fit the input data to the range of for reconstruction, the input vector was normalized as follows:

(13)

where and are the maximum and minimum values, respectively, of the -th dimension of all the data .

Each SAE has an encoder and a decoder. The encoder of the -th SAE in DSAE is

(14)

Following this function, about the -th data, a vector of the -th layer is transformed to a vector of the -th hidden layer . Each decoder is represented as follows: The vector of the -th layer is obtained from the vector of the -th reconstruction layer.

(15)

where in (14) is the weight matrix and is the bias of the encoder. Moreover, and represent the dimensions of the input and hidden layers, respectively. Similarly, in (15 ) is the weight matrix of the decoder, and is the bias.

The loss function was defined as follows:

(16)

Because the dimensions of the weight matrices and are high, it was necessary to prevent the penalty terms , (L2 norm) and (sparse term); this is the Kullback–Leibler divergence between the two Bernoulli distributions having and as their parameters. This type of DSAE is introduced in (Ng, 2011b). The following are details of the sparse term:

(17)

where is a parameter that regulates sparsity. Moreover, represents the average of the -th dimension’s activation. The vector is defined by combining . In this study, to calculate the sparse term, was normalized from to ; this was because was used as an activation function. To optimize the DSAE, simply back-propagation method was used (Rumelhart et al., 1985).

As described above, we could obtain the weight matrices for obtaining . By stacking the optimized SAE’s, high-level feature representations could be obtained.

3.2.2 Dsae-Pbhl

Figure 5: Overview DSAE-PBHL

This section describes DSAE-PBHL that is aimed to subtract speaker-dependent features in the latent space.

DSAE-PBHL is a DSAE that has a final layer; a part of this layer receives speaker index information from the other network. The layer is used to subtract speaker-dependent information in a self-organizing manner. Figure 5 shows an overview of DSAE-PBHL. The -th layer, i.e., the final layer, receives parametric bias input from a different network (see the right nodes of the network in Figure 5).

However, the vital aspect of DSAE-PBHL is that a part of nodes in the final layer receives a projection from the network representing speaker index information.

The input vector consists of the parametric bias and a vector obtained from the -th SAE.

(18)

where and represents the dimensions of and , respectively. Note that .

Next, the vector of the -th hidden layer , , is defined using , as follows:

(19)

where and represent the dimensions of and , respectively. Note that .

The encoder of the -th SAE used (14) similarly as the general DSAE. However, the weight matrix of the encoder was trained to map the input vectors and to the latent vectors and in the hidden layer and generate speaker-independent feature representation and speaker-identifiable representation.

(20)

where,,,,, .

Similarly, the decoder function (15) was used, and the weight matrix of the decoder function was modified as follows:

(21)

where, ,,, , and .

Furthermore, the error function and optimization method were identical to those in the general DSAE.

After the training phase, was obtained by excluding from the vector of the -th hidden layer, and was used as a feature vector, i.e., observation, of NPB-DAA.

The reason we considered it likely that encoded speaker-independent feature representation is that the network was trained to cause to have speaker-identifiable representation; this was because alone had to contribute to reconstructing the speaker index information, i.e., parametric bias. As Figure 5 shows, was connected only to the input of parametric bias, i.e., speaker index. If involves speaker-dependent information that can be used to predict the speaker index, the representation is redundant. Therefore, such speaker-dependent information is likely to be mapped onto . As a result, it is likely that becomes encoding information that does not contribute to the speaker identification task (i.e., it becomes speaker-independent information).

4 Experiment

To evaluate the proposed method, we conducted two experiments. First, we tested whether DSAE-PBHL can extract speaker-independent feature representations using speech signals representing isolated Japanese vowels and an elementary clustering method. Secondly, we tested whether NPB-DAA with DSAE-PBHL can successfully perform unsupervised phoneme and word discovery from speech signals obtained from multiple speakers.

4.1 Common conditions

In the following two experiments, we used the common dataset. The procedure of creating the data is identical to that in the previous related papers (Taniguchi et al., 2016b, c).

We asked two male and two female Japanese speakers to read 30 artificial sentences aloud once at a natural speed and recorded their voice using a microphone. Totally, 120 audio data items were recorded. We name the two female datasets as K-DATA and M-DATA and the two male datasets as H-DATA and N-DATA, respectively.

The 30 artificial sentences were prepared using five artificial words {aioi, aue, ao, ie, uo} consisting of five Japanese vowels {a, i, u, e, o}. By reordering the words, 25 two-word sentences, e.g., “ao aioi,” “uo aue,” and “aioi aioi,” and five three-word sentences, i.e., “uo aue ie,” “ie ie uo,” “aue ao ie,” “ao ie ao,” and “aioi uo ie,” were prepared. The set of two-word sentences comprised of all the feasible pairs of the five words (). The set of three-word sentences were determined manually.

The input speech signals were provided as MFCCs, which have been widely used in ASR studies. The recorded data were encoded into -dimensional MFCC time series data using the HMM Toolkit (HTK).111Hidden Markov Model Toolkit: http://htk.eng.cam.ac.uk/ The frame size and shift were set to ms and ms, respectively. Twelve-dimensional MFCC data were obtained as the input data by eliminating the power information from the original 13-dimensional MFCC data. As a result, 12-dimensional time series data at a frame rate of Hz were obtained.

In DSAE-PBHL, 39-dimensional MFCC was compressed by DSAE, whose variation in the dimensions was The speaker index was provided to the final layer as a four-dimensional input. In the final layer, the dimensions of and were three and three, respectively. We used as an input of clustering methods, e.g., k-means, GMM, and NPB-DAA.

In DSAE, the 39-dimenssional MFCC was compressed by DSAE, whose variation in the dimensions was The parameters in DSAE were set as , , and .

4.2 Experiment 1: Vowel clustering based on DSAE-PBHL

This experiment evaluates if the DSAE-PBHL can extract speaker-independent representations from the perspective of a phoneme clustering task rather than a word discovery task.

4.2.1 Conditions

For quantitative evaluation, we applied two elementary clustering methods (k-means and GMM) to the extracted feature vectors to examine whether the DSAE-PBHL extracts speaker-independent feature representations. If the elementary clustering methods can identify clusters corresponding to each vowel, it implies that each phoneme forms clustered distributions to a certain extent. The clustering performance was quantified with the adjusted Rand index (ARI), which is a standard evaluation criterion of clustering.

We also tested three types of coding of parametric bias, i.e., sparse coding and codings 1 and 2 (Table 1).

As a baseline method, we employed DSAE.

Furthermore, we applied DSAE and the clustering methods separately to the four datasets (H-DATA, K-DATA, M-DATA, and N-DATA) and calculated the average of the ARI. This result can be considered as an upper limit of the performance.

The codes of scikit-learn222http://scikit-learn.org/stable/ were used for k-means and GMM. The number of clusters of the methods was fixed as five, i.e., the exact number. With regard to the other hyperparameters, the default settings of scikit-learn was used.

4.2.2 Results

Table 1 presents the ARI averaged over 20 trials for k-means and GMM and for each method. This result demonstrates that DSAE-PBHL exhibited significantly higher performance than DSAE and MFCC in the representation learning of acoustic features from multiple speakers, in phoneme clustering. Among the three coding methods, sparse coding, i.e., one-hot vector, achieved the bests core. In numerous cases in deep learning, sparse coding exhibits effective characteristics. Therefore, this result appears consistent. However, even in different cases of encoding methods, DSAE-PBHL outperformed other methods. As was considered likely, DSAE-PBHL did not attain the upper limit, although, it reduced the difference.

Method k-means GMM PB: [H-PB], [K-PB], [M-PB], [N-PB]
DSAE-PBHL (Sparse Coding) 0.536 0.519 [0,0,0,1], [0,0,1,0], [0,1,0,0], [1,0,0,0]
DSAE-PBHL (Coding 1) 0.514 0.429 [0,0,0,1], [0,0,1,0], [0,0,1,1], [0,1,0,0]
DSAE-PBHL (Coding 2) 0.448 0.362 [0,0,1,1], [0,1,1,0], [1,1,0,0], [1,0,0,1]
DSAE 0.212 0.222
MFCC 0.243 0.182
Upper Limit 0.626 0.599
Table 1: ARI in phoneme clustering task

Figures 6, 7, 8 and 9 visualizes feature representations extracted by DSAE and DSAE-PBHL with three types of codings. The final three-dimensional representation is mapped to two-dimensional space using principal component analysis (PCA) for the purpose of visualization. In each figure, on the left is the scatter plot of the data from the four speakers, and the one on right is the scatter plot of the data from H-DATA and K-DATA, i.e., a male and a female speaker.

On the one hand, it was observed that DSAE formed speaker-dependent distributions (see Figure 6). For example, “a” from H-DATA and “a” from K-DATA formed entirely different clusters in the feature space.

On the other hand, DSAE-PBHL could form speaker-independent representation to a certain extent.

Figure 6: Feature representations extracted by DSAE visualized using PCA. (Left) all data, (right) H-DATA and K-DATA.
Figure 7: Feature representations extracted by DSAE-PBHL (Sparse Coding) visualized using PCA. (Left) all data, (right) H-DATA and K-DATA.
Figure 8: Feature representations extracted by DSAE-PBHL (Coding 1) visualized with PCA. (Left) all data, (right) H-DATA and K-DATA.
Figure 9: Feature representations extracted by DSAE-PBHL (Coding 2) visualized with PCA. (Left) all data, (right) H-DATA and K-DATA.

4.3 Experiment 2: simultaneous phoneme and word discovery from multiple speakers using NPB-DAA with DSAE-PBHL

This experiment evaluates whether NPB-DAA with DSAE-PBHL can discover phonemes and words from speech signals from multiple speakers.

4.3.1 Conditions

The hyperparameters for the latent language model were set to and ; the maximum number of words was set to seven for weak-limit approximation. The hyperparameters of the duration distributions were set to and ; those of the emission distributions were set to and dimension.

The Gibbs sampling procedure was iterated 100 times for NPB-DAA. Twenty trials were performed using different random number seeds. Sparse coding of parametric bias was employed as the coding method of speaker index.

We comepared NPB-DAA with DSAE-PBHL, NPB-DAA with MFCC, and NPB-DAA with DSAE. Similary as in Experiment 1, we calculated the performance of NPB-DAA with DSAE, which learns speakers separately, as an upper limit of the model. Moreover, we used the off-the-shelf speech recognition system Julius333Julius: http://julius.sourceforge.jp/ having a pre-existing true dictionary consisting of {aioi, aue, ao, ie, uo} to output reference value of ARIs. We used two types of Julius: one is the HMM-based model Julius, and the other is the deep neural network(DNN)-based Julius, namely Julius DNN.

4.3.2 Resuluts

Similarly as in Experiment 1, Table 2 presents the ARIs for each condition. The rows with (MAP) list the score when NPB-DAA exhibits the highest likelihood; the other rows list the average score of 20 trials. The column SS represents the single speaker setting. Speech signals from different speakers are input separately and learned independently. This condition is considered as an upper limit of the proposed model. The columns AM and LM illustrate whether the method uses pre-trained acoustic and language models, i.e., uses transcribed data, respectively.

This demonstrates that NPB-DAA with DSAE-PBHL (MAP), i.e., our proposed method, outperformed the previous models; however, it did not outperform the upper-limit method and Julius DNN. However, it is noteworthy that NPB-DAA with DSAE outperformed Julius, which was trained in a supervised manner.

Method Letter ARI Word ARI SS AM LM
NPB-DAA with DSAE-PBHL (MAP) 0.597 0.373
NPB-DAA with DSAE-PBHL 0.445 0.308
NPB-DAA with DSAE (MAP) 0.161 0.073
NPB-DAA with DSAE 0.234 0.139
NPB-DAA with MFCC (MAP) 0.281 0.115
NPB-DAA with MFCC 0.297 0.104
Upper Limit (speaker-dependence): NPB-DAA with DSAE (MAP) 0.621 0.627
Upper Limit (speaker-dependence) : NPB-DAA with DSAE 0.523 0.448
Julius (triphone + word dictionary) 0.552 0.599
Julius DNN (triphone + word dictionary) 0.693 0.791
Table 2: ARIs in phoneme and word discovery task

This result indicates that DSAE-PBHL can reduce the adverse effect of obtaining speech signals from multiple speakers and that the simultaneous use of NPB-DAA can achieve direct phenome and word discovery from speech signals obtained from multiple speakers, to a certain extent.

5 Conclusion

This paper proposed a new method, NPB-DAA with DSAE-PBHL, for direct phoneme and word discovery from multiple speakers. In particular, DSAE-PBHL was developed to reduce the negative effect of speaker-dependent acoustic features in an unsupervised manner by using speaker index that is required to be obtained through another speaker recognition method. This can be regarded as a more natural computational model of phoneme and word discovery by a human infant because it does not use transcription. Human infants can acquire knowledge or phonemes and words from interactions his/her mother but as well as with other individuals surrounding him/her. We assumed that an infant can recognize and distinguish speakers by considering certain other features, e.g., visual face recognition. The study was aimed at enabling DSAE-PBHL to subtract speaker-dependent acoustic features and extract speaker-independent features. The first experiment demonstrated that DSAE-PBHL can subtract distributed representations of acoustic signals enabling the extraction of speaker-independent feature representation to a certain extent. The performance was quantitatively evaluated. , but depends on types of phonemes. The second experiment demonstrated that the combination of NPB-DAA and DSAE-PB outperformed the available unsupervised learning methods in phoneme and word discovery tasks with speech signals with Japanese vowel sequences from multiple speakers.

The future challenges are as follows: The experiment was performed on vowel signals. However, applying NPB-DAA to more natural speech corpora is our future challenge. It would involve consonants, which exhibit more dynamic features than vowels. However, achieving unsupervised phoneme and word discovery from natural corpora including consonants and common vocabularies continues to be a challenging problem. Tada et al. applied NPB-DAA with a variety of feature extraction methods (Yuki Tada, 2017). However, they obtained limited performance. Therefore, in this study, we focused on vowel data. Extending our studies to more natural spoken language is one of our intention.

Applying the method to larger corpora is another challenge. In this regard, the computational cost is high, and the method to address data from multiple speakers have been problematic. We consider our proposed method to have overcome one of these barriers. Recently, Ozaki et. al. reduced the computational cost of NPB-DAA significantly (Ryo Ozaki, 2018). Therefore, we consider our contribution to be effective for further study of unsupervised phoneme and word discovery.

This paper proposed DSAE-PBHL for proof-of-concept. DSAE-PBHL is regarded as a type of conditioned neural network. Recently, the relationship between autoencoder and probabilistic generative model have been recognized via variational autoencoder (Kingma and Welling, 2013). From a broader perspective, our proposal is to use conditioned deep generative models to obtain disentangled representations to extract speaker-independent acoustic representations. In the field of speech synthesis, voice conversion methods using the generative adversarial network are studied (Kameoka et al., 2018). We intend to explore the relationship between our proposal and such type of studies and integrate them in future studies.

In the current model, DSAE-PBHL and NPB-DAA are separately trained. However, as end-to-end learning in numerous deep learning-based models have indicated, the simultaneous optimization of feature extractor and post-processing is essential. We also intend to study the simultaneous optimization of representation learning and phoneme and word discovery in future.

Funding

This work was supported by MEXT/JSPS KAKENHI Grant Number 16H06569 in #4805 (Correspondence and Fusion of Artificial Intelligence and Brain Science) and 15H05319.

Data Availability Statement

The datasets used for this study are available in our GitHub repository.

References

  • Amodei et al. (2016) Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., et al. (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. 173–182
  • Aslin et al. (1995) Aslin, R. N., Woodward, J. Z., LaMendola, N. P., and Bever, T. G. (1995). Models of word segmentation in fluent maternal speech to infants. In Signal to syntax: Bootstrapping from speech to grammar in early acquisition, eds. J. L. Morgan and K. Demuth (Psychology Press). 117–134
  • Bengio (2009a) Bengio, Y. (2009a). Learning Deep Architectures for AI. Foundations and Trends in Machine Learning 2, 1–127
  • Bengio (2009b) Bengio, Y. (2009b). Learning deep architectures for ai. Foundations and Trends in Machine Learning 2, 1–127. doi:10.1561/2200000006
  • Brent (1999) Brent, M. R. (1999). An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, 71–105
  • Chan et al. (2016) Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 4960–4964
  • Chandler (2002) Chandler, D. (2002). Semiotics the Basics (Routledge)
  • Chen et al. (2014) Chen, M., Chang, B., and Pei, W. (2014). A joint model for unsupervised Chinese word segmentation. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 854–863
  • Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems. 2172–2180
  • Chiu et al. (2018) Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., et al. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 4774–4778
  • Dahl et al. (2012) Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 20, 30–42
  • Dunbar et al. (2017) Dunbar, E., Cao, X. N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., et al. (2017). The zero resource speech challenge 2017. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (IEEE), 323–330
  • Feldman et al. (2013) Feldman, N. H., Griffiths, T. L., Goldwater, S., and Morgan, J. L. (2013). A role for the developing lexicon in phonetic category acquisition. Psychological review 120, 751–78
  • Goldwater et al. (2009) Goldwater, S., Griffiths, T. L., and Johnson, M. (2009). A Bayesian framework for word segmentation: exploring the effects of context. Cognition 112, 21–54
  • Goldwater et al. (2006) Goldwater, S., Griffiths, T. L., Johnson, M., and Griffiths, T. (2006). Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. 673–680
  • Hannun et al. (2014) Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., et al. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., et al. (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations
  • Hinton and Salakhutdinov (2006) Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science 313, 504–507
  • Johnson and Goldwater (2009) Johnson, M. and Goldwater, S. (2009). Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars. In Annual Conference of the North American Chapter of the Association for Computational Linguistics. 317–325
  • Johnson and Willsky (2013) Johnson, M. J. and Willsky, A. S. (2013). Bayesian Nonparametric Hidden Semi-Markov Models. Journal of Machine Learning Research 14, 673–701
  • Kameoka et al. (2018) Kameoka, H., Kaneko, T., Tanaka, K., and Hojo, N. (2018). Stargan-vc: Non-parallel many-to-many voice conversion with star generative adversarial networks. arXiv preprint arXiv:1806.02169
  • Kamper et al. (2015) Kamper, H., Jansen, A., and Goldwater, S. (2015). Fully Unsupervised Small-Vocabulary Speech Recognition Using a Segmental Bayesian Model. In INTERSPEECH
  • Kamper et al. (2017) Kamper, H., Livescu, K., and Goldwater, S. (2017). An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (IEEE), 719–726
  • Kawaharay et al. (2000) Kawaharay, T., Lee, A., Kobayashi, T., Takeda, K., Minematsu, N., Sagayama, S., et al. (2000). Free software toolkit for Japanese large vocabulary continuous speech recognition. In International Conference on Spoken Language Processing (ICSLP). 3073–3076
  • Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. In Advances In Neural Information Processing Systems (NIPS). 1–9
  • Le et al. (2011) Le, Q. V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G. S., et al. (2011). Building high-level features using large scale unsupervised learning. In International Conference in Machine Learning (ICML)
  • Lee et al. (2015) Lee, C.-y., Donnell, T. J. O., and Glass, J. (2015). Unsupervised Lexicon Discovery from Acoustic Input. Transactions of the Association for Computational Linguistics 3, 389–403
  • Lee and Glass (2012) Lee, C.-Y. and Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. In Annual Meeting of the Association for Computational Linguistics:. 40–49
  • Lee et al. (2013) Lee, C.-y., Zhang, Y., and Glass, J. (2013). Joint learning of phonetic units and word pronunciations for ASR. In Conference on Empirical Methods in Natural Language Processing. 182–192
  • Liu et al. (2014) Liu, H., Taniguchi, T., Takano, T., Tanaka, Y., Takenaka, K., and Bando, T. (2014). Visualization of driving behavior using deep sparse autoencoder. In IEEE Intelligent Vehicles Symposium (IV). 1427–1434. doi:10.1109/IVS.2014.6856506
  • Liu et al. (2015) Liu, H., Taniguchi, T., Tanaka, Y., Takenaka, K., and Bando, T. (2015). Essential feature extraction of driving behavior using a deep learning method. In IEEE Intelligent Vehicles Symposium (IV)
  • Magistry (2012) Magistry, P. (2012). Unsupervized word segmentation : the case for Mandarin Chinese. In Annual Meeting of the Association for Computational Linguistics. vol. 2, 383–387
  • Mikolov et al. (2013a) Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
  • Mikolov et al. (2013b) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119
  • Mochihashi et al. (2009) Mochihashi, D., Yamada, T., and Ueda, N. (2009). Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACL-IJCNLP). 100–108
  • Neubig et al. (2012) Neubig, G., Mimura, M., Mori, S., and Kawahara, T. (2012). Bayesian learning of a language model from continuous speech. IEICE Transactions on Information and Systems E95-D, 614–625
  • Ng (2011a) Ng, A. (2011a). Sparse autoencoder. CS294A Lecture notes , 1–19
  • Ng (2011b) Ng, A. (2011b). Sparse autoencoder. CS294A Lecture notes 72
  • Ogata et al. (2007) Ogata, T., Murase, M., Tani, J., Komatani, K., and Okuno, H. G. (2007). Two-way translation of compound sentences and arm motions by recurrent neural networks. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems (IEEE), 1858–1863
  • Rumelhart et al. (1985) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1985). Learning internal representations by error propagation. Tech. rep., DTIC Document
  • Ryo Ozaki (2018) Ryo Ozaki, T. T. (2018). Accelerated nonparametric bayesian double articulation analyzer for unsupervised word discovery. In The 8th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics 2018 (Tokyo, Japan), pp. 238–244
  • Saffran et al. (1996a) Saffran, J. R., Aslin, R. N., and Newport, E. L. (1996a). Statistical learning by 8-month-old infants. Science 274, 1926–1928
  • Saffran et al. (1996b) Saffran, J. R., Newport, E. L., and Aslin, R. N. (1996b). Word Segmentation: The Role of Distributional Cues. Journal of Memory and Language 35, 606–621
  • Sakti et al. (2011) Sakti, S., Finch, A., Isotani, R., Kawai, H., and Nakamura, S. (2011). Unsupervised determination of efficient Korean LVCSR units using a Bayesian Dirichlet process model. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4664–4667
  • Sugiura et al. (2015) Sugiura, K., Shiga, Y., Kawai, H., Misu, T., and Hori, C. (2015). A cloud robotics approach towards dialogue-oriented robot speech. Advanced Robotics 29, 449–456
  • Takeda and Komatani (2017) Takeda, R. and Komatani, K. (2017). Unsupervised segmentation of phoneme sequences based on pitman-yor semi-markov model using phoneme length context. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). vol. 1, 243–252
  • Tani et al. (2004) Tani, J., Ito, M., and Sugita, Y. (2004). Self-organization of distributedly represented multiple behavior schemata in a mirror system: reviews of robot experiments using rnnpb. Neural Networks 17, 1273–1289
  • Taniguchi et al. (2016a) Taniguchi, T., Nagai, T., Nakamura, T., Iwahashi, N., Ogata, T., and Asoh, H. (2016a). Symbol emergence in robotics: A survey. Advanced Robotics 30, 706–728
  • Taniguchi et al. (2016b) Taniguchi, T., Nagasaka, S., and Nakashima, R. (2016b). Nonparametric bayesian double articulation analyzer for direct language acquisition from continuous speech signals. IEEE Transactions on Cognitive and Developmental Systems 8, 171–185. doi:10.1109/TCDS.2016.2550591
  • Taniguchi et al. (2016c) Taniguchi, T., Nakashima, R., Liu, H., and Nagasaka, S. (2016c). Double articulation analyzer with deep sparse autoencoder for unsupervised word discovery from speech signals. Advanced Robotics 30, 770–783. doi:10.1080/01691864.2016.1159981
  • Thiessen and Saffran (2003) Thiessen, E. D. and Saffran, J. R. (2003). When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Developmental psychology 39, 706–716
  • Venkataraman (2001) Venkataraman, A. (2001). A statistical model for word discovery in transcribed speech. Computational Linguistics 27, 351–372
  • Yokoya et al. (2007) Yokoya, R., Ogata, T., Tani, J., Komatani, K., and Okuno, H. G. (2007). Experience-based imitation using rnnpb. Advanced Robotics 21, 1351–1367
  • Yuki Tada (2017) Yuki Tada, T. T., Yoshinobu Hagiwara (2017). Comparative study of feature extraction methods for direct word discovery with npb-daa from natural speech signals. In IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EpiRob) (Lisbon, Portugal)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
379833
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description