Powering Hidden Markov Model by Neural Network based Generative Models

# Powering Hidden Markov Model by Neural Network based Generative Models

## Abstract

Hidden Markov model (HMM) has been successfully used for sequential data modeling problems. In this work, we propose to power the modeling capacity of HMM by bringing in neural network based generative models. The proposed model is termed as GenHMM. In the proposed GenHMM, each HMM hidden state is associated with a neural network based generative model that has tractability of exact likelihood and provides efficient likelihood computation. A generative model in GenHMM consists of a mixture of generators that are realized by flow models. A learning algorithm for GenHMM is proposed in expectation-maximization framework. The convergence of the learning GenHMM is analyzed. We demonstrate the efficiency of GenHMM by classification tasks on practical sequential data.

1

## 1 Introduction

Sequential data modeling is a challenging topic in pattern recognition and machine learning. For many applications, the assumption of independent and identically distributed (i.i.d.) data points is too strong to model data properly. Hidden Markov model (HMM) is a classic way to model sequential data without the i.i.d. assumption. HMM has been widely used in different practical problems, including applications in reinforcement learning [7, 19], natural language modeling [15, 12], biological sequence analysis such as proteins [1] and DNA [24], etc.

A HMM is a statistical representation of sequential data generating process. Each state of a HMM is associated with a probabilistic model. The probabilistic model is used to represent the relationship between a state of HMM and sequential data input. The typical way is to use a Gaussian mixture model (GMM) per state of HMM [2], where GMMs are used to connect states of HMM to sequential data input. GMM based HMM (GMM-HMM) has become a standard model for sequential data modeling, and been employed widely for practical applications, especially in speech recognition [10, 5].

Given the success of GMM-HMM, it is not efficient for modeling data in nonlinear manifold. Research attempts at training HMM with neural networks have been made to boost the modeling capacity of HMM. A successful work of this track has brought deep neural network (DNN) that is defined by restrictive Boltzmann machines (RBMs) [14] into HMM based models [13, 20, 23]. RBM based HMM is trained with a hierarchical scheme consisting of multiple steps of unsupervised learning, formatting of a classification network and then supervised learning. The hierarchical procedure comes from the empirical expertise in this domain. To be more specific, the hierarchical learning scheme of RBM/DNN based HMM consists of: i) RBMs are trained one after the other in unsupervised fashion, and are stacked together as one deep neural network model, ii) then a final softmax layer is added to the stack of RBMs to represent the probability of a HMM state given a data input, iii) a discriminative training is performed for the final tuning of the model at the final stage.

Another track of related work is hybrid method of temporal neural network models and HMM. In [21, 4, 18], a long short-term memory (LSTM) model/recurrent neural network (RNN) is combined with HMM as hybrid. A hierarchical training is carried out by: i) training a HMM first, ii) then doing modified training of LSTM using trained HMM. This hierarchical training procedure is motivated by the intuition of using LSTM or RNN to fill in the gap where HMM can not learn.

The above works help improve modeling capacity of HMM based models by bringing in neural networks. A softmax layer is usually used to represent probability whenever a conditional distribution is needed. These hierarchical schemes are built based on intuition of domain knowledge. Training of these hierarchical models usually requires expertise in specific areas to be able to proceed with the hierarchical procedure of training and application usage.

In this work, we propose a generative model based HMM, termed as GenHMM. Specifically, a generative model in our GenHMM is generator-mixed, where a generator is realized by a neural network to help the model gain high modeling capacity. Our proposed model, GenHMM,

• has high modeling capacity of sequential data, due to the neural network based generators;

• is easy to train. Training of GenHMM employs expectation maximization (EM) framework. Therefore, training a GenHMM is as easy as training a GMM-HMM model, while configuration of GenHMM is flexible;

• is able to compute loglikelihood exactly and efficiently.

Instead of using softmax for probability representation, our GenHMM has tractability of exact loglikelihood of given sequential data, which is based on the change of variable formula. To make the loglikelihood computation efficient, neural network based generators of GenHMM are realized as flow models.

Our contributions in the paper are as follows.

• Proposing a neural network based HMM for sequential data modeling, i.e. GenHMM. GenHMM has the tractability of exact likelihood.

• Designing practical algorithm for training GenHMM under EM framework. Stochastic gradient search in batch fashion is embedded in this algorithm.

• Giving convergence analysis for GenHMM under the proposed learning algorithm.

• Verifying the proposed model on practical sequential data.

## 2 Generator-mixed HMM (GenHMM)

Our framework is a HMM. A HMM defined in a hypothesis space , i.e. , is capable to model time-span signal , where is the -dimensional signal at time , denotes transpose, and denotes the time length2. We define the hypothesis set of HMM as , where

• is the set of hidden states of .

• is the initial state distribution of with as cardinality of . For , . We use to denote the state at time .

• matrix of size is the transition matrix of states in . That is, , .

• For a given hidden state , the density function of the observable signal is , where is the parameter set that defines this probabilistic model. Denote .

Using HMM for signal representation is illustrated in Figure 1. The model assumption is that different instant signal of is generated by a different signal source associated with a hidden state of HMM. In the framework of HMM, at each time instance , signal is assumed to be generated by a distribution with density function , and is decided by the hidden markov process. Putting these together gives us the probabilistic model .

### 2.1 Generative Model of GenHMM

In this section, we introduce the neural network based state probabilistic model of our GenHMM. Recall that . Subscript is omitted when it does not cause ambiguity. The probabilistic model of GenHMM for each hidden state is a mixture of neural network based generators, where is a positive integer. The probabilistic model of a state is then given by

 p(x|s;Φs)=K∑κ=1πs,κp(x|s,κ;θs,κ), (1)

where is a random variable following a categorical distribution, with probability . Naturally . Denote . In (1), is defined as induced distribution by a generator , such that , where is a latent variable following a distribution with density function . Generator is parameterized by . Let us denote the collection of the parameter sets of generators for state as . Assuming is invertible, by change of variable, we have

 p(x|s,κ;θs,κ)=ps,κ(z)∣∣∣det(\pdgs,κ(z)z)∣∣∣−1. (2)

The signal flow of the probability distribution for a state of GenHMM is shown in Figure 2, in which the generator identity is up to the random variable .

### 2.2 Learning in EM framework

Assume the sequential signal follows an unknown distribution . We would like to use GenHMM to model this distribution. Alternatively, we are looking for the answer to the question

 (3)

where denotes the Kullback-Leibler divergence. For practical consideration, we only have access to the samples of , i.e. the dataset of this distribution. For the given dataset, we denote its empirical distribution by , where denotes the total number of sequential samples and superscipt denotes the index of -th sequential signal. The KL divergence minimization problem can be reduced to a likelihood maximization problem

 \uargmaxH∈\Hh1RR∑r=1logp(–xr;H). (4)

For the likelihood maximization, the first problem that we need to address is to deal with the hidden sequential variables of model , namely and . For a sequential observable variable , is the hidden state sequence corresponding to , and is the hidden variable sequence representing the generator identity sequence that actually generates .

Since directly maximizing likelihood is not an option for our problem in (4), we address this problem in expectation maximization (EM) framework. This divides our problem into two iterative steps: i) using the joint posterior of hidden variable sequences and to obtain an “expected likelihood” of the observable variable sequence , i.e. the E-step; ii) maximizing the expected likelihood with regard to (w.r.t.) the model , i.e. the M-step. Assume model is at a configuration of , we formulate these two steps as follows.

• E-step: the expected likelihood function

 (5)

where denotes the expectation operator by distribution and .

• M-step: the maximization step

 \umaxH\Qq(H;Hold). (6)

The problem (6) can be reformulated as

 \umaxH\Qq(H;Hold) = \umaxq\Qq(q;Hold)+\umaxA\Qq(A;Hold)+\umaxΦ\Qq(Φ;Hold), (7)

where the decomposed optimization problems are

 \Qq(q;Hold) (8) \Qq(A;Hold) (9) \Qq(Φ;Hold) (10)

We can see that the solution of depends on the posterior probability . Though the evaluation of posterior according to Bayesian theorem is straightforward, the computation complexity of grows exponentially with the length of . Therefore, we employ forward-backward algorithm [3] to do the posterior computation efficiently. As we would detail in the next section, what are needed to formulate the problem, are actually the and . For the joint posterior , it can be computed by the Bayesian rule when posterior of hidden state is available.

With such a solution framework ready for GenHMM, there are still remaining problems to address before it can be employed for practical usage, including

• how to realize GenHMM by neural network based generators such that likelihood of their induced distributions can be computed explicitly and exactly?

• how to train GenHMM to solve problem in (4) using practical algorithm?

• would the training of GenHMM converge?

We tackle these problems in the following section.

## 3 Solution for GenHMM

In this section, we detail the solution for realizing and learning GenHMM. The convergence of GenHMM is also discussed in this section.

### 3.1 Realizing gs,κ by a Flow Model

Each generator is realized as a feed-forward neural netowrk. We define as a -layer neural network and formulate its mapping by layer-wise concatenation: , where superscript denotes the layer index and denotes mapping concatenation. Assume is invertible and denote its inverse mapping as . For a latent variable with density function , the generated signal follows an induced distribution with density function (2). We illustrate the signal flow between latent variable and observable variable as

 z=h0 h1 hL=x g[1]s,κ f[1]s,κ g[2]s,κ f[2]s,κ g[L]s,κ f[L]s,κ

where is the -th layer of . We have . If every layer of is invertible, the full feed-forward neural network is invertible. Flow model, proposed in [9] as an image generating model, is such an invertible feed-forward layer-wise neural network. It is further improved in subsquential works [8, 17] for high-fidelity and high-resolution image generating and representation. As shown in (2), the challenge lies at the computation of Jacobian determinant. Another track of flow models uses a continuous-depth models instead. The variable change is defined by an ordinary differential equation implemented by a neural network [6, 11], where the key becomes to solve the ODE problem. We use the layer-wise flow model to model the variable change in (2) in which the efficient Jacobian computation is available.

For a flow model, let us assume that the feature at the ’th layer has two subparts as . The efficient invertible mapping of flow model comes from following forward and inverse relations between ’th and ’th layers

 hl =[hl,ahl,b]=[hl−1,a(hl−1,b−mb(hl−1,a))⊘ma(hl−1,a)], hl−1 =[hl−1,ahl−1,b]=[hl,ama(hl,a)⊙hl,b+mb(hl,a)], (11)

where denotes element-wise product, denotes element-wise division, and can be complex non-linear mappings (implemented by neural networks). For the flow model, the determinant of Jacobian matrix is

 det(∇fs,κ)=∏Ll=1det(∇f[l]s,κ), (12)

where is the Jacobian of the mapping from the -th layer to the -th layer, i.e., the inverse transformation. We compute the determinant of the Jacobian matrix as

where is identity matrix and returns a square matrix with the elements of on the main diagnal.

(3.1) describes a coupling layer in a flow model. A flow model is basically a stack of multiple coupling layers. But the issue of direct concatenation of multiple such coupling mappings is partial identity mapping of the whole model. This issue can be addressed by alternating hidden signal order after each coupling layer.

### 3.2 Learning of GenHMM

In this subsection, we address the problem of learning GenHMM.

#### Generative Model Learning

The generative model learning is actually to solve the problem in (10), which can be further divided into two subproblems: i) generator learning; ii) mixture weights of generators learning. Let us define notations: , . Then the problem in (10) becomes

 \umaxΦ\Qq(Φ;Hold)=\umaxΠ\Qq(Π;Hold)+\umaxΘ\Qq(Θ;Hold), (14)

where

 \Qq(Π;Hold) (15) \Qq(Θ;Hold) (16)

We firstly address the generator learning problem, i.e. . This is boiled down to maximize the cost function of neural networks that can be formulated as

 \Qq(Θ;Hold) = = logp(xrt|srt,κrt;H), (17)

where is the length of the -th sequential data. In (3.2.1), the state posterior is computed by forward-backward algorithm. The posterior of is

 =πolds,κp(x|s,κ,Hold)∑Kκ=1πolds,κp(x|s,κ,Hold), (18)

where the last equation is due to the fact that among sequence only depends on .

By substituting (2) and (12) into (3.2.1), we have cost function for neural networks as

 \Qq(Θ;Hold) = [logpsrt,κrt(fsrt,κrt(xrt))+L∑l=1log|det(∇f[l]s,κ)|]. (19)

The generators of GenHMM simply use standard Gaussian distribution for latent variables . Since training dataset can be too large to do whole-dataset iterations, batch-size stochastic gradient decent can be used to maximize w.r.t. parameters of generators.

In what follows we address the problem in our generative model learning. The conditional distribution of hidden variable , , is obtained by solving the following problem

 πs,κ =\uargmaxπs,κ\Qq(Π;Hold) (20) s.t.K∑κ=1πs,κ=1,∀s=1,2,⋯,|\Ss|.

To solve problem (20), we formulate its Lagrange function as

 \Ll=\Qq(Π;Hold)+|\Ss|∑s=1λs(1−K∑κ=1πs,κ). (21)

Solving gives

 πs,κ=1λsR∑r=1Tr∑t=1p(srt=s,κrt=κ|x–r;Hold). (22)

With condition , we have

 (23)

Then the solution to (20) is

 (24)

where

 (25)

Here can be computed by forward-backward algorithm, while is given by (3.2.1).

With the generative model learning obtained, it remains to solve the initial distribution update and transition matrix update of HMM in GenHMM, i.e. the problem (8) and (9). These two problems are basically two constrained optimization problems. The solutions to them are available in literature [3]. But to keep learning algorithm for GenHMM complete, we give the update rules for and as follows.

#### Initial Probability Update

The problem in (8) can be reformulated as

 \Qq(q;Hold) =1RR∑r=1|\Ss|∑sr1=1p(sr1|x–r;Hold)logp(sr1;H). (26)

is the probability of initial state of GenHMM for -th sequential sample. Actually , . Solution to the problem

 q=\uargmaxq\Qq(q;Hold),s.t.|\Ss|∑i=1qi=1,qi≥0,∀i. (27)

is

 (28)

#### Transition Probability Update

The problem (9) can be reformulated as

 \Qq(A;Hold) =R∑r=1∑s–rp(s–r|x–r;Hold)Tr−1∑t=1logp(srt+1|srt;H) (29)

Since is the element of transition matrix , the solution to the problem

 A= \uargmaxA\Qq(A;Hold) s.t. A⋅1=1,Ai,j≥0∀i,j, (30)

is

 Ai,j=¯ξi,j∑|\Ss|k=1¯ξi,k, (31)

where

 ¯ξi,j=R∑r=1Tr−1∑t=1p(srt=i,srt+1=j|x–r;Hold). (32)

### 3.3 On Convergence of GenHMM

In pursuit of representing a dataset by GenHMM, we are interested if the learning solution discussed in subsection 3.2 would converge. The properties on GenHMM’s convergence are analyzed as follows.

{prop}

Assume that parameter is in a compact set, and are continuous w.r.t. in GenHMM. Then GenHMM converges.

###### Proof.

We begin with the comparison of loglikelihood evaluated under and . The loglikelihood of dataset given by can be reformulated as

 =

where the first term on the right hand side of the above inequality can be further written as

 =

According to subsection 3.2, the optimization problems give

 \Qq(qnew;Hold) ≥\Qq(qold;Hold), \Qq(Anew;Hold) ≥\Qq(Aold;Hold), \Qq(Πnew;Hold) ≥\Qq(Πold;Hold), \Qq(Θnew;Hold) ≥\Qq(Θold;Hold).

Since

 \Qq(Hnew;Hold)= \Qq(qnew;Hold)+\Qq(Anew;Hold)+\Qq(Πnew;Hold)+\Qq(Θnew;Hold),

it gives

 \Qq(Hnew;Hold)≥\Qq(Hold;Hold).

With the above inequality, and the fact that is independent of , we have the inequality

Due to , we have

 ≥ =

Since and are continuous w.r.t. in GenHMM, is bounded. The above inequality shows is non-decreasing in learning of GenHMM. Therefore, GenHMM will converge.

### 3.4 Algorithm of GenHMM

To summarize the learning solution in subsection 3.2, we wrap our algorithm into pseudocode as shown in Algorithm 1. We use Adam [16] optimizer for optimization w.r.t. the parameters of generators in GenHMM. As shown from line to in Algorithm 1, the batch-size stochastic gradient decent can be naturally embedded into the learning algorithm of GenHMM.

As described by the pseudocode in Algorithm 1, the learning of GenHMM is divided into optimizations w.r.t. to generators’ parameters , initial probability of hidden state, transition matrix , and generator mixture weights . Different from the optimization w.r.t. to , and , which have optimal solutions, generator learning usually cannot give optimal solution to problem . In fact, given that no optimal is obtained, learning of GenHMM can still converge as long as quantity are improving in iterations in Algorithm 1, where the inequalities in Proposition 3.3 still hold. Therefore optimal in each iteration is not required for convergence of GenHMM as long as the loss in is getting improved.

## 4 Experiments

To show the validity of our model, we implement our model in PyTorch and test it with sequential data. We first discuss the experimental setups and then show the experimental results. Code for experiments is available at https://github.com/FirstHandScientist/genhmm.

### 4.1 Experimental Setup

The dataset used for sequential data modeling and classification is TIMIT where the speech signal is sampled at kHz. The TIMIT dataset consists of phoneme-labeled speech utterances which are partitioned into two sets: a train set consists of utterance, and a test set consists of utterances. There are totally different types of phones in TIMIT. We performed experiments in two cases: i) full -phoneme classification case; ii) -phonme classification case, where phonemes are folded onto phonemes as described in [22].

For extraction of feature vectors, we use ms frame length and ms frame shift to convert sound track into standard Mel-frequency cepstral coefficients (MFCCs) features. Experiments using the deltas and delta-deltas of the features are also carried out.

Our experiments are performed for: i) standard classification tasks (Table 2, 3, 4, 5), ii) classification under noise perturbation (table 6, 7). The criterion used to report the results includes accuracy, precision and F1 scores. In all experiments, generators of GenHMM are implemented as flow models. Specifically, our generator structure follows that of a RealNVP described in [8]. As discussed, the coupling layer shown in (3.1) maps a part of its input signal identically. The implementation is such that layer would alternate the input signal order of layer such that no signal remains the same after two consecutive coupling layers. We term such a pair of consecutive coupling layers as a flow block. In our experiments, each generator consists of four flow blocks. The density of samples in the latent space is defined as Normal, i.e. is the density function of standard Gaussian. The configuration for each generator is shown as Table 1.

For each GenHMM, the number of states is adapted to the training dataset. The exact number of states is decided by computing the average length of MFCC frames per phone in training dataset, and clipping the average length into . Transition matrix is initialized as upper triangular matrix for GenHMM.

### 4.2 Experimental Results

We firstly show the phoneme classification using 39 dimensional MFCC features (MFCC coefficients, deltas, and delta-deltas), to validate one possible usage of our proposed model. Since generative training is carried out in our experiments, GMM-HMM is trained and tested as a reference model in our experiments. Training and testing of GMM-HMM is in the same condition as GenHMMs are trained and tested. Dataset usage for GenMM and GMM-HMM is the same, and number of states for GMM-HMM is the same as that for GenHMM in modeling each phoneme. Apart from setting the reference model, we also run the experiment comparisons with different total number of mixture components.

Table 2 and 3 shows the results for this experiments, in which we test both the folded -phoneme classification case (the conventional way) in Table 2 and the -phoneme classification case in Table 3. As shown in both -phoneme and -phoneme cases, GenHMM gets significant higher accuracy than GMM-HMM for the same number of mixture components. The comparisons with regarding to precision and F1 scores show similar trends and also demonstrate significant improvement of GenHMM’s performance. As our expectation, GenHMM has better modeling capacity of sequential data since we bring in the neural network based generators into GenHMM, which should be able to represent complex relationship between states of HMM and sequential data. Apart from the gain of using neural network based generative models, there are also increases of accuracy, precision and F1 scores as the number of mixture components in GenHMM is increased from to . The sequential dependency of data is modeled by HMM itself, while each state of HMM can have better representation using a mixture probabilistic model if data represented by the state is multi-mode. Comparing the results in -phoneme and -phoneme cases, GenHMM gets higher accuracy for -phoneme classification than it does for -phoneme classification. The total training dataset size remains the same as phonemes are folded into phonemes. There are less training data available per phonemes and more classes to be recognized in the -phoneme case, which makes the task more challenging.

Similar experiments are carried out by using only the MFCC coefficients as feature input (excluding deltas and delta-deltas). The results are shown in Table 4 and 5. The superior performance of GenHMM remains compared with reference model GMM-HMM, with regarding to accuracy, precision and F1 scores. The gain by using mixture generators is also presented in this set of experiments while the difference between -phoneme and -phoneme cases is similar to the set of experiments in Table 2 and 3.

Apart from standard classification testing, we also test the robustness of our model to noise perturbations. We train GenHMM with by clean TIMIT training data in the case of folded phonemes with dimensional features. The testing dataset is perturbed by either the same type of noise with different signal-to-noise ratio (SNR) as shown in Table 6, or different type of noises with the same SNR as shown in Table 7. The noise data is from NOISEX-92 database. The baseline of these two sets of experiments is the accuracy testing of GenHMM and GMM-HMM on clean testing data in the same experimental condition, where GenHMM has and GMM-HMM gets as shown in Table 2. Similar superior performance of GenHMM with regarding to precision and F1 scores is also shown. It is shown in Table 6 that GMM-HMM’s performance degenerates more than GenHMM’s performance at the same level of noise perturbation, though the accuracy of both models increases along the increase of SNR. Especially, for SNR=dB, the accuracy of GenHMM drops only about (from to ), while GMM-HMM encounters more than decrease (from to ) due to the noise perturbation. In Table 7, the SNR remains constant and GenHMM is tested with perturbation of different noise types. It is shown that GenHMM still remain higher performance scores at different types of noise perturbations than GMM-HMM. Among these four types of noise, white noise shows most significant impact to GenHMM while the impact of volvo noise is negligible.

## 5 Conclusion

In this work, we proposed a generative model based HMM (GenHMM) whose generators are realized by neural networks. We provided the training method for GenHMM. The validity of GenHMM was demonstrated by the experiments of classification tasks on practical sequential dataset. The learning method in this work is based on generative training. For future work, we would consider discriminative training for classification tasks of sequential data.

## 6 Acknowledgments

We would like to thank Dr. Minh Thành Vu for his discussions and comments on the algorithm analysis, which helped improve this paper considerably.

### Footnotes

1. affiliationtext: E-mail: {doli, honore, sach, lkra}@kth.se
2. The length for sequential data varies.

### References

1. N.M.R. Ashwin, Leonard Barnabas, Amalraj Ramesh Sundar, Palaniyandi Malathi, Rasappa Viswanathan, Antonio Masi, Ganesh Kumar Agrawal, and Randeep Rakwal. Comparative secretome analysis of colletotrichum falcatum identifies a cerato-platanin protein (epl1) as a potential pathogen-associated molecular pattern (pamp) inducing systemic resistance in sugarcane. Journal of Proteomics, 169:2 – 20, 2017. 2nd World Congress of the International Plant Proteomics Organization.
2. Bing-Hwang Juang, S. Levinson, and M. Sondhi. Maximum likelihood estimation for multivariate mixture observations of markov chains (corresp.). IEEE Transactions on Information Theory, 32(2):307–309, March 1986.
3. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
4. Jan Buys, Yonatan Bisk, and Yejin Choi. Bridging hmms and rnns through architectural transformations. In 32nd Conference on Neural Information Processing Systems, IRASL workshop. 2018.
5. S. Chatterjee and W. B. Kleijn. Auditory model-based design and optimization of feature vectors for automatic speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(6):1813–1825, Aug 2011.
6. Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6571–6583. Curran Associates, Inc., 2018.
7. W. Ding, S. Li, H. Qian, and Y. Chen. Hierarchical reinforcement learning framework towards multi-agent navigation. In 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pages 237–242, Dec 2018.
8. L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. ArXiv e-prints, May 2016.
9. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: non-linear independent components estimation. CoRR, abs/1410.8516, 2014.
10. M. Gales and S. Young. Application of Hidden Markov Models in Speech Recognition. now, 2008.
11. Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: free-form continuous dynamics for scalable reversible generative models. CoRR, abs/1810.01367, 2018.
12. Trienani Hariyanti, Saori Aida, and Hiroyuki Kameda. Samawa language part of speech tagging with probabilistic approach: Comparison of unigram, HMM and TnT models. Journal of Physics: Conference Series, 1235:012013, jun 2019.
13. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012.
14. Geoffrey E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines, pages 599–619. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
15. Wahab Khan, Ali Daud, Jamal A Nasir, and Tehmina Amjad. A survey on the state-of-the-art machine learning models in the context of nlp. Kuwait journal of Science, 43(4), 2016.
16. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
17. Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10215–10224. Curran Associates, Inc., 2018.
18. Viktoriya Krakovna and Finale Doshi-Velez. Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models. arXiv e-prints, page arXiv:1606.05320, Jun 2016.
19. Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR, abs/1805.00909, 2018.
20. L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin, and H. Sahli. Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pages 312–317, Sep. 2013.
21. Larkin Liu, Yu-Chung Lin, and Joshua Reid. Improving the performance of the LSTM and HMM models via hybridization. CoRR, abs/1907.04670, 2019.
22. Carla Lopes and Fernando Perdigao. Phoneme recognition on the timit database. In Ivo Ipsic, editor, Speech Technologies, chapter 14. IntechOpen, Rijeka, 2011.
23. Yajie Miao and Florian Metze. Improving low-resource cd-dnn-hmm using dropout and multilingual dnn training. In INTERSPEECH, 2013.
24. S. Ren, V. Sima, and Z. Al-Ars. Fpga acceleration of the pair-hmms forward algorithm for dna sequence analysis. In 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1465–1470, Nov 2015.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters