Powering Hidden Markov Model by Neural Network based Generative Models
Abstract
Hidden Markov model (HMM) has been successfully used for sequential data modeling problems. In this work, we propose to power the modeling capacity of HMM by bringing in neural network based generative models. The proposed model is termed as GenHMM. In the proposed GenHMM, each HMM hidden state is associated with a neural network based generative model that has tractability of exact likelihood and provides efficient likelihood computation. A generative model in GenHMM consists of mixture of generators that are realized by flow models. A learning algorithm for GenHMM is proposed in expectationmaximization framework. The convergence of the learning GenHMM is analyzed. We demonstrate the efficiency of GenHMM by classification tasks on practical sequential data.
1 Introduction
Sequential data modeling is a challenging topic in pattern recognition and machine learning. For many applications, the assumption of independent and identically distributed (i.i.d.) data points is too strong to model data properly. Hidden Markov model (HMM) is a classic way to model sequential data without the i.i.d. assumption. HMM has been widely used in different practical problems, including applications in reinforcement learning [Ding et al.2018, Levine2018], natural language modeling [Khan et al.2016, Hariyanti, Aida, and Kameda2019], biological sequence analysis such as proteins [Ashwin et al.2017] and DNA [Ren, Sima, and AlArs2015], etc.
A HMM is a statistical representation of sequential data generating process. Each state of a HMM is associated with a probabilistic model. The probabilistic model is used to represent the relationship between a state of HMM and sequential data input. The typical way is to use a Gaussian mixture model (GMM) per state of HMM [BingHwang Juang, Levinson, and Sondhi1986], where GMMs are used to connect states of HMM to sequential data input. GMM based HMM (GMMHMM) has become a standard model for sequential data modeling, and been employed widely for practical applications, especially in speech recognition [Gales and Young2008, Chatterjee and Kleijn2011].
Given the success of GMMHMM, it is not efficient for modeling data in nonlinear manifold. Research attempts at training HMM with neural networks have been made to boost the modeling capacity of HMM. A successful work has brought restrictive Boltzmann machine (RBM) [Hinton2012] into HMM based models [Hinton et al.2012, Li et al.2013, Miao and Metze2013]. In RBM based HMM, a hierarchical learning scheme is used: i) RBMs are trained one after the other in unsupervised fashion, and are stacked together as one model, ii) then a final softmax layer is added to the stack of RBMs to represent the probability of a HMM state given a data input, iii) a discriminative training is performed for the final tuning of the model.
Another track of related work is hybrid method of temporal neural network models and HMM. In [Liu, Lin, and Reid2019, Buys, Bisk, and Choi2018, Krakovna and DoshiVelez2016], a long shortterm memory (LSTM) model/recurrent neural network (RNN) is combined with HMM as hybrid. A hierarchical training is carried out by: i) training a HMM first, ii) then doing modified training of LSTM using trained HMM. This hierarchical training procedure is motivated by the intuition of using LSTM or RNN to fill in the gap where HMM can not learn.
The above works help improve modeling capacity of HMM based models by bringing in neural networks. A softmax layer is usually used to represent probability whenever a conditional distribution is needed. These hierarchical schemes are built based on intuition of domain knowledge. Training of these hierarchical models usually requires expertise in specific areas to be able to proceed with the hierarchical procedure of training and application usage.
In this work, we propose a generative model based HMM, termed as GenHMM. Specifically, a generative model in our GenHMM is generatormixed, where a generator is realized by a neural network to help the model gain high modeling capacity. Our proposed model, GenHMM:

has high modeling capacity of sequential data, due to the neural network based generators;

is easy to train. Training of GenHMM employs expectation maximization (EM) framework. Therefore, training a GenHMM is as easy as training a GMMHMM model, while configuration of GenHMM is flexible;

is able to compute loglikelihood exactly and efficiently.
Instead of using softmax for probability representation, our GenHMM has tractability of exact loglikelihood of given sequential data, which is based on change of variable formula. To make the loglikelihood computation efficient, neural network based generators of GenHMM are realized as flow models.
Our contributions in the paper are as follows:

Proposing a neural network based HMM for sequential data modeling, i.e. GenHMM. GenHMM has the tractability of exact likelihood.

Designing practical algorithm for training GenHMM under EM framework. Stochastic gradient search in batch fashion is embedded in this algorithm.

Giving convergence analysis for GenHMM under the proposed learning algorithm.

Verifying the efficiency of the proposed model on practical sequential data.
2 Generatormixed HMM (GenHMM)
Our framework is a HMM. A HMM defined in a hypothesis space , i.e. , is capable to model timespan signal , where is the signal at time , denotes transpose, and denotes the time length^{1}^{1}1The length for sequential data varies.. We define the hypothesis set of HMM as , where

is the set of hidden states of .

is the initial distribution of with as cardinality of . For , . We use to denote the state at time .

matrix of size is the transition matrix of . That is, , .

For a given hidden state , the density function of the observable signal is , where is the parameter set that defines this probabilistic model. Denote .
Using HMM for signal representation is illustrated in Figure 1. The model assumption is that different instant signal of is generated by a different signal source associated with a hidden state of HMM. In the framework of HMM, at each time instance , signal is assumed to be generated by a distribution with density function , and is decided by the hidden markov process. Putting these together gives us the probabilistic model .
2.1 Generative Model of GenHMM
In this section, we introduce the neural network based state probabilistic model of our GenHMM. Recall that . The probabilistic model of GenHMM for each hidden state is a mixture of neural network based generators, where is a positive integer. The probabilistic model of a state is then given by
(1) 
where is a random variable following a categorical distribution, with probability . Naturally . Denote . In (1), is defined as induced distribution by a generator , such that , where is a latent variable following a distribution with density function . Generator is parameterized by . Let us denote the collection of the parameter sets of generators for state as . Assuming is invertible, by change of variable, we have
(2) 
The signal flow of the probability distribution for a state of GenHMM is shown in Figure 2, in which the generator identity is up to the random variable .
2.2 Learning in EM framework
Assume the sequential signal follows an unknown distribution . We would like to use GenHMM to model this distribution. Alternatively, we are looking for the answer to the question
(3) 
where denotes the KullbackLeibler divergence. For practical consideration, we only have access to the samples of , i.e. the dataset of this distribution. For the given dataset, we denote its empirical distribution by , where superscipt denotes the index of sequential signal. The KL divergence minimization problem can be reduced to a likelihood maximization problem
(4) 
For the likelihood maximization, the first problem that we need to address is to deal with the hidden sequential variables of model , namely and . For a sequential observable variable , is the hidden state sequence corresponding to , and is the hidden variable sequence representing the generator identity sequence that actually generates .
Since directly maximizing likelihood is not an option for our problem in (4), we address this problem in expectation maximization (EM) framework. This divides our problem into two iterative steps: i) using the joint posterior of hidden variable sequences and to obtain an “expected likelihood” of the observable variable sequence , i.e. the Estep; ii) maximizing the expected likelihood with regard to (w.r.t.) the model , i.e. the Mstep. Assume model is at a configuration of , we formulate these two steps as follows:

Estep: the expected likelihood function
(5) where denotes the expectation operator by distribution and .

Mstep: the maximization step
(6)
We can see that the solution of depends on the posterior probability . Though the evaluation of posterior according to Bayesian theorem is straightforward, the computation complexity of grows exponentially with the length of . Therefore, we employ forwardbackward algorithm [Bishop2006] to do the posterior computation efficiently. As we would detail in next section, what are needed to formulate the problem, are actually the and . For the joint posterior , it can be computed by the Bayesian rule when posterior of hidden state is available.
With such a solution framework ready for GenHMM, there are still remaining problems to address before it can be employed for practical usage, including

how to realize GenHMM by neural network based generators such that likelihoond of their induced distributions can be computed explicitly and exactly?

how to train GenHMM to solve problem in (4) using practical algorithm?

would the training of GenHMM converge?
We tackle these problems in the following section.
3 Solution for GenHMM
In this section, we detail the solution for realizing and learning GenHMM. The convergence of GenHMM is also discussed in this section.
3.1 Realizing by a Flow Model
Each generator is realized as a feedforward neural netowrk. We define as a layer neural network and formulate its mapping by layerwise concatenation: , where superscript denotes the layer index and denotes mapping concatenation. Assume is invertible and denote its inverse mapping as . For a latent variable with density function , the generated signal follows an induced distribution with density function (2). We illustrate the signal flow between latent variable and observable variable as
where is the th layer of . We have . If every layer of is invertible, the full feedforward neural network is invertible. Flow model, proposed in [Dinh, Krueger, and Bengio2014] as an image generating model, is such an invertible feedforward neural network. It is further improved in subsquential works [Dinh, SohlDickstein, and Bengio2016, Kingma and Dhariwal2018] for highfidelity and highresolution image generating and representation. Flow model also has advantages as efficient Jacobian computation and low computational complexity.
For a flow model, let us assume that the feature at the ’th layer has two subparts as . The efficient invertible mapping of flow model comes from following forward and inverse relations between ’th and ’th layers
(11) 
where denotes elementwise product, denotes elementwise division, and can be complex nonlinear mappings (implemented by neural networks). For the flow model, the determinant of Jacobian matrix is
(12) 
where is the Jacobian of the mapping from the th layer to the th layer, i.e., the inverse transformation. We compute the determinant of the Jacobian matrix as
(13) 
where is identity matrix and returns a square matrix with the elements of on the main diagnal.
(3.1) describes a coupling layer in a flow model. A flow model is basically a stack of multiple coupling layers. But the issue of direct concatenation of multiple such coupling mappings is partial identity mapping of the whole model. This issue can be addressed by alternating hidden signal order after each coupling layer.
3.2 Learning of GenHMM
In this subsection, we address the problem of learning GenHMM.
Generative Model Learning
The generative model learning is actually to solve the problem in (10), which can be further divided into two subproblems: i) generator learning; ii) mixture weights of generators learning. Let us define notations: , . Then the problem in (10) becomes
(14) 
where
(15)  
(16) 
We firstly address the generator learning problem, i.e. . This is boiled down to maximize the cost function of neural networks that can be formulated as
(17) 
where is the length of the th sequential data. In (3.2), the state posterior is computed by forwardbackward algorithm. The posterior of is
(18) 
where the last equation is due to the fact that among sequence only depends on .
By substituting (2) and (12) into (3.2), we have cost function for neural networks as
(19) 
The generators of GenHMM simply use standard Gaussian distribution for latent variables . Since training dataset can be too large to do wholedataset iterations, batchsize stochastic gradient decent can be used to maximize w.r.t. parameters of generators.
In what follows we address the problem in our generative model learning. The conditional distribution of hidden variable , , is obtained by solving the following problem
(20)  
To solve problem (20), we formulate its Lagrange function as
(21) 
Solving gives
(22) 
With condition , we have
(23) 
Then the solution to (20) is
(24) 
where . can be computed by forwardbackward algorithm, while is given by (3.2).
With the generative model learning obtained, it remains to solve the initial distribution update and transition matrix update of HMM in GenHMM, i.e. the problem (8) and (9). These two problems are basically two constrained optimization problems. The solutions to them are available in literature [Bishop2006]. But to keep learning algorithm for GenHMM complete, we give the update rules for and as follows.
Initial Probability Update
The problem in (8) can be reformulated as
(25) 
is the probability of initial state of GenHMM for th sequential sample. Actually , . Solution to the problem
(26) 
is
(27) 
Transition Probability Update
The problem (9) can be reformulated as
(28) 
Since is the element of transition matrix , the solution to the problem
(29) 
is
(30) 
where
(31) 
3.3 On Convergence of GenHMM
In pursuit of representing a dataset by GenHMM, we are interested if the learning solution discussed in subsection 3.2 would convergence. The properties on GenHMM’s convergence are analyzed as follows.
Assume that parameter is in a compact set, and are continuous w.r.t. in GenHMM. Then GenHMM converges.
Proof.
We begin with the comparison of loglikelihood evaluated under and . The loglikelihood of dataset given by can be reformulated as
where the first term on the right hand side of the above inequality can be further written as
According to subsection 3.2, the optimization problems give
Since
it gives
With the above inequality, and the fact that is independent of , we have the inequality
Due to , we have
Since and are continuous w.r.t. in GenHMM, is bounded. The above inequality shows is nondecreasing in learning of GenHMM. Therefore, GenHMM will converge.
∎
3.4 Algorithm of GenHMM
To summarize the learning solution in subsection 3.2, we wrap our algorithm into pseudocode as shown in Algorithm 1. We use Adam [Kingma and Ba2014] optimizer for optimization w.r.t. the parameters of generators in GenHMM. As shown from line to in Algorithm 1, the batchsize stochastic gradient decent can be naturally embedded into the learning algorithm of GenHMM.
As described by the pseudocode in Algorithm 1, the learning of GenHMM is divided into optimizations w.r.t. to generators’ parameters , initial probability of hidden state, transition matrix , and generator mixture weights . Different from the optimization w.r.t. to , and , which have optimal solutions, generator learning usually cannot give optimal solution to problem . In fact, given that no optimal is obtained, learning of GenHMM can still converge as long as quantity are improving in iterations in Algorithm 1, where the inequalities in Proposition 3.3 still hold. Therefore optimal in each iteration is not required for convergence of GenHMM as long as the loss in is getting improved.
4 Experiments
To show the validity of our model, we implement our model in PyTorch and test it with sequential data. We first discuss the experimental setups and then show the experimental results.
4.1 Experimental Setups
The dataset used for sequential data modeling and classification is TIMIT where the speech signal is sampled at kHz. The TIMIT dataset consists of phonemelabeled speech utterances which are partitioned into two sets: a train set consists of utterance, and a test set consists of utterances. There are totally different types of phones in TIMIT. We performed experiments in two cases: i) full phoneme classification case; ii) phonme classification case, where phonemes are folded onto phonemes as described in [Lopes and Perdigao2011].
For extraction of feature vectors, we use ms frame length and ms frame shift to convert sound track into standard Melfrequency cepstral coefficients (MFCCs) features. Experiments using the deltas and deltadeltas of the features are also carried out.
Our experiments are performed for: i) standard classification tasks (table 1,2), ii) classification under noise perturbation (table 3, 4). In all experiments, generators of GenHMM are implemented as flow models. Specifically, our generator structure follows that of a RealNVP described in [Dinh, SohlDickstein, and Bengio2016]. As discussed, the coupling layer shown in (3.1) maps a part of its input signal identically. The implementation is such that layer would alternate the input signal order of layer such that no signal remains the same after two consecutive coupling layers. We term such a pair of consecutive coupling layers as a flow block. In our experiments, each generator consists of four flow blocks. The density of samples in the latent space is defined as Normal, i.e. is the density function of standard Gaussian.
For each GenHMM, the number of states is adapted to the training dataset. The exact number of states is decided by computing the average length of MFCC frames per phone in training dataset, and clipping the average length into . Transition matrix is initialized as upper triangular matrix for GenHMM.
4.2 Experimental Results
Model  K=1  K=3  K=5 
GMMHMM  
GenHMM 
Model  K=1  K=3  K=5 
GMMHMM  
GenHMM 
We firstly show the phoneme classification using 39 dimensional MFCC features (MFCC coefficients, deltas, and deltadeltas), to validate one possible usage of our proposed model. Since generative training is carried out in our experiments, GMMHMM is trained and tested as a reference model in our experiments. Training and testing of GMMHMM is in the same condition as GenHMMs are trained and tested. Dataset usage for GenMM and GMMHMM is the same, and number of states for GMMHMM is the same as that for GenHMM in modeling each phoneme. Apart from setting the reference model, we also run the experiment comparisons with different total number of mixture components.
Table 1 shows the results for this experiments, in which we test both the folded phoneme classification case (the conventional way) in table 1 and the phoneme classification case in table 1. As shown in both phoneme and phoneme cases, GenHMM gets significant higher accuracy than GMMHMM for the same number of mixture components. As our expectation, GenHMM should have better modeling capacity of sequential data since we bring in the neural network based generators into GenHMM, which should be able to represent complex relationship between states of HMM and sequential data. This is some increase of accuracy as total number of mixture components in GenHMM is increased from to . The sequential dependency of data is modeled by HMM itself, while each state of HMM can have better representation using a mixture probabilistic model if data represented by the state is multimode. Comparing the results in phoneme and phoneme cases, GenHMM gets higher accuracy for phoneme classification than it does for phoneme classification. The total training dataset size remains the same as phonemes are folded into phonemes. There are less training data available per phonemes and more classes to be recognized in the phoneme case, which makes the task more challenging.
Model  K=1  K=3  K=5 
GMMHMM  
GenHMM 
Model  K=1  K=3  K=5 
GMMHMM  
GenHMM 
Similar experiments in table 2 are carried out by using only the MFCC coefficients as feature input (excluding deltas and deltadeltas). The results are shown in table 2. The superior performance of GenHMM remains compared with reference model GMMHMM. The gain by using mixture generators is also presented in this set of experiments while the difference between phoneme and phoneme cases is similar to the set of experiments in table 1.
SNR  15dB  20dB  25dB  30 dB 
GMMHMM  
GenHMM 
Noise Type  White  Pink  Babble  Volvo 
GMMHMM  
GenHMM 
Apart from standard classification testing, we also test the robustness of our model to noise perturbations. We train GenHMM with by clean TIMIT training data in the case of folded phonemes with dimensional features. The testing dataset is perturbed by either the same type of noise with different signaltonoise ratio (SNR) as shown in table 3, or different type of noises with the same SNR as shown in table 4. The noise data is from NOISEX92 database. The baseline of these two sets of experiments is the accuracy testing of GenHMM and GMMHMM on clean testing data in the same experimental condition, where GenHMM has and GMMHMM gets as shown in table 1. It is shown in table 3 that GMMHMM’s performance degenerates more than GenHMM’s performance at the same level of noise perturbation, though the accuracy of both models increases along the increase of SNR. Especially, for SNR=dB, the accuracy of GenHMM drops only about (from to ), while GMMHMM encounters more than decrease (from to ) due to the noise perturbation. In table 4, the SNR remains constant and GenHMM is tested with perturbation of different noise types. It is shown that GenHMM still has higher accuracy at different types of noise perturbations than GMMHMM. Among these four types of noise, white noise shows most significant impact to GenHMM while the impact of volvo noise is negligible.
5 Conclusion
In this work, we proposed a generative model based HMM (GenHMM) whose generators are realized by neural networks. We provided the training method for GenHMM. The validity of GenHMM was demonstrated by the experiments of classification tasks on practical sequential dataset. The learning method in this work is based on generative training. For future work, we would consider discriminative training for classification tasks of sequential data.
References
 [Ashwin et al.2017] Ashwin, N.; Barnabas, L.; Sundar, A. R.; Malathi, P.; Viswanathan, R.; Masi, A.; Agrawal, G. K.; and Rakwal, R. 2017. Comparative secretome analysis of colletotrichum falcatum identifies a ceratoplatanin protein (epl1) as a potential pathogenassociated molecular pattern (pamp) inducing systemic resistance in sugarcane. Journal of Proteomics 169:2 – 20. 2nd World Congress of the International Plant Proteomics Organization.
 [BingHwang Juang, Levinson, and Sondhi1986] BingHwang Juang; Levinson, S.; and Sondhi, M. 1986. Maximum likelihood estimation for multivariate mixture observations of markov chains (corresp.). IEEE Transactions on Information Theory 32(2):307–309.
 [Bishop2006] Bishop, C. M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: SpringerVerlag.
 [Buys, Bisk, and Choi2018] Buys, J.; Bisk, Y.; and Choi, Y. 2018. Bridging hmms and rnns through architectural transformations. In 32nd Conference on Neural Information Processing Systems, IRASL workshop.
 [Chatterjee and Kleijn2011] Chatterjee, S., and Kleijn, W. B. 2011. Auditory modelbased design and optimization of feature vectors for automatic speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 19(6):1813–1825.
 [Ding et al.2018] Ding, W.; Li, S.; Qian, H.; and Chen, Y. 2018. Hierarchical reinforcement learning framework towards multiagent navigation. In 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), 237–242.
 [Dinh, Krueger, and Bengio2014] Dinh, L.; Krueger, D.; and Bengio, Y. 2014. NICE: nonlinear independent components estimation. CoRR abs/1410.8516.
 [Dinh, SohlDickstein, and Bengio2016] Dinh, L.; SohlDickstein, J.; and Bengio, S. 2016. Density estimation using Real NVP. ArXiv eprints.
 [Gales and Young2008] Gales, M., and Young, S. 2008. Application of Hidden Markov Models in Speech Recognition. now.
 [Hariyanti, Aida, and Kameda2019] Hariyanti, T.; Aida, S.; and Kameda, H. 2019. Samawa language part of speech tagging with probabilistic approach: Comparison of unigram, HMM and TnT models. Journal of Physics: Conference Series 1235:012013.
 [Hinton et al.2012] Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T. N.; and Kingsbury, B. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29(6):82–97.
 [Hinton2012] Hinton, G. E. 2012. A Practical Guide to Training Restricted Boltzmann Machines. Berlin, Heidelberg: Springer Berlin Heidelberg. 599–619.
 [Khan et al.2016] Khan, W.; Daud, A.; Nasir, J. A.; and Amjad, T. 2016. A survey on the stateoftheart machine learning models in the context of nlp. Kuwait journal of Science 43(4).
 [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
 [Kingma and Dhariwal2018] Kingma, D. P., and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; CesaBianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31. Curran Associates, Inc. 10215–10224.
 [Krakovna and DoshiVelez2016] Krakovna, V., and DoshiVelez, F. 2016. Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models. arXiv eprints arXiv:1606.05320.
 [Levine2018] Levine, S. 2018. Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR abs/1805.00909.
 [Li et al.2013] Li, L.; Zhao, Y.; Jiang, D.; Zhang, Y.; Wang, F.; Gonzalez, I.; Valentin, E.; and Sahli, H. 2013. Hybrid deep neural network–hidden markov model (dnnhmm) based speech emotion recognition. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 312–317.
 [Liu, Lin, and Reid2019] Liu, L.; Lin, Y.; and Reid, J. 2019. Improving the performance of the LSTM and HMM models via hybridization. CoRR abs/1907.04670.
 [Lopes and Perdigao2011] Lopes, C., and Perdigao, F. 2011. Phoneme recognition on the timit database. In Ipsic, I., ed., Speech Technologies. Rijeka: IntechOpen. chapter 14.
 [Miao and Metze2013] Miao, Y., and Metze, F. 2013. Improving lowresource cddnnhmm using dropout and multilingual dnn training. In INTERSPEECH.
 [Ren, Sima, and AlArs2015] Ren, S.; Sima, V.; and AlArs, Z. 2015. Fpga acceleration of the pairhmms forward algorithm for dna sequence analysis. In 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 1465–1470.