Parsimonious HMMs for Offline Handwritten Chinese Text Recognition
Abstract
Recently, hidden Markov models (HMMs) have achieved promising results for offline handwritten Chinese text recognition. However, due to the large vocabulary of Chinese characters with each modeled by a uniform and fixed number of hidden states, a high demand of memory and computation is required. In this study, to address this issue, we present parsimonious HMMs via the state tying which can fully utilize the similarities among different Chinese characters. Twostep algorithm with the datadriven questionset is adopted to generate the tiedstate pool using the likelihood measure. The proposed parsimonious HMMs with both Gaussian mixture models (GMMs) and deep neural networks (DNNs) as the emission distributions not only lead to a compact model but also improve the recognition accuracy via the data sharing for the tied states and the confusion decreasing among state classes. Tested on ICDAR2013 competition database, in the best configured case, the new parsimonious DNNHMM can yield a relative character error rate (CER) reduction of 6.2%, 25% reduction of model size and 60% reduction of decoding time over the conventional DNNHMM. In the compact setting case of average 1state HMM, our parsimonious DNNHMM significantly outperforms the conventional DNNHMM with a relative CER reduction of 35.5%.
I Introduction
Offline handwritten Chinese text recognition (HCTR) is a challenge topic due to large vocabulary and unrestrained writing styles [1]. Most existing techniques can be classified into two categories: oversegmentationbased and segmentationfree approaches. Oversegmentationbased approaches often need to explicitly segment text line into a sequence of primitive image patches and then merge them to form a candidate lattice [2, 3, 4, 5]. In contrast to the oversegmentationbased approaches, segmentationfree approaches do not require the explicit segmentation for text line. [6] adopted the Gaussian mixture model based hidden Markov model (GMMHMM) for the text line modeling. With the success of deep learning [7], deep neural networks (DNNs) have been widely applied for HCTR. Recently, [12] successfully used multidimensional longshort term memory recurrent neural network (MDLSTMRNN) with connectionist temporal classification (CTC) [13] for HCTR. More recently, [9, 10, 11] proposed hybrid neural network based HMMs (NNHMMs) for HCTR, which achieved the best performance on the ICDAR2013 competition database [1] among existing segmentationfree approaches.
The success of NNHMMs [9, 11] is attributed to two aspects. First, the DNN or convolutional neural network (CNN) [8] is powerful in modeling the emission distributions just like in MDLSTMRNN [12]. Second, the lefttoright HMM [14] with a set of hidden states is adopted to represent each character class, illustrated in Fig. 1. Accordingly, to model the text line as a observation sequence of frames implemented by sliding windows, the character HMMs are concatenated as shown in Fig. 1. However, there is one main problem in the conventional HMMbased HCTR, where each character is modeled with a uniform and fixed number of hidden states, e.g., 5 states in Fig. 1. Due to the large vocabulary of Chinese characters, this setting requires a high demand of memory and computation. Moreover, the uniform setting of state number is unreasonable as the similarity among different characters and the diversity of appearances are not well considered. Chinese characters, which are mainly logographic and consisting of basic radicals, constitute the oldest continuously used system of writing in the world which is different from the purely soundbased writing systems [15] such as Greek, Hebrew, etc. For example in Fig. 2, the regions in red dashed boxes of the left and middle handwritten Chinese characters are quite similar as they belong to the same radical.
In this study, to address the abovementioned problem in conventional DNNHMM approach, we present parsimonious DNNHMMs via the state tying which can fully utilize the similarities among different Chinese characters. We adopt twostep algorithm with the datadriven questionset to generate the tiedstate pool using the likelihood measure, which is inspired by the similar idea in speech recognition area [16, 17, 18]. The proposed parsimonious DNNHMMs not only lead to a compact model but also improve the recognition accuracy via the data sharing for the tied states and the confusion decreasing among state classes. Tested on ICDAR2013 competition database, in the best configured case, the new parsimonious DNNHMM can yield a relative character error rate (CER) reduction of 6.2%, 25% reduction of model size and 60% reduction of decoding time over the conventional DNNHMM. In the compact setting case of average 1state HMM, our parsimonious DNNHMM significantly outperforms the conventional DNNHMM with a relative CER reduction of 35.5%.
Ii Overview of Parsimonious DNNHMM
The proposed framework aims to search the optimal character sequence for a given extracted feature sequence of a text line, which can be formulated according to the Bayesian decision theory as follows:
(1) 
where is the conditional probability of given which is named as the character model. Meanwhile is the prior probability of which is named as the language model. As one implementation of this Bayesian framework, we use an HMM to model one character class, accordingly a text line is modeled by a sequence of HMMs. An HMM has a set of states and each frame is supposed to be assigned to one underlying state. For each state, an emission distribution describes the statistical property of the observed frame. With HMMs, we rewrite the in:
(2)  
(3)  
(4) 
is one underlying state sequence of to represent . is the prior probability of the initial state and is the transition probability from state at the frame to state at the frame. is the emission probability, which can be directly calculated (e.g., GMM in [6]) or indirectly obtained via the state posterior probability (e.g. DNN in [9]).
Within this framework, the main procedure to train parsimonious DNNHMMs are summarized in Algorithm 1. In the recognition stage, after the feature extraction of the unknown handwritten text line, the final recognition results can be generated via a weighted finitestate transducer (WFST) [23, 24] based decoder by integrating both character model and language model. Note that the number of output layer neurons in DNN corresponds to the number of tiedstates, which is controlled by state tying results. In the next section, we will elaborate the state tying algorithm.
Iii Twostep state tying
To give a better explanation of state tying, Fig. 2 shows an example of three Chinese characters with the final tiedstate pool. Each character in this figure is initially modeled by an HMM with 5 states. After parsimonious modeling, the first two states of left and middle characters are tied together while the last three states of middle and right characters are tied together. This is reasonable as these tied states are corresponding to the similar regions of dashed boxes.
We adopt twostep algorithm with the datadriven questionset to generate the tiedstate pool. We will introduce the firststep algorithm, secondstep algorithm and questionset building separately in the next subsections.
Iiia The firststep datadriven method for state tying
In the first step, the binary decision tree is adopted for state tying with each node partitioned by a question. Each question is related with a set of Chinese characters which will be described in Section IIIC. One tree is constructed for each HMM state (e.g., state 1 to state 5 in Fig. 2) to cluster the corresponding states of all associated characters. Because the number of Chinese characters used in this study is 3980, the whole tree on each state is pretty large. In Fig. 3, we just show a fragment of the decision tree for tying the first state of HMM, where five clusters correspond to five leave nodes with each associated with a set of tied character classes. Similar to [17], the basic principle is to partition states recursively to maximize the increase in expected loglikelihood. All states with the same position in HMMs are initially grouped together at the root node and the expected loglikelihood of the training data is calculated. This node is then split into two subsets based on the question which partitions the states to maximize the increase in expected loglikelihood. A maximum priority queue is maintained to save the expected loglikelihood improvements by splitting each parent node to two children nodes. Each node is then recursively partitioned until reaching the threshold of tiedstate number.
IiiB The secondstep datadriven method for state tying
In order to get the final tiedstate pool, the tiedstates generated by firststep are reclusterred in this step. In the second step, the clusters in leaf nodes obtained in the first step is reclustered by a bottomup procedure using sequential greedy optimisation. Similar to [18], the expected loglikelihood decrease by combining every two clusters is calculated. A minimum priority queue is maintained to recluster the two clusters with minimum loglikelihood decrease to a new cluster. This process is repeated until reaching these target tiedstate number N. Finally the tiedstate pool is composed by the reculusterred tiedstate. We illustrate this secondstep in Fig. 4.
The expected loglikelihood in the abovementioned two steps can be calculated on the feature vector based on the Gaussian distribution assumption withe dimensional mean vector and covariance matrix :
(5)  
Let be a cluster with training feature vectors, the expected log likelihood on this cluster is given by:
(6) 
If we partition into two subsets and , with and feature vectors, mean vectors and , covariance matrices and respectively, then the expected loglikelihood increase after splitting becomes:
(7)  
Similarly, we can also obtain the expected loglikelihood decrease for the second step reclustering accordingly. The statistics required in these equations can be calculated from the training data.
IiiC Datadriven question set generation
The question set used for state tying is built via a topdown treebased method like in [18]. Initially, all characters are placed in root node and the expected log likelihood of all the training data is calculated. Then means clustering [21] with is conducted for several times on different initial assignments and the best clustering result is selected to split the root node. A maximum priority queue is maintained to store the likelihood increase by splitting each parent node to two children nodes. This splitting process is recursively performed until each leaf node only has one character class. Each node corresponds to one question which is constituted of all the leaves which this node can reach to when traversing the tree. Finally, the question set consists of these questions.
Iv Experiments
In this section, we present experiments on recognizing offline handwritten Chinese text line with Kaldi toolkit [20], for the purpose of evaluating and comparing the proposed parsimonious HMMs with the conventional HMMs [9]. We use the public CASIAHWDB database [19] for training, including HWDB1.0, HWDB1.1, HWDB2.0, HWDB2.1, and HWDB2.2 datasets. HWDB1.0 and HWDB1.1 are offline isolated handwritten Chinese character datasets while HWDB2.0HWDB2.2 are offline handwritten Chinese text datasets. In total, there are 3,980 classes (Chinese characters, symbols, garbage) with 4,091,599 samples. Here “garbage” classes represent the short blank model between characters and the long blank model at the beginning or end of the text line. The ICDAR2013 competition set [1] is adopted as the evaluation set. The gradientbased feature extracted from one frame of the text line is a 256dimensional vector, followed by PCA to obtain a 50dimensional feature vector. This feature vector is directly used for GMMHMM systems while an augmented version of 7 frames is fed to DNNHMM systems.
For GMMHMM systems, each character class is modeled by a lefttoright HMM with each state modeled by a GMM with 40 Gaussian mixtures. For DNNHMM systems, the input size of DNN is 350. The minibatch size is 256. The initial step size is set to 0.008 which is halved after each iteration if the loss of crossvalidation set is reduced. 16 iterations are conducted.
As for language modeling, 3gram is adopted and trained with the transcriptions of both the CASIA database and other corpora including 208MB texts of Guangming Daily between 1994 and 1998, 115MB texts of PeopleÂ¡Â¯s Daily between 2000 and 2004, 129MB texts of other newspapers, and 93MB texts of Sina News. The evaluation measure is CER, which is the ratio between the total number of substitution/insertion/deletion errors and the total number of character samples in the evaluation set.
Iva Experiments on different settings of tiedstates
In this subsection, five conventional GMMHMM systems are built with the fixed number of HMM states per character from 1 to 5. Four parsimonious GMMHMM (denoted as GMMPHMM) systems are generated based on the state tying from 5state GMMHMM system, yielding average tiedstate per character from 1 to 4. Accordingly, five conventional DNNHMM systems are trained from five conventional GMMHMM systems while four parsimonious DNNHMM (denoted as DNNPHMM) systems are trained based on four GMMPHMM systems. For DNNHMM and DNNPHMM, 6 hidden layers with 2048 nodes for each hidden layer are used and the number of neurons of DNN output layer corresponding to the total number of states varies from 3980 (1 tiedstate per character) to 19900 (5 tiedstates per character).
5  4  3  2  1  

GMMHMM  20.04  19.94  21.94  24.92  30.34 
GMMPHMM    19.41  18.83  18.14  18.49 
DNNHMM  6.73  6.80  7.11  8.21  11.09 
DNNPHMM    6.37  6.31  6.48  7.15 
Table I listed a CER comparison of HMM systems on the evaluation set with different number settings of tiedstates per character. Several observations could be made. First, for both GMMPHMM and DNNPHMM with the decreasing of the number of tiedstates, the CERs first decreased and then increased. This implied that too many states led to the confusion increasing while too few states decreased the discrimination among characters classes. Second, GMMPHMM/DNNPHMM systems consistently and significantly outperformed the corresponding GMMHMM/DNNHMM systems with the same tiedstate number, demonstrating the effectiveness of the proposed state tying algorithm. For example, for the most compact case, namely 1 tiedstate per character, GMMPHMM yielded a relative CER reduction of 39.1% over GMMHMM while DNNPHMM achieved a relative CER reduction of 35.5% over DNNHMM. This indicated that the tiedstate allocation for different character classes could be much more reasonable after state tying by fully utilizing the similarities among different characters. Finally, in the best configured cases, a relative CER reduction of 9.5% was achieved by GMMPHMM over GMMHMM while a relative CER reduction of 6.2% was achieved by DNNPHMM over DNNHMM. Moreover, 40% reduction of tiedstates in total were obtained in DNNPHMM compared with DNNHMM.









18.66  19.17  19.92  21.28  22.54  

7.34  7.50  7.97  8.80  9.52 
One more advantage of DNNPHMM is that we can achieve much more compact design by setting the number of tiedstates per character below 1, as shown in Table II. However, for DNNHMM, the minimum setting is 1 state per character. We could observe from Table II that even in such extreme settings, the recognition performance of GMMPHMM and DNNPHMM was gradually declined, not like the sharp decreasing of performance in GMMHMM and DNNHMM from 2state setting to 1state setting from Table I. With an average 0.5 tiedstate per character setting, the corresponding DNNPHMM outperformed DNNHMM with 1state setting and MDLSTMRNN (with a CER of 10.6% in [12]), yielding the relative CER reductions of 14.2% and 10.2%, respectively.
IvB Experiments on parsimonious modeling
(, )  (1024, 4)  (1024, 6)  (2048, 6)  

CER  7.15  6.91  6.73  
0.38  0.42  1  
0.82  0.93  1  

CER  6.78  6.48  6.31  
0.25  0.27  0.74  
0.28  0.31  0.40 
In order to further address the practical issues such as the demand of memory and computation, the performance comparison of the best configured DNNHMM and DNNPHMM systems with different DNN structures is listed in Table III. Obviously, with the decrease of hidden units and layers, DNNPHMM could still maintain a competitive performance while the corresponding model size and runtime latency could be largely reduced. For example, DNNPHMM using (1024, 4) setting achieved a comparable CER with DNNHMM using (2048, 6) setting. However, 75% of model size and 72% of runtime latency were reduced in DNNPHMM compared with DNNHMM.
IvC Results analysis
To explain the reason why the proposed parsimonious HMMs are so effective in parsimonious modeling, we first show the examples of statetying results in Fig. 5. The first column shows the set of tied characters by the statetying from the first state to the fifth state of 5state HMM with different radicals structures and similarities described in second and third columns. From these results, we observed that although the vocabulary of Chinese characters could be quite large (tens of thousands), most of them consisted of basic radicals and spatial structures with only a few hundred categories. Accordingly, the Chinese characters with the same or similar radicals were easily tied using the proposed algorithm. This is the reason that the proposed DNNPHMM with quite compact design can still maintain high recognition performance as shown in Table II and III.
To give readers a better understanding why DNNPHMM could improve the recognition accuracy over DNNHMM, a recognition example is shown in Fig. 6, where DNNHMM generates one substitution error (marked red) while DNNPHMM generates the correct results as the ground truth. This can be explained as: in DNNHMM system, there are too fewer training handwritten samples with the left radical like the misclassified one in the red dashed box. However, in DNNPHMM, by statetying, this unusual writing style of the left radical can be shared from other handwritten Chinese characters samples to train this specific character class.
V Conclusion
In this paper, we present parsimonious DNNHMMs to reduce model redundancy and capture similarities among different Chinese characters. Note that the model is lefttoright HMM and the features are extracted from lefttoright, so the similarities captured by state tying are more on lefttoright structure. In the future, we plan to investigate the parsimonious modeling for 2DHMM based HCTR to capture more structure information.
Acknowledgment
This work was supported in part by the National Key R&D Program of China under contract No. 2017YFB1002202, the National Natural Science Foundation of China under Grants No. 61671422 and U1613211, the Key Science and Technology Project of Anhui Province under Grant No. 17030901005, and MOEMicrosoft Key Laboratory of USTC. This work was also funded by Huawei Noah’s Ark Lab. The authors would like to thank Mr. Yannan Wang for the contributions on the detail discussion of GMMHMM.
References
 [1] F. Yin, and Q.F. Wang, and X.Y. Zhang, and C.L. Liu, “ICDAR 2013 Chinese handwriting recognition competition,” Proc. ICDAR, 2013, pp. 1164–1470.
 [2] Q. Fu, and X.Q. Ding, and T. Liu, and Y. Jiang, and Z. Ren, “A novel segmentation and recognition algorithm for Chinese handwritten address character strings,” Proc. ICPR, 2006, vol.2, pp.974–977.
 [3] N. Li, and L. Jin, “A Bayesianbased probabilistic model for unconstrained handwritten offline Chinese text line recognition,” Proc. IEEE SMC, 2010, pp. 3664–3668.
 [4] Q.F. Wang, and F. Yin, and C.L. Liu, “Handwritten Chinese text recognition by integrating multiple contexts,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 8, pp. 1469â1481, 2012.
 [5] Y.C. Wu, and F. Yin, and C.L. Liu, “Improving handwritten Chinese text recognition using neural network language models and convolutional neural network shape models,” Pattern Recognition, vol. 65, pp. 251–264, 2017.
 [6] T.H. Su, and T.W. Zhang, and D.J. Guan, and H.J. Huang, “Offline recognition of realistic Chinese handwriting using segmentationfree strategy,” Pattern Recognition, vol. 42, no. 1, pp. 167–182, 2009.
 [7] Y. LeCun, and Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [8] Y. LeCun, and L. Bottou, and Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [9] J. Du, and Z.R. Wang, and J.F. Zhai, and J.S. Hu, “Deep neural network based hidden markov model for offline handwritten Chinese text recognition,” Proc. ICPR, 2016, pp. 3428–3433.
 [10] Z.R. Wang, and J. Du, “Writer Code Based Adaptation of Deep Neural Network for Offline Handwritten Chinese Text Recognition,” Proc. ICFHR, 2016, pp. 548–553.
 [11] Z.R. Wang, and J. Du, and J.S. Hu, and Y.L. Hu, “Deep convolutional neural network based hidden markov model for offline handwritten Chinese text recognition,” Proc. ACPR, 2017, pp. 816–821.
 [12] R. Messina, and J. Louradour, “Segmentationfree handwritten Chinese text recognition with LSTMRNN,” Proc. ICDAR, 2015, pp. 171–175.
 [13] A. Graves, and S. Fernández, and F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” Proc. ICML, 2006, pp. 369–376.
 [14] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
 [15] D. Jurafsky, and J. Martin, Speech and language processing, vol. 3, Pearson London, 2014.
 [16] L. Bahl, and P. Gopalakrishnan, and D. Nahamoo, and MA Picheny, and others. “Decision trees for phonological rules in continuous speech,” Proc. ICASSP, 1991, pp. 185–188.
 [17] S. Young, and J. Odell, and P. Woodland, “Treebased state tying for high accuracy acoustic modelling,” Proceedings of the workshop on Human Language Technology, 1994, pp. 307–312.
 [18] D. Povey, “Phoneticcontextdependent model training,” lecture 3, 2010.
 [19] C.L. Liu, and F. Yin, and D.H. Wang, and Q.F. Wang,, “CASIA online and offline Chinese handwriting databases,” Proc. ICDAR, 2011, pp. 37–41.
 [20] D. Povey, and A. Ghoshal, et al., “The Kaldi speech recognition toolkit,” Proc. ASRU, 2011, no. EPFLCONF192584.
 [21] J. Hartigan, and M. Wong, “Algorithm AS 136: A kmeans clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
 [22] C.L. Liu, “Normalizationcooperated gradient feature extraction for handwritten character recognition,” IEEE transactions on Pattern Analysis and machine intelligence, vol. 29, no. 8, pp. 1465–1469, 2007.
 [23] M. Mohri, and F. Pereira, and M. Riley, “Weighted finitestate transducers in speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 69–88, 2002.
 [24] C. Allauzen, and M. Riley, and J. Schalkwyk, and W. Skut, and M. Mohri,, “OpenFst: A general and efficient weighted finitestate transducer library,” Implementation and Application of Automata, pp. 17–23, 2007.
 [25] A. Rencher, Methods of multivariate analysis, John Wiley & Sona, vol. 492, 2003.
 [26] X.Y. Zhang, and Y. Bengio, and C.L. Liu, “Online and offline handwritten chinese character recognition: A comprehensive study and new benchmark”, Pattern Recognition, vol. 61, pp. 348–360, 2017.