Document Image Coding and Clustering for Script Discrimination
Department of Computer Engineering, Modeling, Electronics and Systems, University of Calabria,
Via P. Bucci Cube 44, 87036 Rende (CS), Italy,
College of Applied Technical Sciences, Aleksandra Medvedeva 20, Niš 18000, Serbia,
The paper introduces a new method for discrimination of documents given in different scripts. The document is mapped into a uniformly coded text of numerical values. It is derived from the position of the letters in the text line, based on their typographical characteristics. Each code is considered as a gray level. Accordingly, the coded text determines a 1-D image, on which texture analysis by run-length statistics and local binary pattern is performed. It defines feature vectors representing the script content of the document. A modified clustering approach employed on document feature vector groups documents written in the same script. Experimentation performed on two custom oriented databases of historical documents in old Cyrillic, angular and round Glagolitic as well as Antiqua and Fraktur scripts demonstrates the superiority of the proposed method with respect to well-known methods in the state-of-the-art.
Keywords: Historical documents, Feature extraction, Script recognition, Clustering
Script recognition has a great importance in document image analysis and optical character recognition . Typically, it represents a process of automatic recognition of script by computer in scanned documents . This process usually reduces the number of different symbol classes, which is then considered for classification .
The proposed methods for script recognition have been classified as global or local ones . Global methods divide the image of the document into larger blocks to be normalized and cleaned from the noise. Then, statistical or frequency-domain analysis is employed on the blocks. On the contrary, local methods divide the document image into small blocks of text, called connected components, on which feature analysis, i.e., black pixel runs, is applied . This last method is much more computationally heavy than global one, but apt to deal with noisy document images. In any case, previously proposed methods reach an accuracy in script identification between 85% and 95% .
In this paper, we present a new method for discrimination of documents written in different scripts. In contrast to many previous methods, it can be used prior or during the preprocessing stage. It is primarily based on feature extraction from the bounding box method, its height and center point position in the text line. Hence, there is no need to identify the single characters to differentiate scripts. For this reason, it is particularly useful when the documents are noisy. Furthermore, it maps the connected components of the text to only 4 different codes similarly as in , which used character code shapes. In this way, the number of variables is considerably reduced, determining a computer non-intensive procedure. A modified version of a clustering method is proposed and applied to the extracted features for grouping documents given in the same script. Experiments performed on Balkan medieval documents in old Cyrillic, angular and round Glagolitic scripts, and German documents in Antiqua and Fraktur scripts determine an accuracy up to 100%. The main application of the proposed approach can be used in the cultural heritage area, i.e., in script recognition and classification of historical documents, which includes their origin as well as the influence of different cultural centers to them.
The paper is organized as follows. Section 2 introduces the coding phase and mapping of the text to 1-D image. Section 3 presents the clustering method. Section 4 describes the experiment and discusses it. Finally, Section 5 draws a conclusion.
2 Script Coding
Coding phase transforms the script into a uniformly coded text which is subjected to feature extraction. It is composed of two main steps: (i) mapping of the text based on typographical features into an image, by adopting text line segmentation, blob extraction, blob heights and center point detection; (ii) extraction of features from image based on run-length and local binary pattern analysis.
2.1 Mapping based on typographical features
First, the text of the document is transformed into a 1-D image based on its typographical features. Text is segmented into text lines by employing the horizontal projection profile. It is adopted for detecting a central line of reference for each text line. A bounding box is traced to each blob, i.e., letter. It is used to derive the distribution of the blob heights and its center point. Typographical classification of the text is based on these extracted features. Figure 1 shows this step of the algorithm on a short medieval document from Balkan region written in old Cyrillic script.
Bounding box heights and center point locations can determine the categorization of the corresponding blobs into the following classes : (i) base letter (0), (ii) ascender letter (1), (iii) descendent letter (2), and (iv) full letter (3). Figure 2 depicts the classification based on typographical features.
Starting from this classification, text is transformed into a gray-level 1-D image. In fact, the following mapping is realized: base letter to 0, ascender letter to 1, descendent letter to 2, and full letter to 3 . It determines the coding of the text into a long set of numerical codes 0, 1, 2, 3. Each code has a correspondence with a gray-level, determining the 1-D image. Figure 3 shows the procedure of text coding.
2.2 Feature extraction
Texture is adopted to compute statistical measures useful to differentiate the images. Run-length analysis can be employed on the obtained 1-D image to create a feature vector of 11 elements representing the document. It computes the following features: (i) short run emphasis (SRE), (ii) long run emphasis (LRE), (iii) gray-level non-uniformity (GLN), (iv) run length non-uniformity (RLN), (v) run percentage (RP) , (vi) low gray-level run emphasis (LGRE) and (vii) high gray-level run emphasis (HGRE) , (viii) short run low gray-level emphasis (SRLGE), (ix) short run high gray-level emphasis (SRHGE), (x) long run low gray-level emphasis (LRLGE), and (xi) long run high gray-level emphasis (LRHGE) . Local Binary Pattern (LBP) analysis can be suitable to obtain only 4 different features from Ô00Õ to Ô11Õ, if the document is represented by 4 gray level images . However, this number of features is not sufficient for a good discrimination. Hence, LBP is extended to Adjacent Local Binary Pattern (ALBP) , which is the horizontal co-occurrence of LBP. It determines 16 features from Ô0000Õ to Ô1111Õ, from which the histogram is computed as a 16-dimensional feature vector . Run-length feature vectors and ALBP feature vectors can be employed for classification and discrimination of scripts in text documents.
3 Clustering Analysis
Discrimination of feature vectors representing documents in different scripts is performed by an extension of Genetic Algorithms Image Clustering for Document Analysis (GA-ICDA) method . GA-ICDA is a bottom-up evolutionary strategy, for which the document database is represented as a weighted graph . Nodes correspond to documents and edges to weighted connections, where is the set of weights, modeling the affinity degree among the nodes. A node is linked to a subset of its -nearest neighbor nodes . They represent the documents most similar to the document of that node. Similarity is based on the norm of the corresponding feature vectors, while parameter influences the size of the neighborhood. Hence, the similarity between two documents and is expressed as:
where is the norm between and and is a local scale parameter.
Then, a node ordering is established, which is a one-to-one association between graph nodes and integer labels, , . Given the node , the difference is computed between its label and the labels of the nodes in . Hence, edges are considered only between and the nodes in for which the label difference is less than a threshold . It is employed for each node in , to realize the adjacency matrix of with low bandwidth. It represents a graph where the connected components, which are the clusters of documents in a given script, are better visible.
Finally, is subjected to an evolutionary clustering method to detect clusters of nodes. Then, to refine the obtained solution, a merging procedure is applied on clusters. At each step, the pair of clusters with minimum mutual distance is selected and merged, until a fixed cluster number is reached. The distance between and is computed as the norm between the two farthest document feature vectors, one for each cluster.
A modification is introduced in the base version of GA-ICDA to be more suitable with complex discrimination tasks like differentiation of historical documents given in different scripts. It consists of extending the similarity concept expressed in Equation (1) to a more general characterization. It is realized by substituting the exponent ’2’ in Equation (1) with a parameter , to obtain a ÒsmoothedÓ similarity computation between the nodes in , when necessary. It is very useful in such a complex context, where documents appear as variegated, for which their mutual distance can be particularly high, even if they belong to the same script typology. Because a lower exponent in Equation (1) determines a higher similarity value from the corresponding distance value, it allows to mitigate the problem.
Hence, the similarity between two documents and is now defined as:
4 Experimental Results
The proposed method is evaluated on two complex custom oriented databases. The first one is a collection of labels from Balkan region hand-engraved in stone and hand-printed on paper written in old Cyrillic, angular and round Glagolitic scripts. The database contains 5 labels in old Cyrillic, 10 labels in angular and 5 labels in round Glagolitic, for a total of 20 labels. The second database is composed of 100 historical German documents mainly from the J. W. von GoetheÕs poems, written in Antiqua and Fraktur scripts. The experiment consists of employing the modified GA-ICDA on the run-length and ALBP feature vectors computed from the documents in the two databases, for testing the efficacy in correctly differentiating the script types. A comparison is performed between GA-ICDA with modification and other 4 clustering methods: the base version of GA-ICDA, Complete Linkage Hierarchical clustering, Self-Organizing-Map (SOM) and K-Means, well-known for document categorization . A trial and error procedure is applied on benchmark documents, different from the databases, for tuning the parameters of the methods. Those providing the best solution on the benchmark are employed for clustering. Hence, parameter is fixed to 1. Precision, Recall, F-Measure (computed for each script class) and Normalized Mutual Information (NMI) are adopted as performance measures for clustering evaluation . Each method has been executed 100 times and average value of measures together with standard deviation have been computed.
Figure 4 shows the corresponding results in graphical form. It is worth noting that GA-ICDA with modification performs considerably better than the other clustering methods for both the databases and that adopted modification determines an improvement in the final result with respect to the base version of GA-ICDA. Also, the standard deviation is always zero. It confirms the stability of the obtained results.
The paper proposed a new method for differentiation of script type in text documents. In the first step, the document was mapped into a uniformly coded text. Then, it was transformed into 1-D gray-level image, from which texture features were extracted. A modified version of the GA-ICDA method was adopted on feature vectors for document discrimination based on script typology. A huge experimentation on two complex databases of historical documents proved the effectiveness of the proposed method.
Future work will extend the experiment on large datasets of labels engraved on different materials, like bronze, and will compare the method with other classification algorithms.
-  D. Ghosh, T. Dube and A. Shivaprasad, Script recognition Ð A review, IEEE Trans. Pattern Analysis and Machine Intelligence, vol.32, no.12, pp.2142-2161, 2010.
-  U. Pal and B. B. Chaudhuri, Indian script character recognition: A survey, Pattern Recognition, vol.37, no.9, pp.1887-1899, 2004.
-  N. Nagy, Twenty years of document image analysis in PAMI, IEEE Trans. Pattern Analysis and Machine Intelligence, vol.22, no.1, pp.38-62, 2000.
-  G. D. Joshi, S. Garg and J. Sivaswamy, A generalised framework for script identification, International Journal of Document Analysis and Recognition, vol.10, no.2, pp.55-68, 2007.
-  P. Sibun and A. L. Spitz, Language determination: Natural language processing from scanned document images, Proc. of the 4th Conference on Applied Natural Language Processing, Las Vegas, USA, pp.423-433, 1995.
-  A. W. Zramdini and R. Ingold, Optical font recognition using typographical features, IEEE Trans. Pattern Analysis and Machine Intelligence, vol.20, no.8, pp.877-882, 1998.
-  D. Brodić, Z. N. Milivojević and Č. A. Maluckov, An approach to the script discrimination in the Slavic documents, Soft Computing, vol.19, no.9, pp.2655-2665, 2015.
-  M. M. Galloway, Texture analysis using gray level run lengths, Computer, Graphics and Image Processing, vol.4, no.2, pp.172-179, 1975.
-  A. Chu, C. M. Sehgal and J. F. Greenleaf, Use of gray value distribution of run lengths for texture analysis, Pattern Recognition Letters, vol.11, no.6, pp.415-419, 1990.
-  B. R. Dasarathy and E. B. Holder, Image characterizations based on joint gray-level run-length distributions, Pattern Recognition Letters, vol.12, no.8, pp.497-502, 1991.
-  T. Ojala, M. Pietikainen and D. Harwood, A comparative study of texture measures with classification based on featured distributions, Pattern Recognition, vol.29, no.1, pp.51-59, 1996.
-  R. Nosaka, Y. Ohkawa and K. Fukui, Feature extraction based on co-occurrence of adjacent local binary patterns, Proc. of the 5th Pacific Rim Symposium on Image and Video Technology, Gwanju, South Korea, pp.82-91, 2011.
-  D. Brodić, A. Amelio and Z. N. Milivojević, Classification of the scripts in medieval documents from Balkan region by run-length texture analysis, Proc. of the 22nd Conference on Neural Information Processing, Istanbul, Turkey, pp.442-450, 2015.
-  D. Brodić, A. Amelio and Z. N. Milivojević, Characterization and distinction between closely related south Slavic languages on the example of Serbian and Croatian, Proc. of the 16th International Conference on Computer Analysis of Images and Patterns, Valletta, Malta, pp.654-666, 2015.
-  C. C. Aggarwal and C. Zhai, Mining Text Data, Springer USA, 2012.
-  N. O. Andrews and E. A. Fox, Recent Developments in Document Clustering, Tech. rep., Computer Science, Virginia Tech., 2007.