Next integrated result modelling for stopping the text field recognition process in a video using a result model with per-character alternatives
In the field of document analysis and recognition using mobile devices for capturing, and the field of object recognition in a video stream, an important problem is determining the time when the capturing process should be stopped. Efficient stopping influences not only the total time spent for performing recognition and data entry, but the expected accuracy of the result as well. This paper is directed on extending the stopping method based on next integrated recognition result modelling, in order for it to be used within a string result recognition model with per-character alternatives. The stopping method and notes on its extension are described, and experimental evaluation is performed on an open dataset MIDV-500. The method was compares with previously published methods based on input observations clustering. The obtained results indicate that the stopping method based on the next integrated result modelling allows to achieve higher accuracy, even when compared with the best achievable configuration of the competing methods.
Modern document entry systems allow to automatize the process of data extraction from various documents, either business, regulatory, or personal. Such systems are used for creating digital archives of historical documents , recognition of small-scale documents such as business cards , ID documents, driving licences, passports , as well as large-scale business documents .
Increasing computational power of mobile devices and rising technical characteristics of small-scale digital cameras lead to increased interest in methods for automatic document entry using mobile devices [5, 6, 7, 8, 9]. As a rule, regular smartphones are used for document recognition, due to relatively low cost, sufficient computational power for performing recognition tasks, and ability of capturing video (or sequence of images). The ability to capture video is one of the most important advantages over traditional scanners, as in such case more information could be retrieved in comparison with a single image, and each newly acquired document image may be used to improve the recognition result . Figure 1 illustrates an example of per-frame recognition results combination in a video stream. As it can be seen, the correct integrated result may be acquired even before any individual frame result is correct.
While processing the sequence of frames and combining the per-frame recognition results a single more precise one, the problem arises – when this process should terminate? The capturing process in a general case might not be naturally limited, and if a sufficiently good combination strategy is employed the increase of the number of integrated observations the expected result precision also increases . However the time required to perform recognition and output the final result is also very important and thus efficient strategies for video stream recognition stopping should be developed and further studied.
The optimal stopping problems themselves occupy a special place in mathematical statistics and decision theory [12, 13]. Some methods were also proposed for the video stream recognition problem [14, 15]. In  a method was presented which consisted of clustering the set of per-frame field recognition results, estimating a confidence score for each cluster, and making a stopping decision based on three parameters: cluster size, cluster confidence, and the total number of processed observations. The method can be applied in two ways: the clusters may be formed from the initial per-frame recognition results, or from the integrated results obtained on each stage. This method, however, is not fully formalized and raises the questions of tuning the clustering parameters. In  a method is proposed, which considers the video stream recognition stopping as a monotone sequential decision problem. It presents a stopping strategy derived from the properties of monotone stopping problems, however it was tested only for text recognition results as simple strings, without any per-character alternatives. At the same time an extended string recognition result model with per-character alternatives is an important way of text recognition result representation: it is used for recognition results post-processing  and was shown to be valuable for improving the integrated result precision .
The goal of this paper is to investigate the applicability of stopping strategy introduced in  for text string recognition with per-character alternatives and to compare it with alternative methods, which are already adapted for such recognition result model. In section 2 a brief description of the stopping method is given, and in section 3 experimental evaluation and comparison for stopping methods is presented.
2 Method description
In order to provide a description of the stopping method, let us consider text string recognition in a video stream as a sequential decision problem. Let represent a set of all possible text string recognition results, and the task is to recognize a text string with correct value given a sequence of images which are obtained one at a time. At stage the image is recognized and the per-frame recognition result is obtained. After is obtained the results are combined using some combination algorithm to produce an integrated result . The stopping decision is now to either stop the process and use as the final recognition result, or continue the process in an effort to obtain in the future the integrated result with higher expected accuracy. If the process is stopped at stage the penalty is paid in form of a linear combination of distance from the obtained result to the correct one (a “price for error”) and the number of frames process (a “price for time”):
where is a metric function on the set , and is a constant representing the price paid for each observation.
The stopping rule can be formally expressed as a random variable (the stopping time), and the stopping method defines a distribution which the variable takes given the observations . The stopping problem is an optimization problem with a goal to minimize the expected loss, which can be expressed as follows:
where is a mathematical expectation, and are random recognition results with identical joint distribution with of which are realizations observed at stages .
The stopping method proposed in  is relying on an assumption that the expected distances between two consecutive integrated results decrease over time:
which allows to consider the text field recognition problem in a video stream as a monotone stopping problem. In monotone stopping problems if at some stage the loss function is not higher that the expected loss at the next stage, then this will be true for all later stages as well. For monotone stopping problems with finite horizon the optimal stopping rule is myopic, i.e. the one calling for stopping at stage if the current loss is not higher than the loss which will be suffered if the process is stopped at stage .
Since is unknown at the moment of making the stopping decision (and thus the loss function (1) cannot be computed), in  it is proposed to use triangle inequality and threshold the upper boundary for the loss difference, by estimating the expected distance between the current integrated result and the next one. To achieve this, a modelling of the next integrated result is proposed, defining the following stopping rule:
where is an observation cost (essentially, a threshold parameter of the stopping method), is an external parameter, and is a modeled integration result of all consecutive observations obtained by the stage concatenated with -th observation.
It is noted in  that the concrete method of modelling the next integrated result might depend on the nature of the combination algorithm and other specifics of the problem, however the proposed method could still be used in a quite general case, by replacing the recognition results combination method and the metric function . In the original paper the experiments were conducted using Tesseract  as the recognition algorithm, simple string of characters as a recognized string representation, a normalized Levenshtein distance  as a metric function , and ROVER  as a combination algorithm. It was not clear whether this stopping method would be effective for an extended string recognition result model, containing per-character classification alternatives. In the extended model, the string recognition result can be represented as a matrix of alternatives:
where are character labels, – class membership estimations for each character, – the size of the alphabet, and – the length of the string. The combination algorithms for string recognition result in this extended model can be viewed as a generalization of the ROVER approach, and to define the metric function a generalized Levenshtein distance may be used after defining the metric on the individual character classification results .
To compare different stopping rules the expected performance profiles can be used – a methodology from the field of anytime algorithms . Expected performance profiles are graphical plots which show dependence of the expected accuracy on the expected time required to obtain it.
In order to evaluate the stopping method described in section 2 we used an open dataset MIDV-500  which contains 500 video sequences of 50 types of identity documents with ground truth. Each original clip contained 30 frames. The frames on which the document was not fully visible were removed from the consideration, and the resulting clip was repeated in a loop until the original size of 30 frames was reached.
The ground truth in the MIDV-500 contains both ideal values for text field recognition and the ideal geometric coordinates, i.e. for each field its geometric position in the document boundaries is known, making it possible to crop the field from any frame of the dataset. Text fields were cropped with margins with width equal to 30% of the smallest text field bounding box side. Since physical dimensions of each document type in the MIDV-500 dataset is also known, it is possible to crop each field in a uniform resolution. For recognition, all text fields were cropped with the resolution of 300 DPI. After cropping each text field was recognized using a text string recognition subsystem of Smart IDReader document recognition software , obtaining the recognized value as a sequence of character classification results with alternatives. For combination of per-frame recognition result a method from  was used, which could be regarded as a generalization of the ROVER  approach for string recognition results with per-character alternatives. As a distance metric a normalized version of the generalized Levenshtein distance was used, with a taxicab metric for individual character classification results.
In  a stopping method was proposed, which was based on clusterization of the set of text field recognition results to clusters, and making a stopping decision based on some properties of the most populous cluster. The method proposed in  and described in section 2 was compared with this method in the original paper, however since the paper was focused on a simplified string recognition result model, not all features of the stopping rule presented in  were used, as the per-character alternatives were not available when using Tesseract as the text string recognition algorithm.
The clusterization of the observations is performed by their lengths (i.e. by the number of characters in the obtained string recognition results). For each cluster its confidence value is computed according to the following formula:
where is a cluster of observations with the same length . The stopping decision is made by three thresholding: the size of the largest cluster, the confidence of the largest cluster, and, if there is more than one cluster, the difference between confidences of the two largest clusters. Such thresholding meant that there are three stopping rule parameters (three thresholds).
Two variations of the stopping method proposed in  can be realized – the first, denoted hereinafter as which treats input observations as strings to compose clusters with, and the second – – treats the integrated results as observations and components of the clusters. Figure 2 illustrates the quality maps of the both approaches with variation of all three thresholds: each data point represent the mean number of observations processed before stopping and the mean distance of the integrated result to the correct value.
One of the main disadvantages of this stopping methods is that it is unclear how to jointly select the values for all thresholds to achieve the highest efficiency. In Figure 2 the black line represents the best option constructed a posteriori, which will be used for comparison with the method described in section 2.
Figure 3 illustrates the expected performance profiles comparison for the best achievable versions of the stopping rules and , the stopping rule based on the modelling of the next integrated result , described in section 2, and, as a baseline, a simple stopping rule which stops after observing -th per-frame result. It can be seen that even though the best versions of the clustering stopping rules were evaluated, without clear understanding of how to obtain these jointly optimal threshold values, the method still outperforms them.
|Stopping method||Limitation to the average number of observations|
Table 1 shows the achieved mean integrated result accuracy (in terms of distance to the correct value) at stopping time, using the evaluated stopping rules and with restrictions to the mean number of processed observations. It can be seen that the method based on modelling the next integrated result and thresholding the estimation of the expected distance from the current result to the next one () outperforms the other methods. In particular, it allows to achieve higher result quality with the same average number of processed observations even when compared with the best achievable version of the previously proposed method .
The paper describe the problem of stopping the process of text line recognition in a video stream. Previously presented stopping methods were described and their properties analyzed. A method based on modelling of the next integrated result is described and applied to the model of text recognition result as an alternatives matrix with extended per-character classification results. The applicability of the stopping method in these conditions is shown, and the comparative evaluation is performed against previously published methods. It was shown that the next integrated result modelling method outperforms the previously published clustering methods, even in their best achievable configurations.
Acknowledgements.This work is partially financially supported by Russian Foundation for Basic Research (projects 17-29-03170 and 19-29-09055).
-  T. Van Phan, K. Cong Nguyen, and M. Nakagawa, “A nom historical document recognition system for digital archiving,” International Journal on Document Analysis and Recognition (IJDAR) 19(1), 49–64 (2016). \doi10.1007/s10032-015-0257-8.
-  B. A. Dangiwa and S. S. Kumar, “A business card reader application for iOS devices based on Tesseract,” in 2018 International Conference on Signal Processing and Information Security (ICSPIS) , 1–4 (2018). \doi10.1109/CSPIS.2018.8642727.
-  V. V. Arlazarov, K. Bulatov, T. Chernov, and V. L. Arlazarov, “A dataset for identity documents analysis and recognition on mobile devices in video stream,” arXiv.1807.05786 (2018).
-  D. Esser, K. Muthmann, and D. Schuster, “Information extraction efficiency of business documents captured with smartphones and tablets,” in Proceedings of the 2013 ACM Symposium on Document Engineering , DocEng ’13, 111–114, ACM, New York, NY, USA (2013). \doi10.1145/2494266.2494302.
-  V. V. Arlazarov and D. Slugin, “Text fields extraction based on image processing,” Proc. Institute for Systems Analysis RAS 67(4), 65–73 (2017). (In Russian).
-  K. Ravneet, “Text recognition applications for mobile devices,” Journal of Global Research in Computer Science 9(4), 20–24 (2018).
-  M. Povolotskiy and D. Tropin, “Dynamic programming approach to template-based OCR,” in Proc. SPIE (ICMV 2018) , 11041 (2019). \doi10.1117/12.2522974.
-  N. Skoryukina, J. Shemiakina, V. L. Arlazarov, and I. Faradjev, “Document localization algorithms based on feature points and straight lines,” in Proc. SPIE (ICMV 2017) , 10696 (2018). \doi10.1117/12.2311478.
-  Y. Chernyshova, M. Aliev, E. Gushchanskaia, and A. Sheshkus, “Optical font recognition in smartphone-captured images and its applicability for id forgery detection,” in Proc. SPIE (ICMV 2018) , 11041 (2019). \doi10.1117/12.2522955.
-  K. Bulatov, V. V. Arlazarov, T. Chernov, O. Slavin, and D. Nikolaev, “Smart IDReader: Document recognition in video stream,” in 14th International Conference on Document Analysis and Recognition (ICDAR) , 6, 39–44, IEEE (2017). \doi10.1109/ICDAR.2017.347.
-  K. Bulatov, “A method to reduce errors of string recognition based on combination of several recognition results with per-character alternatives,” Bulletin of the South Ural State University. Ser. Mathematical Modelling, Programming & Computer Software 12(3), 74–88 (2019). \doi10.14529/mmp190307.
-  T. Ferguson and M. Klass, “House-hunting without second moments,” Sequential Analysis 29(3), 236–244 (2010). \doi10.1080/07474946.2010.487423.
-  S. Christensen and A. Irle, “The monotone case approach for the solution of certain multidimensional optimal stopping problems,” arXiv.1705.01763 (2019).
-  V. V. Arlazarov, K. Bulatov, T. Manzhikov, O. Slavin, and I. Janiszewski, “Method of determining the necessary number of observations for video stream documents recognition,” in Proc. SPIE (ICMV 2017) , 10696 (2018). \doi10.1117/12.2310132.
-  K. Bulatov, N. Razumnyi, and V. V. Arlazarov, “On optimal stopping strategies for text recognition in a video stream as an application of a monotone sequential decision model,” International Journal on Document Analysis and Recognition (IJDAR) 22, 303–314 (Sep 2019). \doi10.1007/s10032-019-00333-0.
-  R. Llobet, J. Cerdan-Navarro, J. Perez-Cortes, and J. Arlandis, “OCR post-processing using weighted finite-state transducers,” in 2010 20th International Conference on Pattern Recognition , 2021–2024 (2010). \doi10.1109/ICPR.2010.498.
-  R. Smith, “An overview of the Tesseract OCR engine,” in Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02 , ICDAR ’07 2, 629–633, IEEE Computer Society (2007).
-  L. Yujian and L. Bo, “A normalized levenshtein distance metric,” IEEE Trans Pattern Anal Mach Intell 29(6), 1091–1095 (2007). \doi10.1109/TPAMI.2007.1078.
-  J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER),” in IEEE Workshop Autom Speech Recognit Underst , 347–354 (1997). \doi10.1109/ASRU.1997.659110.
-  S. Zilberstein, “Using anytime algorithms in intelligent systems,” AI Magazine 17(3), 73–83 (1996). \doi10.1609/aimag.v17i3.1232.