Learning Surrogate Models of Document Image Quality Metrics for Automated Document Image Processing

Learning Surrogate Models of Document Image Quality Metrics for
Automated Document Image Processing

Prashant Singh, Ekta Vats and Anders Hast Department of Information Technology
Uppsala University, SE-751 05 Uppsala, Sweden
Email: prashant.singh@it.uu.se; ekta.vats@it.uu.se; anders.hast@it.uu.se

Computation of document image quality metrics often depends upon the availability of a ground truth image corresponding to the document. This limits the applicability of quality metrics in applications such as hyperparameter optimization of image processing algorithms that operate on-the-fly on unseen documents. This work proposes the use of surrogate models to learn the behavior of a given document quality metric on existing datasets where ground truth images are available. The trained surrogate model can later be used to predict the metric value on previously unseen document images without requiring access to ground truth images. The surrogate model is empirically evaluated on the Document Image Binarization Competition (DIBCO) and the Handwritten Document Image Binarization Competition (H-DIBCO) datasets.

surrogate models; document image quality metrics; hyperparameter optimization

I Introduction

Document image quality metrics are objective measures that enable assessment and quantification of characteristics of a given document image. Such metrics are crucial for enabling automatic document processing applications, such as fully-automatic document image binarization. Specifically, document image processing algorithms involve hyperparameters that must be optimized to achieve the best possible resulting image. Hyperparameter optimization techniques such as Bayesian optimization [1] require formulation of an objective function to be maximized. Document image quality metrics are natural candidates as objective functions.

In general, document image quality is calculated by comparing the image in question to the noise-free replica of the document image, known as the ground truth reference image. There exist several popular image quality metrics in literature [2]. A vast majority of the methods considered Optical Character Recognition (OCR) results as document quality metrics [3, 4, 5]. Simple techniques to measure the image quality, such as Mean Squared Error (MSE) do not suffice due to the complex and degraded nature of images. There is a need for more sophisticated methods to assess image quality. Popular document quality evaluation measures [6, 7] include the F-Measure, the Peak-Signal-to-Noise Ratio (PSNR), the Distance Reciprocal Distortion metric (DRD) [8], and the Negative Rate Metric (NRM). Computation of such metrics requires a corresponding distortion-free ground truth reference image for any given document image.

In addition to ground truth images, human opinion scores have been used as ground truth in [8, 9, 10, 11] to automatically compute the document image quality metrics. A full reference document image quality assessment technique based on texture similarity index was introduced in [12] with promising results for OCR text images. There have been recent efforts to formulate image quality metrics that are not dependent on availability of ground truths. Xu et al. [13] presented a no-reference image quality metric for document image quality assessment.

However, no-reference image quality metrics, such as [13], are typically designed for document images with OCR text, and focus on specific aspects of degradations that are mostly character level distortion (e.g., noise around a character, partial or overlapping characters), and are not suitable to quantify high levels of degradations in historical handwritten texts. Machine-printed documents have simple layouts and fonts, unlike handwritten documents that have complex layouts and variability in writing style. Handwritten documents suffer from degradations such as paper stains, ink bleed-through, missing or faded data, poor contrast, warping effects, etc. that hamper document readability and pose challenges for document image processing algorithms [14].

Such variability and severity of degradations is better captured using ground truth based document image quality metrics such as F-Measure, PSNR and DRD. Ground truth images offer a reference point, relative to which candidate images can be ranked. This immensely helps image processing algorithms in automatically evaluating the quality of processed images.

However, the reliance on the availability of ground truth images is also severely limiting. In fact, the target domain of automated document image processing consists of ground truth generation as one of the applications. Therefore, it is impractical to have access to ground truths corresponding to previously unseen document images to be processed on-the-fly. It is possible however, to have access to a training set of document images and corresponding ground truth images.

This work explores a novel methodology wherein document quality metric scores computed using ground truth images as reference are used to train a model that learns the relationship between the difference in image quality represented by two images, and the corresponding metric score. Given two document images - an initial image and a processed image for which the quality metric is to be computed, the trained model can be used as a surrogate that predicts the value of the metric. Training the surrogate model is a one-time investment, and requires access to input images with corresponding ground truth images. Post training, evaluation of the surrogate model is near-instant and does not require access to ground truth image for any given test image.

This paper is organized as follows. Section II describes the concrete problem statement. Section III discusses various document quality metrics available in literature. Section IV explains the proposed surrogate modeling approach in detail. Section V demonstrated the efficacy of the proposed approach on the DIBCO and H-DIBCO datasets. Section VI discusses an alternative deep learning formulation for surrogate model training. Section VII concludes the paper.

Ii Problem Statement

The performance of document image processing tasks such as document binarization, filtering, enhancement, text or line segmentation, and high level applications such as word spotting in a document, significantly depends on the associated hyperparameter values. In general, an automated document image processing algorithm involves automatic selection of control parameters on-the-fly. This work uses document binarization as a running example throughout the text.

Although there exist several automated document image processing methods in literature [15, 16], a ground truth reference image is required to tune the associated hyperparameters. For example, an automatic document image binarization method is proposed in [15], where Bayesian optimization is used to infer the hyperparameters on-the-fly. The value of hyperparameters is chosen such that the quality metrics corresponding to the binarized image, (such as F-Measure, PSNR etc.) are maximized, or error is minimized. However, the optimization of quality metrics such as the F-Measure, PSNR, DRD and NRM is dependent upon the availability of a ground truth reference image. This limits the applicability of such methods in real world document image processing applications.

This work explores the use of surrogate models to approximate any given document image quality metric. Let be a set of document images comprising of images. Let be the corresponding ground truth images. Let be the set of processed images corresponding to , obtained after processing using algorithm , for example, a binarization algorithm. It is possible to compute and assign various quality metrics to using as a reference. Let be a vector of values computed for any such quality metric corresponding to .

Let be a previously unseen test document image, with being the processed image obtained using a given algorithm . The goal is to learn a surrogate model that can predict the value of the metric for a given pair . Such a model will enable instant on-the-fly performance feedback for the algorithm without the availability of corresponding . The model is in effect, a surrogate of the quality metric . The following section explores popular document image quality metrics.

Iii Quality Metrics

The most popularly used document image quality metrics include F-Measure, PSNR, DRD and NRM. These evaluation measures compute the image quality by comparing the document image with the corresponding ground truth reference image [6, 7].

Iii-a F-Measure

F-Measure captures accuracy, defined as the weighted harmonic mean of Precision and Recall,


where and . TP, FN and FP denote True Positives, False Negatives and False Positives, respectively.

Iii-B Peak Signal-to-Noise Ratio (PSNR)

PSNR is a popularly used metric to measure how close an image is to another image. The higher the value of PSNR, the higher the similarity between two images. PSNR is defined via the mean squared error (MSE). Given a noise-free image and its noisy approximation , MSE is defined as,


and PSNR is defined as,


where is the difference between foreground and background image.

Iii-C Distance Reciprocal Distortion metric (DRD)

DRD is used to measure the visual distortion for all the flipped pixels in binary document images [8], and is defined as,


where is the distortion of the -th flipped pixel, calculated using a normalized weight matrix as,


denotes the weighted sum of the pixels in the block of the ground truth that differ from the centered -th flipped pixel at in the binarized image. NUBN is the number of non-uniform (not all black/white pixels) blocks in the ground truth image.

(a) Surrogate model training framework.
(b) Prediction using the trained surrogate model on previously unseen data without access to ground truth images.
Fig. 1: Surrogate modeling framework for learning document image quality metrics.

Iii-D Negative Rate Metric (NRM)

NRM measures the pixel-wise mismatch rate between the ground truth image and the resultant binarized image. NRM is defined as,


where , .

denotes the false negative rate, denotes the false positive rate, is the number of true positives, is the number of false positives, is the number of true negatives and is the number of false negatives. The lower the value of NRM, the better is the binarized image quality.

Iv Surrogate Models for Learning Document Quality Metrics

Surrogate modeling [17] has emerged as a popular methodology to obtain a fast-to-evaluate approximation of a computationally expensive or data-scarce function. Since the surrogate model allows fast evaluation, it can be used in applications such as optimization, parameter space exploration and sensitivity analysis where a large number of repeated calls to the target function are required.

For example, complex simulation codes are often used during the design process of electronic devices such as antennae, microwave filters, etc. In order to study and test the effect of varying design parameters, repeated calls to simulation codes are made. Each of these calls may take several minutes to evaluate, and this hampers the design space exploration process. Globally accurate surrogate models offer near-instant evaluation and can be used in place of such simulation codes. Obtaining such a surrogate involves preparing training data by evaluating the simulation code on a carefully selected set of parameter combinations or points, which is chosen according to a statistical design or a sampling algorithm [17].

Automated document image processing algorithms that make use of ground truth-based image quality metrics are an excellent use-case for surrogate models. Since ground truth images are scarce, therefore it makes sense to train an accurate surrogate model of a specified image metric using the limited quantity of available ground truth images. The surrogate model can then be used to estimate the value of image quality metric on-the-fly for any input test image, and corresponding processed image.

Iv-a Surrogate Model Types

Numerous surrogate model types exist in literature with Artificial Neural Networks (ANN), Gaussian Processes (GP) and Support Vector Machines (SVM) being popular [18]. ANNs [19] have shown excellent results in recent years, especially in applications involving visual data, and problems involving large training sets. GPs [20] are very popular in design optimization applications and global surrogate modeling owing to their capability of providing the variance of prediction, in addition to the prediction itself. This aids adaptive sampling algorithms in quickly searching for optima within a mathematically principled framework. SVR models [21] formulate the learning problem into an optimization problem that can be solved in a straightforward manner. SVR models have proven to be robust and stable in a variety of problems, and can deal with both small and large datasets. Consequently, SVR models are a reliable choice for general use in global modeling problems. This work uses ANN, GP and SVM regression models as surrogates for the purpose of experiments. However, the framework and methodology proposed herein is independent of any particular model type. A detailed discussion on the model types is out of scope in this work, and the reader is referred to [22, 23] for SVR (support vector regression), [20] for GPs and [19] for ANNs.

Iv-B Model Training

Let each document image and processed image be represented as a matrix. The surrogate model learns the mapping . The target is the value of the document image quality metric. The metric may also be user-defined scores. Intuitively, the inputs must represent the quantity of change or transformation the image processing algorithm has brought about in the original image to obtain . The surrogate must be able to learn the value of a given image quality metric associated with the difference and nature of transformation from to . This transformation can also be represented as a vector of metrics that represent the differences between and . Possible candidates to measure such transformation include the metrics explained in Section III, e.g., F-Measure, PSNR, DRD, etc.

Let be metrics. For any given document image and corresponding processed image , the   vector represents the values of metrics as,


The complete matrix represents the input variables to be learned by the surrogate model. The use of quality metrics as input variables immensely simplifies the learning problem as compared to the case of using raw images as input. The target vector simply represents the values of a specific document image quality metric computed as,


The training set for the surrogate model is then . The framework of the proposed approach is pictorially described in Fig. 1.

V Experiments

V-a Dataset

The proposed surrogate-based approach is empirically evaluated on the images from seven well-known competition datasets: DIBCO 2009 [6], H-DIBCO 2010 [24], DIBCO 2011 [25], H-DIBCO 2012 [26], DIBCO 2013 [27], H-DIBCO 2014 [28] and H-DIBCO 2016 [7]. These datasets contain machine-printed and handwritten historical document images suffering from various kinds of degradations including stained paper, faded ink or ink bleed through, wrinkles and unknown graphical symbols. In total there are 86 document images, out of which 63 randomly chosen images are used for training and 23 images for testing. As an example, the framework is applied to perform automatic image binarization using Bayesian optimization as proposed in [15]. The document image quality metrics used as inputs for the surrogate models include PSNR, DRD and NRM. The target image quality metric to be approximated using surrogates is the F-Measure.

V-B Experimental Results

The -SVR variant [23] with the Sequential Minimal Optimization (SMO) [29] solver is used for the following experiments. The hyperparameters of the SVR model are optimized using Bayesian optimization [1]. The GP model uses a Gaussian kernel with hyperparameters being optimized using Maximum-Likelihood Estimation (MLE) [20]. The variant of ANN used is a feed-forward back propagation neural network [19] trained using the Levenberg-Marquardt algorithm [30].

The error metrics used to test the accuracy of the surrogate models are Root Relative Square Error (RRSE), Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) [31].

ANN 0.8781 2.9980 4.1733
SVR 0.8053 2.7107 3.8272
GP 1.0633 3.3506 25.5363
Ensemble 0.8979 2.9477 5.0533
TABLE I: Error estimates for the surrogate models.
Model Type Training Time (s) Prediction time (s)
TABLE II: Model training and prediction times for a test dataset of document images distinct from the training set.

Table I lists error estimates corresponding to the proposed model training approach described in Section IV-B. All three error measures indicate that ANN and SVR outperform GP surrogate for the given dataset. The table also contains error estimates corresponding to an ensemble model that simply averages the predictions of ANN, SVR and GP models. It can be observed from Table I that SVR emerges as the single best performing model type.

Table II reports the time in seconds taken to train the surrogate models and the total time taken by the models to predict F-Measure values of unseen test document images. It can be seen that once the model is trained, predictions are made almost instantly. This makes the surrogate model assisted approach ideal for use on-the-fly in image processing algorithms. The time taken for preprocessing and model training is a one-time investment. A relatively high value of training time for SVR is due to the time taken to optimize hyperparameters using Bayesian optimization. This was to ensure that the hyperparameters are as close to optimal, given a relatively small training set.

Fig. 2: Predicted value of F-Measure by each surrogate model type for test images. Surrogate models are accurate in general except for test instances 2, 10, 18 and 22.

Figure 2 depicts the values of F-Measure predicted by different model types following the proposed model training approach. It can be seen that there are relatively large errors made by all model types for test instances numbers 2, 10 and 18. However, all models have been able to capture the general trend of the test images, except for test instances 2 and 22. Even though the error is large for test instances 10 and 18, the models have been able to learn the ’downward’ leaning behavior of F-Measure therein.

Fig. 3: Test images having high prediction errors.

Figure 3 shows test instances 2, 10 and 18 on which all surrogate models struggled. It can be seen that test image 2 has high variation in image contrast and intensity. Test image 10 is suffering from paper wrinkles and fold marks, in addition to pen strokes of varying intensities. Test image 18 also contains variation in pen stroke intensities. Test image 22 (not shown) includes text written with multiple inks. These characteristics are not well-represented in the training set, leading to large errors in prediction of corresponding F-Measure. Having a larger training set that captures a wide variation of paper degradations, writing styles, pen stroke intensities, etc. will improve the performance of surrogate models.

Figure 4 shows a sample test document image binarized using the method [15]. The hyperparameters of the binarization algorithm are optimized using Bayesian optimization [1] as described in [15]. The objective function to be maximized using Bayesian optimization is the F-Measure (as predicted by the SVR surrogate model trained above). The resultant image in Figure (b)b is clean and validates the accurate modeling of F-Measure by the SVR surrogate.

Vi Discussion: Raw Images as Input

The approach discussed herein represents the inputs as image quality metrics measuring difference between and . This is done to simplify the learning problem to remain within a handful of input parameters, and allows highly efficient learning and inference. It is also possible to consider the input and processed images themselves as input, without any post-processing to calculate quality metrics. The surrogate model will then learn the mapping , where each and is a matrix. The representation of inputs as images is an ideal use-case of deep learning inspired surrogate models such as convolutional neural networks (CNNs) [32]. The caveat herein is that the training set must be sufficiently large to allow meaningful learning to proceed.

(a) Original document image.
(b) Resultant binarized document image.
Fig. 4: Automatic document image binarization performed by the algorithm described in [15]. Hyperparameters of the binarization algorithm are optimized to maximize the F-Measure approximated using the SVR surrogate model.

Vii Conclusion

A novel approach is presented in this paper that uses surrogate models to learn a given document image quality metric. The surrogate model is trained on a dataset comprising of inputs that quantify differences in image quality between raw input images and corresponding processed images obtained using an image processing algorithm. The target to be approximated by the surrogate model is the value of a given document image quality metric that is computed for the training set by comparing the processed candidate images to corresponding ground truth images. Post training, the surrogate can be used to quickly predict the value of the document image quality metric for any given test pair of raw and processed document images, without any need for corresponding ground truth images. The methodology is tested on well-known publicly available document image datasets. Experimental evaluation indicates that the surrogate model is able to accurately learn the relationship between differing image quality and corresponding variation in document image quality metric value. Future work includes obtaining and experimenting with larger training sets, and exploring regression convolutional neural networks as surrogate models.


This work was funded by the Göran Gustafsson foundation, the eSSENCE strategic collaboration on eScience, and the Riksbankens Jubileumsfond (Dnr NHS14-2068:1).


  • [1] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in neural information processing systems, 2012, pp. 2951–2959.
  • [2] P. Ye and D. Doermann, “Document image quality assessment: A brief survey,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on.   IEEE, 2013, pp. 723–727.
  • [3] C. Hale and E. Barney-Smith, “Human image preference and document degradation models,” in Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, vol. 1.   IEEE, 2007, pp. 257–261.
  • [4] L. Kang, P. Ye, Y. Li, and D. Doermann, “A deep learning approach to document image quality assessment,” in Image Processing (ICIP), 2014 IEEE International Conference on.   IEEE, 2014, pp. 2570–2574.
  • [5] N. Nayef and J.-M. Ogier, “Metric-based no-reference quality assessment of heterogeneous document images,” in SPIE 9402, Document Recognition and Retrieval XXII, 2015, p. 94020L.
  • [6] B. Gatos, K. Ntirogiannis, and I. Pratikakis, “Icdar 2009 document image binarization contest (dibco 2009),” in Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on.   IEEE, 2009, pp. 1375–1382.
  • [7] I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos, “Icfhr2016 handwritten document image binarization contest (h-dibco 2016),” in Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on.   IEEE, 2016, pp. 619–623.
  • [8] H. Lu, A. C. Kot, and Y. Q. Shi, “Distance-reciprocal distortion measure for binary document images,” IEEE Signal Processing Letters, vol. 11, no. 2, pp. 228–231, 2004.
  • [9] T. Obafemi-Ajayi and G. Agam, “Character-based automated human perception quality assessment in document images,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 42, no. 3, pp. 584–595, 2012.
  • [10] J. Kumar, F. Chen, and D. Doermann, “Sharpness estimation for document and scene images,” in Pattern Recognition (ICPR), 2012 21st International Conference on.   IEEE, 2012, pp. 3292–3295.
  • [11] A. Alaei, D. Conte, and R. Raveaux, “Document image quality assessment based on improved gradient magnitude similarity deviation,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on.   IEEE, 2015, pp. 176–180.
  • [12] A. Alaei, D. Conte, M. Blumenstein, and R. Raveaux, “Document image quality assessment based on texture similarity index,” in Document Analysis Systems (DAS), 2016 12th IAPR Workshop on.   IEEE, 2016, pp. 132–137.
  • [13] J. Xu, P. Ye, Q. Li, Y. Liu, and D. Doermann, “No-reference document image quality assessment based on high order image statistics,” in Image Processing (ICIP), 2016 IEEE International Conference on.   IEEE, 2016, pp. 3289–3293.
  • [14] A. P. Giotis, G. Sfikas, B. Gatos, and C. Nikou, “A survey of document image word spotting techniques,” Pattern Recognition, vol. 68, pp. 310–332, 2017.
  • [15] E. Vats, A. Hast, and P. Singh, “Automatic document image binarization using bayesian optimization,” in Proceedings of the 2017 Workshop on Historical Document Imaging and Processing (In Press).   ACM, 2017.
  • [16] N. R. Howe, “Document binarization with automatic parameter tuning,” International Journal on Document Analysis and Recognition, vol. 16, no. 3, pp. 247–258, 2013.
  • [17] D. Gorissen, I. Couckuyt, P. Demeester, T. Dhaene, and K. Crombecq, “A surrogate modeling and adaptive sampling toolbox for computer based design,” Journal of Machine Learning Research, vol. 11, no. Jul, pp. 2051–2055, 2010.
  • [18] P. Singh, I. Couckuyt, K. Elsayed, D. Deschrijver, and T. Dhaene, “Shape optimization of a cyclone separator using multi-objective surrogate-based optimization,” Applied Mathematical Modelling, vol. 40, no. 5, pp. 4248–4259, 2016.
  • [19] S. S. Haykin, S. S. Haykin, S. S. Haykin, and S. S. Haykin, Neural networks and learning machines.   Pearson Upper Saddle River, NJ, USA:, 2009, vol. 3.
  • [20] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning.   MIT press Cambridge, 2006, vol. 1.
  • [21] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [22] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and computing, vol. 14, no. 3, pp. 199–222, 2004.
  • [23] D. Basak, S. Pal, and D. C. Patranabis, “Support vector regression,” Neural Information Processing-Letters and Reviews, vol. 11, no. 10, pp. 203–224, 2007.
  • [24] I. Pratikakis, B. Gatos, and K. Ntirogiannis, “H-dibco 2010-handwritten document image binarization competition,” in Frontiers in Handwriting Recognition (ICFHR), 2010 International Conference on.   IEEE, 2010, pp. 727–732.
  • [25] I. Pratikakis, B. Gatos, and K. Ntirogiannis, “Icdar 2011 document image binarization contest (dibco 2011),” in Document Analysis and Recognition, 2011. ICDAR’11. 11th International Conference on.   IEEE, 2011, pp. 1506–1510.
  • [26] I. Pratikakis, B. Gatos, and K. Ntirogiannis, “Icfhr 2012 competition on handwritten document image binarization (h-dibco 2012),” in Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on.   IEEE, 2012, pp. 817–822.
  • [27] I. Pratikakis, B. Gatos, and K. Ntirogiannis, “Icdar 2013 document image binarization contest (dibco 2013),” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on.   IEEE, 2013, pp. 1471–1476.
  • [28] K. Ntirogiannis, B. Gatos, and I. Pratikakis, “Icfhr2014 competition on handwritten document image binarization (h-dibco 2014),” in Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on.   IEEE, 2014, pp. 809–813.
  • [29] J. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” 1998.
  • [30] H. B. Demuth, M. H. Beale, O. De Jess, and M. T. Hagan, Neural network design.   Martin Hagan, 2014.
  • [31] M. Graczyk, T. Lasota, and B. Trawiński, “Comparative analysis of premises valuation models using keel, rapidminer, and weka,” Computational Collective Intelligence. Semantic Web, Social Networks and Multiagent Systems, pp. 800–812, 2009.
  • [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description