Radial Line Fourier Descriptor for Handwritten Word Representation
Automatic recognition of historical handwritten manuscripts is a daunting task due to paper degradation over time. The performance of information retrieval algorithms depends heavily on feature detection and representation methods. Although there exist popular feature descriptors such as Scale Invariant Feature Transform and Speeded Up Robust Features, in order to represent handwritten words in a document, a robust descriptor is required that is not over-precise. This is because handwritten words across different documents are indeed similar, but not identical. Therefore, this paper introduces a Radial Line Fourier (RLF) descriptor for handwritten word feature representation, which is fast to construct and short-length with 32 elements only. The effectiveness of the proposed RLF descriptor is empirically evaluated using the VLFeat benchmarking framework (VLBenchmarks), and for handwritten word image representation using a historical marriage records dataset.
Automatic recognition of poorly degraded handwritten text is a daunting task due to complex layouts and paper degradations over time. Typically, an old manuscript suffers from degradations such as paper stains, faded ink and ink bleed-through. There is a variability in writing style, and the presence of text and symbols written in an unknown language. This hampers the document readability and make tasks like word spotting more challenging. However, the performance of information retrieval algorithms as well as other computer vision applications depends heavily on the appropriate selection of feature detection and representation methods .
Efforts have been made in the recent past towards research on feature detection and representation. Some popular methods include Scale Invariant Feature Transform (SIFT) , Speeded Up Robust Features (SURF)  and Histograms of oriented Gradients (HoG) . SIFT and HoG contributed significantly towards the progress of several visual recognition systems in the last decade . In a word spotting scenario, the performance of different features was evaluated using Dynamic Time Warping (DTW)  and Hidden Markov Models (HMMs) . It was found that local gradient histogram features outperform other geometrical or profile-based features. These methods generally match features from evenly distributed locations over normalised words  where no nearest neighbour search is necessary as each point in a word has its corresponding point in the other word located in the very same position. Recently, a method based on feature matching of keypoints derived from the words was proposed , which requires a nearest neigbour search. In this case, a robust descriptor is required that is not too precise, since the handwritten words are not normalised. This is due to complex characteristic of handwritten words, unlike simple OCR text. Handwritten words across different documents are indeed similar, but not identical due to variability in writing styles.
This paper proposes a Radial Line Fourier (RLF) descriptor for handwritten word feature representation. In general, the RLF descriptor is based on the idea of radial lines integration. RLF descriptor is tailor-made for word spotting applications with fast feature representation and robustness to degradations. However, the RLF descriptor can be flexibly used in other applications for feature representation of challenging images with promising results. The VLFeat benchmarking framework called VLBenchmarks  is used to test the descriptor performance. The RLF descriptor is capable of handling viewpoint changes, scale-invariance to a limited extent, and conditions such as illumination, defocus and image compression. This paper evaluates the RLF descriptor on degraded word images and challenging scene images of varying image conditions. Also an elaborate comparison analysis is done using RLF and other popular methods such as SIFT and SURF using VLBenchmarks to demonstrate the effectiveness of the proposed RLF descriptor.
Appropriate selection of interest points and descriptors is indispensable for the performance of a word spotting system. This section discusses some popular interest point detection and feature representation methods with reference to word spotting systems.
2.1Interest Point Detection
Feature detection, or interest point detection refers to finding key points in an image that contain crucial information. The selection of interest point detectors has a great impact on the performance of an information retrieval algorithm. Several methods have been suggested in literature for interest point detection . The Harris corner detector  is popularly used for corner points detection. It computes a combination of eigenvalues of the structure tensor such that the corners are located in an image. Shi-Tomasi corner detector  is a modified version of Harris detector. The minimum of two eigenvalues is computed and a point is considered as a corner point if this minimum value exceeds a certain threshold. The FAST detector , based on the SUSAN detector , uses a circular mask to test against the central pixel. MSER  detects keypoints such that all pixels inside the extremal region are either darker or brighter than all the outer boundary pixels.
In general, interest point based feature matching is done by using a single type of interest point detector. SIFT and SURF are the most popular ones that capture the blob type of features in the image. SIFT uses the Difference of Gaussians (DoG) that computes the difference between Gaussian blurred images using different values of , where defines the Gaussian blur from a continuous point of view. SURF computes the Determinant of the Hessian (DoH) matrix, that defines the product of the eigenvalues. Several different combinations of any number of interest point detectors can be chosen depending upon the application . On the other hand, the RLF descriptor proposed in this work is independent of the choice of interest points selection. Any efficient interest point detection method can be flexibly employed with the RLF descriptor.
After a set of interest points has been detected, a suitable representation of their values has to be defined to allow matching between a query word image and the document image. In general, a feature descriptor is constructed from the pixels in the local neighborhood of each interest point. Fixed length feature descriptors are most commonly used that generate a fixed length feature vector, which can be easily compared using standard distance metrics (e.g. the Euclidean distance). Sometimes, fixed length feature vectors are computed directly from the extracted features without the need of a learning step .
Gradient-based feature descriptors tend to be superior, and include SIFT , HoG  and SURF  descriptors. The 128 dimensional SIFT descriptor is formed from histograms of local gradients. SIFT is both scale and rotation invariant, and includes a complex underlying framework to ensure this. Similarly, HoG computes a histogram of gradient orientations in a certain local region. However, SIFT and HoG differ in the sense that HoG normalizes the histograms in overlapping blocks, and creates a redundant expression. The SURF descriptor is generally faster than SIFT, and is created by concatenating Haar wavelet responses in sub-regions of an oriented square window. SIFT and SURF are invariant to rotation changes, unlike HoG. There are several variants of these descriptors that have been effectively employed for word spotting .
The KAZE detector  uses a non linear scale space created using non linear diffusion filtering, instead of Gaussian Blurring. An accelerated version AKAZE  uses a faster method for creating the scale space and a binary descriptor. Many feature descriptors use the local image content in square areas around each interest point to form a feature vector . Both scale and rotation invariance can be obtained in different ways . The Fourier transform has been used to compute descriptors  that is illumination and rotation invariant, and scale-invariant to a certain extent. In order to overcome dimensionality issues, binary descriptors are introduced that are faster, but less precise, for example the BRISK descriptor  and FREAK .
The choice of feature descriptor depends upon the target application. For handwritten words representation in a document, a fast and robust descriptor like RLF descriptor is required that is not over-precise. The RLF descriptor is discussed in detail as follows.
3Radial Line Fourier Descriptor
Radial Line Fourier (RLF) Descriptor is inspired from a variant of Scale Invariant Descriptor (SID) , i.e. SID-Rot . In general, the idea is to perform log-polar sampling in a circular neighborhood around each keypoint . Then the Fourier transform is applied over scales, making it rotation sensitive (hence the name SID-Rot). However, a desirable property is that it will be less sensitive to scale changes. Nevertheless, in order to achieve this, the descriptor is dense with a very large radius, and a length of .
The RLF descriptor addresses this issue and computes a fast and short-length feature vector of 32 dimensions, to be able to perform quick matching in the nearest neighbour search. The RLF descriptor is formed directly from the interest points extracted, without the need to involve a learning step. It characterizes an image region as a whole using a single feature vector of fixed size.
To begin with, the Fourier Feature Transform (FFT) is simplified as it is rather slow and requires computations for a discrete series with elements. The modification is such that each element needed will be computed using the Discrete Fourier Transform (DFT) . Therefore, rewriting using Euler’s formula, the computation required is
The value of determines the frequency used to compute the Fourier element, where . Since noise has higher frequencies as compared to the main structures in the image with lower frequencies, the second () and third () elements of the Fourier transform are selected to form a descriptor. Hence, now the algorithm requires only computations. Note that the Discrete Cosine (DC) component is obtained for and is less informative. The trigonometric functions in the DFT do not have to be computed for each step as they can be efficiently computed using a few additions and multiplications by the Chebyshev recurrence relation , just as is done in the case of FFT.
The RLF descriptor is constructed by computing the amplitude:
Forming the descriptor using only suffices very well and the descriptor is very short. However, adding a second element for improves the quality of the subsequent matching noticeably, even if the feature vector will be twice as long. The advantage is, however, that it makes it possible to sample in a smaller neighborhood, while still getting the same number of corresponding matches, as it is more accurate. Nevertheless, adding a third element for did not improve the accuracy significantly, and is found to be not worth the extra computational effort.
When sampling is done in a log-polar fashion, some kind of interpolation is required as coordinates seldom are in pixel centres. One could for instance use bilinear interpolation to achieve higher accuracy. However, interpolation in a 3x3 neighborhood using a Gaussian is chosen instead.
The RLF descriptor is illumination and rotation invariant, and also scale-invariant to a limited extent. These are important characteristics a feature vector must possess to handle different kind of words with varying size, shape, slant characters etc. RLF descriptor is resistant to high frequency changes, such as due to residuals from neighboring words, as it is based on the low frequency content in the local neighborhood. Nevertheless, it is insensitive to small differences in form and shape as long as they are more or less the same, i.e. the low frequencies are sufficiently similar. The RLF descriptor present a clear advantage over other feature representations as has been experimentally validated in the next section.
This section describes the datasets used in the experiments and empirically evaluates the proposed RLF descriptor.
In order to evaluate the performance of RLF descriptor in representing a word in a degraded historical document, a subset of the Barcelona Historical Handwritten Marriages (BH2M) database  i.e. Esposalles dataset  is used for the experiments. The Esposalles dataset consists of handwritten marriages records stored in the archives of Barcelona cathedral, written between 1617 and 1619 by a single writer in old Catalan. In total, there are 174 handwritten pages corresponding to the volume 69. For the experiments, 50 pages written by a single author are selected from the 17th century.
4.2VGG Affine Dataset
VGG Affine dataset consists of a set of test images under varying imaging conditions . It consists of eight scene types i.e. graffiti, wall, boat, bark, bikes, trees, ubc and leuven, where each of the categories contain images with different conditions. It is employed in the VLBenchmarks  framework for testing image feature detectors and descriptors. This dataset effectively helps in testing the performance of the RLF descriptor with reference to a variety of test images, and for comparing with other feature descriptors such as SIFT and SURF.
The descriptor is evaluated under five different imaging conditions: viewpoint angle change, scale change, image blur, JPEG compression and illumination change. The same change in imaging conditions is applied in case of viewpoint change, scale change and blur for two different scene categories. This means that the effect of varying the image conditions can be separated from the effect of varying the scene type. Structured scene category consists of homogeneous regions with distinctive edge boundaries (e.g. graffiti and buildings), and the textured scene category consists of repeated textures of different forms .
In the test for viewpoint angle change, the camera varies from frontal parallel view to significant foreshortening at approximately 60 degrees to the camera. The illumination variations are introduced by changing the camera aperture. The scale change is acquired by varying the camera zoom and it changes by about a factor of four. The blur sequences are acquired by varying the camera focus. The JPEG compression sequence is generated using a standard xv image browser with the image quality parameter varying from 40% to 2%. Each of the test sequences contains six images with a gradual geometric or photometric transformation. All images are of medium resolution (approximately 800 x 640 pixels).
To evaluate the performance of the RLF descriptor for word image representation, the number of matched feature points and the number of inliers are calculated. Let indicate the number of matched points, indicate the number of inliers i.e. true points, then the inlier ratio can be defined as:
In the ideal case, the inlier ratio should be 1. In the first set of experiments, the RLF descriptor is compared with the HoG descriptor as it is most widely used in word spotting applications . For the same query word, the number of matching interest points found using RLF and HoG descriptors are calculated and quantitatively evaluated as in Table ?. In order to perform word matching and find all inliers, a preconditioner based simple clustering method is employed. Figure 1 present the sample results obtained using RLF and HoG feature descriptors for the two variants of example query word reberé. The matching keypoints (i.e. inliers) are in green and the matches that are discarded (i.e. outliers) by the preconditioner are in red. It is clearly observed that the number of inliers found using the RLF descriptor is more than the number of inliers found using the HoG descriptor, and has been quantitatively evaluated in Table ?. The matching algorithm divides the word into three parts in order to avoid mismatching the same letter occurring in several places. In the table one can note that HoG produces noticeably less matching points. This causes the algorithm to miss some words. In the experiments, three occurrences were found in the search using RLF, while some were missed by HoG, and in one case none was found. This could partly be solved by relaxing the threshold. However, then some incorrect words are found instead. This clearly shows the advantage of RLF over HoG, since it is less precise in the sense that the neighborhood forming the descriptor can be non equal, yet similar while being robust enough and not causing too many mismatches, i.e. yielding a high inlier ratio.
In the next set of experiments, the performance of the RLF descriptor is evaluated using the VLBenchmarks framework. Table ? and Table ? present the number of matches and match scores, respectively, obtained using six different combinations of feature detectors and descriptors, i.e. SIFT descriptor, 64-dimensional RLF descriptor with SIFT keypoints (SIFT+RLF64 with 32 radial lines), 32-dimensional RLF descriptor with SIFT keypoints (SIFT+RLF32 with 16 radial lines), SURF descriptor, 64-dimensional RLF descriptor with SURF keypoints (SURF+RLF64) and 32-dimensional RLF descriptor with SURF keypoints (SURF+RLF32). The varying imaging conditions taken into account include viewpoint angle change, JPEG compression, decreasing light (illumination changes) and increasing blur (defocus). The RLF descriptor is scale invariant to a certain extent, and will be further investigated in future work.
Figures Figure 2 and Figure 3 graphically illustrate the matching performance in terms of matching scores and the number of matches obtained using different combinations of feature detectors and descriptors with varying viewpoint angles. In Figure 2, it can be seen that the RLF performs fairly well with varying viewpoint angle changes, in comparison with the SIFT descriptor. Figure 3 highlights the performance of the RLF descriptor with viewpoint change in comparison with the SURF descriptor. It is clearly observed that the RLF descriptor performs better than the SURF descriptor (length 64) in terms of matching scores and the number of matches obtained.
Furthermore, tests are conducted using MSER and DoG keypoint detectors with different feature descriptor combinations, i.e. SIFT, RLF64, RLF32 and SURF. Figure 4 graphically presents the matching performance comparison between MSER keypoints represented using SIFT, RLF and SURF descriptors, and DoG keypoints represented using the same set of descriptors in varying viewpoint conditions. Table ? quantitatively evaluates the performance of various descriptors and presents the set of results obtained using MSER and DoG keypoints with these descriptors in challenging imaging conditions. It can be seen that RLF outperforms SURF in nearly all the categories and varying conditions. However, it performs fairly in comparison with SIFT, not always better, depending upon the input images, and can be further improved as future work.
This paper presented a fast and robust Radial Line Fourier descriptor. The state-of-the-art feature descriptors such as SIFT, HoG and SURF include a rather complicated framework which is not needed in all applications such as word spotting. Therefore, a much simpler, yet effective descriptor is proposed that is not too precise for handwritten words representation. The novelty of the proposed descriptor lies in lesser computation time, shorter length of feature vector, invariance to rotation, viewpoint angles, and scale to a certain extent, and other issues such as illumination, defocus and image compression. The experimental results on a historical marriage records dataset and VGG Affine dataset demonstrate the effectiveness of the proposed descriptor in handwritten word image representation and test scene images from the VLBenchmarks framework. As future work, the ideas presented herein will be scaled to aid word feature representation for heavily degraded archival databases.
- A. P. Giotis, G. Sfikas, B. Gatos, and C. Nikou, “A survey of document image word spotting techniques,” Pattern Recognition, vol. 68, pp. 310–332, 2017.
- D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
- H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
- N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.1em plus 0.5em minus 0.4emIEEE, 2005, pp. 886–893.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
- J. A. Rodriguez and F. Perronnin, “Local gradient histogram features for word spotting in unconstrained handwritten documents,” Frontiers in Handwriting Recognition (ICFHR), 1st International Conference on, pp. 7–12, 2008.
- J. A. Rodríguez-Serrano and F. Perronnin, “Handwritten word-spotting using hidden markov models and universal vocabularies,” Pattern Recognition, vol. 42, no. 9, pp. 2106–2116, 2009.
- A. Papandreou and B. Gatos, “Slant estimation and core-region detection for handwritten latin words,” Pattern Recognition Letters, vol. 35, pp. 16–22, 2014.
- A. Hast and A. Fornés, “A segmentation-free handwritten word spotting approach by relaxed feature matching,” in Document Analysis Systems (DAS), 2016 12th IAPR Workshop on.1em plus 0.5em minus 0.4em IEEE, 2016, pp. 150–155.
- S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in Proceedings of the 2011 International Conference on Computer Vision, ser. ICCV ’11.1em plus 0.5em minus 0.4emIEEE Computer Society, 2011, pp. 2548–2555.
- C. Schmid, R. Mohr, and C. Bauckhage, “Evaluation of interest point detectors,” International Journal of Computer Vision, vol. 37, no. 2, pp. 151–172, Jun. 2000.
- M. Zuliani, C. Kenney, and B. S. Manjunath, “A mathematical comparison of point detectors,” in 2nd IEEE Image and Video Registration Workshop, Jun 2004, pp. 172–178.
- T. Tuytelaars and K. Mikolajczyk, “Local invariant feature detectors: A survey,” Foundations and Trends in Computer Graphics and Vision, vol. 3, no. 3, pp. 177–280, Jul. 2008.
- C. Harris and M. Stephens, “A combined corner and edge detector,” in Proceedings of The Fourth Alvey Vision Conference, 1988, pp. 147–151.
- J. Shi and C. Tomasi, “Good features to track,” in Computer Vision and Pattern Recognition, 1994. Proceedings CVPR, 1994 IEEE Computer Society Conference on.1em plus 0.5em minus 0.4emIEEE, Jun. 1994, pp. 593–600.
- E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in Proceedings of the 9th European Conference on Computer Vision - Volume Part I, ser. ECCV’06.1em plus 0.5em minus 0.4em Berlin, Heidelberg: Springer-Verlag, 2006, pp. 430–443.
- S. M. Smith and J. M. Brady, “Susan - a new approach to low level image processing,” Int. J. Comput. Vision, vol. 23, no. 1, pp. 45–78, May 1997.
- J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide baseline stereo from maximally stable extremal regions,” in Proceedings of the British Machine Vision Conference.1em plus 0.5em minus 0.4emBMVA Press, 2002, pp. 36.1–36.10.
- K. Terasawa and Y. Tanaka, “Slit style hog feature for document image word spotting,” in Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on.1em plus 0.5em minus 0.4emIEEE, 2009, pp. 116–120.
- P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in Proceedings of the 12th European Conference on Computer Vision - Volume Part VI, ser. ECCV’12.1em plus 0.5em minus 0.4emBerlin, Heidelberg: Springer-Verlag, 2012, pp. 214–227.
- A. B. Pablo Alcantarilla, Jesus Nuevo, “Fast explicit diffusion for accelerated features in nonlinear scale spaces,” in Proceedings of the British Machine Vision Conference, 2013.
- S. Gauglitz, T. Höllerer, and M. Turk, “Evaluation of interest point detectors and feature descriptors for visual tracking,” International journal of computer vision, vol. 94, no. 3, pp. 335–360, 2011.
- G. Carneiro and A. D. Jepson, in In European Conference on Computer Vision (ECCV), Date-Added = 2017-07-18 08:53:17 +0000, Date-Modified = 2017-07-18 08:53:17 +0000, Pages = 282–296, Title = Phase-based local features, Year = 2002.
- ——, “Multi-scale phase-based local features,” in Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition, ser. CVPR’03, 2003, pp. 736–743.
- I. Ulusoy and E. R. Hancock, “A statistical approach to sparse multi-scale phase-based stereo,” Pattern Recogn., vol. 40, no. 9, pp. 2504–2520, Sep. 2007.
- A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,” in Computer Vision and Pattern Recognition, 2012 IEEE Conference on, June 2012, pp. 510–517.
- I. Kokkinos and A. L. Yuille, “Scale invariance without scale selection,” in Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
- E. Trulls, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer, “Dense segmentation-aware descriptors,” in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, ser. CVPR ’13, 2013, pp. 2890–2897.
- I. Kokkinos, M. Bronstein, and A. Yuille, “Dense Scale Invariant Descriptors for Images and Surfaces,” INRIA, Research Report RR-7914, Mar. 2012.
- R. L. Burden and J. D. Faires, Numerical Analysis Brooks.1em plus 0.5em minus 0.4emCole, Thomson Learning, 2001.
- T. Barrera, A. Hast, and E. Bengtsson, “Incremental spherical linear interpolation,” in Sigrad 2004, 2004, pp. 7–10.
- D. Fernández-Mota, J. Almazán, N. Cirera, A. Fornés, and J. Lladós, “Bh2m: The barcelona historical, handwritten marriages database,” in Pattern Recognition (ICPR), 2014 22nd International Conference on.1em plus 0.5em minus 0.4emIEEE, 2014, pp. 256–261.
- V. Romero, A. FornéS, N. Serrano, J. A. SáNchez, A. H. Toselli, V. Frinken, E. Vidal, and J. LladóS, “The esposalles database: An ancient marriage license corpus for off-line handwriting recognition,” Pattern Recognition, vol. 46, no. 6, pp. 1658–1669, 2013.
- K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool, “A comparison of affine region detectors,” International journal of computer vision, vol. 65, no. 1-2, pp. 43–72, 2005.