HoughNet: neural network architecture for vanishing points detection

HoughNet: neural network architecture for vanishing points detection

Alexander Sheshkus1 2 3 Anastasia Ingacheva2 6, Vladimir Arlazarov 2 4 and Dmitry Nikolaev 2 5 1Email: astdcall@gmail.com, Phone: +7 (905) 569-85-08 2LLC Smart Engines Service, 117312, Moscow, Russia. 3Federal Research Center ”Computer Science and Control” RAS, 119333, Moscow, Russia. 4Moscow Institute of Physics and Technology (National Research University), 117303, Moscow, Russia. 5Institute for Information Transmission Problems (Kharkevich Institute) RAS, 127051, Moscow, Russia 6National Research University Higher School of Economics, 101000, Moscow, Russia.
Abstract

In this paper we introduce a novel neural network architecture based on Fast Hough Transform layer. The layer of this type allows our neural network to accumulate features from linear areas across the entire image instead of local areas. We demonstrate its potential by solving the problem of vanishing points detection in the images of documents. Such problem occurs when dealing with camera shots of the documents in uncontrolled conditions. In this case, the document image can suffer several specific distortions including projective transform. To train our model, we use MIDV-500 dataset and provide testing results. Strong generalization ability of the suggested method is proven with its applying to a completely different ICDAR 2011 dewarping contest. In previously published papers considering this dataset authors measured quality of vanishing point detection by counting correctly recognized words with open OCR engine Tesseract. To compare with them, we reproduce this experiment and show that our method outperforms the state-of-the-art result.

Vanishing point, Fast Hough Transform, Convolutional Neural Network, Document Rectification, Deep Learning.

I Introduction

Mobile photo technologies development and ubiquitous mobile devices capable of making images of acceptable for text recognition quality led to the fact that various documents are being photographed for digital processing. Since these images are taken in uncontrolled conditions they suffer from various camera-specific distortions including the projective ones. Therefore, we have to solve the task of the document rectification prior to any other analysis. The purpose of this step is to find an image transformation which will make the document rectangular and correctly oriented. Even though recognition algorithms which work for distorted images exist [1], the mainstream methods include this step since it makes most of the following document understanding simpler. To rectify the document, we have to calculate the parameters of the homography matrix. It is possible if we know the correspondence between four points (two coordinates for each) in the original and rectified image. Selection of these points is possible in different ways. One of the choices is to find the document quadrangle itself using one of the many known methods [2, 3, 4] and then transform it to a rectangular shape. This approach is not always applicable since the borders of the document may be obscured with irrelevant objects, mixed with a background or even located outside of the frame. The alternative is to find two vanishing points. This method works for documents since they typically contain strings of text in a specific direction and therefore it is possible to evaluate horizontal and vertical vanishing points (see Fig. 1). Using these vanishing points one can calculate the homography matrix and rectify the image except for image shift and scale.

Fig. 1: Horizontal and vertical vanishing points for the document.

Vanishing points detection is a commonly known problem which was addressed multiple times in multiple contexts. A classical approach to vanishing points detection is demonstrated in Barnards work [5]. He uses Gaussian sphere centered in the origin of the coordinates while image plane is placed at some distance from image plane – focal length ( coordinate of the image plane). Every point on the image can be mapped to a point on the sphere and then be treated as a radius-vector. With this trick, one can map points at infinity to the finite space and deal with them using regular methods. To detect vanishing points we have to find the intersection points of all lines in the image and then merge them into clusters. All lines which belong to the same cluster represent a bunch of parallel lines in some perspective. Such technique is used not only for vanishing points detection: for example, in [6] authors suggest to find text baselines and skewness of the symbols using clusters of points on the Gaussian sphere. But still, these algorithms have task-specific parameters and therefore lack general robustness. In [7] Gaussian sphere is also used, but as an input for a deep convolutional neural network for vanishing point detection. The main advantage of this method is that it is possible to generate synthetic training data since neural network uses projections on the sphere instead of input images. The problem of this approach is that line segments in the input image still have to be found somehow.

With algorithms evolution and specialization three essentially different cases for the problem emerge. Vanishing point detection method in a road scene (while analyzing images taken with video registrator) typically has to find the target point within the image [8] while in the “Manhattan world” [9] three orthogonal vanishing points exist. When dealing with documents, we expect to find two vanishing points which are located outside of the image. Generally, it is possible to capture the document otherwise, but we will not consider this case since distortions are very strong and the contained data is hardly recognizable even after the rectification procedure. Since we are looking for vanishing points outside the image it is impossible to use direct convolutional neural network approach which becomes popular in road scene vanishing point detection [10].

Despite the fact that there are methods for vanishing point detection using intersection points clusterization [11], the base method for the task relies on double Hough Transform [12]. This method is simple and clear but too non-robust to different image distortions and corruptions and is applicable only to high-quality input data. Most of the algorithms for vanishing point detection more or less are based on this approach. Since Hough Transform is an integral operator on the image, it is worth mentioning that integral operators are used in a wide range of algorithms from skew angle calculation [13] to images reconstruction in computed tomography [14].

In paper [15] authors use direct and inverse Radon transformations combination to calculate candidates for vanishing points and there is an improvement to this method in [16] with a usage of RANSAC scheme. The results are promising, but we can see, that the method is highly dependent on the text amount and, more importantly, that the vertical vanishing point is less confident since there are much fewer vertical lines in the document.

In [17] authors train a recognition neural network using the result of Hough Transform [18] as a feature map. In [19] authors describe the method for Hough transform calculation with neural network and in [20] authors use Hough voting procedure to use not only the NN answers, but also descriptors from the second-last fully connected layer. In paper [21] there is a definition of a Fast Hough Transform (FHT) layer as a linear operator for feature space transformation. We will develop this idea further and introduce new neural network architecture based on this type of layer.

Ii Vanishing Points detection using FHT

In this paper, we consider the problem of vanishing point detection in the images of documents using FHT which has a lot of applications [22]. These images typically contain two vanishing points – the first one is obtained from the top and bottom borders along with text strings, the second one emerges from the left and right borders of the document along with the borders of some elements in the document content. Specifically, we will consider only the cases when both vanishing points are located outside the image.

FHT algorithm calculates four parts of the output image separately. These parts corresponds to angle ranges , , and respectively if the input image is a square. We will refer to them as , , and . We also will use and for vertically joined results of with and with respectively. Consequent FHT appliances will be referred to where and specify the angle range for the first and the second transformations respectively.

Every line in the input image transforms to a specific point in the first FHT image. For the lines intersecting in one point the corresponding points in the first FHT image will be collinear. The line containing all these points will transform to a point in the consequent FHT image. For a better understanding of the basics of the suggested method, we consider an example. In Fig. (a)a there are four lines which all intersect in one point somewhere far above the image. If we calculate from this image we obtain (b)b, where every local maximum corresponds to a certain line in the input image. Now we calculate from the image with points and obtain Fig. (c)c. There is only one local maximum and it represents the intersection point of lines on the input image. In our problem we deal only with the vanishing points outside the image, otherwise we would consider for the second Hough operator.

(a) Input Image
(b)
(c)
Fig. 2: Hough transform for Vanishing Point detection.

Iii Suggested approach

Our solution has to work correctly with vanishing points outside the image including the points at infinity. To overcome difficulties of this case we suggest a new neural network architecture based on the FHT layer introduced in [21]. After the vanishing points detection we perform the rectification procedure as described in [15].

Iii-a Fast Hough Transform layer

Our neural network architecture is based on the FHT layer which performs transformation of the specific angle range. Since this layer is a linear operator, back propagating gradient through it does not require any additional effort. We also want to mention that this layer has no trainable coefficients and is needed only for feature maps transformation. For (which corresponds to mostly vertical lines on the input image) we can calculate a point from line segment using equations:

(1)

For line segment is defined as and can be calculated with:

(2)

In equations (1), (2) and represent image skewing [23] and can be written as:

(3)

This skewing is useful since it makes the feature map continuous. In Fig. 3 there is an example of a simple input image (a)a, image with zero skew (b)b and with skew according to equations (3) (c)c.

(a) Input image
(b) Without skew
(c) With skew
Fig. 3: Examples of with zero skew and with skew by eq. (3).

One can evaluate the correspondence of points and lines between the input and the resulting feature maps. Every point in the input image transforms into different lines on and parts of the output image. If we have a point in the input image, then the corresponding line on the Hough image can be calculated according to the equations:

(4)
(5)

Iii-B Vanishing points evaluation

Using equations (1), (2), (3), (4), (5) it is possible to calculate the correspondence between points in coordinate space of and in coordinate space of :

(6)
(7)

Equation (6) is for while equation (7) is for .

With known and one can evaluate points coordinates in the original image coordinates system with equations (for and ):

(8)
(9)

Iv HoughNet NN architecture

The main purpose of the proposed neural network architecture is to transform feature space and allow convolutional layers to operate with linear areas across the image instead of local areas. When considering the vanishing point detection problem, motivation for such transformation is that generally, one cannot solve the task operating only with local areas. On the contrary, this task can be solved if we operate with linear objects and their properties. Hence, neural network based on the FHT layer seems perfectly reasonable for the task.

For the Hough transform it is proved, that there is no single transformation for the entire range of angles [24], therefore we suggest to build a two-branched neural network (for vertical and horizontal vanishing points). Every branch consists of three convolutional blocks and two FHT layers between them. Table I contains detailed layers description for both branches which are identical except for the fourth layer. The total number of trainable parameters is per branch.

Our neural network architecture is mostly inspired with these points:

  • convolution layers between Hough layers represent peak detectors;

  • line detection with Hough transformation in presence of the noise and outliers in the data implies usage of convolutions [25];

  • multichannel Hough map followed with non linear function approximator allows NN not only operate with accumulated value along the given line but also with its statistics, for example dispersion.

Layers
# Type Parameters Activation function
1 conv 12 filters 55, stride 11, no padding
2 conv 12 filters 55, stride 22, no padding
3 conv 12 filters 55, stride 11, no padding
4 FHT for vertical, for horizontal -
5 conv 12 filters 39, stride 11, no padding
6 conv 12 filters 35, stride 11, no padding
7 conv 12 filters 39, stride 11, no padding
8 conv 12 filters 35, stride 11, no padding
9 FHT for both branches -
10 conv 16 filters 55, stride 33, no padding
11 conv 16 filters 55, stride 33, no padding
12 conv 1 filter 55, stride 11, no padding
TABLE I: HoughNet architecture

The neural network yields two images, one for the vertical vanishing point and one for the horizontal vanishing point. From every image we take a point with maximum intensity as an answer and transform its coordinates back to the original image coordinates space using equations (8) and (9) mixed with coordinates transform according to convolution layers.

The neural network is trained with minimization of distance between the given and an ideal answer. For the ideal answer we used zero-filled images with the one-filled rectangle of at the position of the correct answer. Convergence rate is very low and we were training our neural network for epochs. Even though the amount of epochs is drastic, the training process took about days on a single-GPU PC which is acceptable. Moreover, this neural network can be quite universal and will not require retraining from scratch for every new case. In our future works, we plan to deal with that problem and develop a new cost function which would reduce the number of required epochs.

V Used Datasets

For our neural network training and evaluation, we use two different open datasets. The first one is the dataset of documents MIDV-500 [26]. This dataset consists of images with different documents captured with mobile devices and which therefore have projective distortion (see examples in Fig. 4). The documents are typically made of plastic or hard paper and therefore planar. In other words, this dataset seems perfect for our task. We use the first 30 types of documents for training and the last 20 types of documents for testing. Also, all the document images with more than one document quadrangle corner outside the image were removed. All valid images were homothetic scaled to a constant width of before applying the NN.

(a)
(b)
Fig. 4: Examples of images from MIDV-500 dataset.

To evaluate a baseline we use the second dataset from ICDAR 2011 dewarping contest [27]. These images were scaled to a width of before the NN usage. Even though the method deals only with projective distortions, its ability to tolerate other distortions is required to use it in a real world tasks. This dataset contains distorted images (projective distortion and page curl) of different pages in binary format (see samples in Fig. 5). Even having been trained on the different dataset, our NN still outperforms state-of-the-art result which proves strong generalization ability of the suggested approach.

(a)
(b)
Fig. 5: Examples of images from ICDAR 2011 dataset.

Vi Experimental results

Our experiments consisted of two parts. For the first part we applied our neural network to the testing part of MIDV-500 dataset and evaluated the quality of vanishing point detection. Using two vanishing points we rectify the images and estimate how rectangular the documents become. To do so we compute two distances: and according to equations ( – number of the images):

(10)
(11)

Distance represents the average deviation from of the document rectangle regardless its orientation (10). Distance allows us to estimate how good we managed to correct the orientation of the document (11). Table II presents results for both corrected and original images to underline the impact of the rectification. We also provide results for the training part to show that our neural network does not suffer from overfitting.

Fig. 6: Rectification process steps. The first row – ICDAR 2011 dataset, the second row – MIDV-500 dataset.

The second part of the experiment was performed to show both high accuracy in comparison with previously published results and strong generalization ability of our approach. For that purpose we took ICDAR 2011 dewarping contest dataset consisting of 100 binary images and measured quality of recognition by open OCR engine Tesseract (version 3.02) [28] after the image rectification procedure. We compared our results with [15, 16] and it can be shown that even having been trained on completely different images, our method still outperforms the state-of-the-art result (Table III). It was possible because even while the datasets are different, the features (base and cap lines of the text strings, line beginnings and endings, straight elements of the characters) are similar. Another important point is that since these features are not scale invariant, we have to know approximate proportion in the used dataset and scale images respectively.

Fig. 6 illustrates the process of image rectification with the proposed neural network. The first column shows the input image, the second column shows the input sample for the first FHT layer (branch for horizontal vanishing point detection), the third one shows the input for the second FHT layer (branch for horizontal vanishing point detection) and the final column shows the image after the rectification process using found vanishing points. We select images from the horizontal branch since illustrations are clearer due to higher linear segments presence. The second and the third columns contain one channel of -channels image which we found to be the most illustrative.

Train Train corrected Test Test corrected
2.56 1.63 2.70 1.65
5.76 0.93 4.77 0.91
TABLE II: Results on MIDV-500 dataset
Distorted By [15] By [16] By our method
Recognized 31.3 49.6 50.1 59.7
TABLE III: Results on ICDAR 2011 dewarping contest dataset

Vii Conclusion

In this paper, we introduce new HoughNet architecture based on usage of the FHT layer which allows convolutional filters to use features from linear areas across the image instead of local areas. Results show very good quality in the task of vanishing points detection in the document images. Our method outperforms the state-of-the-art result on the ICDAR 2011 dewarping contest dataset while used neural network was trained using MIDV-500 dataset. This demonstrates its strong generalization ability. This inspires us to develop this topic further to build a solution for different cases of the task. The suggested approach also show robustness to the document image origin and to complex background.

For future work, we are planning to develop the idea and expand our solution to all kinds of the vanishing points. Also, we are planning to deal with convergence rate along with accuracy. The other thing to do is to merge neural network branches into one to reduce the amount of trainable parameters.

Acknowledgment

This work is partially supported by Russian Foundation for Basic Research (project 18-29-26027 and 17-29-03161).

References

  • [1] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text recognition with automatic rectification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4168–4176.
  • [2] N. Skoryukina, J. Shemiakina, V. L. Arlazarov, and I. Faradjev, “Document localization algorithms based on feature points and straight lines,” in Tenth International Conference on Machine Vision (ICMV 2017), vol. 10696.   International Society for Optics and Photonics, 2018, p. 106961H.
  • [3] Z. Zhang and L.-W. He, “Whiteboard scanning and image enhancement,” Digit. Signal Process., vol. 17, no. 2, pp. 414–432, Mar. 2007. [Online]. Available: http://dx.doi.org/10.1016/j.dsp.2006.05.006
  • [4] A. Hartl and G. Reitmayr, “Rectangular target extraction for mobile augmented reality applications,” 2012, international Conference on Pattern Recognition ; Conference date: 11-11-2012 Through 15-11-2012.
  • [5] S. T. Barnard, “Interpreting perspective images,” Artificial intelligence, vol. 21, no. 4, pp. 435–462, 1983.
  • [6] X.-C. Yin, H.-W. Hao, J. Sun, and S. Naoi, “Robust vanishing point detection for mobilecam-based documents,” in 2011 International Conference on Document Analysis and Recognition.   IEEE, 2011, pp. 136–140.
  • [7] F. Kluger, H. Ackermann, M. Y. Yang, and B. Rosenhahn, “Deep learning for vanishing point detection using an inverse gnomonic projection,” in German Conference on Pattern Recognition.   Springer, 2017, pp. 17–28.
  • [8] S. Lee, J. Kim, J. Shin Yoon, S. Shin, O. Bailo, N. Kim, T.-H. Lee, H. Seok Hong, S.-H. Han, and I. So Kweon, “Vpgnet: Vanishing point guided network for lane and road marking detection and recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1947–1955.
  • [9] M. Zhai, S. Workman, and N. Jacobs, “Detecting vanishing points using global image context in a non-manhattan world,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5657–5665.
  • [10] S. Lee, J. Kim, J. S. Yoon, S. Shin, O. Bailo, N. Kim, T.-H. Lee, H. S. Hong, S.-H. Han, and I. S. Kweon, “Vpgnet: Vanishing point guided network for lane and road marking detection and recognition,” in 2017 IEEE International Conference on Computer Vision (ICCV).   IEEE, 2017, pp. 1965–1973.
  • [11] G. McLean and D. Kotturi, “Vanishing point detection by line clustering,” IEEE Transactions on pattern analysis and machine intelligence, vol. 17, no. 11, pp. 1090–1095, 1995.
  • [12] X. Chen, R. Jia, H. Ren, and Y. Zhang, “A new vanishing point detection algorithm based on hough transform,” in Computational Science and Optimization (CSO), 2010 Third International Joint Conference on, vol. 2.   IEEE, 2010, pp. 440–443.
  • [13] P. Bezmaternykh, D. Nikolaev, and V. Arlazarov, “Textual blocks rectification method based on fast hough transform analysis in identity documents recognition,” in Tenth International Conference on Machine Vision (ICMV 2017), vol. 10696.   International Society for Optics and Photonics, 2018, p. 1069606.
  • [14] A. Kak and M. Slaney, Principles of Computerized Tomographic Imaging.   IEEE Press, 1988.
  • [15] Y. Takezawa, M. Hasegawa, and S. Tabbone, “Camera-captured document image perspective distortion correction using vanishing point detection based on radon transform,” in Pattern Recognition (ICPR), 2016 23rd International Conference on.   IEEE, 2016, pp. 3968–3974.
  • [16] ——, “Robust perspective rectification of camera-captured document images,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 6.   IEEE, 2017, pp. 27–32.
  • [17] A. Sheshkus, E. Limonova, D. Nikolaev, and V. Krivtsov, “Combining convolutional neural networks and hough transform for classification of images containing lines,” in Ninth International Conference on Machine Vision (ICMV 2016), vol. 10341.   International Society for Optics and Photonics, 2017, p. 103411C.
  • [18] P. Hough, “Method and means for recognizing complex patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 430–438, Dec 1962.
  • [19] M. Köppen, A. Soria-Frisch, and R. Vicente-García, “Neurohough: A neural network for computing the hough transform,” in Artificial Neural Nets and Genetic Algorithms.   Springer, 2001, pp. 197–200.
  • [20] F. Milletari, S.-A. Ahmadi, C. Kroll, A. Plate, V. Rozanski, J. Maiostre, J. Levin, O. Dietrich, B. Ertl-Wagner, K. Bötzel et al., “Hough-cnn: deep learning for segmentation of deep brain regions in mri and ultrasound,” Computer Vision and Image Understanding, vol. 164, pp. 92–102, 2017.
  • [21] A. Sheshkus, A. Ingacheva, and D. Nikolaev, “Vanishing points detection using combination of fast hough transform and deep learning,” in Tenth International Conference on Machine Vision (ICMV 2017), vol. 10696.   International Society for Optics and Photonics, 2018, p. 106960H.
  • [22] D. P. Nikolaev, S. M. Karpenko, I. P. Nikolaev, and P. P. Nikolayev, “Hough transform: underestimated tool in the computer vision field,” in Proceedings of the 22th European Conference on Modelling and Simulation, 2008, pp. 238–246.
  • [23] M. Aliev, E. Ershov, and D. Nikolaev, “On the use of fht, its modification for practical applications and the structure of hough image,” arXiv preprint arXiv:1811.06378, 2018.
  • [24] P. Bhattacharya, A. Rosenfeld, and I. Weiss, “Point-to-line mappings as hough transforms,” Pattern Recognition Letters, vol. 23, no. 14, pp. 1705–1710, 2002.
  • [25] N. Kiryati and A. M. Bruckstein, “Heteroscedastic hough transform (htht): An efficient method for robust line fitting in the ‘errors in the variables’ problem,” Computer Vision and Image Understanding, vol. 78, no. 1, pp. 69–83, 2000.
  • [26] V. V. Arlazarov, K. Bulatov, T. Chernov, and V. L. Arlazarov, “Midv-500: A dataset for identity documents analysis and recognition on mobile devices in video stream,” arXiv preprint arXiv:1807.05786, 2018.
  • [27] H. El Abed, L. Wenyin, and V. Margner, “International conference on document analysis and recognition (icdar 2011)-competitions overview,” in Document Analysis and Recognition (ICDAR), 2011 International Conference on.   IEEE, 2011, pp. 1437–1443.
  • [28] R. Smith, “An overview of the tesseract ocr engine,” in Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, vol. 2.   IEEE, 2007, pp. 629–633.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393235
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description