Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

Improving Word Recognition using Multiple Hypotheses and Deep Embeddings

Abstract

We propose a novel scheme for improving the word recognition accuracy using word image embeddings. We use a trained text recognizer, which can predict multiple text hypothesis for a given word image. Our fusion scheme improves the recognition process by utilizing the word image and text embeddings obtained from a trained word image embedding network. We propose EmbedNet, which is trained using a triplet loss for learning a suitable embedding space where the embedding of the word image lies closer to the embedding of the corresponding text transcription. The updated embedding space thus helps in choosing the correct prediction with higher confidence. To further improve the accuracy, we propose a plug-and-play module called Confidence based Accuracy Booster (cab). The cab module takes in the confidence scores obtained from the text recognizer and Euclidean distances between the embeddings to generate an updated distance vector. The updated distance vector has lower distance values for the correct words and higher distance values for the incorrect words. We rigorously evaluate our proposed method systematically on a collection of books in the Hindi language. Our method achieves an absolute improvement of around 10% in terms of word recognition accuracy.

Word recognition, word image embedding, EmbedNet

I Introduction

The task of word recognition involves converting the text in an image to a machine-readable format. Word recognition is an important use case of computer vision that finds various applications in digitizing old books, making self-driving cars understand signboard instructions, and creating assistive applications for people with special needs. All these tasks rely on accurate word recognition that is robust to extreme variations in lighting conditions, fonts, sizes and overall typography. To ensure the availability of the word recognizer to a broader audience, it should also be able to function for various languages and have low computational costs.

Fig. 1: Word recognition based methods fail to perform when they encounter new and rare characters, wrong word image segmentation, and provide low recall. However, these methods excel in differentiating between visually similar characters, whereas it is vice-versa for methods using word image embeddings. We aim to use the methods proposed in this work - EmbedNet and cab, for exploring the complementary properties of these methods. Diagram best viewed in color.

In this work, we focus on improving word recognition for the Hindi language, which is agglutinative and inflectional. Hindi contains vowels and consonants, and a horizontal line runs across the words, which is referred to as Shirorekha. If a consonant is followed by a vowel, the shape of the consonant is modified. Such characters are referred to as vowel modifiers. A compound character is formed when a consonant follows one or more consonants. Due to these modifiers and compound characters, the number of distinct shapes in Hindi is far more than that of the Latin scripts [1]. This makes word recognition for Hindi difficult, and hence, it is necessary to devise more intricate techniques which improve the word recognition accuracy for the Hindi language.

Traditionally, word recognition methods fall under two major categories: (a) methods directly converting a word image to its textual transcription [2, 3, 4, 5, 6], and (b) methods converting word images to embeddings and then performing recognition using these embeddings [7, 8, 9, 10]. In this work, we will refer to methods in category (a) as word recognition and methods in category (b) as word image embeddings, respectively. We assume that the word images are already segmented. A word recognition based method aims at directly converting the word image to its corresponding textual transcription. Despite the wide availability of high-grade open-source ocr engines [11], using them with degraded images from historical documents is difficult. On the other hand, word image embedding methods focus on converting the word image and the corresponding text to a holistic representation where word image and its corresponding text lie closer to one another. After the projection of these images and texts to a learned representation space, they can be compared using an appropriate distance metric and perform recognition restricted to a lexicon [7].

Word recognition methods (ocr) perform reasonably well when the text present in the image is reasonably clean. However, if the ocr encounters a rare character or an image with higher degradation, it struggles to generate the correct prediction. In such cases, word image embedding methods prove to be much useful. The reason is that they do not identify each character but instead, focus on converting the word image to an embedding/representation where words with (visually) similar characters, lie closer in the embedding space, resulting in better predictions in challenging situations. However, word image embedding methods find it difficult to distinguish between two different words with an approximately similar set of characters, a task at which word recognition based approaches excel. Also, word recognition methods provide high recall, whereas, word image embeddings methods, provide high precision [12]. Inaccurate word segmentation degrades performance in word recognition based methods [13]. Inaccurate segmentation, however, does not hinder the performance of word image embedding as a slightly degraded word (due to cut) still lies closer to its textual transcription’s embedding. Fig. 1, we show how we propose to use the complementary properties of both methods for improving word recognition.

Designing a pipeline that can exploit the complementary properties provided by word recognition and word image embedding methods can further enhance word recognition. In our previous work [14], we propose to use the complementary information of both the methods to create a more reliable and robust word recognition algorithm. We propose to fuse multiple hypotheses generated by the word recognizer with the embeddings generated from the End2End network (’E2E’) [15] for making use of the complementary information. Using the beam search decoding algorithm, we produce multiple () predictions for a word image from a ctc [16] based word recognizer, where is the number of predictions generated for a word image. We show that as the value of increases, the word recognition accuracy increases. Even though we proposed multiple rule-based methods for using this information and improving word recognition accuracy, we do not explore the learning-based techniques in [14].

In this work, we improve upon the methods presented in [14] and propose EmbedNet and a novel plug-and-play module called Confidence based Accuracy Booster (cab) for improving word recognition. Fig. 3 presents the flowchart of the entire process which includes EmbedNet, cab, and their roles in the word recognition pipeline. Here, EmbedNet attempts to learn an updated Euclidean space where the embeddings of the word image and its correct textual transcription lie closer together, while the incorrect ones lie farther away. The cab boosts the word recognition accuracy by using the updated representation made available by EmbedNet. For accelerating future research, we release the code and models used in this work on our webpage1.

Ii Related Works

In this work, we are interested in devising deep learning methods for fusing the existing methods in the word recognition, and word image embedding realms also referred to as text recognition and word spotting, respectively. This section explores the previous work done in these domains.

Ii-a Text Recognition

A typical text recognizer involves a feature extractor for the input image containing text, and a sequential encoder for learning the temporal information. Modern text recognizers use Convolutional Neural Networks (cnn) as a feature extractor and a Recurrent Neural Network (rnn) as a sequential encoder, which helps in modeling the text recognition problem as a Seq2Seq problem. Architectures using both cnn and rnn for this purpose are called Convolutional Recurrent Neural Network (crnn[2]. Previous works have used a wide range of recurrent networks for encoding the temporal information. Adak et al. [6] perform sequential classification using a rnn, whereas, [5] use Bi-directional Long-Short Term Memory (blstm) network for sequential classification using Connectionist Temporal Classification (ctc) loss [16]. Sun et al. [4] propose to use multi-directional lstm as the recurrent unit, whereas, Chen et al. [3] use Separable Multi-Dimensional Long Short-Term Memory for the same. These methods attempt to address word recognition by undertaking the task of directly converting an input document to a machine-readable text.

Ii-B Word Spotting

Word spotting [17] is an alternative for word recognition where we formulate a matching task. The fundamental problem in word spotting is about learning an appropriate feature representation for word images which is suitable for matching within the collections of document images. In this paper, we consider the word level segmentation to be available a-priori, and thereby limit our discussion on works which are in the domain of segmentation-based word spotting. An initial method [18] represents the word image using profile features and then uses different distance metrics for comparing them. Other works use handcrafted features [19], and Bag of Visual Words [20] for word spotting. Most of these early representations were learned in an unsupervised way. The later methods drifted towards learning in a supervised setting and presented robust representation schemes. One of the classical methods in this space is from Almazan et al. [21] which introduced an attributes framework referred to as Pyramidal Histogram of Characters (phoc) for representing both images and text. More recently, various deep learning based approaches [22, 23] have improved word spotting. VGGNet [24] was adopted by Poznanski et al. [25] for recognising phoc attributes. Many other methods in the word spotting domain successfully explored using phoc as the embedding spaces through different cnn architectures [7, 8, 9, 10]. In the Indian language document community, methods like [26, 27, 20] attempt word spotting methods on Indian texts.

In this work, we propose to combine a cnn-rnn architecture proposed in [28] with the embeddings generated from the E2E proposed in [15] using learning-based methods. By combining two different approaches, we aim at assimilating the best attributes of both the methods.

Iii Methods for improving word recognition

In this section, we elaborate on the proposed method using EmbedNet and cab . This section is divided as follows, Section III-A and III-B brief about crnn [28] and the End2End network [15], respectively. In Section III-C, we motivate EmbedNet and Section III-D proposes a novel Confidence based Accuracy Booster (cab), a plug-and-play module for boosting the word recognition accuracy.

Iii-a Word Recognition

We use a standard cnn-rnn (crnn) hybrid architecture that was proposed in [28]. The network converts the textual contents of an image to textual transcriptions. Fig. 3 shows the architecture of the crnn architecture used in our work. It consists of a spatial transformer layer (stn) followed by the residual convolutional blocks. These blocks are responsible for learning a sequence of feature maps using ResNet18 [29]. These feature sequences serve as an input into a stacked bi-directional long short-term memory (blstm) network. Connectionist Temporal Classification (ctc) [16] is then used for decoding the target label sequence over all the frames.

Iii-B Word Spotting

Fig. 3 shows the End2End network proposed in [15] which learns the textual and visual word image embeddings. The network consists of two major input streams: real and label. In the real stream, ResNet34 contributes by generating the features for the real images. The label stream further gets divided into: (a) synthetic image stream and (b) text stream. Synthetic image’s feature extraction takes place with the help of a shallow cnn architecture, while, generation of textual features happens using a phoc extractor. The features generated are then appended and treated as a conditional label and merged using a fully connected network. The features generated from these streams are appended and merged using a fully connected network which preserves information from both modalities. This fully connected network preserves information from both modalities. After this, the embedding layer projects the embeddings from the real and label stream to a common subspace.

Iii-C EmbedNet

In this work, we propose EmbedNet for projecting the embeddings to an updated embedding space and cab for boosting the word recognition accuracy. We use a set of word images for which we want to get the textual transcriptions. As shown in Fig. 3, these images are passed through the real stream of E2E to generate embeddings represented by . predictions generated for each of the word image are converted to embeddings symbolized by using the E2E’s label stream. We set the value of equal to throughout the paper, unless otherwise specified.

Fig. 2: An EmbedNet, during training, takes in a word image’s embedding (), correct text’s embedding () and incorrect text’s embedding () one at a time. Corresponding output embeddings are passed through the triplet loss for training. Once trained, it takes in and and generates and , respectively. Tuples underneath the blocks represent the input and output size of the corresponding block. See text for notation.
Fig. 3: For generating the textual transcription, we pass the word image through the crnn [28] and the End2End network (’E2E’) [15], simultaneously. The crnn generates multiple () textual transcriptions for the input image, whereas the E2E network generates the word image’s embedding. The textual transcriptions generated by the crnn are passed through the E2E network to generate their embeddings. We pass these embeddings through the EmbedNet proposed in this work. The EmbedNet projects the input embedding to an updated Euclidean space, using which we get updated word image embedding and transcriptions’ embedding. We calculate the Euclidean distance between the input embedding and each of the textual transcriptions. We then pass the distance values through the novel Confidence based Accuracy Booster (cab), which uses them and the confidence scores from the crnn to generate an updated list of Euclidean distance, which helps in selecting the correct prediction. Diagram best viewed in color.

We generate and by providing and as inputs to the EmbedNet, respectively. Fig. 2 shows the EmbedNet architecture; it projects the embeddings from to using a dimensional linear input layer and dimensional linear output layer and a hidden layer in between. We add PReLU activation function after each layer; it helps in introducing non-linearity to the model. L2 normalization is performed on the final layer’s output to project the embedding on a -dimensional hyper-sphere. We train the EmbedNet for epochs with early stopping acting as a regularizer and use the Adam optimizer with a constant learning rate of .

Let EmbedNet be a function defined as ; it learns a compact Euclidean space where the correct lies closer to , and incorrect lies farther away from . We achieve the compact Euclidean space by training the EmbedNet using the triplet pairs as originally proposed in [30]. The pairs constitute of three different embeddings; the first one is the embedding of the word image from the train set for which we want to generate the textual transcription; we refer to them as the anchor denoted as . The second one is the embedding of words for which we have correct textual transcription; we call them positive denoted as . The third one is the embedding of words with incorrect textual transcription; we refer to them as negative denoted as . We sample the anchor from , and is sampled for generating positive and negative.

The triplets are further classified into:

Hard Negatives

Equation 1 shows the condition for hard negatives; here, the Euclidean distance between the anchor and positive is greater than the distance between the anchor and negative. Due to this, they contribute the most while training the EmbedNet.

(1)

Semi-hard Negatives

Equation 2 defines the condition for semi-hard negatives; it relies on the margin (). Here the Euclidean distance between the anchor and negative is less than the distance between the anchor and positive but higher than the Euclidean distance between the anchor and positive.

(2)

Easy Negatives

Equation 3 shows the condition for easy negatives; here, the Euclidean distance between the anchor and positive is less than the distance between the anchor and negative.

(3)

Easy negatives do not contribute while training the EmbedNet as the condition of Euclidean distance between the anchor and positive example being less than the distance between the anchor, and negative is already satisfied. We train the EmbedNet using the Triplet loss, it is defined as:

(4)

here and are anchor, positive and negative embeddings respectively and is the margin.

The triplet pairs are updated after every epoch. For updating, we pass and through the EmbedNet, identify anchors, positives, and negatives. They are then further divided into hard negatives, semi-hard negatives, and easy negatives using equations 1, 2, and 3, respectively.

After training the EmbedNet, we generate and for word images in the test set and pass them through the EmbedNet to generate and ; these updated embeddings help in selecting the correct predictions with much higher confidence. The reason behind this is, in the updated embedding space the correct text’s embeddings lie closer to the input word image’s embedding, and the wrong text’s embeddings lie farther away from the input word image’s embedding. For predicting the text in a given word image , we query using to generate a ranked list of predictions in increasing order of Euclidean distance. We consider the word with the least Euclidean distance as the new prediction.

Iii-D Confidence based Accuracy Booster (CAB)

Fig. 4: Confidence based Accuracy Booster (cab) takes in the confidence scores and the Euclidean distances and generates the updated distances. The number shown in red indicates the word with the lowest Euclidean distance with . The number circled in orange shows the correct word, which ideally should have the lowest Euclidean distance. According to the original Euclidean distances, the correct word is at position 2, whereas, the correct word is actually at position 1 (circled in orange). At position 1 in the confidence scores vector, we have a value of . After using cab we get updated distances where the distance for the word at position 1 is the least. Therefore, cab helps in incorporating the confidence scores and getting reliable distance values. Diagram best viewed in color.

As shown in Fig. 4, cab uses a vector of length consisting of confidence scores, which are a measure of confidence of the crnn for that particular prediction. Authors in [14] sum this confidence score with the Euclidean distances to improve the word recognition accuracy. We improve on it and introduce a novel Confidence based Accuracy Booster (cab) as a plug-and-play module. Mathematically, cab can be defined as:

(5)

here, is the function for cab, denotes the vector of confidence scores of length , denotes the vector of Euclidean distances of length , denotes the vector of ones of length , is the boost coefficient, is the distance coefficient, and is the element wise addition operation. and are fixed to a constant value. takes in and and generates an updated list of distance where embeddings of words with higher confidence scores have a smaller distance value from . Using cab, we achieve the highest word recognition accuracy on the validation set when we set the value of and equal to and , respectively. Therefore, we fix the value of and to and , unless otherwise stated.

A primary motivation behind creating and using cab is to incorporate the confidence scores generated by the crnn. As the value of increases, noise in the predictions increases, which leads to lower confidence score values. Thus, by updating the distance values using the confidence scores, we can filter out the noisy predictions and select more relevant predictions.

Iv Experiments and Results

Iv-a Dataset and evaluation metric details

Language Annotated # Pages # Word Images
Hindi Yes Train Validation Test
TABLE I: The dataset consists of pages from the books in the Hindi language. The pages are annotated at word-level. The annotated words are further divided into train, test, and validation sets.

We perform all the experiments on books in the Hindi language, sampled and annotated from the dli [31] collection. These books range from different periods and consist of a variety of font, font sizes, and a few degraded pages. As summarised in Table I, the sampled books consist of pages containing words. We further divide these words into train, validation, and test sets for training and testing the EmbedNet. We use a pre-trained word recognizer (crnn [28]) and an End2End (’E2E’) network for all the experiments. We report the word recognition accuracy (wra) for all the experiments performed. wra is defined as

(6)

where represents the number of correctly recognised words, and is the total number of words. wra for methods using hypotheses is calculated after generating the re-ranked list of predictions. This list is arranged in the increasing order of Euclidean distance with respect to the query. The word at the first position of the list is used for calculating the wra.

# triplets
# hard negatives # semi-hard negatives # easy negatives
TABLE II: Summary of the number of triplet pairs generated for different values of ; we further classify them into hard, semi-hard, and easy negatives. We use hard and semi-hard negatives for training the EmbedNet. The number of hard, semi-hard, and easy negatives changes as the training progresses.

As described in Section III-C, we generate the triplets and categorize them for training the EmbedNet. Table II summarises the number of hard, semi-hard, and easy negatives for different margins (). With an increase in margin, we observe an increase in the number of semi-hard negatives and a decrease in easy negatives. It is beneficial to have a larger value for , as it maximizes the number of hard negative and semi-hard negative samples. Easy negatives do not contribute to EmbedNet’s training, so we do not use them for training the network. While training the EmbedNet, the number of hard, semi-hard, and easy negatives changes. Initially, as the network starts from a random initialization, the number of easy negatives is the least while the number of hard and semi-hard negatives are more. As the training progresses, the number of easy negatives starts to increase while the other two categories decrease.

Iv-B Selection of the best value for the margin

Sr. No. Highest wra (at )
1. ()
2. ()
3. ()
4. ()
TABLE III: EmbedNet’s performance for various values of .

We train and validate multiple EmbedNets for different using the train and validation set defined in Table I. The aim here is to select the best value of . For that, we perform the experiments on four different values of . The results are reported in Table III. EmbedNet with equal to has the highest wra on the validation set as compared to a lower values of . The reason for this is that a small value of has a low count of semi-hard negatives (Table II), which results in reduced triplets for training the EmbedNet. For the rest of the paper, we consider the value of equal to unless otherwise stated.

Iv-C Results and Comparison with various methods

This section presents baseline methods used for assessing the improvement after using the EmbedNet with and without cab. We also compare the wra between the baselines and the methods proposed in this work. Baseline methods are:

Open-source OCR

For the first baseline, we use a pre-trained open-source word recognizer: Tesseract [11]. The motive here is to compare with an ocr which is not trained on noisy document images.

Crnn

The second baseline score shows the performance of the crnn [28] trained using the best path decoding algorithm. It was trained on defined in [14]. Here, we generate a single prediction for each test image.

E2e+c

We generate the third baseline score using the method proposed in [14]; for that, we use the embeddings generated from E2E and multiple () hypotheses generated from the crnn. For calculating the wra of a given word image , we perform a nearest neighbor’s search on using and add the confidence information to the distances obtained after the nearest neighbor’s search; this provides us a re-ranked list, from which we consider the word with the least Euclidean distance as the new prediction. We refer to this method as ’E2E+C’.

Fig. 5: Comparison between the wra for E2E+C, mlp, and EmbedNet with and without cab. For the experiments not using cab, the wra first increases and then starts to decrease. The reason for such a trend is, as increases, the noise in the crnn’s predictions increases leading to lower wra. However, using cab helps avoid this issue, as it uses the confidence scores from the crnn, which decreases as the noise increases. We achieve the highest wra of using EmbedNet + cab at .

Multilayared Perceptron

For calculating the last baseline score, we train a Multi-Layered Perceptron (mlp) on the train data defined in Table I. mlp is a function defined as ; it projects to an updated embedding space where the Euclidean distance between and correct is less than the distance between and correct . mlp consists of three layers, the initial and final layers have the input and output dimensions of , respectively; the hidden layer has the input and output dimensions of and , respectively. The ReLU activation function follows each layer to introduce non-linearity. Mean Squared Error (mse) is used as a loss function for training the mlp. We train the network for epochs with early stopping acting as a regularizer and use the Adam optimizer with a constant learning rate of . For calculating the word recognition accuracy, we query using for a given value of and to get a ranked list of predictions in increasing order of Euclidean distance. We consider the word with minimum Euclidean distance as the new prediction.

Sr. No. Method wra
1. Tesseract [11]
2. crnn [28]
3. E2E+C [14]
4. E2E+C + cab (ours)
5. mlp (ours)
6. EmbedNet (ours)
7. mlp + cab (ours)
8. EmbedNet + cab (ours)
TABLE IV: Comparison between the wra of all the baselines and the methods proposed in this work. signifies ’s value at which we achieve the highest wra; () signifies the maximum value of for that experiment.

Table IV contrasts the wra of all the baselines with the methods proposed in this work. We observe the lowest wra for the methods not using multiple hypotheses, i.e., methods for which we have a maximum value of equal to . Using [11], we achieve a wra of on the test set. As the training of the open-source ocr does not take place on the noisy documents that we are using, it performs the worst; this shows that the data that we are using contains highly degraded word images which are difficult to understand. On the other hand, we train a crnn [28] on the train split defined in Table I and achieve a wra of on the test set.

We observe an improved wra for the methods using multiple hypotheses, i.e., methods for which we have a maximum value of higher than ; in all the experiments, we have the maximum value of equal to . We observe the wra plateauing on the validation data for higher values of ; due to this, we choose to limit the highest value of at . E2E+C achieves the maximum wra of . However, as we observe in Fig. 5, it achieves a maximum wra at a small value of (); wra begins to decrease as we increase . So, when using E2E+C, one cannot use higher than two, making it impractical to use. We add cab to E2E+C and observe a performance gain and more consistent wra values for a higher value of . Using E2E+C + cab, we achieve the highest wra of at . Fig. 5 shows the change in wra on increasing the value of ; we observe a steady increase till , after which the wra starts to decrease. However, this decrease in the wra is very small as compared to E2E+C without cab. Even at , E2E+C + cab achieves more wra as compared to E2E+C at . The reason for such stability is the usage of the confidence scores. As increases, the noise present in the ocr’s predictions also increases, leading to a lower confidence score for the noisy predictions. cab uses this fact and results in better and consistent wra.

We observe an improvement in the wra for mlp and EmbedNet without cab as compared to E2E+C, crnn, and Tesseract [11]. mlp and EmbedNet achieve the highest wra of and , respectively. As observed in the case of E2E+C and shown in Fig. 5, the wra starts to decrease as the value of is increased, making them impractical to use. Upon using cab with mlp and EmbedNet, we observe high gains in the wra for large as shown in Fig. 5. mlp + cab attains the highest wra at , equal to , which is more than the mlp without cab at . As pointed out in Section III-C, EmbedNet not only helps in bringing correct closer to but also pushes incorrect farther away from . cab utilizes this fact, and as we can see, we obtain a wra of , which is more than the E2E+C without cab at .

Hence, by using the cab, we observe substantial gains in wra as increases, and we also see a more steady wra for all the values of ; this enables us to freely choose any value of without any loss of wra, which was not possible while using E2E+C, mlp, and EmbedNet without cab.

Iv-D Qualitative Results

Fig. 6: Qualitative results on randomly chosen word images after processing using mlp and EmbedNet with and without cab. Diagram best viewed in color.

Fig. 6 shows qualitative results on some randomly chosen words. Words in Fig. 6 (a) and (b) are long and contain characters and contain half consonants. Both of the words are recognised perfectly by mlp and WordNet with and without cab. However, words in Fig. 6 (c) and (d) contains rare characters and are distorted, due to this mlp with and without cab, and EmbedNet fail to predict the correct word. EmbedNet + cab performs well for these cases and is able to predict the correct word. This shows the ability of EmbedNet to use the complementary information provided by word recognition and word image embedding methods.

Iv-E Computational Costs

Mode Process Average time (in milliseconds) Dependent on
Offline Text from crnn
Text embeddings’ generation
Image embeddings’ generation
Online EmbedNet pass Network’s size
mlp pass Network’s size
wra calculation
wra calculation with cab
TABLE V: Time taken by various processes in the word recognition pipeline. Time for the methods dependent on is calculated for . Values are reported for a single word’s image/text.

Table V shows the time taken for various processes done in the entire pipeline. All the experiments are performed on Intel Xeon E5-2640 v4 processors with gb ram on nvidia geforce gtx 1080 Ti gpu. For calculating the time taken, we run the experiments times and average the time taken in all the runs. The process of calculating the word accuracies for all the values of is parallelizable; this reduces the time taken by .

There are two modes in which the majority of our experiments take place. The first mode is the offline mode, which involves computations required only once. It includes time taken in generating the ocr output for all the word images and the time taken by the End2End network in generating and . Second is the online mode, which includes computations that are required every time we calculate wra. It includes time taken in passing the embeddings through the mlp and EmbedNet. It also includes the time taken in calculating wra with and without cab.

V Conclusion

To summarise, in this work, we aim at fusing the word recognition and word image embedding approaches for word recognition. For achieving this, we propose EmbedNet, which helps in learning an updated Euclidean space. We also propose cab for using the updated Euclidean space and boosting the wra by approximately 10% at . We show that learning based approaches for fusion show more promising results than rule-based fusion. As a future task, we plan to develop an end-to-end architecture capable of fusing word recognition and word image embedding approaches.

Footnotes

  1. http://cvit.iiit.ac.in/research/projects/cvit-projects/word-recognition

References

  1. K. Dutta, P. Krishnan, M. Mathew, and C. V. Jawahar, “Towards Accurate Handwritten Word Recognition for Hindi and Bangla,” in Computer Vision, Pattern Recognition, Image Processing, and Graphics, 2018.
  2. B. Shi, X. Bai, and C. Yao, “An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  3. Z. Chen, Y. Wu, F. Yin, and C. Liu, “Simultaneous script identification and handwriting recognition via multi-task learning of recurrent neural networks,” in International Conference on Document Analysis and Recognition (ICDAR), 2017.
  4. Z. Sun, L. Jin, Z. Xie, Z. Feng, and S. Zhang, “Convolutional multi-directional recurrent network for offline handwritten text recognition,” in Conference on Frontiers in Handwriting Recognition (ICHFR), 2016.
  5. U. Garain, L. Mioulet, B. B. Chaudhuri, C. Chatelain, and T. Paquet, “Unconstrained Bengali handwriting recognition with recurrent models,” in International Conference on Document Analysis and Recognition (ICDAR), 2015.
  6. C. Adak, B. B. Chaudhuri, and M. Blumenstein, “Offline Cursive Bengali Word Recognition Using CNNs with a Recurrent Model,” in International Conference on Frontiers in Handwriting Recognition (ICHFR), 2016.
  7. P. Krishnan, K. Dutta, and C. V. Jawahar, “Deep Feature Embedding for Accurate Recognition and Retrieval of Handwritten Text,” in International Conference on Frontiers in Handwriting Recognition (ICHFR), 2016.
  8. S. Sudholt and G. A. Fink, “PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents,” in 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016.
  9. T. Wilkinson and A. Brun, “Semantic and Verbatim Word Spotting Using Deep Neural Networks,” in International Conference on Frontiers in Handwriting Recognition (ICHFR), 2016.
  10. S. Sudholt and G. Fink, “Attribute CNNs for Word Spotting in Handwritten Documents,” International Journal on Document Analysis and Recognition (IJDAR), 2017.
  11. R. Smith, “An Overview of the Tesseract OCR Engine,” in International Conference on Document Analysis and Recognition (ICDAR), 2007.
  12. P. Krishnan, R. Shekhar, and C. Jawahar, “Content level access to Digital Library of India pages,” in ACM International Conference Proceeding Series (ICPS), 2012.
  13. A. Gordo, J. Almazán, N. Murray, and F. Perronin, “LEWIS: Latent Embeddings for Word Images and Their Semantics,” in International Conference on Computer Vision (ICCV), 2015.
  14. S. Bansal, P. Krishnan, and C. V. Jawahar, “Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval,” in Document Analysis Systems (DAS), 2020.
  15. P. Krishnan, K. Dutta, and C. V. Jawahar, “Word Spotting and Recognition Using Deep Embedding,” in IAPR International Workshop on Document Analysis Systems (DAS), 2018.
  16. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in International Conference on Machine Learning (ICML), 2006.
  17. R. Manmatha, C. Han, and E. M. Riseman, “Word spotting: A new approach to indexing handwriting,” in Computer Vision and Pattern Recognition, CVPR, ser. CVPR ’96, 1996, p. 631.
  18. T. Rath and R. Manmatha, “Word spotting for historical documents,” in International Journal of Document Analysis and Recognition (IJDAR), 2007.
  19. A. Balasubramanian, M. Meshesha, and C. V. Jawahar, “Retrieval from document image collections,” in Document Analysis Systems (DAS), 2006.
  20. R. Shekhar and C. V. Jawahar, “Word Image Retrieval Using Bag of Visual Words,” in Document Analysis Systems (DAS), 2012.
  21. J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word spotting and recognition with embedded attributes,” PAMI, 2014.
  22. M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” in Workshop on Deep Learning, NIPS, 2014.
  23. M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep Features for Text Spotting,” in European Conference on Computer Vision (ECCV), 2014.
  24. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  25. A. Poznanski and L. Wolf, “CNN-N-Gram for Handwriting Word Recognition,” in Computer Vision and Pattern Recognition (CVPR), 2016.
  26. A. Bhardwaj, S. Kompalli, S. Setlur, and V. Govindaraju, “An OCR based approach for word spotting in Devanagari documents,” in Document Recognition and Retrieval Conference (DRR), 2008.
  27. S. Chaudhury, G. Sethi, A. Vyas, and G. Harit, “Devising interactive access techniques for Indian language document images,” in International Conference on Document Analysis and Recognition (ICDAR)., 2003.
  28. K. Dutta, P. Krishnan, M. Mathew, and C. V. Jawahar, “Improving CNN-RNN Hybrid Networks for Handwriting Recognition,” in International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018.
  29. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” CoRR, 2015.
  30. F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  31. V. Ambati, N. Balakrishnan, R. Reddy, L. Pratha, and C. V. Jawahar, “The Digital Library of India Project: Process, Policies and Architecture,” in Second International Conference on Digital Libraries (ICDL), 2007.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
420405
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description