AI Learns to Recognize Bengali Handwritten Digits: Bengali.AI Computer Vision Challenge 2018

AI Learns to Recognize Bengali Handwritten Digits: Bengali.AI Computer Vision Challenge 2018


Solving problems with Artificial intelligence in a competitive manner has long been absent in Bangladesh and Bengali-speaking community. On the other hand, there has not been a well structured database for Bengali Handwritten digits for mass public use. To bring out the best minds working in machine learning and use their expertise to create a model which can easily recognize Bengali Handwritten digits, we organized Bengali.AI Computer Vision Challenge. The challenge saw both local and international teams participating with unprecedented efforts.

Bengali Handwritten Digits, Machine Learning, Digit Recognition, Competition

I Introduction

It has been a long time since any well structured Bengali handwritten digits dataset have been open sourced for research purpose. Though there is many handwritten dataset for Arabic numerals like MNIST [1], French [2], Chinese [3], Urdu [4], a massive void was present in terms of Bengali handwritten digits dataset, let alone characters. Several attempts were made to accumulate such dataset like the one collected by Indian Statistical Institute (ISI) [5], but only a small amount was released for public access. Another such attempt was done by Center for Microprocessor Application for Training Education and Research (CMATER) group [6], but it was kept for internal research purposes only. In 2018, NumbtaDB [7] was released, which is the first of its kind, a well structured dataset for Bengali handwritten digits. Collected from multiple sources and open sourced for mass public use, the 85,000+ digits dataset is the largest such dataset for Bengali handwritten digits. Most surprising thing about the dataset was that all the cleaning, standardization and sorting was done by hand which occurred in a span of 5 months.

The natural next step was to organize a competition with this large scale data to connect and bring out the best of the machine learning practitioners of the country. The Bengali.AI CV challenge commenced in the month of June. A grand total of 57 teams participated from 26 local and international institutions ranging from college and universities. The overall number of submissions was around 650, peaking at 100 on the last week of the deadline. The challenge was hosted on Kaggle [8] which enabled the teams to use and exploit it’s computing powers and subsystem for processing data and executing algorithms. Furthermore, the team could use the discussion forums for any queries or problems they had faced. The Kaggle kernels enabled both the veterans to teach the new machine learning practitioners and also made way for the participating teams to share their winning algorithms.

Most of the algorithms tried to address not only the recognition of digits but also the orientation and style of how the digits were written. The data was sorted in such a way that it contained different natural settings which can easily confuse a hard coded algorithm while keeping it recognizable by humans. The challenge was to produce such sophisticated machine learning algorithms which will be at par with their human counterparts. And the winning solutions were successful to produce such models by training on large amount of heavily augmented and variable datasets.

This competition is the first of a series of computer vision challenges hosted by Bengali.AI. The aim is to create a large artificial intelligence and machine learning community in Bangladesh similar to the community created by computer vision challenges like Imagenet [9], Pascal-VOC [10], MS-COCO [11] etc. which have become the cornerstone for participation and knowledge sharing in computer vision. Frequent challenges and competition will also pave the way for machine learning community in Bangladesh to grow and prosper and will motivate them to take part in international challenges.

Ii Data Accumulation and Structuring

Ii-a Data from Different Sources

The data was gathered from six different sources as mentioned in [7]. The first source was the Bengali Handwritten Digits Database (BHDDB) which initially consisted of 23,400 samples. Later 209 samples were removed from it as they were illegible. It was contributed by Bangladesh University of Engineering and Technology’s students from Department of Computer Science and Engineering. The data was collected in a form which had a distinct grid pattern. It was collected in RGB format.

The second data was also collected from BUET. The dataset is called BUET101 Database (B101DB) which had 435 samples of which 7 samples were removed. The data was cropped and labelled by hand after collecting.

OngkoDB was yet another contribution from Department of Computer Science and Engineering from BUET, which consisted of 28,900 samples. From this set, 321 number of samples were removed. The image format was Gray-scale, which was different from other collected data.

Students from Institute of Statistical Research and Training (ISRTHDB), Dhaka University, contributed the fourth data. It consisted of 13,133 samples of which 277 were rejected. It had the same format as the BHDDB dataset. Though it was much more pristine and less noisy.

The fifth dataset was BanglaLekha-Isolated [12] which consisted of numerals and alphabets combined. Of which, we selected the numerals only for digit classification. The data was cleaned thoroughly and erroneous labels were fixed after accumulating. The total number of samples were 20,319 and was collected in Binary format. 572 out of the original samples were removed.

UIUDB, a dataset prepared by students from United International University was the last samples we collected for the competition. It had 576 samples of which 81 were removed.

Table I shows contribution by different sources and the number of samples which were included in the dataset.

Name of
the Dataset
Number of
Digits Database
23,400 209 RGB
BUET 435 7 RGB
OngkoDB CSE, BUET 28,900 321 Gray-scale
Isolated [12]
- 20,319 572 Binary
TABLE I: Data Sources

Ii-B Train and Test Data

The data set was split into train and test sets. The training set consists of 72,044 samples and the test consists of 13,552 samples. Apart from UIUDB data, all the dataset were split into 85% training and 15% test data. UIUBDB dataset in its entirety was considered for test set. The training and test set were named a, b, c, d, e and f. Table II details the training and test datasets.

Ii-C Augmentation on Test set

The test data was further augmented to simulate practical challenges of the task. Different augmentation techniques were applied to make the data varied and fluid. They are:

  • Spatial Transformations: Rotation, Translation, Shear, Height/Width Shift, Channel Shift, Zoom.

  • Brightness, Contrast, Saturation, Hue shifts, Noise.

  • Occlusions.

  • Superimposition (to simulate the effect of text being visible from the other side of a page).

The augmented data was retrieved from the test set ’a’ and ’c’. No augmentation was performed on the training set and to rest of the remaining test set. The number of samples for ’aug-a’ and ’aug-c’ are 2,168 and 2,106. Table II details the augmented set with respect to the original non-augmented train and test split.

Split name
Test Sample
a 85%-15% 19,703 3,490 -
b 85%-15% 359 70 -
c 85%-15% 24,298 4,381 -
d 85%-15% 10,908 1,948 -
e 85%-15% 16,778 2,970 -
f 0%-100% - 495 -
aug-a 0-100% - - 2,168
aug-c 0-100% - - 2,106
TABLE II: Training and Test Data

Iii Bengali.AI CV Challenge 2018

Iii-a Competition Evaluation Metrics

Six different sources were used as test data such as test sets a, b, c, d, e and f. Additionally, two augmented data sets were produced from test set a and c. To account for the non-homogeneity of the sub-sets, we opted to use Unweighted Average Accuracy (UAA) as the evaluation metric for the competition.


In (1), is the model accuracy on the i’th test sub-set. We also compared the Weighted Average Accuracy (WAA) besides UAA in Table III.

Iii-B Participating Teams and Corresponding Score

The competition saw the largest amount of teams participating for any Computer Science related competition in Bangladesh. In total, 92 teams participated from 19 different institutions. Out of those, 57 teams submitted at least one or more viable models for Digit Recognition. The total number of entries for models were 685. Meaning, there were 12 model entries on average per team. The winning teams were Backpropers, Digit_Branch and Dekhi_ki_hoi. Table III lists the top 20 teams, their affiliations, corresponding score and number of entries. Note that the relative difference between the top two teams considering WAA is .01%. As it can be seen that the 1st place winner had a staggering 54 algorithm submission. Digit_Branch came 2nd submitting only one-third of that. Another fascinating thing is all 20 of the teams reached top-10% Accuracy. Additionally, the top 3 teams achieved competitive scores with a difference of 0.1% UAA.

of Entries
1 Backpropers BUET 60 0.99359 0.99484
2 Digit_Branch CUET 17 0.99296 0.99478
3 Dekhi_ki_hoi BUET 26 0.99177 0.99336
4 Sabbir Ahmed BUET 55 0.9808 0.98530
5 Sannin DU 28 0.97889 0.98428
6 Lets Try NSU 39 0.97606 0.98269
7 Diversense KUET 28 0.96694 0.97134
8 AUST_Benzema AUST 47 0.96188 0.97305
9 Kola BUET 19 0.95236 0.96550
10 Rafizunaed BUET 8 0.94393 0.96777
11 Osprishyo SUST 17 0.93942 0.94547
12 Numta_ai BUET 17 0.93631 0.94780
13 RUETvision RUET 33 0.92723 0.94167
14 Eyes on you CUET 8 0.92701 0.93793
15 RUET_13 RUET 12 0.92426 0.93662
16 Yellowchrom DU 11 0.91855 0.93691
17 Halum KUET 13 0.91619 0.93407
18 Sadman Sakib BUET 3 0.91387 0.94281
19 Code_crawlers BUET 10 0.91269 0.93004
20 Md Asadul Islam UAlberta 5 0.90296 0.91972
TABLE III: Standing of Top 20 teams
Fig. 1: Testing set images which were frequently misclassified by competition entries.

Iii-C Winning Algorithms


The team used multiple architectures such as Resnet-34 and Resnet-50 [13]. Transfer learning was performed using pre-trained weights and an ensemble architecture was made. Backpropers used PyTorch [14] deep learning library for creating the model and the whole training and validation schemes were executed in Google Colaboratory [15]. In total, they used 6 models to create an ensemble and averaged the score to get the final prediction.


Digit_Branch used ResNeXt-50 [16] as pre-trained weights for training their model. In total, 3 separate yet similar architectures were used. A data generation scheme called Overlay was used. It generated fixed 5,000 images from the training data. They excluded 0 and 4 from being added as a mirrored image in the Overlay. Those 5,000 images were used along with the original train set for training the three models. Data augmentation was performed such as Rotation, Shear, scale, Translation, Noise, Salt And Pepper, Contrast Normalization etc.


The entrant ensembled 4 different models and trained it end-to-end. The first model was Densenet-121 with pre-trained weights from Imagenet [17]. The next two models had a combination of dilated convolution [18] block stacked in different orders. Lastly, a repetitive plain convolution block was used to create the fourth model. The final prediction was averaged over from all four model’s predictions. Data Augmentation was performed such as Gaussian Blur, Additive Gaussian Noise, Channel-wise random scaling, Affine scaling, Translation, Rotation and Shear, Contrast Normalization, and Additive Pepper noise.

Fig. 2: Confusion matrix of competition entries surpassing 99% UAA.
Fig. 3: Accuracy comparison of testing sets on top submissions.

Iv Evaluating Label Noise in NumtaDB

Large collections of labeled images have powered the recent advances in image classification. Large datasets are prone to having label noise both in the training and testing datasets [19]. Deep learning models trained with large supervised datasets are robust to mislabeled data in the training set [20]. Noisy labels in the testing set could introduce biases which need resolution. To find out noisy labels, we manually inspected the frequently misclassified digits by the submissions in the Bengali.AI CV Challenge (Fig. 1). We found 69 mislabeled testing set instances among which 45 were from testing-e. The mislabeled digits have been replaced with correctly labeled data. The updated dataset is available at

V Results Analysis

To compare accuracy of models on different datasets, we took the top 335 submissions. We defined an accuracy metric of each sample as the percentage of models that predicted correct out of all submissions and calculated the distribution of accuracy of each sample point categorized by dataset.

As the data was assembled from various sources, the accuracy on testing sets varied with source. Analyzing the non augmented sets we found that testing-e was by far the most difficult followed by testing-f. This was expected as testing-e and testing-f were collected manually and so a lot of samples naturally contained noise and deformations. In contrast, the other testing sets were much easier to decipher for the models with testing-b being the easiest. Testing-auga and testing-augc has similar error distributions and accounted for the majority of the error (Fig. 3).
An analysis of the influence of different manner of augmentations was also done. This information is invaluable during training as it can point out where the most focus is needed when augmenting the training set. We used a combination of several augmentations (translate, coarse dropout, perspective transform, hue and white balance shift, pixelation, superposition, contrast normalization, salt pepper noise addition, Gaussian blur, rotate and shear). Our augmented test datasets had a total of 385 unique combinations of different kinds of augmentations. In our analysis of the augmented testing sets testing-auga and testing-augc, we calculated the average number of incorrect predictions made by the top 279 submissions all of which scored above 90% accuracy. We found that the combination of perspective transform, gaussian blur and shear was the hardest for models to predict with an average of 213 incorrect predictions per sample. Combinations of shear, hue - white balance shift and perspective transform was the next hardest with 205 mistakes per sample. All other combinations had an average incorrect prediction per sample less than 149.

Vi Discussions & Conclusion

Bengali.AI Computer Vision Challenge saw a large number of teams participating with hundreds of algorithms to clinch the title of the best Bengali handwritten digit recognition model. Adamant enough, they strove to explore unknown horizons by implementing both state-of-the-art and traditional machine learning techniques to come out victorious. Through this competition they gathered abundant information regarding competitiveness, practical challenges as well as solving computer vision related problems. This challenge is the first of a series of competition that will try to illuminate both the beginner and veteran artificial intelligence practitioners to solve real life issues using machine learning and beyond.


Bengali.AI is a community working to solve the scarcity of standardized datasets for Bengali Natural Language Processing and Computer Vision research. We are grateful to all the members of the community who took part in data contribution and the successful organizing of the event. Bengali.AI Computer Vision Challenge was exclusively sponsored by Apurba Technologies Inc.


  1. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  2. Christian Viard-Gaudin, Pierre Michel Lallican, Philippe Binter, and Stefan Knerr. The ireste on/off (ironoff) dual handwriting database. In icdar, page 455. IEEE, 1999.
  3. Lianwen Jin, Yan Gao, Gang Liu, Yunyang Li, and Kai Ding. Scut-couch2009—a comprehensive online unconstrained chinese handwriting database and benchmark evaluation. International Journal on Document Analysis and Recognition (IJDAR), 14(1):53–64, 2011.
  4. Malik Waqas Sagheer, Chun Lei He, Nicola Nobile, and Ching Y Suen. A new large urdu database for off-line handwriting recognition. In International Conference on Image Analysis and Processing, pages 538–546. Springer, 2009.
  5. BB Chaudhuri. A complete handwritten numeral database of bangla–a major indic script. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft, 2006.
  6. Nibaran Das, Ram Sarkar, Subhadip Basu, Mahantapas Kundu, Mita Nasipuri, and Dipak Kumar Basu. A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application. Applied Soft Computing, 12(5):1592–1606, 2012.
  7. Samiul Alam, Tahsin Reasat, Rashed Mohammad Doha, and Ahmed Imtiaz Humayun. Numtadb-assembled bengali handwritten digits. arXiv preprint arXiv:1806.02452, 2018.
  8. P Dugan, W Cukierski, Y Shiu, A Rahaman, and C Clark. Kaggle competition. Cornell Univerity, The ICML, 2013.
  9. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  10. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  11. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  12. Mithun Biswas, Rafiqul Islam, Gautam Kumar Shom, Md Shopon, Nabeel Mohammed, Sifat Momen, and Anowarul Abedin. Banglalekha-isolated: A multi-purpose comprehensive dataset of handwritten bangla isolated characters. Data in brief, 12:103–107, 2017.
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  14. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  15. Kumarjit Pathak, Jitin Kapila, Nikit Gawande, et al. Incremental learning framework using cloud computing. arXiv preprint arXiv:1805.04754, 2018.
  16. Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
  17. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  18. Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
  19. Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2014.
  20. David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Comments 0
Request comment
The feedback must be of minumum 40 characters
Add comment
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description