Research Article
A Quranic Dataset for Text Recognition
@INPROCEEDINGS{10.4108/eai.18-7-2019.2287842, author={Idris Saleh Al-Sheikh and Masnizah Mohd}, title={A Quranic Dataset for Text Recognition}, proceedings={Proceedings of the 1st International Conference on Informatics, Engineering, Science and Technology, INCITEST 2019, 18 July 2019, Bandung, Indonesia}, publisher={EAI}, proceedings_a={INCITEST}, year={2019}, month={10}, keywords={text recognition dataset quranic arabic ocr}, doi={10.4108/eai.18-7-2019.2287842} }
- Idris Saleh Al-Sheikh
Masnizah Mohd
Year: 2019
A Quranic Dataset for Text Recognition
INCITEST
EAI
DOI: 10.4108/eai.18-7-2019.2287842
Abstract
Any text recognition or Optical Character Recognition (OCR) system requires a dataset to learn how to recognize the text. Due to the lack of a standard benchmark, most of the studies in this field were conducted using private datasets without a fair comparison. In this work, we used the standard Mushaf al Madinah benchmark where there are some rules in writing style, for example, the page should start with the beginning of verse and end with the end of verse. Following these rules make the words vary in size and paragraphs on different pages. These characteristics making the recognition of the Quranic text more challenging than the normal Arabic text, where the state of the art systems fails to recognize the Quranic text. Therefore, Quranic OCR dataset is presented in this study. It contains 604 images on page level and 8927 images in text-line level. This dataset is public and free to use for the research community. The Quranic dataset would help the researchers in the field of Arabic OCR where the dataset produced in this study would be made public and free for the use of research purposes.