A Quranic Dataset for Text Recognition

Idris Al-Sheikh; Masnizah Mohd

Proceedings of the 1st International Conference on Informatics, Engineering, Science and Technology, INCITEST 2019, 18 July 2019, Bandung, Indonesia

Research Article

A Quranic Dataset for Text Recognition

Download3231 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.18-7-2019.2287842,
    author={Idris Saleh Al-Sheikh and Masnizah  Mohd},
    title={A Quranic Dataset for Text Recognition},
    proceedings={Proceedings of the 1st International Conference on Informatics, Engineering, Science and Technology, INCITEST 2019, 18 July 2019, Bandung, Indonesia},
    publisher={EAI},
    proceedings_a={INCITEST},
    year={2019},
    month={10},
    keywords={text recognition dataset quranic arabic ocr},
    doi={10.4108/eai.18-7-2019.2287842}
}

Idris Saleh Al-Sheikh
Masnizah Mohd
Year: 2019
A Quranic Dataset for Text Recognition
INCITEST
EAI
DOI: 10.4108/eai.18-7-2019.2287842

Idris Saleh Al-Sheikh¹^,*, Masnizah Mohd¹

1: Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia

*Contact email: alshikh@gmail.com

Abstract

Any text recognition or Optical Character Recognition (OCR) system requires a dataset to learn how to recognize the text. Due to the lack of a standard benchmark, most of the studies in this ﬁeld were conducted using private datasets without a fair comparison. In this work, we used the standard Mushaf al Madinah benchmark where there are some rules in writing style, for example, the page should start with the beginning of verse and end with the end of verse. Following these rules make the words vary in size and paragraphs on different pages. These characteristics making the recognition of the Quranic text more challenging than the normal Arabic text, where the state of the art systems fails to recognize the Quranic text. Therefore, Quranic OCR dataset is presented in this study. It contains 604 images on page level and 8927 images in text-line level. This dataset is public and free to use for the research community. The Quranic dataset would help the researchers in the field of Arabic OCR where the dataset produced in this study would be made public and free for the use of research purposes.

Keywords: text recognition, dataset, quranic, arabic ocr

Published: 2019-10-01
Publisher: EAI

: http://dx.doi.org/10.4108/eai.18-7-2019.2287842

A Quranic Dataset for Text Recognition

Abstract

About EAI

Community

Publish with EAI