Proceedings of the 1st International Conference on Informatics, Engineering, Science and Technology, INCITEST 2019, 18 July 2019, Bandung, Indonesia

Research Article

A Quranic Dataset for Text Recognition

Download1023 downloads
  • @INPROCEEDINGS{10.4108/eai.18-7-2019.2287842,
        author={Idris Saleh Al-Sheikh and Masnizah  Mohd},
        title={A Quranic Dataset for Text Recognition},
        proceedings={Proceedings of the 1st International Conference on Informatics, Engineering, Science and Technology, INCITEST 2019, 18 July 2019, Bandung, Indonesia},
        publisher={EAI},
        proceedings_a={INCITEST},
        year={2019},
        month={10},
        keywords={text recognition dataset quranic arabic ocr},
        doi={10.4108/eai.18-7-2019.2287842}
    }
    
  • Idris Saleh Al-Sheikh
    Masnizah Mohd
    Year: 2019
    A Quranic Dataset for Text Recognition
    INCITEST
    EAI
    DOI: 10.4108/eai.18-7-2019.2287842
Idris Saleh Al-Sheikh1,*, Masnizah Mohd1
  • 1: Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia
*Contact email: alshikh@gmail.com

Abstract

Any text recognition or Optical Character Recognition (OCR) system requires a dataset to learn how to recognize the text. Due to the lack of a standard benchmark, most of the studies in this field were conducted using private datasets without a fair comparison. In this work, we used the standard Mushaf al Madinah benchmark where there are some rules in writing style, for example, the page should start with the beginning of verse and end with the end of verse. Following these rules make the words vary in size and paragraphs on different pages. These characteristics making the recognition of the Quranic text more challenging than the normal Arabic text, where the state of the art systems fails to recognize the Quranic text. Therefore, Quranic OCR dataset is presented in this study. It contains 604 images on page level and 8927 images in text-line level. This dataset is public and free to use for the research community. The Quranic dataset would help the researchers in the field of Arabic OCR where the dataset produced in this study would be made public and free for the use of research purposes.