Supervised Urdu Word Segmentation Model Based on POS Information

Sadiq Khan; Khairullah Khan; Wahab Khan

sis 18(19): e2

Research Article

Supervised Urdu Word Segmentation Model Based on POS Information

Download2130 downloads

Cite: BibTeX Plain Text

@ARTICLE{10.4108/eai.19-6-2018.155444,
    author={Sadiq Nawaz Khan and Khairullah Khan and Wahab Khan},
    title={Supervised Urdu Word Segmentation Model Based on POS Information},
    journal={EAI Endorsed Transactions on Scalable Information Systems},
    volume={5},
    number={19},
    publisher={EAI},
    journal_a={SIS},
    year={2018},
    month={9},
    keywords={Urdu, Word segmentation, supervised learning, conditional random fields},
    doi={10.4108/eai.19-6-2018.155444}
}

Sadiq Nawaz Khan
Khairullah Khan
Wahab Khan
Year: 2018
Supervised Urdu Word Segmentation Model Based on POS Information
SIS
EAI
DOI: 10.4108/eai.19-6-2018.155444

Sadiq Nawaz Khan¹^,*, Khairullah Khan¹, Wahab Khan²

1: Department of Computer Science, University of Science & Technology Bannu, Pakistan
2: Department of Computer Science & Software Engineering, IIU, Islamabad 44000, Pakistan

*Contact email: sadiqnawaz97@gmail.com

Abstract

Urdu is the national language of Pakistan, also the most widely spoken and understandable language of the globe. In order to accomplish successful Urdu NLP a robust and high-performance NLP tools and resources are utmost necessary. Word segmentation takes on an authoritative role for morphologically rich languages such as Urdu for diverse NLP domains such as named entity recognition, sentiment analysis, part of speech tagging, information retrieval etc. The morphological richness property of Urdu adds to the challenges of the word segmentation task, because a single word can be composed of null or a few prefixes, a stem and null or a few suffixes. In this paper we present supervised Urdu word segmentation scheme based on part of speech (POS) information of the corresponding words. For experiments conditional random fields (CRF) with contextual feature is used. The performance of the proposed system is evaluated on 300K words, results shows evidential improvements on baseline approach.

Keywords: Urdu, Word segmentation, supervised learning, conditional random fields

Received: 2018-05-10
Accepted: 2018-09-04
Published: 2018-09-10
Publisher: EAI

: http://dx.doi.org/10.4108/eai.19-6-2018.155444

Copyright © 2018 Sadiq Nawaz Khan et al., licensed to EAI. This is an open access article distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unlimited use, distribution and reproduction in any medium so long as the original work is properly cited.

Supervised Urdu Word Segmentation Model Based on POS Information

Abstract

About EAI

Community

Publish with EAI