About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
sis 18(19): e5

Research Article

Unsupervised Machine Learning based Documents Clustering in Urdu

Download1932 downloads
Cite
BibTeX Plain Text
  • @ARTICLE{10.4108/eai.19-12-2018.156081,
        author={Atta Ur Rahman and Khairullah Khan and Wahab Khan and Aurangzeb Khan and Bibi Saqia},
        title={Unsupervised Machine Learning based Documents Clustering in Urdu},
        journal={EAI Endorsed Transactions on Scalable Information Systems},
        volume={5},
        number={19},
        publisher={EAI},
        journal_a={SIS},
        year={2018},
        month={12},
        keywords={Urdu; Documents clustering; Similarity Measures; K-Means Algorithm},
        doi={10.4108/eai.19-12-2018.156081}
    }
    
  • Atta Ur Rahman
    Khairullah Khan
    Wahab Khan
    Aurangzeb Khan
    Bibi Saqia
    Year: 2018
    Unsupervised Machine Learning based Documents Clustering in Urdu
    SIS
    EAI
    DOI: 10.4108/eai.19-12-2018.156081
Atta Ur Rahman1,*, Khairullah Khan1, Wahab Khan2, Aurangzeb Khan1, Bibi Saqia1
  • 1: Department of Computer Science, University of Science & Technology Bannu, Pakistan
  • 2: Department of Computer Science & Software Engineering, IIU, Islamabad 44000, Pakistan
*Contact email: attacs9@gmail.com

Abstract

The volume of data on the web is growing rapidly, due to the proliferation of news sources, contents, blogs and journals etc. Like other languages, the Urdu language has also observed tremendous growth on the internet. As the volume of data is expanding, information retrieval (IR) is becoming complicated. Document clustering is an unsupervised ML approach, employed to group a huge number of dispersed documents into a small number of significant and consistent clusters, thus providing a base for indexing, IR and browsing mechanisms. Documents clustering has a long tradition in English as well as English like western languages, but Urdu lags behind in terms sophisticated natural language processing (NLP) tools and resources for documents clustering. Documents clustering becomes a challenging task in Urdu language having a rich morphology, particular structure, syntax peculiarities and cursive nature. In this study, we have developed a framework of document clustering and analysed various similarity measures for Urdu documents. We have also checked the effect of stop words removal in the process of Urdu document clustering.

Keywords
Urdu; Documents clustering; Similarity Measures; K-Means Algorithm
Received
2018-09-30
Accepted
2018-12-04
Published
2018-12-19
Publisher
EAI
http://dx.doi.org/10.4108/eai.19-12-2018.156081

Copyright © 2018 Atta Ur Rahman et al., licensed to EAI. This is an open access article distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unlimited use, distribution and reproduction in any medium so long as the original work is properly cited.

EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL