About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Artificial Intelligence and Digitalization for Sustainable Development. 10th EAI International Conference, ICAST 2022, Bahir Dar, Ethiopia, November 4-6, 2022, Proceedings

Research Article

Amharic Text Complexity Classification Using Supervised Machine Learning

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-28725-1_1,
        author={Gebregziabihier Nigusie and Tesfa Tegegne},
        title={Amharic Text Complexity Classification Using Supervised Machine Learning},
        proceedings={Artificial Intelligence and Digitalization for Sustainable Development. 10th EAI International Conference, ICAST 2022, Bahir Dar, Ethiopia, November 4-6, 2022, Proceedings},
        proceedings_a={ICAST},
        year={2023},
        month={3},
        keywords={Text complexity Supervised classification Lexical complexity},
        doi={10.1007/978-3-031-28725-1_1}
    }
    
  • Gebregziabihier Nigusie
    Tesfa Tegegne
    Year: 2023
    Amharic Text Complexity Classification Using Supervised Machine Learning
    ICAST
    Springer
    DOI: 10.1007/978-3-031-28725-1_1
Gebregziabihier Nigusie1,*, Tesfa Tegegne2
  • 1: ICT4D Research Center, Faculty of Computing, Bahir Dar Institute of Technology
  • 2: ICT4D Research Center, Bahir Dar Institute of Technology
*Contact email: gerenigusie138@gmail.com

Abstract

Amharic documents tremendously increase after the proliferation of the internet. It uses a variety of lexicons to organize the document. Some of them may not be familiar to second language learners and low literacy readers which can cause difficulty to comprehend the idea. Text complexity is focused on how difficult or easy a text is to read and understand based on the reader’s level of knowledge. The appropriateness of text for a certain learner group needs to be in line with their proficiency level. A document that contains complex lexicons can also reduce the performance of NLP tasks such as machine translation. Studying the complexity classification model for the Amharic text helps in solving text complexity for a target population and NLP applications. In this paper, we have developed a complexity classification model for Amharic texts using supervised machine learning. For the experiment, 5126 sentences are used. TFIDF and BOW with bigram language modeling are applied for vectorizing the text document and, Support Vector Machine (SVM), Random forest (RF), and Naïve Bayes (NB) algorithms are used for the experiment. SVM has better classification accuracy with a result of 87.1% using bag-of-words (BOW) feature extraction and 10-fold cross-validation. The RF and NB algorithms score an accuracy of 83% and 80.3% respectively. For error analysis, we have used Mean Square Error (MSE) and Root Mean Square Error (RMSE) metrics. In this study, we have addressed the classification of Amharic text complexity. The simplification process of such identified complex texts is our recommendation for future research works.

Keywords
Text complexity Supervised classification Lexical complexity
Published
2023-03-19
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-28725-1_1
Copyright © 2022–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL