Amharic Text Complexity Classification Using Supervised Machine Learning

Gebregziabihier Nigusie; Tesfa Tegegne

Artificial Intelligence and Digitalization for Sustainable Development. 10th EAI International Conference, ICAST 2022, Bahir Dar, Ethiopia, November 4-6, 2022, Proceedings

Research Article

Amharic Text Complexity Classification Using Supervised Machine Learning

Download

2 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-28725-1_1,
    author={Gebregziabihier Nigusie and Tesfa Tegegne},
    title={Amharic Text Complexity Classification Using Supervised Machine Learning},
    proceedings={Artificial Intelligence and Digitalization for Sustainable Development. 10th EAI International Conference, ICAST 2022, Bahir Dar, Ethiopia, November 4-6, 2022, Proceedings},
    proceedings_a={ICAST},
    year={2023},
    month={3},
    keywords={Text complexity Supervised classification Lexical complexity},
    doi={10.1007/978-3-031-28725-1_1}
}

Gebregziabihier Nigusie
Tesfa Tegegne
Year: 2023
Amharic Text Complexity Classification Using Supervised Machine Learning
ICAST
Springer
DOI: 10.1007/978-3-031-28725-1_1

Gebregziabihier Nigusie¹^,*, Tesfa Tegegne²

1: ICT4D Research Center, Faculty of Computing, Bahir Dar Institute of Technology
2: ICT4D Research Center, Bahir Dar Institute of Technology

*Contact email: gerenigusie138@gmail.com

Abstract

Amharic documents tremendously increase after the proliferation of the internet. It uses a variety of lexicons to organize the document. Some of them may not be familiar to second language learners and low literacy readers which can cause difficulty to comprehend the idea. Text complexity is focused on how difficult or easy a text is to read and understand based on the reader’s level of knowledge. The appropriateness of text for a certain learner group needs to be in line with their proficiency level. A document that contains complex lexicons can also reduce the performance of NLP tasks such as machine translation. Studying the complexity classification model for the Amharic text helps in solving text complexity for a target population and NLP applications. In this paper, we have developed a complexity classification model for Amharic texts using supervised machine learning. For the experiment, 5126 sentences are used. TFIDF and BOW with bigram language modeling are applied for vectorizing the text document and, Support Vector Machine (SVM), Random forest (RF), and Naïve Bayes (NB) algorithms are used for the experiment. SVM has better classification accuracy with a result of 87.1% using bag-of-words (BOW) feature extraction and 10-fold cross-validation. The RF and NB algorithms score an accuracy of 83% and 80.3% respectively. For error analysis, we have used Mean Square Error (MSE) and Root Mean Square Error (RMSE) metrics. In this study, we have addressed the classification of Amharic text complexity. The simplification process of such identified complex texts is our recommendation for future research works.

Keywords: Text complexity, Supervised classification, Lexical complexity

Published: 2023-03-19
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-28725-1_1

Amharic Text Complexity Classification Using Supervised Machine Learning

Abstract

About EAI

Community

Publish with EAI