
Research Article
Amharic Sentence-Level Word Sense DisambiguationUsing TransferLearning
@INPROCEEDINGS{10.1007/978-3-031-28725-1_14, author={Neima Mossa and Million Meshesha}, title={Amharic Sentence-Level Word Sense DisambiguationUsing TransferLearning}, proceedings={Artificial Intelligence and Digitalization for Sustainable Development. 10th EAI International Conference, ICAST 2022, Bahir Dar, Ethiopia, November 4-6, 2022, Proceedings}, proceedings_a={ICAST}, year={2023}, month={3}, keywords={Word sense disambiguation Transfer learning Neural network Pre-trained language model Natural language preprocessing Morphological analyzer Amharic WSD}, doi={10.1007/978-3-031-28725-1_14} }
- Neima Mossa
Million Meshesha
Year: 2023
Amharic Sentence-Level Word Sense DisambiguationUsing TransferLearning
ICAST
Springer
DOI: 10.1007/978-3-031-28725-1_14
Abstract
Word sense disambiguation (WSD) plays an important role, in increasing the performance of NLP applications such as information extraction, information retrieval, and machine translation. The manual disambiguation process by humans is tedious, prone to errors, and expensive. Recent research in Amharic WSD used mostly handcrafted rules. Such works do not help to learn different representations of the target word from data automatically. Moreover, such a manual disambiguation approach looks at a limited length of surrounding words from the sentence. The main drawback of previous works is that the sense of the word will not be detected from the synset list unless the word is explicitly mentioned. Our study explores and designs the Amharic WSD model by employing transformer-based contextual embeddings, namely AmRoBERTa. As there is no standard sense-tagged Amharic text dataset for the Amharic WSD task, we first compiled 800 ambiguous words. Furthermore, we collect more than 33k sentences that contain those ambiguous words. The 33k sentences are used to finetune our transformer based AmRoBERTa model.We conduct two types of annotation for our WSD experiments. First, using linguistic experts, we annotate 10k sentences for 7 types of word relations (synonymy, hyponymy, hypernymy, meronomy, holonomy, toponymy, and homonymy). For the WSD disambiguation experiment, we first choose 10 target words and annotate a total of 1000 sentences with their correct sense using the WebAnno annotation tool. For the classification task, the CNN, Bi-LSTM, and BERT-based classification models achieve an accuracy of 90%, 88%, and 93% respectively. For the WSD task, we have employed two experiments. When we use the masking technique of the pre-trained contextual embedding to find the correct sense, it attains 70% accuracy. However, when we use the FLAIR document embedding framework to embed the target sentences and glosses separately and compute the similarities, our model was able to achieve 71% accuracy to correctly disambiguate target words.