
Research Article
Evaluating the Effect of Corpus Normalisation in Topics Coherence
@INPROCEEDINGS{10.1007/978-3-030-77417-2_15, author={Luana da Silva Sousa and Vinicius Melquiades de Sousa and Rogerio de Aquino Silva and Gustavo Medeiros de Ara\^{u}jo}, title={Evaluating the Effect of Corpus Normalisation in Topics Coherence}, proceedings={Data and Information in Online Environments. Second EAI International Conference, DIONE 2021, Virtual Event, March 10--12, 2021, Proceedings}, proceedings_a={DIONE}, year={2021}, month={6}, keywords={Corpus normalisation LDA Topic coherence Ontology Natural language processing}, doi={10.1007/978-3-030-77417-2_15} }
- Luana da Silva Sousa
Vinicius Melquiades de Sousa
Rogerio de Aquino Silva
Gustavo Medeiros de Araújo
Year: 2021
Evaluating the Effect of Corpus Normalisation in Topics Coherence
DIONE
Springer
DOI: 10.1007/978-3-030-77417-2_15
Abstract
Probabilistic topic models are extensively used to better understand the content of documents. Due to the fact that topic models are totally unsupervised, statistical and data driven, they may produce topics not always meaningful. This work is based on the hypothesis that, since LDA takes into account the number of occurrences of words, we could affect the quality of topics by semantically normalising the text, where each concept would be represented by the same word. We can find a formal description of lexemes found in text using a knowledgebase and extract the several forms of mentioning a lexeme to normalize a corpus. We use topic coherence metric, as it represents the semantic interpretability of the terms used to describe a particular topic, to quantify the influence of semantic corpus normalisation in topics. The first tests on the semantic normalisation framework of texts showed prominent results, and shall be investigated in depth in future.