Data Homogeneity Dependent Topic Modeling for Information Retrieval

Keerthana Sureshbabu Kashi; Abigail A. Antenor; Gabriel Isaac L. Ramolete; Adrienne Heinrich

Intelligent Systems and Machine Learning. First EAI International Conference, ICISML 2022, Hyderabad, India, December 16-17, 2022, Proceedings, Part II

Research Article

Data Homogeneity Dependent Topic Modeling for Information Retrieval

Download

1 download

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-35081-8_6,
    author={Keerthana Sureshbabu Kashi and Abigail A. Antenor and Gabriel Isaac L. Ramolete and Adrienne Heinrich},
    title={Data Homogeneity Dependent Topic Modeling for Information Retrieval},
    proceedings={Intelligent Systems and Machine Learning. First EAI International Conference, ICISML 2022, Hyderabad, India, December 16-17, 2022, Proceedings, Part II},
    proceedings_a={ICISML PART 2},
    year={2023},
    month={7},
    keywords={Topic modeling Topic Discovery Technique selection Information retrieval NMF LDA LSA BERTopic Homogeneity Heterogeneity},
    doi={10.1007/978-3-031-35081-8_6}
}

Keerthana Sureshbabu Kashi
Abigail A. Antenor
Gabriel Isaac L. Ramolete
Adrienne Heinrich
Year: 2023
Data Homogeneity Dependent Topic Modeling for Information Retrieval
ICISML PART 2
Springer
DOI: 10.1007/978-3-031-35081-8_6

Keerthana Sureshbabu Kashi¹^,*, Abigail A. Antenor¹, Gabriel Isaac L. Ramolete¹, Adrienne Heinrich¹

1: Aboitiz Data Innovation, Goldbell Towers

*Contact email: keerthana.sureshbabu@aboitiz.com

Abstract

Different topic modeling techniques have been applied over the years to categorize and make sense of large volumes of unstructured textual data. Our observation shows that there is not one single technique that works well for all domains or for a general use case. We hypothesize that the performance of these algorithms depends on the variation and heterogeneity of topics mentioned in free text and aim to investigate this effect in our study. Our proposed methodology comprises of i) the calculation of a homogeneity score to measure the variation in the data, ii) selection of the algorithm with the best performance for the calculated homogeneity score. For each homogeneity score, the performances of popular topic modeling algorithms, namely NMF, LDA, LSA, and BERTopic, were compared using an accuracy and Cohen’s kappa score. Our results indicate that for highly homogeneous data, BERTopic outperformed the other algorithms (Cohen’s kappa of 0.42 vs. 0.06 for LSA). For medium and low homogeneous data, NMF was superior to the other algorithms (medium homogeneity returns a Cohen’s kappa of 0.3 for NMF vs. 0.15 for LDA, 0.1 for BERTopic, 0.04 for LSA).

Keywords: Topic modeling, Topic Discovery, Technique selection, Information retrieval, NMF, LDA, LSA, BERTopic, Homogeneity, Heterogeneity

Published: 2023-07-10
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-35081-8_6

Data Homogeneity Dependent Topic Modeling for Information Retrieval

Abstract

About EAI

Community

Publish with EAI