
Research Article
Data Homogeneity Dependent Topic Modeling for Information Retrieval
@INPROCEEDINGS{10.1007/978-3-031-35081-8_6, author={Keerthana Sureshbabu Kashi and Abigail A. Antenor and Gabriel Isaac L. Ramolete and Adrienne Heinrich}, title={Data Homogeneity Dependent Topic Modeling for Information Retrieval}, proceedings={Intelligent Systems and Machine Learning. First EAI International Conference, ICISML 2022, Hyderabad, India, December 16-17, 2022, Proceedings, Part II}, proceedings_a={ICISML PART 2}, year={2023}, month={7}, keywords={Topic modeling Topic Discovery Technique selection Information retrieval NMF LDA LSA BERTopic Homogeneity Heterogeneity}, doi={10.1007/978-3-031-35081-8_6} }
- Keerthana Sureshbabu Kashi
Abigail A. Antenor
Gabriel Isaac L. Ramolete
Adrienne Heinrich
Year: 2023
Data Homogeneity Dependent Topic Modeling for Information Retrieval
ICISML PART 2
Springer
DOI: 10.1007/978-3-031-35081-8_6
Abstract
Different topic modeling techniques have been applied over the years to categorize and make sense of large volumes of unstructured textual data. Our observation shows that there is not one single technique that works well for all domains or for a general use case. We hypothesize that the performance of these algorithms depends on the variation and heterogeneity of topics mentioned in free text and aim to investigate this effect in our study. Our proposed methodology comprises of i) the calculation of a homogeneity score to measure the variation in the data, ii) selection of the algorithm with the best performance for the calculated homogeneity score. For each homogeneity score, the performances of popular topic modeling algorithms, namely NMF, LDA, LSA, and BERTopic, were compared using an accuracy and Cohen’s kappa score. Our results indicate that for highly homogeneous data, BERTopic outperformed the other algorithms (Cohen’s kappa of 0.42 vs. 0.06 for LSA). For medium and low homogeneous data, NMF was superior to the other algorithms (medium homogeneity returns a Cohen’s kappa of 0.3 for NMF vs. 0.15 for LDA, 0.1 for BERTopic, 0.04 for LSA).