About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Intelligent Systems and Machine Learning. First EAI International Conference, ICISML 2022, Hyderabad, India, December 16-17, 2022, Proceedings, Part II

Research Article

Data Homogeneity Dependent Topic Modeling for Information Retrieval

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-35081-8_6,
        author={Keerthana Sureshbabu Kashi and Abigail A. Antenor and Gabriel Isaac L. Ramolete and Adrienne Heinrich},
        title={Data Homogeneity Dependent Topic Modeling for Information Retrieval},
        proceedings={Intelligent Systems and Machine Learning. First EAI International Conference, ICISML 2022, Hyderabad, India, December 16-17, 2022, Proceedings, Part II},
        proceedings_a={ICISML PART 2},
        year={2023},
        month={7},
        keywords={Topic modeling Topic Discovery Technique selection Information retrieval NMF LDA LSA BERTopic Homogeneity Heterogeneity},
        doi={10.1007/978-3-031-35081-8_6}
    }
    
  • Keerthana Sureshbabu Kashi
    Abigail A. Antenor
    Gabriel Isaac L. Ramolete
    Adrienne Heinrich
    Year: 2023
    Data Homogeneity Dependent Topic Modeling for Information Retrieval
    ICISML PART 2
    Springer
    DOI: 10.1007/978-3-031-35081-8_6
Keerthana Sureshbabu Kashi1,*, Abigail A. Antenor1, Gabriel Isaac L. Ramolete1, Adrienne Heinrich1
  • 1: Aboitiz Data Innovation, Goldbell Towers
*Contact email: keerthana.sureshbabu@aboitiz.com

Abstract

Different topic modeling techniques have been applied over the years to categorize and make sense of large volumes of unstructured textual data. Our observation shows that there is not one single technique that works well for all domains or for a general use case. We hypothesize that the performance of these algorithms depends on the variation and heterogeneity of topics mentioned in free text and aim to investigate this effect in our study. Our proposed methodology comprises of i) the calculation of a homogeneity score to measure the variation in the data, ii) selection of the algorithm with the best performance for the calculated homogeneity score. For each homogeneity score, the performances of popular topic modeling algorithms, namely NMF, LDA, LSA, and BERTopic, were compared using an accuracy and Cohen’s kappa score. Our results indicate that for highly homogeneous data, BERTopic outperformed the other algorithms (Cohen’s kappa of 0.42 vs. 0.06 for LSA). For medium and low homogeneous data, NMF was superior to the other algorithms (medium homogeneity returns a Cohen’s kappa of 0.3 for NMF vs. 0.15 for LDA, 0.1 for BERTopic, 0.04 for LSA).

Keywords
Topic modeling Topic Discovery Technique selection Information retrieval NMF LDA LSA BERTopic Homogeneity Heterogeneity
Published
2023-07-10
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-35081-8_6
Copyright © 2022–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL