Collaborative Computing: Networking, Applications and Worksharing. 15th EAI International Conference, CollaborateCom 2019, London, UK, August 19-22, 2019, Proceedings

Research Article

Extracting Topics from Semi-structured Data for Enhancing Enterprise Knowledge Graphs

Download
431 downloads
  • @INPROCEEDINGS{10.1007/978-3-030-30146-0_8,
        author={Neda Abolhassani and Lakshmish Ramaswamy},
        title={Extracting Topics from Semi-structured Data for Enhancing Enterprise Knowledge Graphs},
        proceedings={Collaborative Computing: Networking, Applications and Worksharing. 15th EAI International Conference, CollaborateCom 2019, London, UK, August 19-22, 2019, Proceedings},
        proceedings_a={COLLABORATECOM},
        year={2019},
        month={8},
        keywords={Topic modeling Knowledge graphs Semi-structured data MongoDB Gibbs Sampling Variational Bayes},
        doi={10.1007/978-3-030-30146-0_8}
    }
    
  • Neda Abolhassani
    Lakshmish Ramaswamy
    Year: 2019
    Extracting Topics from Semi-structured Data for Enhancing Enterprise Knowledge Graphs
    COLLABORATECOM
    Springer
    DOI: 10.1007/978-3-030-30146-0_8
Neda Abolhassani1,*, Lakshmish Ramaswamy1,*
  • 1: University of Georgia
*Contact email: neda@cs.uga.edu, laks@cs.uga.edu

Abstract

Unifying information across the organizational data silos that lack documentation, structure and automated semantic discovery has been of an intense interest in the recent years. Enterprise knowledge graph is a common tool of data integration and knowledge discovery and it has become a backbone to APIs that demand access to structured knowledge. A piece which was previously unnoticed in building enterprise knowledge graph, is adding an abstract layer of themes and concepts which is mapped to various documents stored as semi-structured files in databases. Augmenting enterprise knowledge graphs by concepts will help companies to find the trends in their data and get a holistic view over their entire data stores. Extracting topics from semi-structured data suffers from lack of corpus or description as its major challenge. In this research, we investigate the impact of self-supplementation of words and documents on probabilistic topic modeling upon semi-structured data. Another contribution of this paper is finding the best tuning of probabilistic topic modeling that fits semi-structured data. The extracted topics are potential summaries and concepts about the dataset. Moreover, they can be mapped to their sources of origin in order to extend the enterprise knowledge graph. We consider 2 inferencing techniques and demonstrate the results on real data pools from Open City data and Kaggle data containing 7.5 GB and 1.15 GB of data stored in MongoDB collections, respectively. We also propose a selection heuristic for effective identification of topics hidden in various data sources.