About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Intelligent Systems and Machine Learning. First EAI International Conference, ICISML 2022, Hyderabad, India, December 16-17, 2022, Proceedings, Part II

Research Article

AI/ML Based Sensitive Data Discovery and Classification of Unstructured Data Sources

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-35081-8_31,
        author={Shravani Ponde and Akshay Kulkarni and Rashmi Agarwal},
        title={AI/ML Based Sensitive Data Discovery and Classification of Unstructured Data Sources},
        proceedings={Intelligent Systems and Machine Learning. First EAI International Conference, ICISML 2022, Hyderabad, India, December 16-17, 2022, Proceedings, Part II},
        proceedings_a={ICISML PART 2},
        year={2023},
        month={7},
        keywords={Data Discovery Data Protection Sensitive Data Classification Data Privacy Unstructured Data Discovery Classification Model},
        doi={10.1007/978-3-031-35081-8_31}
    }
    
  • Shravani Ponde
    Akshay Kulkarni
    Rashmi Agarwal
    Year: 2023
    AI/ML Based Sensitive Data Discovery and Classification of Unstructured Data Sources
    ICISML PART 2
    Springer
    DOI: 10.1007/978-3-031-35081-8_31
Shravani Ponde1,*, Akshay Kulkarni1, Rashmi Agarwal1
  • 1: RACE, REVA University
*Contact email: shravanip.ba03@reva.edu.in

Abstract

The amount of data produced every day is enormous. According to Forbes, 2.5 quintillion data is created daily (Marr,2018). The volume of unstructured data is also multiplying daily, forcing organizations to spend significant time, effort, and money to manage and govern the data assets. This volume of unstructured data also leads to data privacy challenges in handling, auditing, and regulatory encounters thrown by governing bodies like Governments, Auditors, Data Protection/Legislative/Federal laws, regulatory acts like The General Data Protection Regulation (GDPR), The Basel Committee on Banking Supervision (BCBS), Health Insurance Portability and Accountability Act (HIPPA), The California Consumer Privacy Act (CCPA) etc.

Organizations must set up a robust data protection framework and governance to identify, classify, protect and monitor the sensitive data residing in the unstructured data sources. Data discovery and classification of the data assets is scanning the organization’s data sources both structured and unstructured, that could potentially contain sensitive or regulated data.

Most organizations are using various data discovery and classification tools in scanning the structured and unstructured sources. The organizations cannot accomplish the overall privacy and protection needs due to the gaps observed in scanning and discovering sensitive data elements from unstructured sources. Hence, they are adapting to manual methodologies to fill these gaps.

The main objective of this study is to build a solution which systematically scans an unstructured data source and detects the sensitive data elements, auto classify as per the data classification categories, and visualizes the results on a dashboard. This solution uses Machine Learning (ML) and Natural Language Processing (NLP) techniques to detect the sensitive data elements contained in the unstructured data sources. It can be used as a first step before performing data encryption, tokenization, anonymization, and masking as part of the overall data protection journey.

Keywords
Data Discovery Data Protection Sensitive Data Classification Data Privacy Unstructured Data Discovery Classification Model
Published
2023-07-10
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-35081-8_31
Copyright © 2022–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL