Research Article
An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique
@INPROCEEDINGS{10.4108/eai.16-5-2020.2304041, author={S Kumar Sreedhar and Syed Ahmed and P Mercy Flora and LS Hemanth and J Aishwarya and Rahul Gopal Naik}, title={An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique}, proceedings={Proceedings of the First International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India}, publisher={EAI}, proceedings_a={ICASISET}, year={2021}, month={1}, keywords={classification classifier keyword based document classification (kbdc) predetermined irrelevant text model probability technique pre-determined keyword text pattern model (pktpm)}, doi={10.4108/eai.16-5-2020.2304041} }
- S Kumar Sreedhar
Syed Ahmed
P Mercy Flora
LS Hemanth
J Aishwarya
Rahul Gopal Naik
Year: 2021
An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique
ICASISET
EAI
DOI: 10.4108/eai.16-5-2020.2304041
Abstract
Document classification is the task to split the document set into dis-tinct highly relative classes or groups based on nature of the document con-tents.Here, an improved approach of document classification called keyword-based document classification (KBDC) is introduced. It focuses on splitting the unstructured text document set into K number of dissimilar classes based on K predetermined keywords text models by improved probability technique. This new system comprises of the following stages. Namely, pre-processing, classi-fication and classifier stage respectively. Initial, the proposed system (KBDC) recognizes all the immaterial existing contents in the input text document through constructed Predetermined Irrelevant Text Pattern Model (PITPM). Next, it divides the pre-processed document set into ‘K’ different groups or classes by K number of Pre-determined Keyword Text Pattern Models (PKTPM) through probability technique, where K denotes the number of groups or classes or models. Finally, the KBDC system classifies the trial test text document without any class label that belongs to either of the existing group based on the K different class models (PKTPs). Experimentation results show that the KBDC is appropriate to split and identifies the unstructured text document set into K distinct extremely comparative classes.