Research Article
Security Level Classification of Confidential Documents Written in Turkish
@INPROCEEDINGS{10.1007/978-3-642-12630-7_41, author={Erdem Alparslan and Hayretdin Bahsi}, title={Security Level Classification of Confidential Documents Written in Turkish}, proceedings={Mining User-Generated Content for Security}, proceedings_a={MINUCS}, year={2012}, month={10}, keywords={document classification security Turkish support vector machine na\~{n}ve bayes TF-IDF stemming data loss prevention}, doi={10.1007/978-3-642-12630-7_41} }
- Erdem Alparslan
Hayretdin Bahsi
Year: 2012
Security Level Classification of Confidential Documents Written in Turkish
MINUCS
Springer
DOI: 10.1007/978-3-642-12630-7_41
Abstract
This article introduces a security level classification methodology of confidential documents written in Turkish language. Internal documents of TUBITAK UEKAE, holding various security levels (unclassified-restricted-secret) were classified within a methodology using Support Vector Machines (SVM’s) [1] and naïve bayes classifiers [3][9]. To represent term-document relations a recommended metric “TF-IDF" [2] was chosen to construct a weight matrix. Turkic languages provide a very difficult natural language processing problem in comparison with English: “Stemming”. A Turkish stemming tool "zemberek" was used to find out the features without suffix. At the end of the article some experimental results and success metrics are projected.