Mining User-Generated Content for Security

Research Article

Security Level Classification of Confidential Documents Written in Turkish

Download
553 downloads
  • @INPROCEEDINGS{10.1007/978-3-642-12630-7_41,
        author={Erdem Alparslan and Hayretdin Bahsi},
        title={Security Level Classification of Confidential Documents Written in Turkish},
        proceedings={Mining User-Generated Content for Security},
        proceedings_a={MINUCS},
        year={2012},
        month={10},
        keywords={document classification security Turkish support vector machine na\~{n}ve bayes TF-IDF stemming data loss prevention},
        doi={10.1007/978-3-642-12630-7_41}
    }
    
  • Erdem Alparslan
    Hayretdin Bahsi
    Year: 2012
    Security Level Classification of Confidential Documents Written in Turkish
    MINUCS
    Springer
    DOI: 10.1007/978-3-642-12630-7_41
Erdem Alparslan1,*, Hayretdin Bahsi1,*
  • 1: National Research Institute of Electronics and Cryptology-TUBITAK
*Contact email: ealparslan@uekae.tubitak.gov.tr, bahsi@uekae.tubitak.gov.tr

Abstract

This article introduces a security level classification methodology of confidential documents written in Turkish language. Internal documents of TUBITAK UEKAE, holding various security levels (unclassified-restricted-secret) were classified within a methodology using Support Vector Machines (SVM’s) [1] and naïve bayes classifiers [3][9]. To represent term-document relations a recommended metric “TF-IDF" [2] was chosen to construct a weight matrix. Turkic languages provide a very difficult natural language processing problem in comparison with English: “Stemming”. A Turkish stemming tool "zemberek" was used to find out the features without suffix. At the end of the article some experimental results and success metrics are projected.