Proceedings of the First International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India

Research Article

Support Vector Machine based Breast Cancer Classification using Next Generation Sequences

Download415 downloads
  • @INPROCEEDINGS{10.4108/eai.16-5-2020.2303953,
        author={Babymol  Kurian and V. L. Jyothi},
        title={Support Vector Machine based Breast Cancer Classification using Next Generation Sequences},
        proceedings={Proceedings of the First  International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India},
        publisher={EAI},
        proceedings_a={ICASISET},
        year={2021},
        month={1},
        keywords={support vector machine supervised machine learning breast cancer multiple classification next generation sequencing},
        doi={10.4108/eai.16-5-2020.2303953}
    }
    
  • Babymol Kurian
    V. L. Jyothi
    Year: 2021
    Support Vector Machine based Breast Cancer Classification using Next Generation Sequences
    ICASISET
    EAI
    DOI: 10.4108/eai.16-5-2020.2303953
Babymol Kurian1,*, V. L. Jyothi2
  • 1: Sathyabama Institute of Science and Technology,Chennai
  • 2: Department of Computer Science & Applications, Guru Shree Shanthi Vijai Jain College, Chennai
*Contact email: babymolkurian@gmail.com

Abstract

Next Generation Sequencing is inevitable for providing better approach for predicting and curing diseases with high success rate in an appreciable timeline. Modern technology such as machine learning support the medical research with high speed and tremendous accuracy from disease prediction to cure. In this paper, the supervised learning model, Support Vector Machine is applied on next generation sequences for the prediction of breast cancer. Ten basic features of DNA sequences such as individual nucleobase average count of A, G, C, T, AT and GC-content, AT/GC composition, G-Quadruplex occurrence, ORF (Open Reading Frame) count and MR (Mutation Rate) are used for framing the feature vector. The feature vectors along with the class value are considered as the dataset for supervised learning. Datasets are prepared to classify (class value) as ‘0’ for normal sequences, ‘1’ for BRCA1 cancer sequences and ‘2’ for BRCA2 cancer sequences. Four different categories of datasets are prepared with 50, 100, 150 and 200 sequences for each class of normal sequence, BRCA1 and BRCA2 cancer sequence. While increasing the dataset size, the outlier, the distribution and scattered features of data were also analysed. The datasets are split into training and testing set with 80:20 ratio for the classification process. SVM model in Python is applied for supervised classification process.