Research Article
Support Vector Machine based Breast Cancer Classification using Next Generation Sequences
@INPROCEEDINGS{10.4108/eai.16-5-2020.2303953, author={Babymol Kurian and V. L. Jyothi}, title={Support Vector Machine based Breast Cancer Classification using Next Generation Sequences}, proceedings={Proceedings of the First International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India}, publisher={EAI}, proceedings_a={ICASISET}, year={2021}, month={1}, keywords={support vector machine supervised machine learning breast cancer multiple classification next generation sequencing}, doi={10.4108/eai.16-5-2020.2303953} }
- Babymol Kurian
V. L. Jyothi
Year: 2021
Support Vector Machine based Breast Cancer Classification using Next Generation Sequences
ICASISET
EAI
DOI: 10.4108/eai.16-5-2020.2303953
Abstract
Next Generation Sequencing is inevitable for providing better approach for predicting and curing diseases with high success rate in an appreciable timeline. Modern technology such as machine learning support the medical research with high speed and tremendous accuracy from disease prediction to cure. In this paper, the supervised learning model, Support Vector Machine is applied on next generation sequences for the prediction of breast cancer. Ten basic features of DNA sequences such as individual nucleobase average count of A, G, C, T, AT and GC-content, AT/GC composition, G-Quadruplex occurrence, ORF (Open Reading Frame) count and MR (Mutation Rate) are used for framing the feature vector. The feature vectors along with the class value are considered as the dataset for supervised learning. Datasets are prepared to classify (class value) as ‘0’ for normal sequences, ‘1’ for BRCA1 cancer sequences and ‘2’ for BRCA2 cancer sequences. Four different categories of datasets are prepared with 50, 100, 150 and 200 sequences for each class of normal sequence, BRCA1 and BRCA2 cancer sequence. While increasing the dataset size, the outlier, the distribution and scattered features of data were also analysed. The datasets are split into training and testing set with 80:20 ratio for the classification process. SVM model in Python is applied for supervised classification process.