Data Driven Prognosis of Cervical Cancer Using Class Balancing and Machine Learning Techniques

INTRODUCTION: With the progression of innovation and its joint effort with health care services, the world has achieved a lot of benefits. AI procedures and machine learning techniques are constantly improving existing statistical methods for better results in the medical field. These improved methods will assist health care providers in providing intelligent medical services. OBJECTIVES: This Cervical cancer is the fourth most common cancer among the other female cancers. This cancer is preventable with early diagnosis. This reason becomes the motivation of the research work. For efficiently and timely prognosis of cervical cancer require a computer-assisted algorithm METHODS: The work demonstrated in this paper contributes to the techniques of machine learning for diagnosing cervical cancer. The machine learning algorithms used in this research are K Nearest Neighbour, Support Vector Machine and Random Forest Tree. These classification algorithms are used with class balancing techniques including undersampling, Oversampling and SMOTE. RESULTS: The evaluation metrics used for comparative analysis includes accuracy, sensitivity, specificity, negative predicted accuracy, and positive predictive accuracy. The results show the Random Forest algorithm with SMOTE technique delivered more promising results over SVM and KNN for four target variables Schiller, Biopsy, Hinselmann , and Cytology respectively. CONCLUSION: It is concluded that with the limited amount of data which also suffers from the unbalancing problem the promising results drawn using the proposed model.


Introduction
Cancer is the second leading driving factor for death all around the world. The facts show that approximately 9.6 million deaths in 2018 are due to cancer [1]. Globally, one out of six deaths is due to this malignant disease. Cervical cancer is ranked as the fourth most common cancer among other female cancers. Cervical cancer is the cancer of the cervix which is caused due to human papillomavirus (HPV). This virus causes the abnormal development of cells in the cervix that can lead to the malignant stage. It takes one or two decades to reach from pre-cancerous to the cancerous stage. Thus it can be preventable with a timely diagnosis and prognosis. The machine learning techniques showed the state-of-art in various medical applications. This gives the motivation in developing a computer-based algorithm that can assist the health care providers in providing a timely diagnosis so that it can decrease the rate of mortality. The experimental work presented in this paper is done on the cervical cancer dataset publically available on the UCI repository [2]. Several machine learning classification algorithms are applied to achieve a higher rate of sensitivity so that no patient left untreated with cancer. The machine learning techniques used for this work are KNN, SVM-Linear, SVM-POLY and random forest. These classification algorithms are applied with the various class balancing techniques like random under sampling, random oversampling and SMOTE. The ensemble algorithm random forest with SMOTE gives promising results over the other classification methods. The uniqueness of this study exists in the fact that with the limited amount of data which also suffers from the unbalancing problem the promising results drawn using the proposed model. The work presented in this paper has implications for healthcare providers that can use this model for serving the mass population by giving a timely diagnosis to the patients. The presented work of the study is structured into four sections. The survey of the recently published papers is covered in section 2. Section 3 covers the machine learning and its approaches used in the study for conducting the experiments. Section 4 describes dataset, class balancing techniques and the experimental results. At last, the conclusion and future work are discussed in Section 5.

Related Work
In this section, we explained the publications that work on numerical clinical values.  [3] in which they described the cervical cancer prediction model using a Multilayer perceptron and random forest approach. The experiment is conducted on the cervical cancer dataset publically available on the UCI repository. The dataset is consisting of 858 patients and 4 target variables. Random forest classifier reported the highest prediction accuracy as 97.6% for Hinselmann target variable.
In 2019, Dhwaani Parikh and Vineet Menon [4] demonstrated the use of various machine learning algorithms on the similar dataset of cervical cancer available on UCI repository. The authors reported that the K-NN model gives better accuracy, F1-Score, recall and precision. In the different studies, author Wu and Zhou [5] examined principal components analysis (PCA) for dimensionality reduction with SVM classifier for prediction of cervical cancer. They achieved Acc=90%, Sen=100%, Spec= 88% but their study lack the explanation of using PCA as feature selection.
In the study [6] the authors have presented a comparative analysis of 15 machine learning algorithms to diagnose cervical cancer. They have used the Pap smear benchmark database prepared by Herlev's university hospital. They applied 15 algorithms on two datasets namely old and a new dataset which consist of 500 and 917 single cell Pap smear images respectively. Among the 15 machine learning algorithms, Ensemble of Nested Dichotomies (END) outperformed for both dataset with the accuracy of 77.38% for the first dataset and 78.28% for the second dataset. On the other side, this study also shows that Naive Bayes is the worst performer with accuracy of about 50% and 60% on the first and second dataset respectively. In  [8] which deals with the segmentation of cervical images. This segmentation is obtained by applying the fuzzy Edge Detection method that segments the cytoplasm and nucleus part. They applied the Fuzzy edge detection method on the Pap smear old dataset (Herlev's University Hospital) which originally consisted of 917 images that are further described using 20 features. After this segmentation neural networks are applied on the modified dataset for classifying the entire data into two classes' i.e. normal and abnormal class. Each network is trained, tested and validated with 70%, 15%, and 15% sample size respectively. They have evaluated their performance using mean squared error (MSE). The algorithm with the less MSE is considered to be best amongst others. The study shows that BF has the highest classification accuracy with 100% but with a higher MSE value of 1. So RBF can't be used for classification as its classification will be more error-prone. But MLP gives the classification accuracy of 92.03% with the small MSE value of 0.0616. The work reveals that the MLP network outperformed among other three networks.
The published work in [9] is different from the abovestated work as they have used the biopsy test data instead of Pap smear data for the prediction of cervical cancer. For the sake of classifying the data into normal or cancer cervix, the authors applied a powerful data mining algorithm on biopsy numerical data. The data collected from NCBI (National Centre for Biotechnology Information) which consists of 500 records and 61 biopsy features including a gene identifier. They have selected a sample of 100 records for training and testing purposes. On the selected sample Classification and regression Tree EAI Endorsed Transactions on Energy Web 09 2020 -11 2020 | Volume 7 | Issue 30 | e2 algorithm (CART), the Random Forest tree algorithm (RFT) and RFT with the K-Means learning algorithm were applied for classifying the data into the normal and cancerous cervix. The study reveals that the proposed hybrid algorithm RFT with k-means outperformed among CART and RFT with the accuracy of 96.77% on NCBI biopsy data for prediction of cervical cancer.
In the study [10] authors Prableen Kaur and Manik Sharma presented a comprehensive review of supervised machine learning and nature-inspired computing techniques used in the analysis of human psychological disorders. Their analysis revealed that the application of supervised machine learning techniques in identifying psychological disorders achieves an accuracy of more than 90%.
In a published study [11] author Manogaran, Gunasekaran, et al. demonstrated the use of big data analytics and machine learning algorithms for identifying the changes in DNA sequencing. Big data analytics is commonly used in applications related to DNA study. The amalgamation of big data analytics with machine learning techniques is justified in the work by achieving 86.55% accuracy.
In [12] authors review the available literature on diagnosis of cancer and diabetic using five different insect-based optimization techniques viz. Ant Colony Optimization (ACO), Artificial Bee Colony (ABC), Glow-Worm Swarm Optimization (GSO), Firefly Algorithm (FA) and Ant Lion Optimization (ALO). The study highlights two things: first, most of the disease diagnostic work has been carried out using ACO, whereas GSO found to be least explored and second, high predictive accuracy achieved using the hybridization of ACO and neural network.
In [13] the author presented a detailed review of machine learning algorithms used in the prognosis of breast cancer. The commonly used machine algorithms are the Support vector machine, Decision tree, K-nearest neighbor and artificial neural network. The data used for the experiment drawn from Wisconsin Breast Cancer Database (WBCD) which is a benchmark dataset for breast cancer.
The survey of the recently published studies justifies the use of ML techniques not only for cancer prediction but also for other chronic diseases. After analyzing the results and architectures discussed in previously published studies the best-suited algorithms were chosen for this research so, that a reliable machine learning model can be developed to predict cervical cancer patients. Most of the recently published studies also lack an important evaluation factor of sensitivity which is important because UCI cervical cancer dataset is suffering from heavy class imbalance problem. So the work present in this paper using ML techniques for early diagnosis and prognosis of cervical cancer and evaluated by considering one of the evaluation metric as sensitivity along with accuracy and specificity.

Machine learning (ML) and its Approaches
Machine learning is technique to resolve the artificial intelligence problem. It consists of set of learning algorithms that are further classified into supervised and unsupervised learning. The supervised learning algorithms work with the labelled data whereas unsupervised algorithms work with unlabelled data. The most commonly used ML techniques are K-Nearest Neighbour (KNN), Support Vector Machine (SVM) and Random Forest (RF) which are briefly described in this section below.

K-Nearest Neighbour (KNN)
K Nearest Neighbour is a classification technique that classifies the data into k classes. It is a non-parametric as it doesn't give any model. The value of k depends on the training data. The instances are classified into one of the k classes based on the distance function. The common distance functions used for KNN include Euclidean distance, Manhatten distance, Minkowski distance or Hamming distance. The Euclidean, Manhatten and Minkowski distance functions are used when the values in the dataset are continuous in nature whereas if the data is categorical then the Hamming distance is used. The equations for all the mentioned functions are mentioned below. [16] (1) [17] (2) [18] (3) Hamming standardizes the numerical variables between 0 and 1 by using normalization.

Random Forest Tree (RFT)
Random Forest tree (RFT) is most widely used as a supervised machine learning technique in cervical cancer prediction. This technique is first introduced by Leo Breiman [20]. RFT used for solving both classification and regression problems. In this technique, multiple trees are generated and each tree gives "vote" for the target class. In the case of classification problem the forest makes the selection of trees that are having a maximum vote for the class and in case of regression average of a different tree is computed.

EAI Endorsed Transactions on
Energy Web 09 2020 -11 2020 | Volume 7 | Issue 30 | e2 For predicting a continuous variable using Random Forests, the trees are grown depending on, a random vector, in such a manner that which is the tree predictor takes on numeric values. The values of response variable are numeric and it is assumed that the training sample is drawn independently from the distribution X of random vector Y. Equation (5) shows the mean square generalization error for a numeric predictor.
[21] (5) The Random Forest predictor is constructed by taking the mean over k of the trees[ℎ( , )]. Random Forests tend to be accurate and effective in prediction due to the right kind of randomness.

Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning technique that is introduced by Vapnik [22]. It works for both linear and non-linear dataset. The principle behind the SVM is to maximize predictive accuracy by minimizing the over fitting. It transforms the non-linear dataset into the linear dataset by using a higher dimension. It constructs various hyper planes for classifying the dataset.
The core of the SVM algorithm is in the minimization: [23] (6) Where C refers to the error penalty function and are the cost functions when y=1 and y=0 respectively.

Dataset
The Cervical Cancer dataset used in the experiment is obtained from the UCI repository. The dataset is consisting of a clinical history of 858 patients which is described using 32 attributes and 4 labels (Schiller, Hinselmann, Biopsy, and Cytology) are described in Table 2. The description of these four target variables is as given below.
• Hinselmann: This test was developed by Mr.
Hinselmann. He developed a tool that is used for visual inspection of the cervix at a magnified scale [24].

Proposed Architecture
The proposed architecture is depicted in figure 1 that describes the flow of the study. The data is consisting of 858 clinical records and 32 attributes. The given data is suffering from two major issues: firstly imbalanced data and secondly Missing values. To address this issue the first step taken is of data pre-processing. Thus before feeding the data to the predictive model missing value issue is resolved by eliminating the two features namely "STDs: Time since first diagnosis" and "STDs: Time since the last diagnosis" as it doesn't contain enough data. After eliminating these two columns, 190 instances that contains missing values ('?', Null) are also dropped. So, after cleaning the data, 668 records in the raw dataset are used for experiment. The imbalanced data issue is resolved through three class balancing technique that includes RUS, ROS, and Smote. The balanced data obtained with these methods separately are shown in Figure 2-5. The cleaned data is then divided into train and test dataset in the ratio of 70-30%. After this step, the training dataset is fed to four predictive i.e. Random forest classifier, K-nearest neighbor, Support vector machine using linear kernel and support vector machine using a polynomial kernel. The performance of each model is analyzed on test data. The predicted results are then compared with actual results for evaluating the performance of the model.

Class Balancing Techniques
The given data is highly imbalanced for all the four targets. Thus before feeding the data to a predictive model for obtaining good results, there is a need to make the data balanced. Therefore three-class balancing techniques are applied namely Random under-sampling, Random Oversampling, and SMOTE. The balanced distribution of data after applying the class balancing techniques on all the four target variables is shown in Figure 2 to Figure 5. The first set of bars in figures 2-5 corresponds to original data distribution which indicates an imbalance of benign and malignant records. The second and third set of bars represents equal no. of malignant and benign instances. In the fourth set of the bar, malignant cases are synthetically upscale to benign cases.

Random Under Sampling (RUS)
In this technique the number of instances from the majority class is reduced in order to get the balanced data. The number of instances from the majority class is selected randomly which will be equals to the instances in minority class. The number of instances in minority class will remain same. This is also known as down sampling technique.

Random Over Sampling (ROS)
In this technique, the numbers of instances of the minority class are increased to the number of instances in the majority. This is done by duplicating random instances of the minority class. Thus all the features of the original Data Driven Prognosis of Cervical Cancer Using Class Balancing and Machine Learning Techniques EAI Endorsed Transactions on Energy Web 09 2020 -11 2020 | Volume 7 | Issue 30 | e2 data set preserved [28] as no instance dropped off. This technique is also known as up-scaling.

Synthetic Minority Over-sampling Technique (SMOTE)
This technique also increases the number of instances of minority classes like ROS. The difference between the SMOTE and ROS is that in SMOTE the samples are increased synthetically by using the nearest neighbor [29] approach whereas in ROS the samples are increased simply by duplicating the samples available in the minority class.

Simulation Experimental Results and Analysis
The evaluation measures used for the model are accuracy, sensitivity, specificity, positive predicted accuracy and negative predicted accuracy. As the given dataset is suffering from the problem of imbalance, thus accuracy can't be taken as the only criterion for evaluation of the performance of the model. Thus the sensitivity and specificity will play a major role in diagnosis true positive and false-negative cases. The formula used for calculation accuracy, sensitivity, specificity, positive predicted accuracy, and negative predicted accuracy is given in

Comparative Analysis
The comparative analysis of accuracy for all the four machine learning classifiers namely Random forest, KNN, SVM-poly and SVM-Linear for all the four target variables is shown in figure 6. It depicts the Random Forest Tree with the SMOTE class balancing technique is giving the highest accuracy for cytology, Schiller, Hinselmann, and biopsy. The KNN classifier is the least performing for Cytology, Schiller, and biopsy. Similarly, figure 7 represents the analysis of sensitivity for all the four classifiers. The random forest classifier results in the highest sensitivity for all the four target variables and the KNN classifier results in low sensitivity. Thus for both accuracy and sensitivity, the results have shown a random forest algorithm with Smote balancing technique overpowered all the other classifiers.

Conclusion
In this paper, explanations of different ML classifiers and class balancing techniques are provided. The most commonly used ML techniques namely K-NN, SVM, and random forest tree are chosen for carrying out the experimental work. The data used in the experiment is the cervical cancer dataset which is available publicly on the UCI repository. The data obtained from the repository was imbalanced. Thus three-class balancing techniques viz. ROS, RUS, and SMOTE are used to make the data balanced before feeding the data to the proposed model. The findings of the study are predictive cervical cancer model. The results have shown the Random forest algorithm performs better with SMOTE for four target variables Schiller, Biopsy, Hinselmann, and Cytology respectively whereas the KNN is the least performer. As future work, we will use dimensionality reduction technique with the classifiers to see their influence. The second potential work is a multi-class classification with the four targets.