A Robust Feature Extraction Technique for Breast Cancer Detection using Digital Mammograms based on Advanced GLCM Approach

INTRODUCTION: Breast cancer is the most hazardous disease among women worldwide. A simple, cost-effective, and efficient screening called mammographic imaging is used to find the breast abnormalities to detect breast cancer in the early stages so that the patient’s health can be improved. OBJECTIVES: The main challenge is to extract the features by using a novel technique called Advanced Gray-Level Co-occurrence Matrix (AGLCM) from pre-processed images and to classify the images using machine learning algorithms. METHODS: To achieve this, we proposed a four-step process: image acquisition, pre-processing, feature extraction, and classification. Initially, a pre-processing technique called Contrast Limited Advanced Histogram Equalization (CLAHE) is used to increase the contrast of images and the features are retrieved using AGLCM which extracts texture, intensity and shape-based features as these are important to identify the abnormality. RESULTS: In our framework, a classifier called eXtreme Gradient Boosting (XGBoost) is applied on mammograms and the results are compared with other classifiers such as Random Forest (RF), K-Nearest Neighbor (KNN), Artificial Neural Networks (ANN), and Support Vector Machine (SVM). The experiments are done on the Mammographic Image Analysis Society (MIAS) dataset. CONCLUSION: The outcome achieved with CLAHE+ AGLCM+ XGBoost classifier is better than the existing methods. In future, we experiment on large datasets and also concentrate on optimal features selection to increase the classification. Neighbor, Artificial Neural Network, Random Forest and eXtreme Gradient Boosting.


Introduction
Breast Cancer (BC) is prominent cancer that occurs in women of 40 years age group [1]. The chances of survival are very remote if it reaches advanced stages. The survival rate in patients of BC is very low in India because of the delay in the detection of tumors. At present, several imaging modalities such as mammography, tomosynthesis, magnetic resonance, and ultra-sonography are used for BC detection. Among these modalities, mammography is the best costeffective for detecting BC in the early stages [2]. But it is a L Kanya Kumari, B Naga Jagadesh 2 which is otherwise difficult for radiologists or pathologists or physicians. It not only helps in diagnosing but also helps to find the stage of cancer thereby facilitating the doctors to give appropriate drugs to the patients [7]. ML techniques that are very much helpful in cancer detection are Artificial neural Networks (ANN) [8] and Decision Trees (DT) [9]. To predict and prognosis cancer the commonly used ML techniques are K-Nearest Networks (KNN), Bayesian Networks (BN), and Support Vector Machines (SVM) [10]. SVM is used to detect heart disease, breast cancer, ovarian cancer, multiple myeloma, and leukemia [11].
The importance of the proposed methodology are: a. The mammogram images are pre-processed using CLAHE. b. We proposed a new feature extraction technique called AGLCM which extracts texture (GLCM), intensity (entropy) and shape-based (Fourier descriptor) features. These features are fed to different classifiers for experimental comparison. c. The proposed methodology (CLAHE+AGLCM) is experimented by using a classifier XGBoost and compared with other classifiers such as KNN, ANN, SVM, and RF. d. The performance of these methods is evaluated by confusion matrix parameters and misclassification rate. e. The results show that CLAHE+AGLCM with XGBoost is superior to previous works done by other authors [12] [13] [16]. The proposed methodology provides better mammogram classification using CLAHE+AGLCM.
The remaining paper is arranged as follows. Section 2 presents the related work and Section 3 discusses the proposed methodology in which the dataset, pre-processing technique CLAHE, feature extraction technique AGLCM, and about XGBoost classifier are described. Section 4 presents the results and at the end, the conclusion and future scope are provided.

Related Work
Extensive research was done in the domain of mammogram classification. The majority of the literature used different types of feature extraction techniques like texture, intensity, shape, or combination of these features.
The authors [12], Gray Level Co-occurrence Matrix (GLCM) was used for feature extraction. The feature selection techniques were applied and obtained 94.27% accuracy with a neural network classifier. To extract the features from Digital Database for Screening Mammography (DDSM) dataset, the authors used Gabor features [13]. These features were optimized using PSO and classified using SVM. The accuracy was 93.95% in classifying the images as benign and malignant. In [14], features were extracted by using an intensity histogram and feature radial distance. Enhanced Cuckoo Search (ECS) algorithm was experimented for feature selection and concluded that KNN with ECS achieved 99.13% and the Minimum Distance Classifier with ECS achieved 98.75% accuracy.
Global thresholding was used for pre-processing and classifying the mammogram images [15]. Features were extracted by using laws texture energy. To select the features Particle Swarm Optimized Wavelet neural networks were used. The sensitivity, specificity, and misclassification rates obtained were 94.167%, 92.105%, and 0.063 respectively.
The authors [16] proposed a model to classify mammogram images based on the CLAHE pre-processing technique and Histogram of Oriented Gradients (HOG) to extract features. They obtained a classification accuracy of 66% for the RF classifier.
In [17], the authors extracted GLCM features from the MIAS dataset. These features are passed to a hybrid classifier called KNN with SVM for classification. They achieved 94% accuracy for classifying the images. The authors [18] used the Gaussian filtering pre-processing technique and features were extracted using GLCM and Gray Level Run Length Matrix (GLRLM). They achieved 98% and 97.8% as sensitivity and specificity for feedforward network classifier.
The authors presented [19] mammogram classification based on spiculation index, fractional concavity and compactness features and achieved 80% accuracy. The masses were detected by extracting GLCM features and obtained Area Under the Curve (AUC) as 0.79 [20]. Local descriptors were played an important role in mammogram classification. The authors [21] classified parenchymal tissue by extracting local descriptors and probabilistic Latent Semantic Analysis (pLSA) and obtained an accuracy of 95.42%.
The breast density classification was done based on the local descriptors such as Square Invariant Feature Transforms (SIFT), Local Binary Patterns (LBP) and texton histograms. The feature vector was classified using SVM and obtained an accuracy of 93% [22]. The authors [23] focused on extracting Speeded-Up Robust Feature (SURF) descriptors for mammogram classification and reported that they obtained 92.3% accuracy.
A CAD system was designed for tumor detection which extracted GLCM features. These features were fed to the SVM classifier to classify the tumors and reported that achieved 92.3% accuracy [24]. To detect breast cancer in early stages, the authors [25] extracted the Hough transform features and classified the features using SVM. They have analysed and concluded that they obtained 94% accuracy for early detection.
From the literature, it can be seen that an effective and efficient Computer Diagnosis System (CAD) is required for BC detection in the early stages. Furthermore, textual features are giving better results in the existing methodologies including pre-processing technique. In the literature, neural networks, SVM, RF classifiers are widely used for classification. Further, the existing methods are based on the MIAS dataset. Based on the above factors, it is required to propose an improved feature extraction technique to increase the overall performance of the CAD framework. So, we proposed a novel method called AGLCM to extract texture, intensity and shape-based features as all these features are very much helpful in the detection of breast cancer. These features are classified as normal or abnormal.

Proposed Methodology
The proposed CAD methodology is a 4-step process: image acquisition, pre-processing, feature extraction, and classification. In our methodology, CLAHE is applied as a pre-processing technique which increase the contrast of images so that better features can be extracted.
To extract the features, the AGLCM technique is used. AGLCM is a novel feature extraction technique that extracts texture, intensity and shape-based features of the tumor in mammogram images. These classifies the mammograms into normal or abnormal (0 indicates normal and 1 indicates abnormal). The efficiency of the proposed methodology is carried out by using a confusion matrix and misclassification rate. Figure 1 depicts the framework of proposed methodology.

Contrast Limited Advanced Histogram Equalization
Pre-processing is an important phase that enhances some image features which are important for further processing. It plays a vital role in medical imaging which leads to the extraction of better features such as masses and tumors etc.
In literature, the authors proposed a variety of contrast enhancement techniques such as Histogram Equalization (HE) [27], Median Filtering [28], filtering with morphological operators and un-sharp masking [29] to improve the visual contents of mammograms [30]. In [31] the authors used Local Contrast Enhancement (LCE) to enhance the contrast in images and achieved better results.
In the same manner, we too used a pre-processing technique called CLAHE to enhance the contrast in images as it can overcome the problems in, HE and Adaptive Histogram Equalization (AHE) [27]. The over enhancement in AHE is reduced by using CLAHE [32]. CLAHE is an improvement of AHE where contrast is improved by user-defined cliplevel. This method reduces the noise and edge-shadowing generated in consistent locations and is designed for medical imaging [33] [34]. It is used to remove artifacts like wedges, labels, and markers in mammograms and it makes suspicious or hidden regions more visible.
In CLAHE, the image is split into small parts called tiles. By applying this technique, each tile contrast is enhanced. To combine the tiles, bipolar interpolation is used to eliminate the artifacts in the borders. The CLAHE steps are explained in algorithm 1. The clip limit is considered as 6.0 and window size considered as 88  .  The above Algorithm 1 is applied on random images from the MIAS dataset named as mdb304, mdb076, mdb099 and mdb241. The following Table 1 consists of MIAS images and contrast-enhanced (HE, AHE, and CLAHE) images.

Advanced Gray Level Co-Occurrence Matrix
The features are extracted from the pre-processed images using AGLCM technique. The performance of the classifier is depending on how well the feature vector is calculated. The feature extraction technique examines the images to extract the features that signifies the several classes. These features are given as input to the classifier that assigns the class label for test data.
In our proposed methodology, the features are extracted based on the AGLCM technique in which texture, intensity and shape-based approaches are used as these features are crucial in detecting tumors or masses in mammograms [35]. Texture features in an image are the spatial distribution of gray levels whereas shape features describe the lesion boundaries (rounded, spiculate or stellate). Texture features are extracted using Gray Level Co-occurrence Matrix (GLCM), entropy is used to extract intensity-based features [36] and Fourier Descriptor is used for shape-based features. The combination of all these features is named AGLCM.
GLCM is a commonly used technique to extract texture features [37]. This technique results the distribution of graylevel pixel pairs in the image. The spatial distribution between two pixels is computed based on reference pixel and neighbour pixel. In GLCM, the matrix form of Gray values is called the Co-occurrence Matrix (CM). This matrix represents the relative frequencies of the neighboring pixels which are separated by the distance 'd'. The values in the matrix represent the frequency variations in the pixel intensities. The probability occurrences are calculated in 8 different directions 'ɵ' (0 0 ,45 0 , 90 0 , 135 0 , 180 0 , 225 0 , 270 0 , and 315 0 ) with distance 'd'. The working procedure of the GLCM technique is represented in Figure 3.  Entropy gives information about the contents of the image. It gives the image uncertainty or randomness or intensity levels [38]. This is calculated as in equation 6 and it is added to the GLCM feature vector as another feature.  The images from the MIAS dataset are considered where some images are normal and some abnormal. GLCM features' performance is efficient in finding breast cancer [39]. In our framework, texture, intensity-based features are combined with shape-based features as it is significant to know the intensity and shape of the tumor is important including the texture in finding the abnormality in mammograms. Depending on the tumor shape, it is easy to identify whether the tumor is normal or abnormal. Several shape descriptors are available in the literature to identify the shape of the tumor. Among them, Fourier Descriptors (FD's) are extremely useful for pattern recognition [40] and are used to recognize the shape of the tumor in mammograms. An FD is based on Fourier transformed boundary as the shape feature. These features are effective in representing the shape and it is invariant to rotation, translation, and scaling [14] [41]. obtain the feature vector. The step-by-step process of the AGLCM algorithm is described in Algorithm 2. The above Algorithm 2 is applied on contrast-enhanced images and features are extracted. The AGLCM feature vector for MIAS sample images is depicted in Table 2.

Classification -Extreme Gradient Boosting Classifier
Classification is the process of finding a class label based on existing data. A classifier is constructed on training data in which the class label is already known. Based on this information, the classifier learns the properties of each subset and finds the labels for test data. in our framework, we have two classes as normal and abnormal. The extracted AGLCM features are classified as normal or abnormal by using several classification algorithms namely KNN, ANN, SVM, RF and XGBoost. Among these, XGBoost is efficient in automatic parallel computation and gives good results for most of the datasets. The XGBoost classifier is implanted by using gradient boosting which has special characteristics like the more regularized model to control overfitting, which results in a better way [43]. XGBoost is an optimized combination of hardware and software by using fewer computing resources in less amount of time [44]. This model is a good combination in terms of prediction, performance, and processing speed compared to other algorithms. It is more effective than deep learning techniques if a limited number of training samples are available [6].

Results and Discussion
In our methodology, pre-processing is an important phase in extracting better features. The pre-processing techniques (HE, AHE, CLAHE) are applied on the MIAS dataset images discussed in section 3.2. The performance of these pre-processing techniques is measured by Mean Square Error (MSE) and Peak Signal to Noise Ratio (PSNR) [45] [46] [47]. MSE is the most used form of measuring image quality which is the Mean Squared Error between the actual and pre-processed image. PSNR is the image quality measurement between the actual image and the processed image. Mathematically, they are represented as in equations 8 and 9. The MSE and PSNR values are calculated in between the original image and pre-processed images using equations 8 and 9. The small value of PSNR represents poor quality whereas a greater value indicates a good-quality image. The PSNR value ranges for HE, AHE, and CLAHE preprocessing techniques are 20db to 30db, 31db to 35db and 36db to 45db respectively and are represented in Table 3.  Table 3, it can be inferred that PSNR values of CLAHE are greater than other pre-processing techniques. Hence, the image quality is improved by using the CLAHE technique and also, it is understood that pre-processing is an important phase in mammogram classification. The contrast-enhanced images are fed to the AGLCM feature extraction algorithm which is discussed in section 3.3. The MIAS mammograms consists of abnormality in the middle, Region of Interest (ROI) is considered as 256×256 for feature extraction [48]. The feature extraction algorithm extracts texture, intensity and shape-based features from pre-processed images. A total of 22 features are extracted from the MIAS dataset and are represented in Table 2 for sample images. This table consists of the image as a row and features are in columns with a class label. Among these 22 features, the first 20 features are GLCM texture-based features namely dissimilarity, energy, homogeneity, correlation, contrast (five properties in four different angles 0 0 ,45 0 , 90 0 and 135 0 and distance d=1), GLCM intensitybased entropy, and finally shape-based feature called mean of Fourier descriptor are represented. The extracted features are classified using the XGBoost algorithm and the results are compared with other classifiers like KNN, ANN, SVM and RF. The dataset is divided into 70%, 30% training and testing respectively.
The effectiveness of the methodology is validated using several performance measures such as sensitivity, specificity, precision, f1-score, accuracy, and misclassification rate and are calculated using the equations 10 to 14. They are calculated based on the confusion matrix. Diagrammatically the confusion matrix is represented in Table 4 [49].
The measure of true positives rate is Sensitivity whereas the measure of true negatives rate is Specificity. Precision is a measure of correct positive rates, and f1-score is a harmonic mean of precision and sensitivity. A publicly available dataset called MIAS is used for experimentation.  Table 5.
It is observed that mammogram image classification accuracies are improved by using CLAHE as a preprocessing technique, AGLCM as a feature extraction technique with XGBoost classifier. This work has proved that pre-processing with a good feature extraction technique positively impacts the accuracy of classifiers. We have used another good performance measure called misclassification rate to evaluate the proposed feature extraction technique called AGLCM. It is the ratio of FPR+FNR to a total which is given in equation 15.
The misclassification rates are represented for HE, AHE, CLAHE as pre-processing techniques and GLCM, AGLCM as feature extraction techniques. These are calculated for the XGBoost classifier and compared with other classifiers such as KNN, ANN, SVM, and RF. The graphical representation is given in Figure 5.
It is noticed that less misclassification rate and highest accuracy are obtained for the proposed methodology and it is observed that CLAHE+ AGLCM+ XGBoost classifier is better than the state-of-the-art methods as signified in the following Table 6.

Conclusion
In our research, an improved diagnostic system is designed to encounter the challenges in breast cancer detection. In the proposed framework, firstly, a pre-processing technique called CLAHE is applied to increase the contrast in mammograms. It is followed by a feature extraction technique called GLCM, the combines texture, intensity and shape-based features. These features are classified using the XGBoost technique and the results are compared with KNN, ANN, SVM, and RF classifiers. The experiments are done by using a dataset called MIAS. The performance of the proposed methodology reflects that our framework achieves better performance concerning confusion matrix parameters including misclassification rate. The better accuracy and less misclassification rate are obtained for CLAHE+ AGLCM with XGBoost classifier. Designing a CAD system for BC detection remains a research problem. There are some directions that might improve our research in the future. They are 1. The method is to be applied on large databases 2. The optimal features are to be selected from the extracted features for better classification.