Feature Extraction using CNN for Peripheral Blood Cells Recognition on Scalable Information Systems

INTRODUCTION: The diagnosis of hematological diseases is based on the morphological di ff erentiation of the peripheral blood cell types. OBJECTIVES: In this work, a hybrid model based on CNN features extraction and machine learning classifiers were proposed to improve peripheral blood cell image classification. METHODS: At first, a CNN model composed of four convolution layers and three fully connected layers was proposed. Second, the features from the deeper layers of the CNN classifier were extracted. Third, several models were trained and tested on the data. Moreover, a combination of CNN with traditional machine learning classifiers was carried out. This includes CNN_KNN, CNN_SVM (Linear), CNN_SVM (RBF), and CNN_AdaboostM1. The proposed methods were validated on two datasets. We have used a public dataset containing 12444 images with four types of leukocytes to find the best optimizer function(eosinophil, lymphocyte, monocyte, and neutrophil images). The second dataset contains 17,092 images divided into eight groups: lymphocytes, neutrophils, monocytes). the second public dataset was used to find the best combination of CNN and the machine learning algorithms. the dataset containing 17,092 images: lymphocytes, neutrophils, monocytes, eosinophils, basophils, immature granulocytes, erythroblasts, and platelets.RESULTS: The results reveal that CNN combined with AdaBoost decision tree classifier provided the best performance in terms of cells recognition with an accuracy of 88.8%, demonstrating the performance of the proposed approach. CONCLUSION: The obtained results show that the proposed system can be used in clinical practice.


Introduction
The Peripheral blood cells can be divided into three classes: erythrocytes (red blood cells ), leukocytes (white blood cells ), and thrombocytes (platelets ). Leukocytes are the cells of the immune system responsible for protecting the body from infections and foreign invaders. Each cell is composed of a nucleus and cytoplasm. The White Blood Cell Differential Count measures the WBCs types present in the bone marrow, of a hybrid model based on CNN features extraction and AdaboostM1 decision Tree classifier for peripheral blood cells recognition. We use CNN to extract features from the training images. The network concepts a hierarchical representation of input images. Early layers contain lower-level features that build up into higher-level features as the network goes deeper. The CNN model is composed of four convolutional layers and four fully connected layers. The features extracted from the third fully connected layer are classified by CNN, CNN_KNN, CNN_SVM with linear kernel, CNN_SVM with RBF kernel, and CNN_AdaboostM1. The remainder of this paper is organized as follows: Section 2 presents the related works, Section 3 presents the material and methods used in this work, Section 4 presents the obtained result and discussions, and Section 5 presents the conclusion and future work.

Related works
Many works have been proposed for white blood cells' classification based on deep learning methods. Among these methods, the Convolutional Neural Networks (CNN) architectures are widely used.
In [1],a new CNN architecture called WBCNet was proposed. This model combines batch normalization, residual convolution architecture, and improved activation function, allowing it to fully extract features from the microscopic WBC image. The WBCNet model contains 33 layers and is remarkably faster than a traditional CNN model during training. The accuracy on the test dataset was 83%. In the paper [2], the authors introduce PatternNetfused Ensemble of Convolutional Neural Networks (PECNN), a new architecture for WBCs classification. The proposed method fuses outputs of n randomly generated CNNs using an ensemble method, PatternNet. The PatternNet captures the strengths of each participating model while remaining insensitive to outliers. Authors in [3] presented an improved hybrid approach for efficiently classifying WBC Leukemia. The model is based on features extraction from WBC images using VGGNet. The approach then uses a statistically enhanced Salp Swarm Algorithm (SESSA) to filter the extracted features. They reported some of the bestranked results on the used datasets and outperformed several convolutional network models. In [4], the CNN was paired one at a time with Densenet201, Alexnet, GoogleNet, and Resnet50. On the other hand, the dataset images were treated with Gaussian and median filters separately, then classified by CNN models. The new dataset with filtred images showed better results than the original data. An improved algorithm based on feature weight adaptive K-means clustering for extracting complex leukocytes was proposed in [5]. First, the color space was decomposed. Then, the image was segmented using a combination of the color space decomposition and K-means clustering. Then, the watershed algorithm was used to separate adherent complex white blood cells. Finally, classification tasks based on the CNN were performed, and the obtained results were compared with other methods. The proposed method achieved 95.8% of segmentation accuracy. In the study published in [6], two Convolutional Neural Networks-VGG-16 and ResNet-50-were employed to classify five categories of white blood cells: Basophil, Eosinophil, Neutrophil, Lymphocyte, Monocyte. The results showed that ResNet-50 provided the best performance and achieved 88.29% of accuracy. Authors in paper [7] applied convolutional neural network (CNN) for the image classification of four types of leucocytes, namely, eosinophils, lymphocytes, monocytes, and neutrophils. They used the Genetic Algorithm (GA) for CNN's hyperparameters optimization. The optimized CNN obtained a classification accuracy of 91% for the validation set and 99% for the training set. It also achieved a sensitivity and specificity of 91% and 97%. In the paper [8], the authors used a data augmentation method based on generative adversarial networks (GAN), then used VGG-16, ResNet, and DenseNet for the recognition of leukocytes. The proposed method improves the classification performances of WBCs. The DenseNet-169 model provided the best accuracy (98.8%). Another proposed framework [9] was developed based on a comparative study between Convolutional Neural Networks and other architectures. The proposed method improves the accuracy rate and loss in both detection and classification of WBCs. The proposed system in [10] consisted of a deep neural convolution network (DCNN) enhanced with a modified loss function besides regularization. The proposed system improved the classification accuracy from 96.1% to 98.92% and a decrease in processing time from 0.354 to 0.216 s. In [11], an automatic approach for WBCs' detection and classification from peripheral blood images was proposed. The detection process used firstly local binary pattern (LBP) features and SVM to separate basophil and eosinophil. As a second step, a convolution neural network was used to automatically extract high-level features from WBCs, and the features were then passed through a random forest to recognize the other types of WBCs. Another work published in [12] explored the utility of a deep learning model to detect lymphocyte cells and the 2 EAI Endorsed Transactions Scalable Information Systems 10 2021 -01 2022 | Volume 9 | Issue 34 | e12 subtypes; they proposed using a CNN classifier. This approach was compared to another method based on SVM. The obtained results showed that the proposed classifier offered better performance in identifying normal lymphocytes and pre-B cells. In addition, the sensitivity on the T-cells was quite low. Recently, authors presented in [13] a system based on a convolutional neural network (CNN) model to identify ALL. They performed different data augmentation techniques to expand the training data. The proposed model achieved an accuracy of 95,54%. In the paper [14], a VGG19 model was optimized by SSPSO algorithm. As a result, they obtained an accuracy of 98%. Another work based on the histogram of oriented gradient feature of nuclei shapes was proposed in [15]. The segmentation method used the YCbCr color space and K means clustering technique. Finally, the extracted features were classified with SVM and backpropagation artificial neural network (ANN). Hegde et al. proposed in [16] a robust model to detect the WBC nuclei in peripheral blood smear images.
As mentioned in this section, the problem of segmentation of white blood cells in blood smears has been largely addressed by calssic methods of machine learning or by CNN models. On the other hand, the detection of the various components circulating in the blood has been little treated in the literature. In this work, We have proposed a CNN model composed of four convolution layers and three fully connected layers in our work. This model has been used as a feature extractor, and the classification was performed with KNN, SVM with linear kernel, SVM with RBF kernel function, and the Adaboostm1 algorithm.

Convolutional neural network (CNN)
The convolutional neural network (CNN) is the basis of the most popular algorithms in deep learning. It has been applied with success in different fields. For example, the CNN models have been used in image segmentation, classification, and object detection. The model proposed in this work has the following layers : The CNN model has 22 layers: input layer, 4 convolution layers, 3 max-pooling layers, and 3 fully connected layers.

Adaptive moment estimation optimizer
Adaptive Moment Estimation (Adam) [17] is a way of computing adaptive learning rates for each parameter. It combines two stochastic gradient descent approaches, Where m t and v t are parameters calculate the mean and the uncentered variance of the gradients. β 1 and β 2 are the bias The authors of Adam propose to compute the biascorrected first and second-moment estimates: The authors show that Adam provides good results when the default values of β 1 = 0.9 and β 2 = 0.999 are used.

Stochastic gradient descent with momentum optimizer
The Stochastic gradient descent (SGDM) algorithm makes a parameter update for each training example: x (i) and label y (i) :

Support Vector Machines
Cortes and Vapnik [19], [20] invented support vector machines (SVM) in 1995. They are utilized to solve various learning issues, including pattern recognition, text classification, and even medical diagnosis. The SVM is based on the ideas of the greatest margin and the kernel function. They are used to tackle nonlinear discrimination issues. The margin is the distance between the separation boundary and the support vectors, the closest samples. In more technical terms, a support vector machine creates a hyperplane or group of hyperplanes in a high-dimensional space that may be used to classify data.

Adaptive Boosting with Decision Trees
Yoav Freund and Robert Schapire published the Adaboost (for Adaptive Boosting) in 1995 [21], which is a meta-algorithm that employs the boosting principle to enhance the performance of classifiers. The goal is to give each of the learning set's examples a weight. They all start with the same weight, but as each iteration progresses, the weights of misclassified components will grow while those of correctly classified elements will decrease. AdaBoost can be applied on top of any classifier to learn from its limitations and develop a more accurate model, being the "best out-of-thebox classifier." Therefore, the most suited and most common algorithm used with AdaBoost is decision trees with one level, or what is frequently called decision stumps.

Datasets
In this work, we used two data sets: The first dataset is divided into four folders, each folder corresponding to a different blood cell type, and containing approximately 2500 images for training and 600 images in train data and test data for testing, respectively. The training folder contains 2497 eosinophil images, 2483 lymphocyte images, 2478 monocyte images, and 2499 neutrophil images. The testing folder contains 623 eosinophil images, 620 lymphocyte images, 620 monocyte images, and 624 neutrophil images. The data set is publically available at https://www. kaggle.com/paultimothymooney/blood-cells. The second dataset is available in the Mendeley repository [22][23]: "A dataset for microscopic peripheral blood cell images for the development of automatic recognition systems "Data identification  figure 1.
The evaluation of the classofiers is based on the following parameters:

Influence of the maximum number of epochs
We used the first dataset to study the impact of the maximum number of epochs on the results. This parameter was set to 10 then to 20. The global accuracy(figure 2), sensitivity(figure 3), and PNV (figure 4) were estimated. The Adam function was used with 'InitialLearnRate' = 3e-4, and 'SquaredGradientDecayFactor' = 0.99. The Accuracy accuracy achieved a rate of 70.5% after ten iterations and 85.9% after 20 iterations. These findings lead to the conclusion that the number of maximum epochs impacts the performances of the classifier. The negative predictive value after 10 iterations: 66,8% for the eosinophil, 83,60% for the lymphocyte, 73,90% for the monocyte, and 61,5% for the neutrophil. The negative predictive value after 20 iterations: 83% for the eosinophil, 93,00% for the lymphocyte, 87,60% for the monocyte, and 80% for the neutrophil. The sensitivity after 10 iterations: 56,30% for the eosinophil, 71,70% for the lymphocyte, 80,5% for the monocyte, and 73,6% for the neutrophil. The sensitivity after 20 iterations: 74,20% for the eosinophil, 88,90% for the lymphocyte, 97,2% for the monocyte, and 83,3 % for the neutrophil.

Influence of the training algorithm
In this section, the influence of the optimizer function on the performances of the CNN classifier was performed. First, two algorithms were carried out: ADAM and SGDM. Then we have tested the CNN model on the first dataset containing 3100 images and four leukocytes. The results were described in 1. In addition, the used parameters were also reported.
From table 1, we can see that using the ADAM optimizer function constantly improves the accuracy and decreases training time compared to using the SGDM function. For 120 epochs, we obtained an accuracy of 92.23. We reported in table 2 the results obtained with the CNN, CNN_KNN, CNN_SVM (Linear), CNN_SVM (RBF), and CNN_AdaboostM1 models using the second 5 EAI Endorsed Transactions Scalable Information Systems 10 2021 -01 2022 | Volume 9 | Issue 34 | e12 The best accuracy were was obtained with CNN_Adaboost 88,8%. The best sensitivity of the eight components are listed as follows: • Neutrophil: Sensitivity = 99.6% using CNN.
In practice, diagnosing hematological diseases is based on microscopes' morphological differentiation of the peripheral blood cell in laboratories; the traditional method is very tedious and time-consuming. Therefore, an automatic differential counting system is preferred. Automatic recognition of peripheral blood cells using classical machine-learning approaches has been widely treated in the literature [24,25]. With the emergence of the CNN architectures, many works have been proposed to perform the automatic segmentation and classification of the white blood cells [8][9][10]. The classification of the eight components of the peripheral blood cells was proposed in the paper [22]. The authors collected a data set that contained the eight components: neutrophils, eosinophils, basophils, lymphocytes, monocytes, immature granulocytes (metamyelocytes, myelocytes, and promyelocytes), erythroblasts, and platelets (thrombocytes). They have used vgg16 and inceptionv3 as features extractors and the SVM as classifiers.
We have proposed a CNN model composed of four convolution layers and three fully connected layers in our work. This model has been used as a feature extractor, and the classification was performed with KNN, SVM with linear kernel, SVM with RBF kernel function, and the Adaboostm1 algorithm. We have found that the Adaboost M1 classifier provides the best results. The CNN and KNN also provided high accuracy (88,6%) and ranked in the second position after AdaboostM1. Moreover, using CNN alone showed a competitive performance in cell recognition, where it gave a rate of 88,4% of accuracy. On the other hand, the combination of SVM and CNN provided different results depending on the nature of the kernel function. For example, for a Gaussian kernel function, the accuracy was promising (88,5%) due to the suitability of the RBF function. However, for a linear function, the output accuracy was poor with a rate of 44,3%. This is explained by the fact that blood cell recognition is not a linearly separable problem.

Conclusion
In this work, we have proposed a features extractor based on CNN combined with machine learning methods to recognize peripheral blood cells automatically. At first, we have studied the influence of training parameters on the performance of the CNN classifier, namely, the optimizer function ( ADAM and SGDM) and the maximum number of epochs. The best accuracy reached was 92,23% and was obtained with an ADAM function and 120 epochs. In the second step, we extract the features from the deeper layers of the CNN classifier, and then we classified these features using CNN, CNN_KNN, CNN_SVM (Linear), CNN_SVM (RBF) and CNN_AdaboostM1 classifiers. The best accuracy was achieved with the AdboostM1 algorithm (88,8%). We propose to test other CNN models such as VGG19, Resnet, and Inception in future works.