Convolutional Neural Network for Multi-class Classification of Diabetic Eye Disease

Prompt examination increases the chances of e ff ective treatment of Diabetic Eye Disease (DED) and reduces the likelihood of permanent deterioration of vision. A key tool commonly used for the initial diagnosis of patients with DED or other eye disorders is the screening of retinal fundus images. Manual detection with these images is, however, labour intensive and time consuming. As deep learning (DL) has recently been demonstrated to provide impressive benefits to clinical practice, researchers have attempted to use DL method to detect retinal eye diseases from retinal fundus photographs. DL techniques in machine learning (ML) have achieved state-of-the-art performance in the binary classification of healthy and diseased retinal fundus images while the classification of multi-class retinal eye diseases remains an open challenge. Multi-class DED is therefore considered in this study seeking to develop an automated classification framework for DED. Detecting multiple DEDs from retinal fundus images is an important research topic with practical consequences. Our proposed model was tested on various retinal fundus images gathered from the publicly available dataset and annotated by an ophthalmologist. This experiment was conducted employing a new convolutional neural network (CNN) model. Our proposed model for multi-class classification achieved a maximum accuracy of 81.33%, sensitivity of 100%, and specificity of 100%.


Introduction
DED is a category of eye disorders that can affect people with diabetes. Diabetic retinopathy (DR), diabetic macular edema (DME), glaucoma (Gl), and cataract (Ca) are among these diseases (see Fig. 1). Diabetes can cause eye damage over time, resulting in blurred vision or even vision impairment. However, by keeping track of diabetes, one can avoid DED or keep it from worsening [1]. Around one-third of people with diabetes are likely to be diagnosed with DED. 2.2 billion people worldwide are confirmed by the World Health Organization (WHO) to have blindness or vision loss, and at least 1 billion have a vision impairment which * Corresponding author. Email: rubina.sarki@live.vu.edu.au could have been reversed 1 . Identifying and diagnosing these diseases for treatment is critical.
Motivated by the need for successful development of detection and preventive measures to implement the wide range of lifespan needs associated with retinal conditions and visual impairments. Deep learning (DL) has recently been demonstrated to provide impressive benefits to many real applications [2][3][4][5], researchers have attempted to use DL method to detect retinal eye diseases from retinal fundus photographs [6,7]. DL techniques in machine learning (ML) have achieved state-of-the-art performance in the binary classification of healthy and diseased retinal fundus images while the classification of multi-class retinal eye diseases remains an open challenge. To solve these problems, automated DED diagnostic techniques using DL are vital [8,9]. The time and labour intensive [10] nature of screening of DED make it a crucial inaccurate prognosis based on a competent ophthalmologist. While DL has generally achieved high accuracy of validity for healthy and diseased (binary) classification, the results of multi-class classification, particularly for early-stage disability, are less impressive [11,12]. We, therefore, present in this paper an automated DED classification model based on a deep convolutional neural network (DCNN) that can distinguish healthy images from disease pathology.
Initially, various DCNN architectures were evaluated to determine the best performing system for the mild and multi-class DED classification tasks [13]. We aspired to achieve the highest standards of output that previous works have recorded. The following is a list of the research's main contributions.
• We trained the proposed CNN model and evaluated a multi-class classification model to improve sensitivities for the various DED levels.
• Incorporating various pre-processing methods.
• Augmentation approaches to improve the accuracy of the results and sample volume adequacy for the dataset.
Early amplification of retinal diseases is important however, it takes a tremendous amount of time and memory to diagnose these diseases using neural networks. To improve accuracy, additional data must be given, but this requires high computing capacity and a large amount of time spent. Therefore, the method will benefit from a pre-trained model as it adapts the design to minimize losses. Pretrained models or models for transfer learning (TL) [14] have already shown promising results in the classification and detection of medical images [15][16][17][18]. In this phase of the research, following TL theory, we used state-ofthe-art CNN models, pre-trained on the sizeable public image repository, ImageNet, also used in our previous study [13,19]. The deep neural network's top layers were trained with the publicly available fundus image corpora using initialized weights for mild and multiclass classification. We proposed a new CNN model to solve the problem of multi-class classification. Initially, the highest-performing CNN model was developed based on a comprehensive experiment conducted in previous studies [20,21]. Then we evaluated the set of performance improvements, including imageprocessing and optimizer selection. Finally, an ideal specificity and sensitivity were achieved by the proposed model, thus facilitating an efficient and effective fully automated DL-based system improving the outcomes of mass screening services for the at-risk population.
The arrangement of this article is as follows. Section 3 This section covers study design based on datasets, image processing, and DL algorithms. Section 4 addresses the experiments performed in two scenarios. Section 5 compared the results based on the classification evaluation. Section 6 discussion of the article. Finally, section 7 concludes the paper. 2 EAI Endorsed Transactions on Scalable Information Systems 04 2022 -08 2022 | Volume 9 | Issue 4 | e5

Literature Review
By using machine learning algorithms [22][23][24][25] to identify a huge proportion of fundus images captured from ocular scanning programs [26,27], several previous researches focused on automatic retinal disease detection. Various machine learning techniques have been introduced for the automatic detection of retinal diseases [15]: Artificial Neural Network(ANN), K-nearest neighbor algorithm, Support Vector Machine(SVM), and Naive Bayes classifier. To define the difference between glaucoma and nonglaucoma [28,29], multiple studies have introduced ANN models. By finding preperimetric glaucoma using DL feedforward neural networks(FNN) [30], a glaucoma research team observed visual field assessment. An artificial DCNN was applied for the grading sensitivity of nuclear cataracts [31]. In the field of ML techniques, DL emerges as a common solution for different classification issues [32][33][34]. An advanced DL model capable of diagnosing Diabetes Mellitus Retinopathy(DMR) [8] has been developed by the Google research community. Using identical DL techniques, fundus photographs and optical coherence tomography [35,36], the Agerelated Macular Degeneration(AMD) research was performed. However, binary classification was chosen by all retinal image classification studies from which problems with normal versus one disease were resolved [37]. Besides, Lam et al. [38] used pre-trained networks (GoogleNet and AlexNet) in the Messidor dataset to distinguish mild and multi-class diabetic retinopathy, researchers developed selective contrast-limited adaptive histogram equalization (CLAHE) and reported enhancement in recognition of subtle characteristics. Multi-class DL architectures have been tested by Choi et al. [39] for automated detection of ten retinal disorders. This work's findings have shown that existing DL algorithms have failed to distinguish retinal images from small multi-group datasets. While the outcome of high classification success in structured experimental settings has been published in research conducted in this field, it is fundamentally difficult to apply the binary classification model in real medical practice where visiting patients suffer from different DEDs. Indeed, there have been minimal studies on mild and multi-class groups aimed at recognizing DED. This study used a state-of-the-art CNN for fundus image analysis to adapt TL in mild and multi-class DED environments. This paper articulates a pilot study intended to evaluate TL on mild and multi-class classification using the small open-source fundus retinal image database.

Study Design
The real aim of this study is to increase performance in automated multi-class DED classification and detection through an experimental assessment of different classification improvement techniques. The following are key definitions of associated priorities and goals.
1. Performance analysis of proposed CNN framework with multiple datasets collected from multiple sources; 2. Impact of an optimizer on the performance of proposed frameworks; 3. Visual representation of the performance of the model using heat-map; 4. Analysis of image quality improvement using contrast enhancement techniques for further 3 EAI Endorsed Transactions on Scalable Information Systems 04 2022 -08 2022 | Volume 9 | Issue 4 | e5 classification improvements on multi-class DED detection task.
The DL process is illustrated (see Fig. 2), to demonstrate the steps that followed.
Data Collection. Open-source data was collected, including publicly available Messidor, Messidor-2, DRISHTI-GS, and kaggle cataract datasets. Despite its relatively limited size, Messidor's dataset contains highresolution photos with accurate labeling. Similarly, Messidor-2 is a public dataset for analyzing DED algorithm output used by other individuals. The data comprises 1,748 photographs of 874 topics. Messidor-2 varies from the primary 1200 image data set of Messidor, and for each object has two images, one for each eye. They acquired the cataract and glaucoma dataset from the retina dataset obtained by Kaggle. This dataset consists of 100 images of a cataract and 100 images of glaucoma.
Image Processing. To accomplish contrast enhancement in the retinal fundus images, mathematical morphology has been used. Mathematical morphology approaches work hinged on the structural values of objects. To pull out the elements of an image, these methods use relationships between classes and mathematical fundamentals, which help explain areas. In morphological operators, the input consists of two data sets. The original image is included in the first set, and the second one illustrates the structural element (SE), which is also called a mask. The original image is in grey level or binary, and the mask is a 0s and 1s value matrix [40]. In morphological operators, if the gray-level image matrix id represented by I(x, y) and the SE by S(u, v), the erosion and dilation operators are defined as equation 1 and 2 [41].
The erosion operator decreases the objects' size and increases the size of an image's holes and eliminates very tiny information from that image. It makes the final image appear darker than the original image by removing bright areas under the SE. The dilation operator operates in reverse; in other words, the size of objects increases and holes in the image decreases, respectively. Therefore, the opening operator is similar to implementing the dilation and erosion operations on the same image equation 3, while the closing operator acts in reverse equation 4.
The opening operator eliminates poor relations between artifacts and small information, while small gaps are eliminated, and the closing operator fills cracks. The size and shape of a SE are usually chosen arbitrarily; however, disk-shaped SE is used more frequently than other masks for medical images.
Proposed Convolutional Neural Network. This layer comprises a filter set (kernel). Each filter is convoluted against the input image and then extract features by creating a new layer. Each layer signifies some of the important features or characteristics of the input image. The * symbol identifies the operation of the convolution. The output (or function map) K(t) where t is time, defined below when input M n (t) where t is time is convoluted with a filter or k(a) kernel where a is the measurement. We obtained a function new function K to estimate the position. This operation is called convolution.
If t can only accept integer values, the following discrete convolution is provided by the following equation: The above assumes a one-dimensional convolutional operation. A two dimension convolution operation with input M n (m, n) and a kernel k(a, b) is defined as: By the commutative law, the kernel is flipped and the above is equivalent to: Neural networks implement the cross-correlation function, which is the same as convolution but without flipping the kernel.
So there is less variation in the range of valid values of m and n, this last perspective is usually easier to implement in a machine learning framework.
Rectified Linear Unit (ReLU) Layer. This layer is an activation function that sets the negative input value to zero, which optimizes and speeds up analyses and training, and helps prevent the gradient from disappearing. Mathematically, this described as: In which x is input to the neuron. 4 EAI Endorsed Transactions on Scalable Information Systems 04 2022 -08 2022 | Volume 9 | Issue 4 | e5 Maxpooling Layer. This Layer is a sample-based discretization method. It is employed to down-sample an input design (input image, hidden-layers, output matrix, etc.), its dimensionality is compressed, allowing assumptions about the components available in the binned sub-regions. This will decrease the size of learning parameters and provide fundamental interpretation invariance to internal depiction, thus further reducing the cost of computation. Our model adopted the kernel size of 3 × 3 during the Maxpooling process. After the final convolution block, the network flattened to one dimension.
Batch Normalization. Batch normalization enables every layer of the network to learn a little more independently of the other layers. It also normalizes the output from the previous activation layer by subtracting the batch mean and dividing the batch standard deviation [42] to improve the steadiness of the neural network.
Fully Connected Layer. This layer takes the output of the previous layer (Convolutional, ReLU, or Pooling) as its input and calculates the probability values for classification into the various groups.
Loss Function. This layer applies a soft-max function to the input data sample. This layer is used for the final prediction. Therefore, our loss function is given as: Where β j is the jth element of the vector of class scores β, β y is the CNN score for the positive class and c is classes for each image. The softmax ensures a proper prediction probability in the log of the equation.
Regularization. An efficient regularization method named as a dropout is employed. This strategy was being proposed by Srivastava et al. [43]. During the training process, the dropout is conducted by maintaining the neuron active with a certain probability P or by setting it to 0. In our study, we set hyperparameter to 0.50 because it outputs in the maximum amount of regularization [44].

Classification Description.
Training and testing performed by new CNN architecture, as mentioned above, in Scenario I classification models.
• Scenario I: In this scenario, we will classify five classes of DED, such as healthy, DR, DME, Gl, and Ca retinal fundus images (multi-class classification) Classification Performance Analysis. Retinal fundus images are evaluated as follows, the efficacy of each CNN is calculated by various metrics implemented to measure the true and false category for the diagnosed DED. Next, the cross-validation estimator [45] is used, resulting in a  (5) is essentially the fraction of the uncertainty matrix's total values between true positive and true negative. Therefore, the above elements of the confusion matrix should determine the classifier's efficacy in our present framework.

Accuracy(%) = T P + T N T P + FN
Similarly, Sensitivity (Recall): A test's sensitivity is often called the true positive rate (TPR), which is the percentage of truly positive tests that use the test in question to give a positive result.

Sensitivity(Recall) = T P T P + FN
Specificity: The specificity also refers to it as the true negative rate (TNR), which is the percentage of genuinely negative samples testing negative using the test in question. Precision Call: Precision demonstrates the ratio of the real true positive test samples are flattering for all test samples classified by the model. The formula will be seen as follows:

Specif icity = T N T N + FP
ROC curve: A ROC (receiver operating characteristic curve) curve is a graphical representation that demonstrates how a test's specificity and sensitivity differ concerning each other. Using the test, samples considered to be true or false are calculated to create a ROC curve. To offer a graph identical to the one below, the TPR (sensitivity) is mapped against the FPR (1 -specificity) for specified cut-off values. Preferably, a point is selected across the curve's shoulder, reducing false positives while optimizing real positives.

Experiment Setup
All the studies conducted used Python, Keras library, TensorFlow as a back-end. The resolution of the images has been standardized to a uniform size, following each model's input requirements. The epoch number set at 50 because of the use of new CNN trained on weights in our experiments. The distribution of training/testing dataset at 80/20. Stratified standard preference made to ensure a nearly similar dispersion of the class. Minibatch size set to 32, and the categorical cross-entropy loss function was selected due to its suitability for multi-class classification tasks. The default RMSProp was the Optimiser. The primary assessment metric for Accuracy, Specificity, and Sensitivity of test data was used for final scores validation.

Performance enhancement
In this article, CNN recommended that the classification of the DED dataset be included. A deep convolutionary neural network (CNN) converts a function vector with a fixed weight matrix to obtain specific representations of features without losing spatial arrangement information [46]. Optimizer selection is a vital component of the neural network, helping to pick which one to use for the model by understanding how they function. Several hyperparameters could be modified to increase the efficiency of the neural network. Not all of them, however, have a significant influence on the efficiency of the network. Not all of them, however, have a significant influence on the efficiency of the network. The optimizer is one of the parameters that could allow the adjustment between the algorithm assemblies or setoff. There are different optimizers we have picked from, the most widely used ones.
RMSprop Optimizer The RMSprop is an unpublished method proposed by Geoff Hinton based on adaptive learning rate [47]. The RMSprop optimizer is equivalent with momentum to the gradient descent algorithm. In the vertical direction, this optimizer limits the oscillations. Thus, we can maximize our learning rate, and our algorithm will take larger steps to converge faster in the horizontal direction. The discrepancy in how the gradients are measured is between RMSprop. For the RMSprop with momentum, the following equations illustrate how the gradients are determined. Hinton suggested the momentum value is normally set to 0.9 and a good default value for the learning rate η is 0.001. RMSprop's gist is to maintain a moving average of the gradient square and divide the gradient by the root of this average 2 .

Results
The proposed CNN model has obtained accuracy in the test dataset. The efficiency acquired by the model has been used to demonstrate That multi-class classification in DED can be improved by improving the quality of the training images and using the right parameters for the model. The highest sensitivity and specificity was identified by proposed model in RMSprop optimizer. Finally, the accuracy obtained is shown in Table.3.
This analysis is a study of multi-class DED classification using the DL algorithm. A minimum of 80 percent sensitivity and 95 percent precision for sight-threatening DED identification must be obtained by any method [48] under the British Diabetic Association (BDA) guidelines. In scenario I, we achieved a maximum sensitivity of 100 percent and a maximum specificity of 100 percent, respectively, after checking our strategy in DED detection tasks. Thus, according to the BDA criteria, multi-class DED detection is adequate for its sensitivity and specificity.

Visualizing feature map
The feature maps, or activation maps, record the input applied with filters, such as the source images or other feature maps. The purpose of visualizing a feature map for particular source images would explain what attributes in the feature maps are observed or retained. The idea would be that the feature maps near the input detect fine-grained or small information while featuring maps near the model output to capture more distinctive characteristics. The first layer of CNN always learns features like edges, line patterns, color, and deeper layer network to identify more complex features like pathological lesions. Later layers construct their features by merging features from previous layers. To analyze the visualization of feature maps, we used the highest performed model with fundus retinal images, i.e., proposed CNN model, and used to create activations. The activations for CNN network models shown in Fig. 5.

Explaining Proposed Model using Grad-CAM
To make deep learning more practical and explainable, a range of work was performed [49][50][51]. It is also essential to make the deep neural network more interpretative in various deep learning applications linked to medical imaging. A technique of Gradient Weighted Class Activation Mapping (Grad-CAM) is developed by Selvaraju et al. [52], which provides an illustrative view of deep learning techniques. The technique of Grad-CAM offers a visual description for any deeply related neural network. This helps to decide more about the model when conducting identification or prediction tasks. The simple retinal fundus image is given as input and uses the proposed model as a detection method. After calculating the predicted label using the full model, Grad-CAM is applied to the last Convolution layer. Fig. 6 shows the heatmap visualization on various retinal fundus images by the proposed model.

Discussion
This research focuses on DL algorithms to classify multi-class DED automatically. Previous studies in this field found that the new DL algorithms are ineffective in classifying multi-class classification of fundus images from small datasets. For a computeraided medical application, it failed to produce realistic and efficient results. Consequently, because of each disease's significance, this article adapted the optimized DL architectures for the automated classification of regular, DR, DME, GL, and Ca to create an automatic framework for the classification.
In the proposed CNN model multi-class DED classification, the performance was reduced by 3 percent [53]. As DED fundus images consist of subtle features that can be crucial for diagnosis, this finding is prevalent. Interestingly, the most widely implemented architectures were designed to classify features based on artifacts such as those found in the ImageNet dataset. However, a new mechanism such as lesion-based (e.g., exudates) may be required to diagnose diseases using CNN models. Our future goals include multi-class classification of segmented DED lesions (region of interest) [54,55] [56] for improving the accuracy of multi-class DED identification and progressing to more complicated and beneficial multigrade disease identification.
Automated DED recognition has been the topic of several studies in the past, with the main emphasis on healthy/unhealthy binary retinal classification [57]. In the case of multi-class classification, the performance of DL models has been reduced as categories have multiplied. When categories increased, the predicted precision of the random distribution decreased. This finding corresponded to the previous studies [58]. Recent research using the GoogLeNet model to identify skin cancer has shown that increase in the number of classes has underperformed (with an accuracy of 72.1% for a three class problem and 55.4% for a nine class problem) [59]. Thus it is essential to create disease-specific strategies to differentiate between DED 7 EAI Endorsed Transactions on Scalable Information Systems 04 2022 -08 2022 | Volume 9 | Issue 4 | e5  to enhance the efficiency of multi-class classification. Therefore, the research proposed a system that focuses entirely on the identification of multi-class DED among healthy instances, as discussed in previous studies. According to the empiric nature of DL, a variety of performance optimization techniques have been applied (i) optimizer choice, (ii) data augmentation, and (iii) contrast enhancement. Besides, the study used combined datasets from different sources to evaluate the system's robustness in its flexibility to cope with real world scenarios. As Wan et al. [60] have pointed out, the single data collection environment presents difficulties in the validation of accurate models [61,62].

Conclusions
Images of the retinal fundus are a popular and useful instrument and are used to provide accurate DED details. Such photographs can easily show injuries, anomalies and help to prevent permanent loss of vision. However, it is challenging to identify DEDs through retinal fundus images accurately, and even highly experienced ophthalmologists are prone to misdiagnose eye lesions. Severe DEDs, which require rapid diagnosis and treatment, cause irreversible vision loss, visual imparities, and vision distortion disorders. Consequently, to assist in the diagnosis, it is essential to use DL methods. Multiple diseases may represent one collection of fundus photos. Using a single picture for diagnosis with a fundus image, which is well examined by traditional methods, is inaccurate. Furthermore, it requires a large amount of time to mark a series of retinal fundus images one by one. For this analysis, we, therefore, used publicly accessible and annotated photographs of the fundus. This research suggested a model that learns the characteristics of fundus images in retinal fundus photography and their feature dependencies for multi-class classification. We grouped images of the retinal fundus into five types of DED. The findings presented in this work show that DL algorithms can automatically identify the form of DED. This technology may have possible clinical applications, and may enhance healthcare delivery by identifying different acute DED diseases. 8 EAI Endorsed Transactions on Scalable Information Systems 04 2022 -08 2022 | Volume 9 | Issue 4 | e5 Figure 6. Visualisation on fundus retinal images of Normal/DR/DME/Glaucoma/Cataract infected using Grad-CAM on the proposed model.