Alzheimer’s disease diagnosis via 5-layer Convolutional Neural Network and Data Augmentation

OBJECTIVES: Alzheimer's disease (AD) is a progressive neurodegenerative disease with insidious onset and one of the biggest challenges in geriatrics. Because the cause of the disease is unknown and there is currently no cure, AD Early diagnosis is particularly important. METHODS: In this paper, we built a 5-layer convolutional neural network based on deep learning technology. We used six data augmentation methods to increase the size of the training set. Batch normalization and dropout techniques are also used, which are respectively associated with the convolutional layer and the fully connected layer, Form convolution batch normalization (CB) and dropout fully connected (DOFC) block respectively. RESULTS: Our 5-layer CNN has shown excellent results on the training set, a sensitivity of 94.80%, a specificity of 93.98%, a precision of 94.04% and an accuracy of 94.39%, and has good performance compared with several other state-of-the-art methods. CONCLUSION: In terms of classification performance, our method performs better than 8 state-of-the-art approaches and the performance of human observers. Therefore, this proposed method is effective in the detection of Alzheimer’s disease.


Introduction
Alzheimer's disease (AD) is a clinical syndrome characterized by progressive deterioration of cognitive and memory abilities. The clinical manifestations are generally memory loss, language impairment, cognitive decline, etc., mostly in old age. But it is not a patent for the elderly. Alzheimer's disease has slowly begun 20 years before the onset of the patient's disease. According to statistics, a new patient with Alzheimer's disease will be added every 3 seconds in the world. The incidence of AD is inextricably linked with age. The prevalence rate of people over 65 years old is 5%, and people over 80 years old. The prevalence rate has risen to 10%, and by the age of 90, the prevalence rate is as high as 50%. That is to say, more than half of the elderly in the centenarian period have Alzheimer's disease.
S. Gao 2 completely lost; the life cannot be taken care of by oneself, and the caregiver is completely needed, and eventually he will die. The incidence of Alzheimer's disease is so high, but so far, there is still no effective cure, and only drugs and postcare can be used to slow down the condition. But it can also be prevented in advance. There are six main aspects to prevent Alzheimer's disease: diet, sleep, exercise, social interaction, stress reduction, and brain use. For the treatment of Alzheimer's disease, we can use medications for symptomatic treatment, such as antidepressants, anti-anxiety drugs, antipsychotics, etc., and can also be combined with cognitive rehabilitation therapy, vocational training, cognitive training, etc. Slow down the condition. Since Alzheimer's disease cannot be treated, it is particularly important for the early diagnosis of Alzheimer's disease.
At present, relevant scholars have proposed many methods based on computer vision and artificial intelligence for medical image classification, which can generally be divided into traditional machine learning methods and deep learning methods. Traditional machine learning medical image classification methods mainly include artificial neural network (ANN), genetic algorithm (GA), support vector machine (SVM), etc. Macedo Firmino et al. [1] have achieved certain results by extracting two-dimensional textures and using the SVM classifier to classify lung cancer with the possibility of malignant tumors, but the system does not have a good solution in the case of severe lesions. Michael C. Lee et al. [2] used a two-step supervised learning method to combine GA and random subspace method (RSM). In the combined classifier ensemble algorithm, 125 lung nodules and 125 pulmonary nodules were diagnosed. Better than the performance of a single step, but the generality of this model is weak. Yu et al. [3] classify non-small cell lung cancer by fully automated microscopic pathological image features, but it does not have much effect on large cell carcinoma in lung cancer. Suzuki et al. [4] developed a large-scale training artificial neural network (MTANN) based on ANN's pattern recognition technology, which achieved a sensitivity of 80.3%. We can find that these methods have achieved good detection results for medical image classification, but they cannot capture medical image features well, nor can they extract information containing medical features, so that the overall classification effect is not very good.
With the continuous advancement of technology, deep learning methods have become the preferred method for analyzing medical images, providing new ideas for solving the problems of traditional machine learning in medical image classification. Therefore, medical image research based on deep learning has received extensive attention from scholars and conducted in-depth research. Esteva et al. [5] used a data set containing 129,450 clinical images to train a deep convolutional neural network. Experimental results showed that CNN's ability to diagnose skin cancer is comparable to that of dermatologists. Ehsan et al. [6] proposed to use a deep 3D convolutional neural network (3D-CNN) to classify AD patients. This classifier is superior to other conventional classifiers in terms of accuracy and robustness. Bidart et al. [7] developed a fully convolutional neural network (FCN) to automatically locate breast cancer tissue slices, and then divide the nuclear pattern into benign epithelial (BE), lymphocyte (L) and malignant epithelial (ME) cells for three categories, the final classification accuracy rate is 94.6%. Through analysis, we can know that deep learning technology has achieved great success in the field of medical image classification. However, medical images contain far more feature information than natural images, and their adaptive capabilities are poor. Therefore, when constructing deep learning models, we must fully consider the characteristics of medical images to build the most suitable model, effectively positioning and Extract feature information. Savio and Grana [8] put forward a Deformationbased Morphometry (DBM) method. Yuan [9] proposed an eigenbrain (EB) method to identify AD. Zhang [10] presented to use displacement field (DF) and machine learning (ML) methods. Liu, et al. [11] extends to a three-dimensional displacement field (3D-DF) method to identify AD. Zhang, et al. [12] proposed a characteristic brain for diagnosing AD, extending the 2D characteristic to 3D to form a 3D eigenbrain (3D-EB), combined with the Pol-SVM classifier, and achieved 92.81% accuracy. Li [13] combined wavelet entropy (WE) and biogeography-based optimization (BBO). Du [14] combined Pseudo Zernike Moment (PZM) and Linear Regression Classification (LRC) for AD diagnosis. Gorriz [15] presented a predator-prey particle swarm optimization (PPPSO). Jiang and Chang [16] combined batch normalization and dropout technique into convolutional neural network. Zhou [17] presented a novel network with Convolutional Block Attention Module (CBAM).
In this paper, we have constructed a five-layer convolutional neural network (5L-CNN) model by analyzing the characteristics of Alzheimer's disease images, and building a 5L-CNN to automatically extract effective features from the image as accurately as possible. In order to solve the problem of small data sets of medical images, we use data augmentation (DA) methods to expand our data sets to achieve better training results.
The rest of this paper is organized as follows. In the second section, we describe the dataset used and Introduce pre-processing steps. Then in the third section, the method we used is explained in detail. The experimental results and discussion are given in Section 4. Finally, the fifth section summarizes and discusses the full paper.

Subjects
There are two sources of brain imaging data used: (i) One is downloaded from the "Open Access Series of Imaging Studies (OASIS)", this dataset contains crosssections of 416 male and female subjects aged 18 to 96. For each subject, this includes obtaining 3 or 4 individual T1-weighted MRI scans in a single scan session. The subjects are all right-handed. Due to the incomplete clinical data of some subjects, we finally selected 126 subjects with complete records, of which 28 were AD patients and 98 were HC subjects. (ii) The other is from four local hospitals, Zhong-Da Hospital of Southeast University, Children's Hospital of Nanjing Medical University, First Affiliated Hospital of Nanjing Medical University, and Affiliated Nanjing Brain Hospital of Nanjing Medical University. We selected 70 AD patients to balance the data set. Our inclusion criteria: are patients who meet the diagnostic criteria of AD, have a Clinical Dementia Rating (CDR) score of 1 to 2, and are over 60 years and have a Mini-Mental State Examination (MMSE) score of less than 24. The scoring criteria are shown in Table 1 and Table  2. The exclusion criteria were those with contraindications to MR scanning, incomplete clinical data, patients with mental illness. The study has been approved by the medical ethics committee of the local hospital, and the patients and their families have been informed and signed the informed consent before the study.  The 70 AD data obtained by the local hospital was scanned on patients with the MR Prisma 3.0 magnetic resonance instrument produced by Siemens, Germany was used to examine the patients. Before the scan, explain the situation to the patients in detail, all patients are lying on their backs, and earplugs are used to reduce the impact of noise from the scanning equipment. The imaging parameters of the scan are:TE = 2.48 ms, TR = 1900 ms, TI = 900 ms, FA = 9°, FOV = 256 mm × 256 mm , matrix size = 256 × 256 , slice thickness = 1 mm. Through the MP-RAGE sequence, a total of 176 sagittal slices covering the whole brain were obtained.
OASIS is combined with all the data of the local hospital to form our dataset. There is a total of 196 image data, 98 ADs and 98 HCs. The detailed information is listed in Table 3.

Preprocessing
The purpose of preprocessing the data is to output the region of interest (ROI), because the data will inevitably be interfered by the outside world and may lose characteristic information [18]. Through the preprocessing, the image quality can be enhanced and the image details can be highlighted to achieve better Experimental results. Fig. 3 shows two samples of preprocessed data, of which (a) shows the AD sample, (b) shows its healthy control sample. Fig. 2 shows a schematic diagram of preprocessing. First use the brain extraction tool to extract brain features, then FLIRT and FNIRT are used for spatial normalization, and the Gaussian kernel is smoothed. We choose the hippocampus as the slice selection criterion, because compared with HC, the hippocampus of AD patients is relatively small, slice S. Gao selection is implemented in the MNI space of Z = −22 mm. Due to the different sources of our data, the gray level distribution of the image may be uneven, so we perform histogram stretching to standardize the scanned image. We first set the original data set as 1 , which consists of 98 ADs and 98 HCs For the i-th image 1 ( ), it becomes a new image 2 ( ) through histogram stretching where ( ) and ( ) represent the minimum 0% and maximum 100% intensity values of the original image, but under normal circumstances, 5% and 95% will be used instead of 0% and 100%, because pixel with minimum and maximum values are more susceptible to noise. Finally, we get the image 2 after the histogram is stretched.

Methodology
In this section, we will elaborate on the convolutional network and data augmentation methods we used.

Convolutional Neural Network
Convolutional Neural Network (CNN) usually consists of two stages [19,20], feature extraction and classification. Feature extraction consists of convolutional layers (CL) and pooling layers (PL) alternately. As shown in Fig. 4, the classification consists of fully connected layers (FCL) and softmax layers

Convolutional Layer
The most basic operation in CNN is convolution, and CL performs 2D convolution along the width and height directions [21]. Assuming there is an input matrix, the kernel and the output, its size is defined as [22,23] ( Where X and Y represent the length and height of the matrix [24], C represents the number of input channels, the subscripts F and O represent the filter and output respectively, and K represents the number of filters [25]. The output of each filter is stacked together to form a stacked output, we can get the output length and height where A represents stride size, and D the margin. Therefore, the activation diagram is × × , as shown in Fig. 5.

Activation function
The stacked output adds a biased, and then through a nonlinear activation function, the output activation map can be obtained. The activation function of the traditional shallow neural network uses the sigmoid function [26], which is defined as The sigmoid function can transform the continuous real value of the input into the output between 0 and 1. In the deep neural network [27,28], the gradient will explode and disappear when the gradient is passed back. The probability of gradient explosion is very small, but the probability of gradient disappears is relatively large, and its analytical formula contains exponentiation, which is relatively timeconsuming to solve by computer. For large-scale deep networks, this will greatly increase the training time. In order to solve this problem, the rectified linear unit (ReLU) has become popular. It not only solves the gradient vanishing problem, but also has a very fast calculation speed. It only needs to judge whether the input is greater than 0, and the convergence speed is much faster than sigmoid. Therefore, we choose the ReLU activation function. It is defined as However, the output of Relu is not zero-centered, and the Dead ReLU Problem means that certain neurons will never be activated [29], resulting in the corresponding parameters not being updated. Therefore, a leaky ReLU (LReLU) has been proposed, which has all the advantages of ReLU, and will not have zero-centered and dead ReLU problems. LReLU is a deformation made on the basis of ReLU. It is defined as However, there is no accurate conclusion on the effect of LReLU. Some experiments performed well, while some experiments were not. The difference between sigmoid and ReLU activation functions is shown in Fig. 6. Because the activation map in CNN contains many features of the image, all of them are usually very large, which may lead to excessive computational cost and overfitting. Therefore, we need to use a pooling layer (PL) to reduce the size of the activation map, perform feature dimensionality reduction [30], reduce overfitting, and improve the fault tolerance of the model. In addition, PL can also achieve invariance, that is, rotation invariance, scale invariance and translation invariance.
The common pooling operations are average pooling (AP) and maximum pooling (MP). We first define a spatial neighborhood. AP is to calculate the average value in the corrected feature map in the window as the pooled value of the region, and MP is to select the maximum value in the corrected feature map in the window as the value after zone pooling. Fig. 7 shows two examples of pooling operations. A feature map is divided into 4 regions, that is, the pooling window has a size of 2 × 2 and a step size of 2.

Fig. 8 Pooling graph
Both pooling operations have their own pros and cons. Average pooling can preserve the background well, but it is easy to make the picture blurred. Maximum pooling can well retain texture features and reduce the impact of useless information, but may encounter a situation where the output value fluctuates greatly. The comparison diagram of the two operations is shown in Fig. 8. Observing the left image, you can see that the foreground brightness If it is lower than the background brightness, the maximum pooling fails at this time, but in practice, the brightness of most foreground objects is greater than the background, so the maximum pooling may be used more.

Batch Normalization
From machine learning perspective, if the input distribution of a layer changes, then its parameters need to be relearned, which may make the network performance worse. This phenomenon is internal covariate shift (ICS) [31]. In order to solve this problem, it is necessary to keep the input distribution of each layer of neural network consistent during the training process [32,33], so we have led to Batch Normalization (BN) to normalizes the layer's inputs A at every mini batch.
For a layer with a mini batch of size m, we have training set A defined as To ensure that the normalized output B can be evenly distributed Its mean and variance are expressed as The standard normalization of input A will make it concentrate near 0. If the sigmoid activation function is used, this value interval is just close to the linear transformation interval, which will weaken the nature of the nonlinear transformation of the neural network [34]. In order to make the normalization operation not have a negative impact on the network's representation ability, an additional scaling and translation transformation can be used to change the value range and represent the parameter vector of scaling and translation respectively, and they are also two learnable parameters. This is the whole process of batch normalization The normalization remains in the current layer, and the transformed output y i ∈ is passed to the next layer [35,36]. In this paper, we propose to construct CB with one convolutional layer and one batch normalization layer.

Dropout Layer
In previous studies, the convolutional layer was used for feature learning, and the fully connected layer (FCL) was used as the classifier. FCL needs to be multiplied by a weight matrix, and then the result is added to a bias vector [37]. Because it is connected to all neurons in the previous layer, it has a large number of parameters, which may cause overfitting. Many solutions have been proposed to solve overfitting, among which dropout is simple and the effect is very good. In this paper, we add dropout layer (DOL) before each FCL to form a dropout fully connected (DOFC) block to prevent overfitting and improve the generalization ability of the model [38].
In the learning and training process of the dropout method, the neural network training unit is randomly removed from the network according to a certain probability [39,40], and the selection of dropout neurons is random, Fig. 9 illustrates the performance of dropout neuron in training and test stages. The selections of dropout are random with a retention probability (default is 0.5).
where T means the weights of dropped out neurons, and means before dropout. During the test, the entire neural network did not use dropout, but the weight of the neural network will be reduced due to .

Data Augmentation
In the field of medical imaging, lack of dataset is a common phenomenon. Since our dataset has only 196 images, the corresponding training cannot be performed, and the model cannot achieve the expected effect [41]. In order to solve this problem, we use the data augmentation (DA) method [42] to expand the data set. First, the entire data set E is randomly divided into 98 training sets C and 98 test sets D, as shown in Table 4. The training set of 98 images is significantly smaller for the convergence of the convolutional neural network, and we use DA to increase it in this training set. In 98 training set, we use six types of data augmentation methods (DA-6) for each picture ( ) ∈ , = 1,2, ⋯ ,98 , rotation, scaling, random translation, Gamma correction, noise injection and random affine.
(29) (5) Noise injection (NI). Inject noise into each image to generate thirty new images, using zero-mean Gaussian noise with a variance of 0.01.  Combining the above 6 DA methods, 180 new pictures are made for each original picture, so our training data set has a total of 17,738 pictures. The results are shown in Table 5.

Measurement
In order to verify the performance of our model, we will evaluate it on 98 test sets. We select sensitivity, specificity [43,44], precision, accuracy, MCC, F1, and FMI as performance test indicators. We use the confusion matrix visualization tool to obtain these indicators. Our ideal confusion matrix (CM) is We assume that the confusion matrix is Where the positive category is AD and the negative category is HC. We obtain other performance indicators for model performance verification.
(1) Sensitivity. It represents the proportion of all positive samples that the model judges to be truly positive, and it measures the classifier's ability to recognize positive samples.
(2) Specificity. Represents the proportion of negative samples that the model judges to be true negatives, and measures the classifier's ability to recognize negative samples.
(6) F1-score. It is a new single indicator that combines accuracy and sensitivity. When the accuracy and sensitivity are both high, a higher F1-score can be obtained. The maximum value is 1 and the minimum value is 0. = + + (39) (7) FMI. FMI is defined as the geometric mean of precision rate and recall rate. It is an external indicator of clustering performance measurement. The larger the value, the better the clustering result.

CNN
The details of our proposed 5-layer CNN architecture are shown in Table 6, which includes three convolutional layers as the feature extraction stage and two fully connected layers as the classification stage. The padding value of each layer is the same. In the convolution stage, we named each block "Convolution Batch Normalization (CB)". In the fully connected phase, each block is named "Dropout fully connected (DOFC)". Here the parameter × / represents a filter of size × and pooling size of c. The last column of the table shows the size of the AM, as shown in Fig. 11.

Statistical Analysis
In our experiment, using the original 98 pictures, and 17,640 pictures generated by data enhancement, a total of 17,738 pictures were used to train our model. In order to evaluate the performance of our 5L-CNN model, we obtain the sensitivity, specificity, precision, accuracy, F1, MCC, and FMI indicators through the confusion matrix. We perform ten runs of 5L-CNN, and the results of the ten runs are shown in Table 7，and we use Fig. 12  The overall performance of our method has good results, and it can be clearly seen that the accuracy of 94.39% has been achieved, which also shows the superiority of our 5L-CNN.

Comparison with State-of-the-art Approaches
In order to further prove the effectiveness of our 5L-CNN method, compare it with several state-of-the-art approaches: DBM, EB, DF-ML, 3D-DF, 3D-EB, WE-BBO, PZM-LRC, PPPSO. Test set the detailed results of the ten runs are shown in Table 8. Fig. 13  1.91% 3D-EB. It can be seen that we use data augmentation methods to expand our dataset so that the model is better trained, and by using batch normalization and dropout layers to improve the stability and generalization ability of the model. Therefore, our method is superior to other methods. In the future, our method may be applied to other fields.

Conclusion
In this research on AD diagnosis, we proposed 5L-CNN. We use DA-6 to solve the problem of small-scale dataset, add batch normalization and dropout layer to the network, each convolutional layer is associated with BNT to form a CB block, and use DOFC to replace the traditional fully connected layer. Our method also has shortcomings. Although the dataset is expanded by data augmentation technology, it is still relatively small. Some new network technologies have not been tested. In future research, we can use more advanced data augmentation techniques and try transfer learning techniques or pretraining models.