Towards exploration and evaluation of sleep staging classification schemes for healthy and patient subjects

INTRODUCTION: Sleep stage classification is an important task for the timely diagnosis of sleep-related disorders, which are one the most common indicator of illness. OBJECTIVE: An automated sleep scoring implementation with promising generalization capabilities is presented, aiding towards eliminating the tedious procedure of manual sleep scoring. METHODS: Two Electroencephalogram (EEG) channels and the Electrooculogram (EOG) channel are utilized as inputs for feature extraction both in the time and frequency domain, while temporal feature changes are utilized in order to capture contextual information of the signals. An ensemble tree-based and a neural network approach are presented at the classification process. RESULTS: A total of 66 subjects belonging to three different groups (healthy, placebo, drug intake) were included in the study. The tree-based classification method outperforms the neural network at all cases. CONCLUSION: State of the art results are achieved, while it is highlighted that using jointly the healthy and patient subjects dataset, boosts the model’s accuracy and generalization capability. Received on 30 June 2020; accepted on 01 October 2020; published on 19 October 2020


Introduction
Sleep is a state characterized by loss of consciousness and greatly reduced responsiveness to external stimuli. It is distinguished from coma or anaesthesia by its rapid reversibility [1,2]. The maintenance of a human's wellbeing and cognitive performance as well as mood and behavior are affected by the quality of sleep. Sufficient sleep is an essential ingredient for good health, that may task that requires considerable work and it is affected by the scorer's experience and fatigue. The percentage of agreement between two human scorers is often below 90% highlighting the need for automated scoring [3,4]. Sleep scoring is following specific rules, based on some predefined standards. Two of the most widely used standards are Rechtschaffen and Kales' (R&K) rules [5] published in 1968 and a more recent guideline revised in 2012 by the American Academy of Sleep Medicine (AASM) [6]. The current study is based on the AASM standard which includes the following sleep stages that alternate cyclically every 90 to 110 minutes, during a human's sleep [7,8]: • Awake stage (W) stage is characterized by alpha frequency bands as well as frequent eye movements.
• N1 stage corresponds to light sleep. It is characterized by alpha or faster frequency bands occupying more than 50% of the epoch, while theta activity and slow eye movements are evident.
• In N2, the eyes stop moving, the brain waves become slower and sleep spindles or k-complexes are noted.
• N3 corresponds to deep sleep, where no eye movement and muscle activity exist, while delta activity is detected in over 20% of the epoch length.
• In Rapid Eye Movement (REM) stage the breathing rate increases and the eyes move rapidly.
According to the standard, the scorer must assign a sleep stage every 30 seconds for the whole duration of a subject's sleep (7-8 hours). The process cannot be easily repeated due to its high cost, the subject's inconvenience and the scorer's demanding task. Moreover, PSG recordings may severely differ from person to person depending on age, health condition, sleep condition etc. It is common that even the same person has different PSG recordings on different days, mainly due to different health and sleeping condition. Machine learning algorithms have been applied to the sleep scoring problem for many years, offering automated sleep scoring services. Thanks to the great progress made during the recent years in the field of computational power capabilities, more complex deep learning approaches have been employed, delivering promising results. Despite the above, the vast majority of sleep physicians still shows disbelief on sleep scoring algorithms, picking the tedious task of manual scoring over the automated process [9]. From the machine learning point of view, sleep staging is an unbalanced classification problem. Class imbalance is merely the cause for different predictive accuracy at each class.
The current study extends the work presented in [10] in terms of the number of subjects used in the experiments, the classification methods, as well as the comparison of two different evaluation methods.
The rest of the paper is organized as follows: The related work is presented in Section 2, while Section 3 describes the data and methods that were implemented. Section 4 presents and analyzes the experimental results. Section 5 discusses what has been presented in the paper and proposes potential future work. Finally, conclusions are drawn in Section 6.

Related work
Automated sleep stage scoring is a topic that is gradually getting more attention in the literature, as computational power increasing capabilities give the opportunity for more complex and computationally expensive models. Although access to sleep lab data is restricted, a number of recordings have been made available in online repositories [11][12][13] serving as research datasets for a number of studies. Data sampling frequency usually ranges between 100-256 Hz while a whole night's sleep recording lasts for about 8 hours, resulting into a massive amount of data . As a result, in most studies a relatively small number of sleep subjects is being used, even less than 10.
Reviewing the literature it is deduced that benchmarking and comparing different studies is a challenging task, since there are a number of parameters determined by the authors when defining the problem. First of all, the available datasets are comprised of different subjects in terms of age, sex, sleep disorders etc. Some of the most well-known datasets are available online, like Sleep-EDFx [11], CAP sleep dataset [12], St. Vincent's University Hospital / University College Dublin Sleep Apnea Database [14]. There are also datasets provided by sleep clinics or universities that do not have free access like [15]. All of the datasets contain multiple subjects and each study usually selects a subset of those subjects ignoring the rest. Sleep is affected by many factors such as age or health condition, so different subjects may provide results that vary significantly even if the same algorithm is used. Some datasets may also contain both healthy and patient subjects, while patients, depending on the dataset, may suffer from sleep apnea, insomnia, mild disorders, REM sleep behavior disorder etc. The results that derive from patients with different disorders are obviously not comparable with each other. Another issue that arises is that a PSG may include different EEG channels, and potentially EOG and EMG. Even if the most common data source is used, namely EEG, data may derive from different channels in terms of the electrodes positioning on the subjects' head. In some cases, even the data sampling rate varies as well. The ground truth of each dataset may follow a specific 2 EAI Endorsed Transactions on Bioengineering and Bioinformatics Online First standard, but as it has already been pointed out in the introduction, there is a big percentage of disagreement between experts. Taking a closer look at the approaches followed by sleep staging studies, the feature extraction process seems to include three main type of features: i) timedomain, ii) frequency-domain, iii) non-linear features. Time-domain features include statistical features such as first, second and higher order statistics of the timeseries raw signal [16,17]. Fourier transform (FT) has been extensively used for feature extraction in the frequency domain [18,19]. Recent studies tend to use wavelet transform (WT) more often as it has a strong advantage compared to FT, since it is able to localize the frequency components into the time-domain [19,20]. Non-linear features are commonly used with EEG data since they are able to portray the the non-linear dependencies of different parameters associated with EEG [21].
Another key point that can be used to make a fair discrimination of the studies found in the literature, is the method that is used for the evaluation of the suggested model. Two methods are predominantly used. For the context of this study, the first method will be named Internal Subjects Evaluation and the second, External Subjects Evaluation. In internal evaluation a part of each subject's sleep is used for training and the rest sleep is kept for testing. In external evaluation, the model's testing is done on totally unseen subjects. The type of testing is not explicitly declared by all studies, as cross-validation may imply the first or the second method, depending on whether the tested subjects are completely kept out of the training process (leave-oneout cross validation).
Starting with shallow machine learning approaches, in [27] HMM is used trying to correlate the transitions between sleep stages. It is deduced than predictive accuracy varies among sleep stages. Stage 1 is underrepresented, as it represents a small part of a whole night's sleep. Its accuracy is usually lower compared to the other stages. In the current study it is below 50% while the rest stages are close to 90%. The same issue regarding the accuracy of sleep stage 1 is also noticed at [16], where a set of statistical features of the Pz-Oz electrode are fed into an SVM classifier. In [18], frequency domain features such as spectral energy band, central frequency, bandwidth and Itakura Distance are evaluated in the context of the sleep stage classification task. The extraction of multiple features in the time and frequency domain may lead to models that have better generalization capabilities. In [30], EEG, EOG and EMG signals are converted into the frequency domain and band features were extracted. MLP classifier outperformed other approaches, but the accuracy obtained remains relatively low, as the study was conducted on patients with sleep apnea. A comparison between FT and WT is attempted in [19]. It is found that WT is more efficient mainly due to the fact that EEG signals are non-stationary, so small changes may not be realized by FT and the analysis may change depending on the length of data. A method totally based on WT performed on the EOG signal is introduced in [31]. Db4 is selected as the mother wavelet and features are extracted from the WT's detail and approximation coefficients. SVM and tree-based under-sampling boosting classifiers were used while internal validation was carried out. A comparison of probabilistic classifiers, on healthy and patients subjects, using external validation is analyzed in [15]. Conditional Random Fields (CRF) classifiers are proved to be superior, providing moderate sleep stage classification results for patients with apnea, outperforming earlier work. However, stage 1 low predictive accuracy is also highlighted by the authors. The authors of [25] put more emphasis on the feature selection process, performing feature selection using mRMR, after extracting feature importances with the Kruskal-Wallis statistical test. Very high accuracy is obtained for the 6-class and 5-class sleep stage classification problem, as a result of the internal validation. Moreover, the wake stage is not discarded from the EEG recordings, boosting the model's accuracy. An alternative approach is presented in [23], as graph domain features are extracted from an EEG channel. The mapping of the signal segments into visibility graphs ends up into an SVM classifier performing internal validation.
Deep learning approaches have been extensively tried out in the field of sleep stage classification, as the majority of the published studies over the last few years are based on ANNs. The first study that utilized WT for feature extraction relied on a feedforward NN for the classification task [20]. Computational power capabilities at that time only allowed the existence of one hidden layer with 10 neurons, while the input layer comprised of 13 neurons. The method provided relatively poor results but paved the way for future deep learning application in the sleep staging field. Coming to more recent researches, in [28] the implementation of a complex-valued CNN is examined. The selection of CNN is backed by the claim that the construction of handcrafted features able to reveal information about sleep patterns, is a process that requires an experienced domain expert. Complex CNN outperforms classical CNN approaches but poor accuracy for stage N1 remains a problem. The authors of [29] are also presenting a CNN approach utilizing Fractional Discrete Fourier Transform (F-DFT) in order to fully utilize the local frequency domain information of EEG signals. Wavelet Transform is also used in an effort to depict the low frequency structure information of local signals and better classify deep sleep. Stateof-the-art performance is achieved but the model is tested only with internal validation. A comparison of three different NN classifiers is introduced in [32]. A recurrent classifier, a feedforward NN and a probabilistic NN are compared. As expected, temporal information enclosed in the PSG recording time-series data, boosts the recurrent model's accuracy making it by far more accurate than the others. The same logic also applies to [33], where LSTM is utilized for classification. The most encouraging finding after the internal validation process, is the improvement of stage 1 predictive accuracy, compared to other methods.

Data Acquisition and Preprocessing
The dataset used in this paper is Sleep-EDF [Expanded] Database which is publicly available online from PhysioNet [14]. The database contains 197 PSGs with accompanying hypnograms, while the data are acquired from two studies. The first one named SC* is the study of different age effects while sleeping. 153 recordings from healthy Caucasian males aged 25-101 are found in this study. The second one named ST* is the study of temazepam effects while sleeping in 22 Caucasian subjects without other medication, and in that study, subjects had mild difficulty falling asleep but were otherwise healthy. Each subject was recorded for 2 nights, one of which was after temazepam intake and the other after placebo intake.
The recordings contain two EEG channels (from Fpz-Cz location and Pz-Oz locations), EOG (horizontal), submental chin EMG, and an event marker. EEG and EOG signals were sampled at 100Hz. Each recording was scored by well-trained experts according to the R&K manual, but based on Fpz-Cz/Pz-Oz EEGs instead of the suggested C4-A1/C3-A2 ( Fig. 1) [5]. Annotations of every 30-second epoch contain 1, 2, 3, 4, W, R, M and ?, which represent stages S1, S2, S3, S4, Awake, REM, 'Movement' and 'not scored' respectively. The hypnograms are converted to the AASM scoring standard for the needs of the current study. Movement data and not scored epochs were completely disregarded from the dataset.
The subjects were separated into 4 diferrent groups ( Table 1). The first group contains totally healthy subjects, obtained from the SC* study of Sleep-EDFx dataset. The second and third group contain subjects from ST* study, that have difficulty falling asleep. The second group contains the placebo intake nights, while the third group corresponds to the temazepam intake nights. Finally, in group 4 all of the three groups described above are joined into one group containing both healthy and patients subjects.
Temazepam belongs to the class of medications called benzodiazepines and it is used for the treatment of short-term sleeping problems. The effects of temazepam on human EEG have been studied by several researchers in the past. In [34], twenty healthy males aged 21-26 years were recorded both for placebo and temazepam intake nights. It was found that compared to the placebo condition, temazepam significantly reduced the interval between lights-off and the first occurrence of stage 2 NREM sleep. Moreover, total sleep time was significantly longer in the temazepam condition and comparing the first 6 hours of sleep for the two nights, it was noticed that temazepam significantly reduced REM sleep but it did not reduce slow-waves sleep or stage 4 NREM sleep (R&K scoring). A similar study [35], which was also conducted with healthy volunteers on placebo and medication nights, detected changes in the recorded EEGs, using mean power density spectra and t tests. The distinction between placebo and temazepam nights was absolutely clear. It is consequently deduced that the separation between placebo and drug intake nights is meaningful, as temazepam changes the EEG characteristics and sleep structure.
In the current study, the sleep scoring standard of the American Academy of Sleep Medicine was used [6], as this is the standard that is followed by most of the recent studies. An inconsistency between AASM

Feature Extraction
The fact that raw PSG signals are non-stationary and their statistics change over time is taken into account during the feature extraction process. Time-domain analysis is not sufficient, so most studies also use frequency, time-frequency and non-linear features [36].
For the current study, the features were extracted at epochs of 30 seconds and they are briefly presented below. Time domain features are extracted in order to capture information regarding how the signal changes over time. The first and second moment statistics, namely mean value (eq. 1) and variance (eq. 2) are measuring the epoch's average value and the spread of the data points from this value. Moving into higher order statistics, skewness (eq. 3) defines the extent to which a distribution differs from a normal distribution. The skewness of a normal distribution is zero, while positive and negative skewness indicates that data are skewed right and left respectively. Kurtosis (eq. 4) is a statistical measure that describes the degree to which the data points are concentrated around the peak or the tails compared to the normal distribution. All of the statistical features described above are extracted both for the EEG and EOG channels.
Entropy (eq. 5) is a measure of randomness describing the lack of order or predictability. High entropy denotes a stochastic process that does not form a specific pattern.
In terms of spectral content, EEG waveforms are characterised by components belonging to five different frequency bands ( fig. 2). The EEG signal energy in each of these frequency bands is calculated via FT. Signal energy has been successfully used as a feature to many machine learning problems related to EEG analysis, from classification of sleep stages [18] to epilepsy detection, human emotion recognition [37] and cognitive performance [38].
Power spectral Density is also a frequency domain feature describing how the power of a signal or a time-series is distributed over frequency. The power is defined as the squared value of the signal. The unit of PSD is energy per frequency and its computation is done directly from FT.
Fractal dimension is a ratio that provides a statistical index of complexity, comparing how detail in a pattern changes depending on the scale at which it is measured. Petrosian Fractal Dimension (PFD) is a feature extracted from PyEEG library [39] which is an open-source python module for EEG/MEG feature extraction. 5 EAI Endorsed Transactions on Bioengineering and Bioinformatics Online First  Lag features of PSG recordings. PSG recordings are essentially time-series data, so extraction of temporal information is expected to boost our models' accuracy. This concept has been successfully implemented in the literature, mostly at deep learning models that used LSTM layers [40,41]. The integration of that information at a static model is feasible with the generation of time-delay features from the original ones. Those, so called, lagged features, are feature vectors containing data from previous time steps. The batches of lag features are finally concatenated with the original features, eventually shaping the dataset used for training ( fig. 3).
Feature selection. In Fig. 4 the feature importance of the 252 extracted features is presented, utilizing Extreme Gradient Boosting (XGBoost's) built-in method. Feature importance represents how much each feature contributes to decreasing node impurity, weighted by the probability of reaching that node. The more that a feature is used for decision making in a tree, the higher its relative importance. The final value for each feature is calculated by averaging importances across

Classification Algorithms
Two classification approaches with different characteristics are compared on the results. The first approach is a static model based on decision trees and the weak learners boosting technique, called XGBoost. High speed and performance make XGBoost stand out among other ensemble methods. Since this is a static model, temporal information is incorporated into the model, using the lag features method described above. The second approach is based on NNs and more specifically Recurrent NNs. LSTM network is a dynamic model able to process input sequences of variable length. It is a model widely used with time-series data since it is able to learn from important events that occurred on some past time steps.
XGBoost. XGBoost is an optimized implementation of Gradient Boosted Trees (GBT) designed to be highly efficient, flexible and portable. GBT is a specific type of gradient boosting model, a technique usually used for regression and classification problems, producing a prediction model in the form of an ensemble of weak prediction models (called weak learners). Weak learners are trained sequentially, each one correcting the errors made by its predecessor. In the case of GBT, the weak learners are decision trees. GBT aims to minimise an objective function that combines a convex loss function and a penalty term for model complexity. 6

EAI Endorsed Transactions on Bioengineering and Bioinformatics Online First
Towards exploration and evaluation of sleep staging classification schemes for healthy and patient subjects The training process proceeds iteratively, adding new trees that predict the residuals of errors of prior trees that are then combined with previous trees to make the final prediction. The simplified form of the objective function for the new tree f t is [42]: where g i and h i are first and second order gradient statistics on the loss function. They are defined as follows: The second term of the objective function Ω(f t ), represents a regularisation term in charge of seeking the appropriate final weights to avoid over-fitting. For the implementation presented in the current study, XGBoost parameters (estimators, learning rate, max depth) were optimized, utilizing grid search. Following the same logic, the optimal number of lag features that was fed into the model, was found to be 5 time steps for each feature.
LSTM. Classification of sequential data is a problem commonly tackled using recurrent neural networks. The idea behind RNNs is that given a sequence of states an RNN will find patterns and optimize itself. Most RNNs suffer from the vanishing and explosive gradient problem. Long Short-Term Memory network, is a more complex variation of a typical recurrent neural network and overcomes the vanishing gradient but not the exploding gradient problem. LSTMs use "forget gate" units in order to decide whether previous information should be kept or forgotten. The most basic hyperparameters that should be taken into consideration to train an LSTM is the batch size of each iteration, the time steps and the number of hidden units of the LSTM itself. A time step is the number of previous inputs that are fed into the network. The number of hidden units refers to the dimensionality of the hidden state and dimensionality of the output state.
The network of the current study consists of an LSTM layer followed by a fully connected layer, followed by a softmax activation layer. In order to decide on the hyperparameters 900 different configurations were tried, resulting into an optimal configuration of 50 hidden units and 3 time steps.

Internal Validation
The first set of experiments is using the internal validation process for testing. 10-fold cross-validation with stratified splits was implemented separately to all 4 groups of subjects. The average results of the 10-fold CV, for every group and every sleep stage are presented in Tables 3 and 4 where XGBoost and LSTM approaches are examined.
It is obvious that XGBoost algorithm outperforms LSTM at all cases, achieving higher classification accuracy across all groups and all of the sleep stages. As expected, better results are obtained on group A which is comprised of healthy subjects. More specifically, classification accuracy reaches 91%, while on groups B and C it drops almost by 10%, reaching at 80% and 82% respectively. An interesting finding is that group D which contains all of the subjects, has an accuracy of 87%, consequently it seems that adding healthy and patient subjects to the training set may help the model generalize better, but that is a claim that will be investigated more thoroughly on the external validation. As commonly referred in the literature, stage N1 has the lowest accuracy among all sleep stages. This is also confirmed from the current study and it also seems that N1 is the sleep stage that LSTM mostly struggles to predict. Comparing groups B and C, it seems that they do not have any significant differences regarding the accuracy scores of each sleep stage, however they both have low scores for the Awake stage compared to the healthy subjects.

External Validation
The second set of experiments was tested with external validation. This means that the subjects that were used for testing were completely kept out of the training process. Initial groups were split into training set and test set, as seen in Table 2. In group D, a model was trained, using subjects from groups A, B and C. Then, the external validation is done separately on subjects of those three groups (Tables 7, 8).
The accuracy seems to drop dramatically at all cases, comparing to the internal validation. This happens because during internal validation one part of each subject's sleep was used at the training phase, so using the rest of the same subject's sleep for testing results in higher accuracy. XGBoost surpasses LSTM again in this case, while the problem with N1 stage's accuracy becomes even worse. The accuracy of group A falls at 82%, while groups B and C drop at 57% and 55% respectively. A significant improvement to the results was observed when a single model was trained with subjects that belonged to group D. The predictive accuracy improved to each group separately (A, B and C), comparing to the case that each group was trained and tested only with subjects that belonged to this specific group. That confirms the initial claim that models are able to generalize better when they are trained jointly with healthy and 7 EAI Endorsed Transactions on Bioengineering and Bioinformatics Online First  Comparing to the state-of-the-art, the proposed approach of the current study for healthy subjects ranks second among six studies (ranging between 71% and 87%) that implemented external validation utilizing the Sleep-EDFx dataset [43]. The same study presents results for patient subjects which cannot be compared with this study, since the subjects were suffering from a different sleep-related disease, however the reported accuracy ranges between 49% and 69%.

Discussion
The current study attempts to propose a reliable sleep stage classification algorithm, trying to contribute towards the replacement of manual sleep scoring with automated solutions. The contribution of frequency domain features extracted from non-stationary PSG signals in combination with the temporal information incorporated from time delay features, results into a robust model, achieving high accuracy, with the appropriate configuration of a tree boosting algorithm. XGBoost surprisingly outperforms LSTM at all cases. This could probably happen because the feature extraction process was not based on the concept of long and short range correlations. A different LSTM setup, like separating each epoch at more narrow time windows, or utilizing LSTM as an auto-encoder from raw data could probably improve its poor accuracy.
Most studies are based only on healthy subjects, nonetheless the results obtained highlight that adding patient subjects may improve the model's generalization capability. The need for efficient sleep scoring is anyway bigger for subjects that suffer from sleeprelated problems. Moreover, internal validation may lead to somehow biased results, that can make someone overestimate a model's capabilities. That is why the authors suggest that research on sleep staging should be targeted on studies that focus on the external validation of patient subjects. State-of-the-art results suggest that there is a large margin of improvement at this 8 EAI Endorsed Transactions on Bioengineering and Bioinformatics Online First benefit ANNs, since the limited number of samples in the N1 stage of sleep is reflected in the limited ability of the network to identify it correctly. Another way to avoid the problem of the limited N1 stage samples is to use class weights thus forcing indirectly the network to focus more on the underrepresented class. Including data from additional inputs besides the PSG recording (EEG, EOG, EMG, HR) could also be a promising approach. In addition to actigraphy and heart rate, based on consumer electronics, Electrodermal Activity (EDA) sensors, which measure the changes in skin conductance resulting from the sympathetic nervous system activity, have been proposed for sleep monitoring [44][45][46]. Studies utilizing EDA data are mainly targeted on sleep/wake discrimination and sleep quality characterization mainly applicable in environments out of sleep lab. In combination with PSG recordings it is assumed that this extra piece of information on the autonomic function could slightly improve sleep staging predictive accuracy, while at the same time increasing the problem's complexity due to higher dimensionality of the feature space. However, EDA information is not included in the dataset used in the current study. Even though there are numerous studies regarding automatic sleep scoring, there are much fewer real-life applications that the algorithms are actually used. Future researchers should be focused on the development of those real-life applications and on the issues that could probably arise. Hospitals or sleep clinics equipped with automatic sleep scorers could faster and more easily diagnose sleep related issues of patients, since a human scorer (doctor) would no longer be necessary. There are also a number of issues that were identified, that make comparing and benchmarking different solutions a demanding task. The diversity between the datasets used for the different studies, leads to results that are not easily compared. The use of different EEG channels by the datasets can also be a problem that prevents the comparison of the results. Finally, as shown in the related work section and in accordance with the current study's results, the models' evaluation method plays an important role, as internal evaluation yields far more accurate results.

Conclusion
In this work, a sleep stage classification study for healthy and patient subjects is analyzed. The proposed approach utilizes a mixture time-domain and frequency domain features extracted from 2 EEG and the EOG signals. Two different classification approaches are presented, the first based on tree boosting and the second on LSTM NNs. Sleep staging results are presented for two different evaluation methods and the differences between those methods are highlighted.
The suggested tree boosting model achieves results that rank among the state-of-the art models found in the literature. It is also deduced that predictive accuracy is improved when healthy and patient subjects are used jointly at the training process. One possible limitation of the current study is that it does not seem totally applicable for implementation on a real-time wearable sleep scoring machine as it would be rather complex due to the fact that it utilizes 3 different channels as inputs.