1 Introduction
Sleep is a universal recurring dynamical and physiological activity in mammals. According to American Academy of Sleep Medicine (AASM) Iber2007 ; berry2012aasm that generalizes Rechtschaffen and Kales (R&K) criteria Rechtschaffen_Kales:1968 , the sleep dynamics can be divided into two broad stages, the rapid eye movement (REM) and the nonrapid eye movement (NREM), and the NREM stage is further divided into shallow sleep (stage N1 and N2) and deep sleep (stage N3). Up to now, we have accumulated a lot of physiological knowledge about sleep dynamics Saper2013 and a lot of research is actively carried out to explore the unknowns Kanda2016 . Despite those unknowns, it has been well known that a distortion of sleep dynamics could lead to catastrophic outcomes. For example, REM disturbance slows down the perceptual skill improvement Karni1994 , deprivation of slow wave sleep is associated with Alzheimer’s disease Kang2009 , insufficient N2 sleep is associated with weaning failure RocheCampo2010 , several public disasters are caused by low sleep quality Leger_Bayon_Laaban_Philip:2012 , etc. Therefore, in addition to the interest stemming from physiological aspects, we have a lot of clinical applications from knowing the sleep dynamics.
The polysomnography (PSG) is the gold standard of evaluating the sleep dynamics. However, scoring the overnight sleep stage from the PSG outputs by human experts is time consuming and errorprone due to the huge signal loading norman2000interobserver
. Due to the advance of the technology and computational power, in the past decades a lot of effort has been devoted to establish an artificial intelligence (AI) system for the automatic sleep stage annotation purpose. In addition to being able to accurately score sleep stages, an ideal AI system should also be able to “selflearn” or accumulate knowledge from the historical database. In other words, when the database with experts’ annotations grows, an ideal AI system should be able to provide a more accurate results. Note that despite the well accepted homeostasis assumption of physiological system, physiological dynamics vary from subject to subject. Therefore, the main challenge of this selflearning capability is handling the interindividual variability. This challenge is actually ubiquitous – for a newarriving subject, how to utilize the existing database with the expert annotations?
In this paper, we focus on the electroencephalogram (EEG) signal, and propose a novel algorithm to achieve the abovementioned challenges – an accurate automatic sleep stage scoring AI system that can selflearn. The feature extraction part of the algorithm is based on a combination of modern signal processing tools, including the synchrosqueezing transform (SST)
Daubechies_Lu_Wu:2011 ; Wu:2011Thesis ; Chen_Cheng_Wu:2014 and diffusion map (DM) coifman2006diffusion or diffusionbased sensor fusion (alternating DM (ADM) lederman2015alternating ; Talmon_Wu:2016 and multiview DM Linderbaum_Yeredor_Salhov_Averbuch:2015 ). The learning part of the algorithm for a new subject is based on finding “similar subjects” in the existing database, and establishing a model tailored for the new subject by the hidden Markov model (HMM) fraser2008hidden .The EEG signal, as a time series, is a summation of several oscillatory components with diverse spectra, each of which depicts a portion of the brain dynamics. SST is a nonlineartype timefrequency analysis technique, which allows us to “decouple” the spectra of those oscillatory components from the EEG signal. This decoupling is attained by nonlinearly deform the spectrogram provided by the short time Fourier transform (STFT) according to the EEG phase information. Compared with the spectrogram, SST provides a sharpened spectrogram, and alleviates the blurring effect of the spectrogram caused by the uncertainty principle. We call the outcome the
synchrosqueezed EEG spectral features. From a different perspective, the information hidden in the synchrosqueezed EEG spectral features contain not only the magnitude of STFT (the spectrogram), but also the EEG phase information.The synchrosqueezed EEG spectral feature, however, is in general different from the intrinsic sleep dynamics. This difference comes from various resources. For example, the EEG sensor could be viewed as a nonlinear deformation of the inaccessible brain dynamics. Thus, even if we could extract all pieces of information from the EEG signal via the synchrosqueezed EEG spectral feature, an extra step is needed to recover the intrinsic sleep dynamics. We consider DM coifman2006diffusion or diffusionbased sensor fusion, like ADM lederman2015alternating ; Talmon_Wu:2016 and multiview DM Linderbaum_Yeredor_Salhov_Averbuch:2015 , depending on how many EEG channels we have, for this purpose. The synchrosqueezed EEG spectral features are reorganized by the DM/diffusionbased sensor fusion to recover the intrinsic sleep dynamics. We call the output features of the EIGbased DM/diffusionbased sensor fusion the intrinsic sleep dynamical features.
For the learning step, based on the physiological knowledge, we first take the available phenotype information of the newarriving subject to determine “similar subjects” from the existing database. Then, based on the intrinsic sleep dynamical features and the annotations of these similar subjects, a prediction model for the new subject is established by the HMM.
To evaluate the performance of the proposed algorithm, we consider the publicly available benchmark database, the PhysioNet SleepEDF SC database Goldberger_Amaral_Glass_Hausdorff_Ivanov_Mark_Mietus_Moody_Peng_Stanley:2000 , which consists of healthy subjects without any sleeprelated medication, the PhysioNet SleepEDF ST Goldberger_Amaral_Glass_Hausdorff_Ivanov_Mark_Mietus_Moody_Peng_Stanley:2000 , which consists of subjects who had mild difficulty falling asleep. To further evaluate the algorithm, we analyze the sleep database collected from the sleep center at Chang Gung Memorial Hospital (CGMH), Linkou, Taoyuan, Taiwan, which consists of subjects with different levels of sleep apnea syndrome.
1.1 Related Work
There have been many proposed algorithms for the sake of automatic sleep stage scoring. Among many papers, we shall distinguish two common cross validation (CV) schemes. According to whether the training data and the testing data come from different subjects, the literature is divided into two groups, leaveonesubjectout and nonleaveonesubjectout
CV. When the validation set does not overlap the set used for the training purpose, we call it the leaveonesubjectout CV scheme; otherwise we call it the nonleaveonesubjectout CV scheme. The main challenge of the leaveonesubjectout CV scheme comes from the interindividual variability, but this scheme is close to the realworld scenario – how to predict the sleep dynamics of a new arrival subject from a given annotated database. On the other hand, in the nonleaveonesubjectout CV scheme, the training set and the testing set are dependent, and the performance might be overestimated. In this work, to better evaluate the proposed algorithm caused by the interindividual variability issue, we focus on the leaveonesubjectout CV scheme, and only summarize papers considering the leaveonesubjectout CV scheme and single or two channel EEG signal.
, timedomain and frequencydomain features are designed to predict awake, REM, light NREM stage (N1 and N2 are combined) and deep sleep stage by support vector machine. The performance is evaluated in a private database with 4 subjects. In
MemarFaradji2018, 104 features are extracted from the EEG signals and selected features are used to classify 4 sleep stages (N1 and N2 are combined) by the random forest. The following papers classify 5 sleep stages like our work. In
2016autoencoder, timedomain and frequencydomain features are used as the input layer of the stacked sparse autoencoder, which is a specific type of neural network models, with the sigmoid activation function. The performance of the stacked sparse autoencoder was evaluated in the publicly available benchmark database, SleepEDF SC
Goldberger_Amaral_Glass_Hausdorff_Ivanov_Mark_Mietus_Moody_Peng_Stanley:2000 , and the overall accuracy was 78.9%. Instead of extracting features based on the domain knowledge, features in tsinalis2016automaticare automatically learned by the convolutional neural networks (CNNs). The overall accuracies was 74.8% for the same SleepEDF SC
database. In DeepSleepNet, the authors also proposed a deep learning model, named DeepSleepNet, which utilized CNNs to extract features and apply the bidirectional Long ShortTerm Memory to learn the stage transition rules from EEG epochs. The DeepSleepNet algorithm reaches the stateofart 82.0% of overall accuracy on the SleepEDF SC
database. In Vilamala2017 , a similar approach based on the deep CNN with modifications is considered and achieves a compatible result.The rest of this paper is organized as follows. In Section 2, we introduce the proposed algorithm, where we summarize the theoretical background of STFTbased SST and introduce the synchrosqueezed EEG spectrum as the dynamical features in Section 2.1, we describe how to apply the diffusion geometry to integrate and fuse the dynamical features in Section 2.2, and the HMM is summarzied to automatically score the sleep stages in Section 2.3. The results of applying the proposed algorithm to two databases are shown in Section 4.
2 Method
The algorithm is composed of two essential steps – the feature extraction and the learning, and is summarized in Figure 1. In this work, we only consider the spectral information of EEG to generate our features. In the feature extraction step, we extract spectral information from the EEG signal by the SST (indicated by Part 1 in Figure 1), and apply the DM or ADM (indicated by Part 2 in Figure 1) with a properly designed metric to determine the final features. In the learning step, we take the HMM to build up the prediction model. Below, we detail the algorithm step by step, and postpone several technical details to the Online Supplementary.
2.1 Feature extraction step, part 1: Synchrosqueezed EEG spectrogram
In this subsection, we introduce the synchrosqueezed EEG spectrogram based on the STFTbased SST Wu:2011Thesis algorithm. The basic idea underlying SST is utilizing the phase information in the STFT of the EEG signal to sharpen the spectrogram. There are two benefits. First, the phase information that is commonly ignored in the spectrogram is preserved. Second, the uncertainty principle intrinsic to the spectrogram is alleviated and the spectrogram is sharpened, which can prevent the “energy leakage” caused by the blurring effect inherited in the uncertainty principle associated with the STFT. These two benefits allow us a better quantification of the EEG dynamics. The SST algorithm is composed of three steps – extract the local dynamics of the EEG signal by the STFT, manipulate the phase information, and sharpen the spectrogram according to the extracted phase.
Take to be the digitalized EEG signal sampled uniformly every second from a subject during his/her sleep. For , the STFT at th second is directly implemented by the weighted Fourier transform:
(1) 
where , is a parameter used to adjust the frequency resolution, is the standard Gaussian function, and is the bandwidth of the Gaussian window. The SST is then implemented as
(2) 
where
and is the derivative of . Note that is the synchrosqueezed EEG spectrogram at th second, which can be viewed as a dynamical spectral feature for the sleep dynamics. We refer the reader with interest to Section SI.7 in Online Supplementary for more details of SST.
Finally, we convert the synchrosqueezed EEG spectrogram into the features we have interest. In this work, we follow the standard sleep stage scoring, the RK Rechtschaffen_Kales:1968 or AASM classifications berry2012aasm , and take an epoch for the sleep stage evaluation to be 30 seconds long. Therefore, we have epochs. For the th epoch, we get a 10dimension vector that include the total energy
(3) 
and the energy ratios on 9 frequency regions in the spectral domain, including , , , , , , , , and . Specifically,
(4) 
In practice we observed that suitably dividing the beta wave frequency range, Hz, and the lowgamma wave frequency range, , into finer bands leads to higher classification accuracy, so we consider the frequency bands in this work. We call , , the synchrosqueezed EEG spectral feature of the th epoch.
2.2 Feature extraction step, part 2: (Alternating) diffusion map
In the optimal situation, the sleep dynamics at each epoch can be faithfully reflected by the synchrosqueezed EEG spectral features. However, the synchrosqueezed EEG spectral features might be erroneous due to the inevitable noise, other sensorspecific artifacts and the information distortion caused by the observation procedure. We then have to stabilize these features to better quantify the intrinsic sleep dynamics. For this purpose, we apply DM. DM is a nonlinear manifold learning algorithm that not only is robust to noise but respects the nonlinear structure of the intrinsic dynamics. We summarize it below and its generalization to fuse information from multiple channels.
2.2.1 Integrate single channel information – Diffusion map
Take the synchrosqueezed EEG spectral features , which is a point cloud in an Euclidean space. First, from the point cloud , we build a graph with being vertices. The affinity between the features and is defined
(5) 
where is chosen by the user and is the metric chosen by the user to compare two synchrosqueezed EEG spectral features. Thus, we have a affinity matrix . Note that according to the noise analysis in ElKaroui_Wu:2016b , when the signal to noise ratio is small, it is beneficial to set the diagonal terms of the affinity matrix to 0. Next, define the degree matrix of size , which is diagonal, as
(6) 
With matrices and , we could define a random walk on the point cloud with the transition matrix given by the formula
(7) 
Since is similar to the symmetric matrix
, it has a complete set of right eigenvectors
with corresponding eigenvalues
, where . Indeed, from the eigendecomposition , where is a orthonormal matrix and is a diagonal matrix, we have , where and . With the decomposition , the DM is defined as(8) 
where and is the diffusion time chosen by the user, and is an estimate of the dimension of the intrinsic state. Here, can be obtained according to the spectral gap in the decay of the eigenvalues . Clearly, consists of the second to coordinates of , where is the unit dim vector with the th entry . In other words, the synchrosqueezed EEG spectral features are converted into a set of new features in via the DM. Therefore, we obtain our final feature for the recorded EEG signal , which is denoted as . We call them the intrinsic sleep features for the EEG signal . In this work, we choose .
Note that the diffusion process plays an essential role in the whole algorithm, since (8) is based on the diffusion nature of the random walk with the transition matrix (7). Based on a series of theoretical papers Berard_Besson_Gallot:1994 ; VonLuxburg_Belkin_Bousquet:2008 ; Singer_Wu:2016 , we know that the Riemannian manifold hosting the inaccessible intrinsic sleep features is almost isometrically recovered by DM (8). In addition to recovering the intrinsic state space, another important property of DM is its robustness to noise. Essentially, the diffusion process could be viewed as a nonlinear “averaging” process. This viewpoint has been captured and proved in ElKaroui:2010a ; ElKaroui_Wu:2016b . We refer readers with interest to Section SI.8 in the Online Supplementary for a summary of the theoretical background of DM and more literature citations. In this work, we choose the “local Mahalanobis” distance singer2008non ; TalmonPNAS as the metric in (5). The local Mahalanobis distance is defined in (SI.23) and details for this distance can be found in Section SI.9 in the Online Supplementary. See Figure 1 for an example of DM when . It is clear that epochs of different sleep stages are clustered and separated, and this shows the reason we call the intrinsic sleep features.
2.2.2 Fuse two channels via diffusion
If only one EEG signal is available, we proceed to the learning step in Section 2.3 with the intrinsic sleep features for channel . If we have two or more channels, we can take them into account simultaneously. In general, this problem is understood as the sensor fusion problem. While different EEG channels capture information from the same brain, the information recorded might vary and they might be contaminated by brainactivity irrelevant artifacts from different resources, including noise and other sensorspecific nuisance. These artifacts not only deteriorate the quality of the extracted features but might also mislead the analysis result. The main purpose of sensor fusion is distilling the brain information and removing those unwanted artifacts.
To simply the discussion, we assume that we have two simultaneously recorded EEG channels and , and the discussion could be directly generalized to the case with more channels Katz_Wu_Lo_Talmon:2016 . For each channel, we apply Sections 2.1 and 2.2 to obtain its intrinsic sleep features, denoted as and respectively, where might be different from . A naive way to combine information from and is simply concatenating features and form a new feature of size . However, due to the inevitable sensorspecific artifacts or errors, such a concatenating scheme might not be the optimal route to fuse sensors Katz_Wu_Lo_Talmon:2016 . We consider the recently developed diffusion approaches to fuse sensor information, including ADM lederman2015alternating ; Talmon_Wu:2016 and coclustering dhillon2001co (a special case of multiview DM Linderbaum_Yeredor_Salhov_Averbuch:2015 ). Essentially, the information from different sensors are “diffused” to integrate the common information, and simultaneously eliminate artifacts or noise specific to each sensor.
For ADM, we first form a new transition matrix:
(9) 
where and are the transition matrices (7) associated with the channel and . Under the assumption that both EEG channels and share an isometric common manifold, following the analysis in lederman2015alternating , the distance between the th row and the th row of serves as a distance between the “common information” associated with the th and the th epochs. We call this distance the common metric. Then, run DM with in (5) replaced by the common metric between the th epoch and the th epoch. Denote the first nontrivial eigenvectors of the associated transition matrix by , where is chosen to be in this work. For the coclustering, we choose
(10) 
and define a diagonal matrix so that its th diagonal entry is the sum of the th row of . Denote to be the th left eigenvector of the transition matrix . Since the transition matrix is associated with the random walk on the bipartite graph formed from epochs of and , its eigenvectors “coclustering” two channels dhillon2001co , and hence we can use its eigenvectors as features for the sleep stage. We provide more technical details of coclustering in Section SI.10 in the Online Supplementary. Since for each , and correspond to the th epoch for each , we could consider the vector, , to be another set of features associated with the sleep stage of the th epoch, where and is chosen to be in this work.
With ADM and coclustering, call the dimensional vector
the common intrinsic sleep feature associated with the th epoch. Denote . An illustration of the result of ADM with is shown in Figure 1.
2.3 Learning step: Sleep Stage Classification by the Hidden Markov Model
In this work, we choose the standard and widely used algorithm HMM for the classification purpose. HMM is particularly powerful if we want to model a sequence of variables that changes randomly in time. Although it is standard, to make the paper selfcontained, we provide a summary of the HMM and the numerical detail of establishing the model in Section SI.11 in the Online Supplementary.
2.4 Statistics
To evaluate the performance of the proposed automatic sleep scoring algorithm, we consider the leaveonesubjectout cross validation (CV). For each database, one subject is randomly chosen as the testing set and the other subjects form the training set. For the testing subject, we take the phenotype information to find the most similar subjects to establish the HMM model. The impact of age on the sleep dynamics Vitiello2004 and EEG signal BoselliParrino1998 ; VanCauter2000 is wellknown, so the EEG information from subjects with similar age will provide more information. While the sleep dynamics is influenced by other phenotype information, since age is the common information among databases we consider, we determine the most similar subjects by the age. Note that this approach imitate the real scenario – for a newarriving subject, we can score its sleep stages by taking the existing database with annotation into account.
All performance measurements used in this paper are computed through the unnormalized confusion matrix
. For , the entry represents the number of expertassigned class epochs, which were predicted to the class. The precision (), recall (), and F1score () of the th class, where , are computed respectively through(11) 
The overall accuracy (ACC), macro F1 score (MacroF1) and kappa () coefficient are computed respectively through
(12) 
where
(13) 
3 Material
To evaluate the proposed algorithm, we consider the following two databases, one is from the public repository PhysioNet and the other one is collected from the sleep lab.
The first dataset is SleepEDF Database [Expanded] from Physionet Goldberger_Amaral_Glass_Hausdorff_Ivanov_Mark_Mietus_Moody_Peng_Stanley:2000 . It contains two subsets (marked as SC and ST). The first subset SC comes from healthy subjects without any sleeprelated medication. The subset SC contains FpzCz/PzOz EEG signals recorded from 10 males and 10 females without any sleeprelated medication, and the age range is 2534 yearold. There are two approximately 20hour recordings per subject, apart from a single subject for whom there is only a single recording. The EEG signals were recorded during two subsequent daynight periods at the subjects’ home. The sampling rate is 100 Hz. The second subset ST was obtained in a 1994 study of temazepam effects on the sleep of subjects with mild difficulty falling asleep. The subset ST contains FpzCz/PzOz EEG signals recorded from 7 males and 15 females, who had mild difficulty falling asleep. Since this data set is originally used for studying the effects of temazepam, the EEG signals were recorded in the hospital for two nights, one of which was after temazepam intake. Only their placebo nights can be downloaded from Goldberger_Amaral_Glass_Hausdorff_Ivanov_Mark_Mietus_Moody_Peng_Stanley:2000 . The sampling rate is 100 Hz. For both SC and ST sets, each 30s epoch of EEG data has been annotated into the classes Awake, REM, N1,N2, N3 and N4. The epochs corresponding to movement and unknown stages were excluded and the the epochs labeled by N4 are relabeled to N3 according to the AASM standard. For more details of the database, we refer the reader to https://www.physionet.org/physiobank/database/sleepedfx/.
The second database is from the sleep center at Chang Gung Memorial Hospital (CGMH), Linkou, Taoyuan, Taiwan, where the subjects have different severity level of sleep apnea syndrome. A standard PSG study was performed with at least 6 hours of sleep to confirm the presence or absence of OSA from the clinical subjects suspected of sleep apnea at the sleep center. The Institutional Review Board of CGMH approved the study protocol (No. 1014968A3). The whole night signals were recorded by Alice 5 data acquisition system (Philips Respironics, Murrysville, PA). Standard signals, such as EEG signals (O1A2, O2A1, C3A2 and C4A1), oralnasal airflows, piezobased abdominal and thoracic movement signals, electrocardiogram (ECG), electromyography (EMG), and photoplethysmography (PPG), were recorded. We focus on the EEG signals here, which are recorded at the sampling rate 200 Hz. We collected 45 subjects. The overnight sleep stages for all 30 seconds epochs, including Awake, REM, N1, N2 and N3, are provided by two sleep experts. Among 45 subjects, there are 39 males and 6 females, and there are 3, 10 and 32 subjects with mild (5AHI15), moderate (15AHI30) and severe SAS (AHI30) respectively. The ages of the mild, moderate and severe groups are , and , respectively. The average of sleep recording time is 6.16 0.22 hours.
4 Results
We report the results of applying the proposed algorithm to the abovementioned two databases. To have a fair comparison, the commonly considered features from the spectrogram are considered. The spectrogram is evaluated by the pwelch method, and we consider two scenarios for the comparison. First, we take the features from the spectrogram for the training and testing in the learning step. Second, we apply the EIG and DM (or ADM) to estimate the intrinsic dynamics from the spectrogram.
4.1 SleepEDF database (SC*)
We start from showing the visualization of the intrinsic sleep features and the common intrinsic sleep features from 12 different subjects. See Figure 2 for a visualization of DM, ADM and multiview DM. Clearly, we see that Awake, REM, N2 and N3 stages are well clustered in all plots, while N1 is less clustered and tends to mixed up with other stages. Moreover, in ADM and multiview DM, we further see a “circle” with a hole in the middle, and the sleep stages are organized on the circle and follow the usual sleep dynamics pattern.
Since there are long periods of wakefulness at the start and the end of recordings, when a subject is not sleeping, DeepSleepNet only includes 30 minutes of such periods just before and after the sleep periods. To have a fair comparison, we also follow this truncation rule in this work. In the end, the labeled epochs are imbalanced, with 42.4% epochs labeled N2 and only 6.6% epochs labeled N1.
We run the leaveonesubjectout CV with . For each subject in the testing set, we take epochs of nine most similar subjects collected during the first night into account, and balance classes by taking the epochs from the second night. The averaged confusion matrix of the proposed algorithm over 20 subjects is shown in Table 1. The overall accuracy is and the macro F1 is , with Cohen’s kappa . Note that the N1 prediction is relatively low compared with other stages. Particularly, the F1 is and most N1 epochs are classified as REM or N2. This misclassification is related to the scattered N1 epochs in Figure 2 that can be visually observed, and it is the main reason to drag down the overall accuracy and macro F1. We also note that N3 is commonly classified wrongly as N2, Awake is commonly classified wrongly as N1, and REM is commonly classified wrongly as N2.
Predicted  Perclass Metrics  

Awake  REM  N1  N2  N3  PR  RE  F1  
Awake (17%)  6052 (86%)  233 (3%)  562 (8%)  212 (3 %)  21 (0%)  88.9  85.4  87.2 
REM (19%)  54 (1%)  6687 (86 %)  217 (3%)  750 (10%)  3 (0%)  76.5  86.7  81.3 
N1 (7%)  494 (18%)  933 (34%)  803 (29%)  539 (19%)  16 (0 %)  38.9  28.8  33.1 
N2 (43%)  139 (1%)  883 (5%)  461 (3%)  15360 (87 %)  786 (4 %)  87.5  87.1  87.3 
N3 (14%)  69 (1%)  1 (0%)  18 (0%)  690 (12 %)  4851 (86%)  85.4  86.1  85.8 
Authors of 2016autoencoder ; tsinalis2016automatic ; DeepSleepNet handle the imbalanced data issue by setting the number of epochs perstage per recording belonging the training data (consisting of 19 subjects) equal to the number of epochs of the least represented stage, N1. To have a fair comparison and test the stability of the proposed algorithm, we apply the same scheme. The averaged confusion matrix of the proposed algorithm over 20 subjects is shown in Table 2. The overall accuracy is and the macro F1 is , with Cohen’s kappa . Note that the N1 prediction accuracy is slightly better compared with that reported in Table 1, while F1 being is still relatively low.
Predicted  Perclass Metrics  

Awake  REM  N1  N2  N3  PR  RE  F1  
Awake (17%)  5958 (84%)  298 (4%)  697 (10%)  115 (2%)  12 (0%)  92.6  84.1  88.2 
REM (19%)  16 (0%)  6836 (89 %)  231 (3%)  623 (8%)  5 (0%)  75.0  88.7  81.2 
N1 (7%)  354 (13%)  914 (33%)  996 (36%)  507 (18%)  14 (0 %)  38.4  35.8  37.0 
N2 (43%)  69 (0%)  1063 (6%)  639 (4%)  14825 (84%)  1033 (6 %)  89.2  84.1  86.5 
N3 (14%)  39 (0%)  1 (0%)  34 (1%)  557 (10 %)  4998 (89%)  82.4  88.8  85.5 
To appreciate the importance of each step in the algorithm, we report the results when some steps of the algorithm are removed. We consider the following different combinations (confusion matrices not shown).

If we consider the spectral features directly from the spectrogram but not SST, and then apply sensor fusion to reorganize features, the overall accuracy equals 82.1%, the macro F1 score equals 72.3% and Cohen’s kappa equal 0.748.

If the ADM is not applied to reorganize the synchrosqueezed EEG features, i.e., we apply HMM directly to predict sleep stages from the concatenated synchrosqueezed EEG features from two channels, the overall accuracy equals 79.7%, the macro F1 score equals 70.0% and Cohen’s kappa equal 0.716.

If we do not use the local Mahalanobis distance to measure the similarity between synchrosqueezed EEG spectral features but the ordinary distance for sensor fusion, the overall accuracy is 73.6%, the macro F1 score is 65.6%, with Cohen’s kappa 0.63.

If we only consider multiview DM to fuse information, the overall accuracy is with the macro F1 score . On the other hand, if we replace the sensor fusion step by a direct concatenation of intrinsic sleep features from both sensors, the overall accuracy is with the macro F1 score . This result is consistent with that reported in Katz_Wu_Lo_Talmon:2016 .

The above results are all based on two channels. If we only apply the analysis to the single FpzCz channel; that is, we run HMM on the intrinsic sleep features extracted from FpzCz, the overall accuracy equals 78.5%, the macro F1 score equals 67.9% and Cohen’s kappa equal 0.701. If we take the single PzOz channel, the overall accuracy equals 79.3%, the macro F1 score equals 70.3% and Cohen’s kappa equal 0.71.
Compared these results with that reported in Table 1, we see that the accuracy is consistently downgraded.
Lastly, in Table 3, we compare performance of the proposed algorithm and several reported algorithms validated by the leaveonesubjectout CV scheme.
Author  EEG Channel  Extracted features  ModelAlgorithm  ACCMF1 

DeepSleepNet  SleepEDF SC  
FpzCz  
(41950 epochs)  
PzOz  
(41950 epochs)  Raw data without any prior processing  Recurrent Neural Networks with the classbalanced random sampling and the stochastic gradient descend  82.0%76.9%  
(=0.76)  
79.8%73.1%  
(=0.72)  
2016autoencoder  SleepEDF SC  
FpzCz  
(37022 epochs)  (a) Complex Morlet wavelets  
(b) Timedomain amplitude characteristics  
(c) Pearson correlation coefficient between each pair of frequencyband signals  
(d) Autocorrelation in the time domain signal for 0.5s lags  Stacked sparse autoencoders induced by the classbalanced random sampling  78.9%73.7%  
tsinalis2016automatic  SleepEDF SC  
FpzCz  
(37022 epochs)  Raw data without any prior processing  Convolutional Neural Networks with the classbalanced random sampling and the stochastic gradient descend  74.8%69.8%  
Ours  SleepEDF SC  
FpzCz + PzOz  
(40351 epochs)  
SleepEDF ST  
FpzCz + PzOz  
(21076 epochs)  Short time Fourier transform based  
synchrosqueezed transform  
(alternating) diffusion map  Hidden Markov Model  82.66%74.95%  
(=0.76)  
76.1%71.0%  
(=0.66) 
4.2 SleepEDF database (ST*)
See Figure 3 for a visualization of ADM and multiview DM of the ST* database. While Awake, REM, N2 and N3 stages are still well clustered in all plots, compared with the normal subjects in SC* database, the separation and the “circle” are less clear.
We run the leaveonesubjectout CV with . The averaged confusion matrix of the proposed algorithm over 22 subjects is shown in Table 4. The overall accuracy is and the macro F1 is , with Cohen’s kappa . In this database, there are epochs labeled as N1. Although it is slightly higher than that of SC database, the prediction performance of N1 is , which is still relatively low. Also, note that the prediction performance of N3 is lower, and a significant portion of N3 is misclassified as N2. If we run HMM directly on the synchrosqueezed EEG spectral feature without running ADM, the overall accuracy becomes , and the macro F1 score becomes with Cohen’s kappa . If we only consider multiview DM to fuse information, the overall accuracy is with the macro F1 score ; if we replace the sensor fusion step by a direct concatenation of intrinsic sleep features from both sensors, the overall accuracy is with the macro F1 score . Again, the overall performance is downgraded.
Predicted  Perclass Metrics  

Awake  REM  N1  N2  N3  PR  RE  F1  
Awake (11%)  1671 (79%)  73 (3%)  361 (17%)  19 (1%)  0 (0%)  77.5  78.7  78.1 
REM (20%)  12 (0%)  3649 (89%)  111 (3%)  346 (8%)  5 (0%)  77.7  88.5  82.8 
N1 (10%)  374 (19%)  560 (28%)  783 (40%)  297 (15%)  1 (0%)  45.3  38.9  41.9 
N2 (45%)  96 (1%)  414 (4%)  429 (5%)  8004 (85%)  509 (5%)  81.1  84.7  82.9 
N3 (15%)  4 (0%)  0 (0%)  43 (1%)  1203 (38%)  1891 (61%)  78.6  60.2  68.2 
4.3 CGMH database
See Figure 4 for a visualization of ADM and multiview DM of the CGMH database. Clearly, compared with those from the SC* and ST* databases, while the sleep stages Awake, REM, N2 and N3 are still well clustered, the “circle” does not exist.
We again run the leaveonesubjectout CV with . The averaged confusion matrix of the proposed algorithm over 45 subjects is shown in Table 5. The overall accuracy is and the macro F1 is , with Cohen’s kappa . At first glance, this result is not ideal. However, note that this database is composed of subjects with heterogeneous background, particularly the sleep apnea, and this classification result is consistent with the clinical finding that the more severe the sleep apnea is, the worse the consistency is among experts. Physiologically it is not surprising. The sleep dynamics of a subject with severe sleep apnea is impacted by the apnea events, and hence the more irregular pattern inside the EEG signal.
To have a closer look at this effect, in Figure 5
we show a scatter plot of prediction accuracy versus apneahypopnea index (AHI). The linear regression fit is
with and both the intercept and AHI covariate are statistically significant, with the significant level set to . This result shows that the higher the AHI is, the more severe the sleep apnea is. It is clear that higher the sleep apnea is, the lower the prediction accuracy is. For the 13 subjects with AHI30, the overall accuracy is 70.5%, the macro F1 score is 55.7%, and Cohen’s kappa is . For the 32 subjects with AHI30, the overall accuracy is 60.2%, the macro F1 score is 54.5%, and Cohen’s kappa is 0.432.In this database, since we have more than 2 channels, we could take three channels, O1A2, O2A1 and C3A2, into account to obtain more sleep dynamics information and hence a better prediction accuracy. As predicted, the overall performance increases so that the overall accuracy is 68.4%, the macro F1 score is 54.8%, with Cohen’s kappa 0.54 (confusion matrix not shown).
Predicted  Perclass Metrics  

Awake  REM  N1  N2  N3  PR  RE  F1  
Awake (19%)  4773 (75%)  153 (2%)  782 (12%)  677 (11%)  13 (0%)  63  75  69 
REM (9%)  469 (15%)  831 (26%)  936 (30%)  913 (29%)  0 (0%)  35  26  30 
N1 (23%)  1686 (22%)  865 (11%)  2327 (30%)  2843 (37%)  5 (0%)  48  30  37 
N2 (44%)  517 (3%)  493 (3%)  763 (5%)  12911(87%)  196 (1%)  71  87  78 
N3 (4%)  77 (6%)  8 (1%)  26 (2%)  895 (67%)  324 (24%)  60  24  35 
5 Discussion
The proposed feature, the common intrinsic sleep feature, is shown to well capture the geometric structure of the sleep dynamics. Specifically, by the modern nonlineartype TF analysis and diffusion technique, the inaccessible intrinsic state represented by the common intrinsic sleep feature is then robustly recovered from the EEG signal. When combined common intrinsic sleep features with the learning algorithm HMM, we provide an accurate prediction model, and the result is at least compatible with, or better than, several stateoftheart algorithms. The stateoftheart result we compare with depends on neural network (NN). While NN works well in general, it is often criticized to work as black boxes. For medical problems and datasets, due to the nonnegligible portion of ambiguous annotations Norman2000 , we need to be careful when interpreting its performance. The algorithm we proposed, on the other hand, has a clear and clean interpretation with solid mathematical supports. It is anyway interesting to ask if combining the designed feature extraction algorithm and NN could lead to a better prediction model, and we leave this problem to our future work.
Although our overall prediction accuracy is compatible with the stateoftheart prediction algorithm in the SleepEDF SC* database, like DeepSleepNet , we see that the prediction accuracy of N1 is relatively low by our algorithm (F1 is 37% by our method and 46.6% in DeepSleepNet ), and this low N1 accuracy downgrades the overall accuracy. This low prediction rate of N1 is also seen in the SleepEDF ST* and CGMH databases. This low prediction rate partially comes from the relatively small size of N1 epochs, and partially comes from the algorithm and available channels. Based on the AASM criteria Iber2007 ; berry2012aasm , to distinguish N1 and REM, we need electrooculogram and electromyogram, which are not available in the dataset. The EEG backgrounds of N1 and N2 are the same, and experts distinguish N1 and N2 by the Kcomplex or spindle, as well as the 3minute rule. While the synchrosqueezed EEG spectral features capture the Kcomplex or spindle behavior, the 3minute rule is not considered in the algorithm. Note that in order to handle the interindividual variability for the interindividual prediction, the temporal information among epochs is not fully utilized in this work. How to incorporate the temporal information into the algorithm will be explored in the future. Furthermore, while the vertex sharp is a common “landmark” indicating transition from N1 to N2, this feature is not always present and a rulebased approach is needed to include this temporally transient feature. On the other hand, it is well known that different EEG leads provide different information for N1. For example, the occipital lead has a stronger alpha wave strength, compared with the central lead, when transiting from wakefulness to N1. When there are multiple EEG leads, this lead information could be taken into account to further improve the accuracy. On the other hand, it is interesting to see how deep neural network approach can achieve 46.6% accuracy under the same condition. This suggests that some unclear distinguishable EEG structure of N1 that is not sensible by human can be efficiently extracted by the deep neural network framework proposed in DeepSleepNet . Since the proposed features depend solely on the spectral information, we may need features of other category to capture this unseen N1 structure. Note that in addition to N1, the prediction performance of N3 is also lower in the SleepEDF ST* database, where the subjects take temazepam before data recording. It has been well known that in general benzodiazepine hypnotics Borbely1985 reduces the low frequency activity and enhances spindle frequency. Since our features are mainly based on the spectral information, a N3 epoch might look more like N2 epochs, and hence the confusion and the lower performance. This outcome emphasizes the importance of the drug history when design the algorithm.
The performance of the proposed algorithm in the CGMH database is lower than that in the SleepEDF database. While the prediction accuracies of awake and N2 are reasonable, the prediction accuracies of REM, N1 and N3 are all low. Specifically, REM and N1 are highly mixed with Awake and N2. Also, a significant portion of N3 is misclassified as N2. The main reason is that all subjects in that database have sleep apnea of different degree of severity, which interrupts the sleep dynamics with different frequency. The low accuracy of N1 has been discussed above, and the same reason carries over here. It is well known that REM is vulnerable to apnea events and hence sleep dynamics fluctuation due to the lower arousal threshold. As a consequence, pathophysiologically, the fluctuation and transition among REM, N1 and Awake is frequent due to the apnea events. The transition can be so frequent that different features of different stages show up in one epoch. Due to these possible bizarre EEG patterns, sleep experts need a more delicate information to determine the sleep stage, including different channels and temporal information. For example, to determine arousal events from REM, the muscle activity information is necessary, but this information is not available to us. To distinguish N2 and REM, the temporal relationship of features of REM and N2 needs to be taken into account, but in our algorithm the temporal relationship is not used. These facts explain why REM gets confused with N1, Awake and N2. Ideally, since N3 has a distinguishable spectral behavior, we should be able to easily classify it. The low accuracy of N3 might come from the “fragile EEG pattern” of the interrupted N3 due to sleep apnea. For example, if an arousal event happens at 20th second of a N3 epoch so that the last 10 seconds is filled in by the alpha wave, there delta wave information might be limited and the overall behavior of the epoch becomes similar to N2, particularly when a K complex happens. In short, the more sleep apnea events, the sleep dynamics is more frequently interrupted, and hence the fluctuation. This has been shown in Figure 5
. In short, while the proposed algorithm provides a reasonable prediction result for subjects with sleep apnea, it is not good enough, particularly for subjects with AHI greater than 30. Since sleep apnea is an influential factor of the sleep dynamics and the interaction between sleep dynamics and sleep apnea is intricate, a new analysis approach is needed. Probably we even need different sensor for this purpose, but this is out of the scope of the current paper and will be explored in our future work.
In Figures 2, 3 and 4 , we show the underlying geometric structure of the sleep dynamics captured by the proposed algorithm – the awake, REM, N2 and N3 are well clustered, with N1 scattered around awake, REM, and N2. Furthermore, in Figures 2, 3, we can even visualize a “circle”, while it is less clear in Figure 3. An interesting physiological finding from these plots is the close relationship between N3 and awake. Note that the same result is also shown in Figure 1. This geometric relationship indicates the similarity between the common intrinsic sleep features of N3 and awake stages. This similarity comes from the well known fact that before arousal, particularly across the descending part of sleep cycles and in the first cycle, we can observe the “deltalike burst” that mimics the delta wave specific for N3 stages BonnetCarley1992 ; Halasz2004 . Note that epochs from one subject are used to generate Figure 1, and 12 different subjects are pooled together to generate Figures 2 and 3, and we see the same geometric structure. This finding indicates that this distribution is consistent across subjects. On the other hand, due to the sleep apnea disturbance, this “circle” no longer exists. This indicates the interruption of sleep dynamics by sleep apnea and hence the frequent random transition from one sleep stage to another.
One main clinical application of sleep dynamics prediction is for a more accurate sleep apnea screening. With the sleep stage information, the apneahypopnea index could be calculated. For that purpose, it is a common consensus to consider fewer channels for the patient to wear. While more channels provide more information, they may disturb the sleep. The respiratory flow is a common channel that people consider to screen the sleep apnea at home. In addition to provide the sleep apnea information, it has been known that the respiratory flow signal also contains abundant information about the sleep stage, which is based on different physiological mechanism compared with the EEG signal Thompson2001 . In this scenario, the proposed algorithm has the potential to incorporate the information hidden in the flow signal and design a more accurate prediction system.
In this work, for each given subject, we take the age into account to find “similar subjects” to train the model to automatically annotate the sleep stage of the given subject. This choice is based on the limitation of the available databases. This idea allows us to establish a “selfevolving” artificial intelligence system for the automatic annotation purpose. What we considered in this paper is a special case of the general picture. Note that the interindividual variability among individuals is inevitable, and this variability is the main obstacle toward a selfevolving system that can selflearn like a human being. Our solution is respecting the physicians’ decision making process and clinical experience to build up the system. Specifically, in this paper we take the impact of age on the sleep dynamics Vitiello2004 and EEG signal BoselliParrino1998 ; VanCauter2000 into account. Another benefit of this phenotype based approach is its flexibility for the growing database. While it is widely believed that the larger the database is, the more accurate model we can establish, we should take the limited computational resource into account. By only choosing those subjects sharing similar phenotype to establish the prediction model, the computational efficiency can be achieved. To further validate and develop this proposed selfevolving system, in the future study we will include the electrical health record information, and mimic physicians’ decisionmaking rules to determine the similarity between subjects.
Despite the strength of the proposed method, the discussion is not complete without mentioning its limitations. While we testes the algorithm on two databases, and compare our results with those of stateoftheart algorithms, those databases are small. To draw a conclusion and confirm its clinical applicability, a large scale and prospective study is needed. While with two channels our algorithm could outperform that reported in DeepSleepNet that depends on only one channel, when we have only one channel, our algorithm does not perform better (For FpzCz, the accuracy and F1 of our algorithm are 78.5% and 67.9%, while the accuracy and F1 reported in DeepSleepNet are 82% and 76.9%. For PzOz, the accuracy and F1 of our algorithm are 79.3% and 70.3%, while the accuracy and F1 reported in DeepSleepNet are 79.8% and 73.1%). As discussed above, this limitation comes from the poor features for N1 classification. We need to find features that can better quantify N1 dynamics, and better understand how and why the deep neural network achieves the accuracy. Although extended theoretical understandings of applied algorithms have been established in the past decade, there are still open problems we need to explore. For example, the statistical property of SST on EEG and the relationship between ADM and multiview DM. Particularly, while ADM and multiview DM look similar, they are developed under different motivations, and the consequence is never discussed. Another open problem is how to take the temporal information into account when we deal with the interindividual prediction. Note that in the current algorithm, although the temporal relationship of epochs is considered in the HMM model, it is not taken into account to design the feature. The above mentioned limitations will be studied and reported in the future work.
6 Conclusion
A novel sleep stage prediction algorithm based on nonlineartype timefrequency analysis and diffusion geometry with solid mathematical background and interpretable features based on two channel EEG signals is proposed and tested on two databases. In addition to visually seeing the relationship between different sleep stages, the prediction result suggests its potential in practical applications.
References
References
 (1) C. Iber, S. AncoliIsreal, A. C. Jr., , S. Quan, The AASM Manual for Scoring of Sleep and Associated EventsRules: Terminology and Technical Specification, American Academy of Sleep Medicine, 2007.
 (2) R. B. Berry, D. G. Budhiraja, et al., Rules for scoring respiratory events in sleep: update of the 2007 AASM manual for the scoring of sleep and associated events, J Clin Sleep Med 8 (5) (2012) 597–619.
 (3) A. Rechtschaffen, A. Kales, A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects, Washington: Public Health Service, US Government Printing Office, 1968.
 (4) C. B. Saper, The Neurobiology of Sleep, Continuum 19 (1) (2013) 19–31.
 (5) T. Kanda, N. Tsujino, E. Kuramoto, Y. Koyama, E. A. Susaki, S. Chikahisa, H. Funato, Sleep as a biological problem: an overview of frontiers in sleep research, The Journal of Physiological Sciences 66 (1) (2016) 1–13.
 (6) A. Karni, D. Tanne, B. S. Rubenstein, J. J. Askenasy, D. Sagi, Dependence on REM sleep of overnight improvement of a perceptual skill, Science 265 (5172) (1994) 679–682.
 (7) J.E. Kang, M. M. Lim, R. J. Bateman, J. J. Lee, L. P. Smyth, J. R. Cirrito, N. Fujiki, S. Nishino, D. M. Holtzman, Amyloidb Dynamics are regulated by Orexin and the sleepwake cycle, Science 326 (Nov 13) (2009) 1005–1007.
 (8) F. Roche Campo, X. Drouot, A. W. Thille, F. Galia, B. Cabello, M.P. D’Ortho, L. Brochard, Poor sleep quality is associated with late noninvasive ventilation failure in patients with acute hypercapnic respiratory failure., Critical care medicine 38 (2) (2010) 477–85.
 (9) D. Leger, V. Bayon, J. Laaban, P. Philip, Impact of sleep apnea on economics, Sleep medicine reviews 16 (5) (2012) 455–62.
 (10) R. G. Norman, I. Pal, C. Stewart, J. A. Walsleben, D. M. Rapoport, Interobserver agreement among sleep scorers from different centers in a large dataset., Sleep 23 (7) (2000) 901–908.
 (11) I. Daubechies, J. Lu, H.T. Wu, Synchrosqueezed wavelet transforms: An empirical mode decompositionlike tool, Appl. Comput. Harmon. Anal. 30 (2011) 243–261.
 (12) H.T. Wu, Adaptive Analysis of Complex Data Sets, Ph.D. thesis, Princeton University (2011).

(13)
Y.C. Chen, M.Y. Cheng, H.T. Wu, Nonparametric and adaptive modeling of dynamic seasonality and trend with heteroscedastic and dependent errors, J. Roy. Stat. Soc. B 76 (3) (2014) 651–682.
 (14) R. R. Coifman, S. Lafon, Diffusion maps, Appl. Comput. Harmon. Anal. 21 (1) (2006) 5–30.
 (15) R. R. Lederman, R. Talmon, Learning the geometry of common latent variables using alternatingdiffusion, Appl. Comp. Harmon. Anal.doi:10.1016/j.acha.2015.09.002.
 (16) R. Talmon, H.T. Wu, Discovering a latent common manifold with alternating diffusion for multimodal sensor data analysis, Appl. Comput. Harmon. Anal., In press.
 (17) O. Lindenbaum, A. Yeredor, M. Salhov, A. Averbuch, MultiView Diffusion Maps, ArXiv eprintsarXiv:1508.05550.
 (18) A. M. Fraser, Hidden Markov models and dynamical systems, SIAM, 2008.
 (19) A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P. Ivanov, R. Mark, J. Mietus, G. Moody, C.K. Peng, H. Stanley, Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals., Circulation 101 (23) (2000) e215–e220.

(20)
S. Gudmundsson, T. Runarsson, S. Sigurdsson, Automatic Sleep Staging using Support Vector Machines with Posterior Probability Estimates, in: CIMCAIAWTIC’06, 2006, pp. 2–7.
 (21) P. Memar, F. Faradji, A novel multiclass EEGbased sleep stage classification system, IEEE Transactions on Neural Systems and Rehabilitation Engineering 26 (1) (2018) 84–95.
 (22) O. Tsinalis, P. M. Matthews, Y. Guo, Automatic Sleep Stage Scoring Using TimeFrequency Analysis and Stacked Sparse Autoencoders, Annals of Biomedical Engineering 44 (5) (2016) 1587–1597.
 (23) O. Tsinalis, P. M. Matthews, Y. Guo, S. Zafeiriou, Automatic sleep stage scoring with singlechannel EEG using convolutional neural networks, arXiv:1610.01683.
 (24) A. Supratak, H. Dong, C. Wu, Y. Guo, DeepSleepNet: a model for automatic sleep stage scoring based on raw singlechannel EEG, IEEE Trans. Neural. Syst. Rehabil. Eng. 25 (2017) 1998–2008.

(25)
A. Vilamala, K. H. Madsen, L. K. Hansen, Deep Convolutional Neural Networks for Interpretable Analysis of EEG Sleep Stage Scoring, in: 2017 IEEE international workshop on machine learning for signal processing, 2017.
 (26) N. El Karoui, H.T. Wu, Connection graph Laplacian methods can be made robust to noise, Ann. Stat. 44 (1) (2016) 346–372.
 (27) P. Bérard, G. Besson, S. Gallot, Embedding Riemannian manifolds by their heat kernel, Geom. Funct. Anal. 4 (1994) 373–398.

(28)
U. von Luxburg, M. Belkin, O. Bousquet, Consistency of spectral clustering, Ann. Stat. 36 (2) (2008) 555–586.
 (29) A. Singer, H.T. Wu, Spectral convergence of the connection Laplacian from random samples, Information and Inference: A Journal of the IMA 6 (1) (2017) 58–123.
 (30) N. El Karoui, On information plus noise kernel random matrices, Ann. Stat. 38 (5) (2010) 3191–3216.

(31)
A. Singer, R. R. Coifman, Nonlinear independent component analysis with diffusion maps, Appl. Comput. Harmon. Anal. 25 (2) (2008) 226–239.
 (32) R. Talmon, R. Coifman, Empirical intrinsic geometry for nonlinear modeling and time series filtering, Proc. Nat. Acad. Sci. 110 (31) (2013) 12535–12540.

(33)
O. Katz, R. Talmon, Y.L. Lo, H.T. Wu, Diffusionbased nonlinear filtering for multimodal data fusion with application to sleep stage assessment, Information Fusion,
In press.  (34) I. S. Dhillon, Coclustering documents and words using bipartite spectral graph partitioning, in: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2001, pp. 269–274.
 (35) M. V. Vitiello, L. H. Larsen, K. E. Moe, Agerelated sleep change, Journal of Psychosomatic Research 56 (5) (2004) 503–510.
 (36) M. Boselli, L. Parrino, A. Smerieri, M. G. Terzano, Effect of age on EEG arousals in normal sleep, Sleep 21 (4) (1998) 361–367.
 (37) E. Van Cauter, R. Leproult, L. Plat, Agerelated changes in slow wave sleep and rem sleep and relationship with growth hormone and cortisol levels in healthy men, JAMA 284 (7) (2000) 861–868.
 (38) R. G. Norman, I. Pal, C. Stewart, J. A. Walsleben, D. M. Rapoport, Interobserver Agreement Among Sleep Scorers From Different Centers in a Large Dataset, Sleep 23 (7) (2000) 901–908.
 (39) A. Borbély, P. Mattmann, M. Loepfe, I. Strauch, D. Lehmann, Effect of benzodiazepine hypnotics on allnight sleep EEG spectra, Hum Neurobiol 4 (3) (1985) 189–94.
 (40) M. Bonnet, D. Carley, et. al., EEG arousals : Scoring rules and examples. A preliminary report from the Sleep Disorders Atlas Task Force of the American Sleep Disorders Association, Sleep 15 (2) (1992) 173–184.
 (41) P. Halasz, M. Terzano, L. Parrino, R. Bodizs, The nature of arousal in sleep, J Sleep Res 13 (2004) 1–23.
 (42) S. R. Thompson, U. Ackermann, R. L. Horner, Sleep as a teaching tool for integrating respiratory physiology and motor control, Advances in phyiology education 25 (2) (2001) 29–44.
 (43) M. Belkin, P. Niyogi, Towards a theoretical foundation for Laplacianbased manifold methods, in: Proceedings of the 18th Conference on Learning Theory (COLT), 2005, pp. 486–500.
 (44) M. Hein, J. Audibert, U. von Luxburg, From graphs to manifolds  weak and strong pointwise consistency of graph Laplacians, in: COLT, 2005, pp. 470–485.
 (45) A. Singer, From graph to manifold Laplacian: The convergence rate, Appl. Comput. Harmon. Anal. 21 (1) (2006) 128–134.
 (46) E. Giné, V. Koltchinskii, Empirical graph Laplacian approximation of laplacebeltrami operators: Large sample results, in: A. Bonato, J. Janssen (Eds.), IMS Lecture Notes, Vol. 51 of Monograph Series, The Institute of Mathematical Statistics, 2006, pp. 238–259.
 (47) M. Belkin, P. Niyogi, Convergence of Laplacian eigenmaps, in: Adv. Neur. In.: Proceedings of the 2006 Conference, Vol. 19, The MIT Press, 2007, p. 129.

(48)
P. W. Jones, M. Maggioni, R. Schul, Manifold parametrizations by eigenfunctions of the Laplacian and heat kernels., P. Natl. Acad. Sci. USA 105 (6) (2008) 1803–8.
 (49) J. Bates, The embedding dimension of Laplacian eigenfunction maps, Appl. Comput. Harmon. Anal. 37 (3) (2014) 516–530.
 (50) J. W. Portegies, Embeddings of Riemannian manifolds with heat kernels and eigenfunctions, Comm. Pure Appl. Math.ArXiv:1311.7568 [math.DG].
 (51) N. Garcia Trillos, D. Slepcev, A variational approach to the consistency of spectral clustering, Appl. Comput. Harmon. Anal. (2015) 1–39.
 (52) X. Wang, Spectral Convergence Rate of Graph Laplacian, ArXiv:1510.08110.
 (53) P. Bérard, Spectral Geometry: Direct and Inverse Problems, Springer, 1986.
 (54) A. Singer, H.T. Wu, Vector diffusion maps and the connection Laplacian, Comm. Pure Appl. Math. 65 (8) (2012) 1067–1144.
 (55) H.T. Wu, R. Talmon, Y.L. Lo, Assess sleep stage by modern signal processing techniques, IEEE Trans. Biomed. Eng. 62 (4) (2015) 1159–1168.
 (56) R. Talmon, R. R. Coifman, Empirical intrinsic geometry for nonlinear modeling and time series filtering, PNAS 110 (31) (2013) 12535–12540.
 (57) J. Malik, C. Shen, N. Wu, H.T. Wu, Connecting dots – from covariance to geodesics, empirical intrinsic geometry, and locally linear embedding, submitted.
 (58) F. Chung, Spectral Graph Theory, American Mathematical Society, 1996.
 (59) J. R. Lee, S. O. Gharan, L. Trevisan, Multiway spectral partitioning and higherorder Cheeger inequalities, J. ACM 61 (6) (2014) 37:1–37:30.
 (60) A. Buzo, A. Gray, R. Gray, J. Markel, Speech coding based upon vector quantization, IEEE Transactions on Acoustics, Speech, and Signal Processing 28 (5) (1980) 562–574.
si.7 More discussion of SST
In this section, we provide a continuous version of SST. For a given recorded EEG signal , its STFT is defined by
(SI.14) 
where is the chosen window, is the time, and is the frequency. As a complexvalued function, the local phase information of could be approximated by . Intuitively, the phase function records the number of oscillations, and this intuition leads us to calculate the following reassignment rule:
(SI.15) 
where means taking the imaginary part. The spectrogram of is finally sharpened by reallocating the coefficients of the spectrogram according to the reassignment rule:
(SI.16) 
where is the frequency. Intuitively, at each time , we identify all spectrogram entities that contain oscillatory components at frequency by , and put all identified entities in the slot. We call the synchrosqueezed EEG spectrogram. In addition to providing a sharp and concentrated spectrogram, it has been proved in Chen_Cheng_Wu:2014 that SST is robust to different kinds of noise. This property is desirable since the EEG signal is commonly noisy. For theoretical developments and more discussions, we refer readers to Daubechies_Lu_Wu:2011 ; Wu:2011Thesis ; Chen_Cheng_Wu:2014 .
To have a better insight of the SST algorithm, look at the following toy example. Take a harmonic function , where is constant. By a direct calculation, we have . From (SI.15), we know that the instantaneous frequency of can be recovered from the phase information of ; that is . A direct calculation leads to , which means that the spectra energy near is spread from for any . Thus, to obtain a sharp spectrogram, we only need to reallocate the spectrogram to the right frequency . And this motivates the design of the SST.
The numerical algorithm detailed in Section 2.1 in the main article is a direct discretization of the abovementioned continuous setup for SST, and hence the synchrosqueezed EEG spectrogram.
si.8 Summary of mathematical theory beyond diffusion map
Graph Laplacian based algorithms have attracted a lot of attention in the supervised learning society. Diffusion map
coifman2006diffusion is one of those successful algorithms. To better understand the theoretical foundation of DM, in the past decade, several works have been done based on the differential geometry framework; see, for example, Berard_Besson_Gallot:1994 ; Belkin_Niyogi:2005 ; Hein_Audibert_Luxburg:2005 ; Singer:2006 ; Gine_Koltchinskii:2006 ; Belkin_Niyogi:2007 ; VonLuxburg_Belkin_Bousquet:2008 ; Jones2008 ; Bates:2014 ; Portegies:2015 ; GarciaTrillos_Slepcev:2015 ; Wang:2015 ; Singer_Wu:2016 . In this section we summarize those theoretical results.Supposing a data set is independently and identically sampled from a random vector with the range supported on a low dimensional manifold embedded in with some mild conditions; for example, the manifold is smooth, the metric is induced from and could be well approximated from the dataset and the sampling density function is smooth enough. When , the transition matrix defined in (7) in the main article converges to a continuous diffusion operator defined on the manifold when the metric is chosen to be . If the sampling scheme is uniform, we can further approximate the LaplaceBeltrami operator of the manifold, denoted as , when ; if the sampling scheme is nonuniform but smooth enough, by estimating the density function we could correct the diffusion process by the normalization scheme proposed in coifman2006diffusion , and again approximate the LaplaceBeltrami operator of the manifold when . The convergence happens both in the pointwise sense and spectral sense GarciaTrillos_Slepcev:2015 ; Wang:2015 ; Singer_Wu:2016 . Therefore, we can view the eigenvectors and eigenvalues of as approximation of the eigenfunctions and eigenvalues of the LaplaceBeltrami operator associated with the manifold.
With the LaplaceBeltrami operator, we could apply the spectral embedding theory Berard:1986 ; Berard_Besson_Gallot:1994 to embed the manifold (and hence the data) using the eigenfunctions of the diffusion operator. Specifically, if we have , where , the spectral embedding is defined as
(SI.17) 
where is the diffusion time, and is a constant depending on . Note that the embedding defined in (8) in the main article is a discretization of . We also define to be the diffusion distance. This embedding allows us to reveal the geometric and topological structure of the manifold, and hence the structure of the dataset. In particular, it is shown in Berard_Besson_Gallot:1994 that is an almost isometric embedding when is small enough; in Singer_Wu:2012 , it is shown that the local geodesic distance of the manifold could be well approximated by the diffusion distance, when the diffusion time is long enough compared with the geodesic distance. Lastly, when combined with the finite dimensional embedding result of the spectral embedding theory of the LaplaceBeltrami operator shown in Bates:2014 ; Portegies:2015 , we could guarantee that with a finite sampling points, when is large enough, we can reconstruct the manifold with a given accuracy. This is the embedding property that we expect to reorganize the dataset.
The ability to reconstruct the underlying intrinsic structure is not the only significant strength of DM. It has been shown in ElKaroui:2010a ; ElKaroui_Wu:2016b that DM is also robust to noise in the following sense. Suppose the data point comes from contaminating a clean sample by some noise . Suppose
satisfies some mild conditions, like the finite variance and the noise level is reasonably controlled. Note that we do not require
to be identical from point to point. Denote the transition matrix built up from as . Under this condition, the deviation of from is well controlled by the noise level in the norm sense. Thus, we conclude that the eigenvectors of with sufficiently large eigenvalues could be well reconstructed from . This is the robustness property we expect from DM to analyze the noisy data.With the embedding property and the robustness property, we can well approximate the underlying geometric structure from the noisy dataset. To show this, recall that the clean points are sampled from a manifold as discussed above. By the Weyl’s law Berard:1986 , and spectral convergence of DM, the eigenvalues of decays exponentially fast. Therefore, by the robustness property, the first few eigenvectors and eigenvalues could be well reconstructed. However, the eigenvectors with small eigenvalues will be highly contaminated. Since eigenvalues of the clean data decay fast, those eigenvectors with small eigenvalues are not that informative from the spectral embedding viewpoint – although DM is a nonlinear map, when is chosen large enough, the finite dimensional embedding result guarantees that we can reconstruct the manifold, and hence DM is an almost isometric embedding. Therefore, we conclude that DM with the truncation scheme allows us to a well reconstruction of the clean data up to a tolerable error. We call this property the recovery property.
si.9 Local Mahalanobis distance and empirical intrinsic geometry
To recover the intrinsic sleep dynamics, we need a sophisticated metric to organize the synchrosqueezed EEG spectral features. In this paper, we consider the local Mahalanobis distance as the metric. To motive how the local Mahalanobis distance is designed and how it is incorporated into the interindividual prediction framework, below we review the dynamics system model, the empirical intrinsic geometry (EIG) model singer2008non ; TalmonPNAS . Then we show how to generalize the EIG model to carry out the interindividual sleep dynamics prediction, and detail the desired local Mahalanobis distance.
We start from recalling the EIG model as the motivation. The EIG model is designed specifically for a single subject. For each subject, we assume that the synchrosqueezed EEG spectral feature comes from a diffeomorphic deformation , where is the dimension of the synchrosqueezed EEG spectral feature, which maps the inaccessible intrinsic dynamical system that describes the brain activity into the synchrosqueezed EEG spectral feature; that is, , where is the dimension of the inaccessible dynamical space. We call the observation transform. Furthermore, assume that the inaccessible intrinsic state is the value of the process at the th epoch and satisfies the stochastic differential equation
(SI.18) 
where is an unknown drift function and is a standard dim Brownian motion. This model is called the EIG. Based on (SI.18), it is shown in (singer2008non, , Section 3) that when the intrinsic distance of and is sufficiently small, we could recover the intrinsic distance between and by
(SI.19) 
where is the covariance matrix associated with the deformed Brownian motion, up to the error term . Furthermore, it is shown in singer2008non ; TalmonPNAS that can be estimated by the covariance matrix of , where is a predetermined integer. We call the local Mahalanobis distance between and , which is a generalization of Mahalanobis distance from the statistical viewpoint.^{1}^{1}1The appearance of the Mahalanobis distance definition here is slightly different from the commonly used one, but it is actually a local version of the commonly used Mahalanobis distance. We thus call it the local Mahalanobis distance for the sake of simplicity.
While the EIG model works well for a single subject Katz_Wu_Lo_Talmon:2016 , it may not be suitable for the intersubject sleep assessment mission we have interest. Indeed, since different subjects have different sleep dynamics, and the observation transforms are different, physiologically it is not reasonable to quantify the intrinsic state dynamics of different by (SI.18). Indeed, it does not make sense from the physiological viewpoint to integrate the temporal sleep dynamics of different subjects into one; that is, there is no temporal relationship between epochs from different subjects. In this paper, motivated by the success of (SI.18) in the intrasubject sleep stage prediction wu2015assess , we consider the following generalization of the EIG model.
Based on the physiological assumption that the EEG signals of different subjects share similar spectral behavior, we assume that the synchrosqueezed EEG spectral features from different subjects comes from one map , which maps the inaccessible intrinsic state , where is the dimension of the inaccessible space hosting the state space of the dynamics, which is assumed to be the same among subjects, to the space hosting the synchrosqueezed EEG spectral features. We further assume that and is an identity from the state space to its range perturbed by a stationary random perturbation that has mean 0 and the covariance . To simplify the terminology, we still call the observation transform. Note that this is the simplified EIG model with noise in the observation considered in Talmon2013 . Note that due to the lack of temporal information between epochs from different subjects, this model is nothing but assuming that the synchrosqueezed EEG spectral features are noisy and the noise is independent and identical distributed noise.
Under this simplified EIG model, we can get an estimate for similar to (SI.19) by modifying the proof of singer2008non . Denote the neighborhood of by for each , where and the ratio is predetermined. Based on the assumption of , when the data is supported on a dimensional smooth manifold embedded in , we have
(SI.20) 
where means the truncated pseudoinverse of defined by , is the eigendecomposition, , and with . Similarly, the covariance of the stationary random perturbation associated with becomes
(SI.21)  
Combining (SI.20) and (SI.21) yields
(SI.22) 
Therefore, following the observation in (SI.19), for the interindividual prediction mission we have interest, we thus consider the following metric to compare/organize the synchrosqueezed EEG spectral features:
(SI.23) 
which we also call the local Mahalanobis distance for the sake of simplicity. In this work, we take . We mention that this approach is nothing but obtaining a more accurate geodesic distance estimation by a direct hard threshold of the noisy covariance matrix to remove the influence of noise. A more general discussion can be found in JohnNan2018 .
si.10 Multivew DM and coclustering
In this section, we discuss the relationship between the sensor fusion algorithm, multiview DM, and the well known coclustering algorithm based on the bipartite graph model dhillon2001co . Denote (resp. ) to be the set of vertices representing the feature vectors extracted from the first (resp. second) channel. Form a bipartite graph , where consists of vertices from and and is the set of edges between vertices in and vertices in ; that is, edges only exist between and and there is no edge inside and there is no edge inside . Then, by assigning weights on all edges, for all , we obtain a bipartite affinity graph. Denote to be the affinity matrix associated with the constructed bipartite affinity graph. Note that the first columns and rows of M are associated with the first sensor, and the last columns and rows are associated with the second sensor.
In dhillon2001co , based on , it is argued how we can determine the corresponding clusters in based on the clusters in , and vice versa. More precisely, given disjoint clusters for vertices in , where is a predetermined integer, the clusters for vertices in can be formed by
(SI.24) 
where . The motivation beyond this assignment is intuitive – a given epoch from the second channel has a higher chance to belong to the th cluster if it more likely belongs to the th cluster of the first channel than other clusters of the first channel. Similarly, given disjoint clusters for vertices in , the induced clusters for vertices in are determined by
(SI.25) 
The information extracted from the first and second channels iteratively interacts through (SI.24) and (SI.25). This approach is called the “coclustering” algorithm.
To understand the relationship of this coclustering algorithm and the multiview DM, we illustrate how to bipartition the vertices of a bipartite graph . In short, it is directly related to the traditional spectral clustering algorithm and the well known Cheeger’s inequality Fan:1996 , and its multiway clustering is supported by the recently developed theory for the multiway spectral clustering algorithm LeeTrevisan2014
. Take the biclustering for the illustration purpose. First of all, define a loss function, normal cut,
for a set of clusters for vertices in by(SI.26) 
where
(SI.27) 
and is a diagonal matrix, whose th diagonal entry represents the degree of the th vertex; that is, . Given two disjoint sets and with , the generalized partition vector is defined by
(SI.28) 
It is shown in (dhillon2001co, , Theorem 3) that
(SI.29) 
which implies that the problem of finding a balanced partition with small cut value can be relaxed and cast as an eigenvalue problem as follows
(SI.30) 
The minimizer of (SI.30) is the eigenvector of corresponding to the second largest eigenvalue
, and the bipartition is achieved by running kmean on
with . The first entries of (as well as other eigenvectors) are associated with the clustering of the first sensor, and the last entries are associated with the clustering of the second sensor. The result is the “cocluster” of the two sensors. To cocluster vertices associated with two sensors into more than two clusters, or usually called multiway clustering, we can directly generalize the whole idea, including the normal cut functional and the spectral clustering. Then take eigenvectors, , into account and run kmean on . The theoretical guarantee based on the generalized Cheeger’s inequality is provided by LeeTrevisan2014 .In this work, for two EEG sensors and , we define the affinity matrix by taking the product of affinities of two sensors by
Comments
There are no comments yet.