Modeling Binary Time Series Using Gaussian Processes With Application to Predicting Sleep States
Abstract
Motivated by the problem of predicting sleep states, we develop a mixed effects model for binary time series with a stochastic component represented by a Gaussian process. The fixed component captures the effects of covariates on the binaryvalued response. The Gaussian process captures the residual variations in the binary response that are not explained by covariates and past realizations. We develop a frequentist modeling framework that provides efficient inference and more accurate predictions. Results demonstrate the advantages of improved prediction rates over existing approaches such as logistic regression, generalized additive mixed model, models for ordinal data, gradient boosting, decision tree and random forest. Using our proposed model, we show that previous sleep state and heart rates are significant predictors for future sleep states. Simulation studies also show that our proposed method is promising and robust. To handle computational complexity, we utilize Laplace approximation, golden section search and successive parabolic interpolation. With this paper, we also submit an Rpackage (HIBITS) that implements the proposed procedure.
Keywords: Binary time series; classification; Gaussian process; latent process; sleep state.
1 Introduction
The American Academy of Sleep Medicine indicates that humans go through several cycles during sleep with each cycle comprised of different stages. It is important to study sleep in humans because the lack of sleep is associated with psychiatric diseases (e.g. depression and ADHD) and chronic diseases (e.g. diabetes, heart disease and hypertension). In particular, understanding sleep state (asleep versus awake) and uncovering its latent pattern play a critical role in people’s daily routine. For example, many young mothers wonder if their infant’s sleep state can be predicted in advance; physicians are interested in forecasting their patient’s anesthesia level/sleep state for surgery. The goal of this paper is to develop statistical inference for studying changes in the sleep state (in particular, asleep versus awake) and the potential roles of covariates such as heart rate, respiration rate and body temperature on sleep states. A plot of the sleep states and the exogenous time series of heart rate and temperature, given in Figure (1), suggest a leadlag depenence between sleep states and the exogenous time series. In this paper, we develop a model that formally tests for these leadlag dependence and predict future sleep states.
Various approaches have been proposed to model and predict sleep states. Since sleep states can be measured sequentially in time during an experiments or in observational studies, typical strategies for modeling categorical time series have been implemented. Caiado et al. (2006) introduced new measurements in classifying time series based on periodograms. Maharaj (2002) put forward a framework of comparing time series in frequency domain. Wavelet based clustering method was also introduced by Maharaj et al. (2010). Jacobs and Lewis (1978) proposed a discrete autoregressivemoving average (DARMA) model by utilizing probabilistic mixtures. A comprehensive modeling framework based on generalized linear models and partial likelihood inference have been developed in Fokianos and Kedem (2002) and Fokianos and Kedem (2003). Fokianos and Kedem (1998) extended the partial likelihood analysis to nonstationary categorical time series including stochastic time dependent covariates. With the Markovian structure, Meyn and Tweedie (2012), Bonney (1987) and Keenan (1982) developed inferential procedures based on the conditional likelihood. These previous studies provide inference on binary time series. Their main drawback is that they involve massive computation for high dimensional integrals, which results in poor prediction accuracy. Lindquist and McKeague (2009) introduced a logistic regression model with functional predictors and extended it to generalized linear model. Their substantial work was superior in detecting sensitive and interpretable time points that were most predictive to the response. However, when it is applied to this study, the drawbacks are: (1) the Brownian motion assumption is unlikely to be satisfied in practice because the covariates in this study hardly have the property of increment independence; (2) the influence of covariates on responses is assumed to spread across the entire trajectory and hence implies the nonexistence of “sensitive time points”; (3) prediction of the time series is not developed, which could be a serious limitation for this project since we are also interested in such predictions.
From the view of the machine learning community, typical classification methodologies such as decision tree, random forest and strategies such as boosting can also be used for predicting sleep states. Although such approaches are able to achieve predictions with high accuracy, the major drawback is that they give very little guidance to sleep researchers who are measuring the impact of previous heart rate, temperature and respiratory rate on future sleep state. In this paper, we develop a statistical model that can provide us simultaneously with convincing inference and interpretation at the same time produce prediction accuracy that is higher than that achieved by typical machine learning classification approaches.
This work is inspired by Keenan (1982) which developed a binary time series using a latent strictly stationary process. The focus here is to provide an accurate, interpretable, efficient yet computationally less demanding approach for estimation and prediction. When prior information indicates that a binary time series is determined by a process comprised with fixed and random components, we decompose the unobserved latent process into linear and stochastic effects with different covariates. On stage one, inference on the fixed effects is conducted using maximum likelihood estimation. On stage two, conditioned on the estimated fixed effect, a Gaussian process will be used to represent the random components. Predictions are obtained by combining inference on these two components. In addition, based on the results from these two stages, we use parametric bootstrap samplers from the estimated Gaussian process to obtain the final point and interval estimates of parameters.
Using the proposed procedure, we can identify the dependence of the endogenous time series (sleep state) on potential covariates (e.g., heart rate and body temperature) by providing the point and interval estimates of the coefficients from linear effects based on the results from the twostage algorithm. Inference can be easily and directly performed by maximum likelihood using existing software. Moreover, results are easily interpretable under the framework of generalized linear model. On stage two, which is derived from Gaussian process classification strategy, we can predict the sleep state with high accuracy. Laplace approximation was implemented to reduce the computation cost. This work is also inspired by Brillinger (1983) which, to the best of our knowledge, is the first to introduce this notion of a Gaussian random effect as random intercept in a logit model. Here, we generalize this by representing the random component as a stochastic process rather than just a scalar random variable.
The main advantages of our proposed approach, which we call the hybrid inference method for binary time series (HIBITS), are the following: (1) it accounts for the linear and nonlinear stochastic effects of covariates and endogenous variables on sleep states; (2) it provides efficient point and interval estimates of the coefficients from the linear effects while maintaining type I error rates; (3) it produces more accurate predictions compared to other existing approaches; (4) it is easily implemented with low computational cost; and (5) unlike other classification approaches, it gives more straightforward interpretation of the results.
The remainder of this paper is organized as follows. Section 2 is devoted to brief introduction of Gaussian process and its existing applications in regression and classification. In Section 3, we develop our proposed methodology and discuss the motivation and the technical derivation of the proposed HIBITS method. A complete algorithm that yields prediction and inference on the coefficients of covariates and endogenous variables is also provided. Model selection strategy is also developed to address application problems. Section 4 presents the simulation results that show the benefits of the proposed method over the existing methods in terms of the significant higher prediction accuracy and narrower confidence intervals. In Section 5, we apply our proposed model and inference procedure to identify predictors of sleep states and to predict future sleep states. The results are promising in terms of prediction accuracy at low computational cost and interpretability. Moreover, the proposed method can also be modified when there are missing values.
2 Background on Gaussian Processes in Binary Time Series
2.1 Gaussian process and regression models
Gaussian process have been widely developed in spatialtemporal modeling (Williams and Rasmussen, 2006; Banerjee et al., 2008, 2014; Gelfand et al., 2005; Quick et al., 2013; Stein, 2012; Zhou et al., 2015; VandenbergRodes and Shahbaba, 2015; Wang and Gelfand, 2014). It provides a framework that can capture the nonlinear and stochastic components of exogenous and endogenous variables based on generalized linear models, which makes it useful for modeling binary time series and classification.
The definition of a Gaussian process is as follows.
Definition 1.
A stochastic process is a Gaussian process if and only if for every finite set of indices in the index set , is a multivariate Gaussian random variable.
We will write the Gaussian process as where Let us now denote the observed data to be , where and is the response data. We split the dataset into training points and testing points. Let () represent the testing datasets and () represent the training datasets respectively. Define It follows that
(1) 
The distribution of the response can be determined by Equation (1). Point estimates, interval estimates and sampling distribution of can be derived accordingly.
2.2 Gaussian process in modeling binary time series
Model formulation
Denote the observed training data as , where and . For our data in this paper, denotes the sleep state at time point and can be heart rate or body temperature at time point . We define a latent Gaussian process indexed by as . The relationship between and is characterized by where is a link function that determines the relation between and the probability of the sleep state. To name a few, can be a logit, probit or complementary loglog link functions (McCullagh, 1984).
Classification method
For a given link, the inferential procedure will be divided into two steps. First, we compute the distribution of the latent process on the test data
(2) 
where . Then, we estimate the conditional probability of by
(3) 
which is approximately a weighted average of the probability of over all possible realizations of predicted stochastic components that is a Gaussian process.
It should be pointed out that both of the two integrands in Equations (2) and (3) do not have closed forms. For Equation (3), following the argument in Williams and Rasmussen (2006), numerical tools such as Monte Carlo method can be used to obtain the approximate value of the integral given . To obtain Equation (2), Williams and Barber (1998) introduced Laplace approximation for this problem. Minka (2001) proposed an alternative expectation propagation(EP). Besides these methods, a number of MCMC algorithms have also been considered. In the following section, we will follow the direct Laplace approximation.
From Equation (2), we can write the approximate distribution of as where is the MLE of the distribution and is the observed Fisher information matrix. To find the value of , Newton’s method can be implemented, where in each iteration , where and is the covariance matrix of . Thus, the distribution can be approximated by
Opper and Winther (1999) suggested the conditional expectation of could be obtained by Following similar arguments, the conditional variance of can be obtained by Given the mean and variance, at the last step, the probability of can be approximated by It should be pointed out that the Gaussian process essentially captures information beyond those provided by past value of both endogenous and exogenous time series.
Remark 2.
takes the following forms for the logit and probit links, respectively,
Here and are the normal probability density function and the cumulative distribution function, respectively.
3 HIBITS: The hybrid estimation method for modeling and predicting binary time series
Building on the established theoretical foundations of Gaussian processes, we now develop a novel twostage inference and classification method. This section is organized as follows: in Section 3.1, we discuss the motivation of using the hybrid strategy in modeling sleep stage; followed by details of the twostage hybrid method in Section 3.2; in Section 3.3, we discuss our model selection strategy; and in Section 3.4, we provide a method in providing point and interval estimates of the coefficients of the covariates and endogenous variables.
3.1 Motivation
The common approach is to use a Gaussian distribution with zero mean value as a random effect if the latent process yields, equally likely, positive and negative fluctuations around 0 (Kuss, 2006). Yet, when it comes to real data, this set up overlooks the linear structure between covariates and the actual response of interest. For instance, to model the binary sleep state, scientists believe that body temperature and heart rate should be involved as potential predictors. In Fokianos and Kedem (2002), a regressionbased approach for modeling covariates is proposed. However, if we naively utilize the existing Gaussian distribution with zero mean function to model the data, the latent process equally produces positive and negative value fluctuating around 0 which can produce misleading results because it will render the effects of covariates (body temperature and heart rate) to be insignificant. In addition, incorporating those covariates in the covariance function is a reasonable approach to modeling the association. However, the interpretation is complicated. Much work has been done to overcome the aforementioned limitations. To name a few, Snelson et al. (2004) proposed an approach to transform data in agreement with the Gaussian process model. Their work generalized the Gaussian process by warping the observational space. Although the transformed data can be fitted by Gaussian process, it leads to difficulty in the interpretation of the transform. Another drawback is that the effects of particular covariates could be lost (or difficult to interpret). Cornford (1998) suggested a Gaussian process regression model with mean function . Their work incorporates the effect of particular covariates. The main drawback is the computational burden that results from the choice of hyperparameter and MCMC sampler when it applies to classification problem. Building on the prior work, we develop a twostage method that takes advantage of the strengths of the existing methods. It is able to model the linear association with particular covariates while maintaining computational efficiency.
3.2 The proposed HIBITS method
Consider the data where Here, are the covariates in the fixed effects part and are covariates in the stochastic part. Then, We now propose the systematic component of the generalized linear model to take the form
where and The systematic component with fixed and random effects follow a linear mixed effect model with the first part capturing the fixed effect and the second part describing the randomness that is not covered by the first part. Note that does not include an intercept term on this stage. Following the same notation as previous sections, we denote as the training dataset and as the testing subsets. The proposed inference method proceeds as follows.
Stage 1. Inference on the fixed effect.
The joint likelihood function can be written as
(4) 
On the first stage, we consider the latent Gaussian process fixed across time . Numerical algorithms such as NewtonRaphson method can be used to obtain , the MLE of the joint likelihood function. In fact, in this stage, we regard the latent Gaussian process as the timeinvariant intercept of the logistic regression, which is considered fixed but unknown.
Stage 2. Inference on the stochastic components.
On the second stage, we make use of the result of inference on the fixed effect from Stage 1 and adjust the estimates by introducing the latent Gaussian process . Conditional on we define then it follows that
Here, we model the stochastic component as a Gaussian process with covariance function
(5) 
and takes value 1 when and 0 otherwise. The parameters and are estimated by the strategy proposed by Section 3.3 and we will not specify any prior on those parameters. Since is known, we can implement the strategy in Section 2.2 in dealing with the predictive probability from Equation (3). The complete hybrid method can be summarized in the following Algorithm 1.
Remark 3.
The Hessian matrix is a diagonal matrix with the following elements for the logit and probit link respectively,
3.3 Model selection
Strategies on model selection are also presented in two steps.
Step 1. In this study, we will use exploratory analysis to choose variables. Alternatively, we could use AIC or BIC focusing on the fixed effects. Using automatic variable selection strategies based on AIC or BIC, we can choose a model with a subset of predictors. AIC value is defined as and BIC is defined as , where is the number of parameters, is the number of observations and is the maximum value of likelihood.
Step 2. We select the parameters for the covariance matrix by maximum likelihood estimation. The strategy is inspired by the work of Williams and Rasmussen (2006). Our work is similar in terms of maximizing the marginal likelihood but differs in the way that the both fixed and random effects are involved.
We denote as the parameters in the covariance structure . The approximate log marginal likelihood is
(6) 
where and is defined in Section 2.2.2. The strategy is to choose the value of that maximizes Equation (6). Note that the covariance matrix ()and involve parameters , the partial derivative of is therefore
where and are defined as follows
NewtonRaphson method or coordinate descent will be applied to optimize the log marginal likelihood in Equation (6).
In this study, the parameters from Equations (5) are and Through our simulation studies, we specify the parameters and and apply the aforementioned strategy on estimating for the following reasons: (1) it might lead to identifiability problem if we do not fix some of the parameters in this frequentist setting; (2) results do not show much difference if parameters and are not fixed; (3) computation will be demanding if no parameter is fixed.
3.4 Inference on the effects of covariates
We propose to use bootstrap sampler to provide point and confidence intervals of the linear coefficients of the covariates . Our approach is based on resampling the stochastic component and maximum likelihood. The algorithm is summarized in Algorithm 2.
3.5 Summary
In summary, the proposed method on inference, prediction and model selection maintain the following strengths: (1.) it uses linear and nonlinear stochastic components to model the effect of the covariates on the response; (2.) it provides point and interval estimates of the linear effects that are more efficient than the existing methods as demonstrated in Section 4; (3.) it is able to make accurate predictions as shown in Section 4; (4.) the computational cost is not demanding; (5.) similarly to generalized linear models, it provides results that are straightforward to interpret.
4 Simulations
In this section, simulations are implemented to test the performance of the proposed method. In Section 4.1, binary time series are generated by the logit model. We compared the classification error rates derived from the proposed method with 6 other competing statistical and machine learning approaches, namely, the ordinal model, logistic regression, generalized additive mixed model, random forest, decision tree and gradient boosting. We also compute the point and confidence intervals of the coefficients of the covariates and endogenous variables in comparison with other existing methods. To test the robustness of our method, in Section 4.2, we generate time series from the probit model but use the logit model to fit the data. In Section 4.3, we utilize mixture kernels to generate the Gaussian process and then apply the proposed HIBITS method. Classification error rates, point estimates and confidence intervals are also utilized as measures for comparison.
4.1 Prediction and inference performance on logit model
To evaluate the prediction power and robustness of the proposed method, binary time series are generated under two scenarios:

Scenario 1 (with a stochastic process).
; 
Scenario 2 (without a stochastic process).
.
Here, follows Gaussian process with
and takes value 1 when and 0 otherwise. The parameter controls the strength of dependence on previous realizations and it denotes the log odds ratios of versus . is the linear coefficients with respect to covariates at current time point. is the parameter that determines the strength of dependence across adjacent time points. In this simulation, parameters and vary in different scenarios. simulations are conducted in each scenario. Figure 2 shows plots of the simulated data. In this scenario, 500 sleep stages were generated.
Alternative Methods. To evaluate the prediction power of the proposed method, we compare the classification error rates with other six competing approaches. In general, those approaches include regression and tree based classification strategies. Generalized linear model with logit link is fitted as the first competing method. Further, to respect the correlated structure of the binary time series, we consider the generalized additive mixed models (GAMMs) as the second regression based competing approach. In the work of Lin and Zhang (1999), linear structures of covariates are extended to be smooth functions. Following the notation in Section 3, the GAMM model is defined as
where denotes the component of vector , is a centered twicedifferentiable smooth function, the random effects are assumed to be distributed as and is the variance components. Lin and Zhang (1999) estimated nonparametric functions and parameters by using smoothing splines and marginal quasilikelihood. In this simulation, R package ‘gamm4’ was implemented to test the performance of GAMMs. We also considered the regression models for nominal and ordinal time series introduced by Fokianos and Kedem (2002). As is discussed in their concrete work, we implemented ordinal time series model in the simulation. It should be pointed out that due to the binary response, ordinal time series model is degenerated into logistic regression. Simulation results also suggest the equivalence of these two approaches. In addition, we compared our method to treebased classification approaches. In general, we split the feature space (heart rate and previous sleep states in this study) into “subspaces” and fit simple models within each region. Following the derivation in Friedman et al. (2001), for a node denoting a region with observations, we let
where class is either or and is the indicator function. We assign the observations in node to class Measures of node impurity, denoted as , can be chosen as the misclassification error, Gini index and crossentropy or deviance.
To further extend the decision tree approach, we also consider random forest and gradient boosting algorithms in the simulation. The essence of random forest is to average many noisy but asymptotically unbiased classifiers and hence reduce the variation. It requires bootstrapping samples and selection features from the training dataset. Since there exist only a few features in this model, the benefit from using random forest approach is mainly derived from the bootstrapping strategy. For each bootstrap training sample set, we grow a random forest tree The final output is the ensemble of trees and then predictions are made by majority vote. In addition to random forest, gradient boosting is another extension of decision tree based method. Similar to the general boosting methods, gradient boosting searches for a strategy to combine multiple weak classifiers in an iterative manner. As discussed in Friedman (2001) and Friedman et al. (2001), the method generically starts from a model with constant value. At iteration , suppose the classifier is denoted as , we calculate pseudoresiduals by
where is a loss function. Then, we fit a classifier to the pseudoresiduals and implement a line search algorithm in solving the optimization problem
At the end of this iteration, we update the model by
We keep repeating the full sweep until convergence. The final classifier is denoted as .
Model Evaluation. To formally evaluate the performance of all the aforementioned approaches, we calculate the classification error rates under both scenarios. In particular, we fit the results in linear mixed effect model to account for the correlation among classification errors across different methods that result from the same simulated dataset. We consider the model
where denotes the classification error rate of approach on dataset ; is the mean error rate of method , which is welldefined by the law of large numbers. and We calculate the simultaneuous Bonferroni confidence intervals of to detect the difference in the mean error rates between the proposed method with all the other approaches. In particular, denote the mean error rates of HIBITS, Ordinal model (logistic regression), GAMMs, Random forest, Gradient boosting and Decision tree respectively.
Table 1 provides a summary of the simulation studies for various parameters. It can be seen that for datasets with Gaussian process, there is statistically significant difference in comparison with the competing methods. In particular, the proposed HIBITS method produces significantly lower prediction error rates compared to existing methods. The advantage of the proposed approach is even more obvious when compared with gradient boosting and decision tree approaches. The results show that the proposed HIBITS method captures effective information from covariates , and also the stochastic process. The covariate serves as a significant predictor as we increase the ratio of over .
For the datasets generated without the Gaussian process (Scenario 2) shown in Table 2, the accuracy prediction from the twostage approach is significantly higher than some of the existing approaches such as decision tree and gradient boosting. Among all the other competitors, the proposed method behaves equally competitive. This shows the robustness of the proposed approach when data have no Gaussian process components. This is partly due to the strategy on choosing hyperparemeters. By controlling their values, the effects of Gaussian process will be adjusted to the data.
Scenario 1  
Parameters()  Competing Method  95% confidence interval of 
Ordinal model*  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
* For binary time series, ordinal model is equivalent to logistic regression. 
Scenario 2  
Parameters()  Competing Method  95% confidence interval of 
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree 
We also evaluate the performance of modeling the linear effects of covariates , by comparing the confidence intervals of and with the corresponding interval estimates from the other existing methods. Table 3 summarizes the results under the same scenarios in Table 1. It shows that compared with ordinal model, the proposed HIBITS method produces narrower confidence intervals of parameters while maintaining high capture rates of the true values. The length difference is obvious and it can gain almost shorter confidence intervals in some scenario. It should be noted that using ordinal model, the capture rate of is extremely low while HIBITS method provides promising performance. The same pattern can also be found in Table 4. Under Scenario 2, the benefits of using HIBITS is even more obvious in terms of shorter confidence interval length and high capture rate.
Scenario 1  
Parameters()  Method  95% confidence interval of  
HIBITS method  
Ordinal model  
HIBITS method  
Ordinal model  
HIBITS method  
Ordinal model  
HIBITS method  
Ordinal model 
Scenario 2  
Parameters()  Method  95% confidence interval of  
HIBITS method  
Ordinal model  
HIBITS method  
Ordinal model 
Overall, the proposed method outperforms competing approaches when comparing the results from data both with and without Gaussian process. Through the model selection strategy discussed in Section 3.3, the proposed approach can adjust the covariance matrix to the data, which in return produces lower prediction error rate and more efficients inference on covariates than existing methods.
4.2 Investigating robustness of the estimation method
Our goal is to investigate robustness of the proposed model by applying the logisticbased model on data that are generated using a probit model. We generate binary time series following the scenarios:

Scenario 3 (with a stochastic process).
; 
Scenario 4 (without a stochastic process).
. is the cumulative distribution function of standard normal distributions and is defined in the same manner as in Section 4.1.
Parameters and vary in different scenarios. simulations are conducted in each scenario. We fit the same linear mixed effect model discussed in Section 4.1. Tables 5 and 6 show the summary of confidence intervals Similar to the results in Section 4.1, for dataset with Gaussian process, most of the confidence intervals do not cover 0. The negative values of the classification error rates imply remarkable benefits of using the proposed method over the other competing methods. Note that when comparing with the gradient boosting and decision tree approaches, the proposed method behaves significantly better in terms of extraordinary higher prediction accuracy. In Scenario 4, we tested the proposed method on the dataset without Gaussian process. It is shown that although there is no significant difference in comparison with other competing methods, the proposed approach produces the same prediction power as other competing methods, which implies the robustness with regard to various link functions. In addition, Table 7 shows the confidence intervals of the coefficients derived from the proposed method and the ordinal model. Similar to the results in Section 4.1, the proposed method yields much narrower confidence intervals while maintaining good properties of capturing true values.
Scenario 3  
Parameters()  Competing Method  95% confidence interval of 
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree 
Scenario 4  
Parameters()  Competing Method  95% confidence interval of 
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree  
Ordinal model  
GAMMs  
Random forest  
Gradient boosting  
Decision tree 
Scenario 3  
Parameters()  Method  95% confidence interval of 