A IRLS estimation of \alpha following a GEE analysis with FSE

Marginal analysis of longitudinal count data in long sequences: Methods and applications to a driving study

Abstract

Most of the available methods for longitudinal data analysis are designed and validated for the situation where the number of subjects is large and the number of observations per subject is relatively small. Motivated by the Naturalistic Teenage Driving Study (NTDS), which represents the exact opposite situation, we examine standard and propose new methodology for marginal analysis of longitudinal count data in a small number of very long sequences. We consider standard methods based on generalized estimating equations, under working independence or an appropriate correlation structure, and find them unsatisfactory for dealing with time-dependent covariates when the counts are low. For this situation, we explore a within-cluster resampling (WCR) approach that involves repeated analyses of random subsamples with a final analysis that synthesizes results across subsamples. This leads to a novel WCR method which operates on separated blocks within subjects and which performs better than all of the previously considered methods. The methods are applied to the NTDS data and evaluated in simulation experiments mimicking the NTDS.

\kwd
\doi

10.1214/11-AOAS507 \volume6 \issue1 2012 \firstpage27 \lastpage54 \setattributecopyrightownerPublic Domain

\runtitle

Marginal analysis of longitudinal count data

\thankstext

T1Supported by the Intramural Research Program of the National Institutes of Health (NIH), Eunice Kennedy Shriver National Institute of Child Health and Human Development. The computation was facilitated by the Biowulf cluster computer system made available by the Center for InformationTechnology at the NIH.

{aug}

A]\fnmsZhiwei \snmZhang\correflabel=e1]zhiwei.zhang@fda.hhs.gov, B]\fnmsPaul S. \snmAlbertlabel=e2]albertp@mail.nih.gov and C]\fnmsBruce \snmSimons-Mortonlabel=e3]mortonb@mail.nih.gov

Correlation \kwdgeneralized estimating equation \kwdmultiple outputation \kwdoverdispersion \kwdrandom effect \kwdseparated blocks \kwdwithin-cluster resampling.

1 Introduction

In this paper we consider the analysis of longitudinal data that arise in the form of a small number of long sequences. Our interest in this problem is motivated by the Naturalistic Teenage Driving Study (NTDS), an observational study of teenage driving performance and characteristics [Simons-Morton et al. (2011a; 2011b)]. In the study, 42 newly licensed teenage drivers in Virginia were monitored continuously during their first 18 months of independent driving using in-vehicle data recording systems. The instrumentation included accelerometers, video cameras, a global positioning system, a front radar and a lane tracker. The NTDS is a first of its kind, at least for teenage drivers in the United States. The study provides valuable information on risky driving behavior, which can be assessed in terms of elevated gravitational force (-force) events (rapid start, hard stop, hard turn and yaw). Counts of -force events are available for each trip (defined as ignition on to ignition off), and their incidence rates represent different aspects of risky driving behavior. The NTDS data set comprises more than 68,000 trips by the 42 teen subjects, with an average of 1,626 trips per subject. Figure 1 provides a summary of the NTDS data by subject in terms of the number of trips made and the total number of miles driven. An important goal in our analysis of the NTDS data is to understand how risky driving is associated with subject-level covariates (i.e., individual characteristics such as gender) as well as trip-level or time-dependent covariates (e.g., time since licensure, presence of passengers).

Figure 1: Summary of NTDS data by subject: the number of trips made and the total number of miles driven, both plotted on the logarithmic scale.

There is an extensive literature on longitudinal data analysis; see, for example, Diggle et al. (2002), Fitzmaurice et al. (2008) and McCulloch, Searle and Neuhaus (2008). Common approaches to longitudinal data analysis include random effect models [Laird and Ware (1982), McCulloch, Searle and Neuhaus (2008)] and generalized estimating equations (GEE) [Liang and Zeger (1986), Zeger and Liang (1986), Zeger, Liang and Albert (1988)]. In general, the two approaches have different interpretations (subject-specific versus marginal), although that distinction is not important when the log link is used for count data. For analyzing the NTDS data, it is natural to consider generalized linear mixed models (GLMM) with appropriate random effects to account for population heterogeneity, serial correlation and/or possible overdispersion [Chan and Ledolter (1995), McCulloch (1997), McCulloch, Searle and Neuhaus (2008)]. However, a realistic GLMM for the NTDS data would involve several random components, including a latent process that induces serial correlation (see Section 5), and the computational demand of such a GLMM analysis can be prohibitive with thousands of trips per subject. Even if the computation is feasible, the resulting inference may be sensitive to modeling assumptions that are difficult to verify. The computational burden can be reduced by resorting to a marginal analysis using the GEE approach, especially under a working independence assumption. The GEE approach only requires specification of the first two moments and not the entire distribution. The robust variance estimate can be used to make asymptotically valid inference that only requires correct specification of the mean structure, assuming that the number of subjects is large relative to the number of observations per subject. The exact opposite situation occurs in the NTDS, and Albert and McShane (1995) have shown that the robust variance estimate can perform poorly in such a situation and that the model-based variance estimate may be preferable. Numerous methods have been proposed to improve upon the robust variance estimate, including jackknife [Paik (1988), Lipsitz, Laird and Harrington (1990)], bias correction [Mancl and DeRouen (2001)] and window subsampling techniques [Sherman (1996), Heagerty and Lumley (2000)].

Given the well-known advantages and potential issues of the GEE approach, it is natural to ask how to perform a simple and valid marginal analysis of the NTDS data with reasonable efficiency and robustness. We attempt to address this question in the present paper by examining some existing methods and developing new ones. We find that it is generally helpful to include a fixed effect for each subject in the GEE model. Once the subject effects are estimated, they can be treated as the response variable in a subsequent linear regression analysis for estimating the effects of subject-level covariates. These fixed effects also help with trip-level covariates by removing the correlation due to population heterogeneity. If the counts are large (with a marginal mean of 1, say), the effects of trip-level covariates can be estimated from a conventional GEE analysis using an estimated covariance matrix together with the robust variance estimate. For small counts (with a marginal mean of 0.1), the conventional approach is not satisfactory, and we explore a within-cluster resampling (WCR) approach [Hoffman, Sen and Weinberg (2001), Follmann, Proschan and Leifer (2003)]. The WCR approach was originally proposed to deal with informative cluster sizes, which is not of concern in the NTDS. Our motivation for considering WCR is to improve the performance of GEE methods by altering the data structure. To this end, we consider extensions of WCR that reduce serial correlation within clusters or increase the number of (approximately) independent clusters. This leads to a WCR method involving separated blocks which performs better than the conventional methods.

The rest of the paper is organized as follows. In the next section we set up the notation, formulate the problem, and discuss potential issues. Then we examine the standard GEE methods in Section 3 and explore some WCR methods in Section 4. The NTDS data are analyzed in Section 5 using the appropriate methods. The paper concludes with a discussion in Section 6.

2 Formulation

Let (; ) denote the number of events that occur during the th trip by the th subject. For reasons that will become clear later, we distinguish subject-level covariates such as gender from trip-level covariates such as time since licensure, writing for the former group of covariates and for the latter. Note that may include interactions between subject-level and trip-level covariates. With denoting the mileage of the th trip by the th subject, the marginal model of interest to us may be written as

(1)

where , and are unknown parameters, the latter two being of primary importance. Without loss of generality, we treat the covariates as fixed.

Estimation of the parameters in model (1) is challenged by the fact that the from the same subject tend to be correlated due to considerable variability between drivers as well as serial correlation (over time) within drivers. These sources of correlation are often accounted for using random effects [e.g., Zeger (1988), Davis, Dunsmuir and Wang (2000)]. For example, one might postulate that, conditional on the random effects , and , the are independent and each follows a Poisson distribution with conditional mean

(2)

where induces some heterogeneity between drivers (beyond that explained by ), generates serial correlation among trips close in time, and accounts for any additional overdispersion relative to the Poisson model. It is often assumed that the for a given subject arise from a subject-specific stochastic process through the relationship , where denotes the time of the th trip. Note that model (2) is consistent with model (1) because it implies the same mean structure after integrating out the random effects, provided the distributions of the random effects do not depend on the covariates.

For the NTDS data, it seems reasonable to equip model (2) with the following distributional assumptions. One might assume that , , and is a zero-mean Gaussian process with a covariance structure given by

(3)

The random effects and and the process are assumed independent of each other, although the are necessarily correlated within each subject. The parameter determines how rapidly the serial correlation decreases with the gap time. The process is known as the Ornstein–Uhlenbeck process [Uhlenbeck and Ornstein (1930)], and the above distributional assumptions will be collectively referred to as the Gauss–Ornstein–Uhlenbeck–Poisson (GOUP) model. It is possible to perform a maximum likelihood analysis under the GOUP model [e.g., Chan and Ledolter (1995)]. However, such an analysis can be computationally demanding because of the serial correlation, and the resulting inference may be sensitive to the distributional assumptions involved. The methods to be considered for our marginal analysis will not require the full strength of the GOUP model, and they may or may not involve the distributional assumptions in the GOUP model. Some of the methods we consider do incorporate such distributional assumptions into estimating equations through a working covariance matrix, in which case we also assess the robustness of the resulting inference against misspecification of the correlation structure. Our goal is to make valid and reasonably efficient inference on and in model (1) with minimal dependence on the distributional assumptions in the GOUP model.

Without the distributional assumptions, model (2) does play a crucial role in this paper, as a way to conceptualize the different sources of variability. We assume that the different subjects are independent of each other and that the random effects and and the random process are independent of each other within each subject. It follows that the () are conditionally independent given and . Another important implication is that

(4)

where

(5)

This shows that the correlation due to can be removed by treating subject as a fixed effect. Obviously, this approach would not work in the usual asymptotic theory assuming a large number of subjects and a limited number of observations per subject. However, it might be appropriate when the number of subjects is small and the number of observations per subject is large as in the NTDS. Note that the subject-specific model (4) does not directly involve , which has been absorbed into the subject-specific intercept ; thus,  cannot be estimated directly by fitting model (4). Nonetheless, equation (5) suggests that once has been estimated, say, by , it should be possible to estimate from a subject-level linear model regressing on .

3 Standard GEE methods

3.1 GEE with working independence

Assuming working independence, a standard GEE analysis can be performed with or without fixed subject effects (FSE), using either the robust variance estimate or the model-based variance estimate. This gives rise to four possible methods for estimating . As discussed earlier, is not directly estimable from a GEE analysis with FSE but can be recovered from a subsequent linear regression analysis based on (5). Let denote the GEE estimate of , and assume that is approximately unbiased, that is, . This, together with (5), implies that

An approximately unbiased estimate of can then be obtained from a subject-level linear model regressing on . A standard least squares (LS) algorithm can be used to fit this model if the can be treated as independent. In general, the are correlated due to correlated errors in GEE estimation, even though the are indeed independent of each other. This correlation can be taken into account using an iteratively reweighted least squares (IRLS) algorithm described in Appendix A. Thus, under working independence,  can also be estimated using four different methods (LS and IRLS with FSE, robust and model-based without FSE).

These methods are evaluated and compared in a simulation study mimicking the NTDS. Specifically, each simulated data set consists of 40 subjects and 60,000 trips (1500 per subject). The study duration is rescaled to the unit interval, over which the trips are uniformly distributed. The offset  follows a normal distribution with mean 1 and variance 1,  is generated as a Bernoulli variable with success probability 0.5, and  is taken to be the trip time (i.e., ). Given the covariates, the outcome  is generated according to the GOUP model described in Section 2, with and or 300, corresponding to longer- and shorter-lived serial correlation, respectively. These values are based on the NTDS data (see Section 5) and the wide range for reflects a large amount of variability in its estimation (see Section 3.2). We set for simplicity because the results change little over a range of realistic values for these parameters. We choose in (2) such that the marginal mean of  equals a specified value (0.1 or 1). This range for covers most kinematic measures in the NTDS with varying thresholds. Some measures are associated with larger counts, which are generally easier to deal with. In each scenario (combination of parameter values), 1,000 replicate samples are generated and analyzed using the methods described in the preceding paragraph.

Mean Serial Include Further Emp. Emp. Median Emp.
count correlation FSE? option bias SD SE CP
1 Short No Robust 0.39 0.33 0.90
No Model-based 0.40 0.03 0.14
Yes LS 0.33 0.32 0.95
Yes IRLS 0.32 0.31 0.95
Long No Robust 0.42 0.34 0.90
No Model-based 0.42 0.03 0.11
Yes LS 0.33 0.33 0.94
Yes IRLS 0.32 0.32 0.95
0.1 Short No Robust 0.41 0.33 0.90
No Model-based 0.40 0.04 0.17
Yes LS 0.35 0.33 0.94
Yes IRLS 0.36 0.32 0.93
Long No Robust 0.42 0.34 0.89
No Model-based 0.40 0.04 0.15
Yes LS 0.35 0.33 0.94
Yes IRLS 0.33 0.32 0.95
Table 1: Estimation of , the effect of a subject-level covariate, using GEE methods with working independence. The methods are described in Section 3.1 and Appendix A, and compared in terms of empirical bias, standard deviation (SD), median standard error (SE) and coverage probability (CP) of intended 95% confidence intervals. Each entry is based on 1,000 replicates

Table 1 compares the four methods for estimating in terms of empirical bias, standard deviation, median standard error and coverage probability of intended 95% confidence intervals. All methods in Table 1 are nearly unbiased, suggesting that bias is not of concern here. In terms of efficiency, the most important factor is clearly the use of FSE, which consistently results in smaller standard deviations. With FSE in the model, one might expect the IRLS estimate of , which accounts for the correlation among the , to be more efficient than the LS estimate. However, their difference seems rather small in Table 1, for two reasons. First, the variability due to estimating the is dominated by the variability in the , resulting in weak correlation among the in this particular example. Second, a referee pointed out that any correlation that does exist among the should be approximately exchangeable, in which case the LS estimate is still efficient. The precision in estimating appears insensitive to the length of the serial correlation (specified through ), although the FSE estimates do seem to become more variable for low counts []. The use of FSE also helps with variance estimation. Without FSE, the model-based variance estimate is clearly disastrous, as expected, and even the robust variance estimate is unsatisfactory, with sub-nominal coverage (90%). Including FSE in the GEE model leads to reasonable variance estimates and nearly correct coverage, under both (LS and IRLS) approaches. Thus, it seems that the key to valid and efficient inference about in similar situations is to include FSE in a GEE analysis followed by a subject-level linear regression analysis for the estimated FSE.

Mean Serial Include Variance Emp. Emp. Median Emp.
count correlation FSE? estimate bias SD SE CP %NA
Working independence
1 Short No Robust 0.12 0.09 0.91 0.0
No Model-based 0.12 0.05 0.67 0.0
Yes Robust 0.12 0.09 0.90 0.0
Yes Model-based 0.12 0.04 0.47 0.0
Long No Robust 0.21 0.16 0.92 0.0
No Model-based 0.21 0.05 0.41 0.0
Yes Robust 0.21 0.16 0.91 0.0
Yes Model-based 0.21 0.04 0.28 0.0
0.1 Short No Robust 0.13 0.10 0.90 0.0
No Model-based 0.13 0.07 0.74 0.0
Yes Robust 0.13 0.10 0.90 0.0
Yes Model-based 0.13 0.06 0.65 0.0
Long No Robust 0.21 0.17 0.90 0.0
No Model-based 0.21 0.07 0.49 0.0
Yes Robust 0.21 0.17 0.91 0.0
Yes Model-based 0.21 0.06 0.42 0.0
Estimated covariance matrix
1 Short Yes Robust 0.07 0.07 0.95 0.0
Yes Model-based 0.07 0.07 0.95 0.0
Long Yes Robust 0.12 0.12 0.94 0.0
Yes Model-based 0.12 0.11 0.90 0.1
0.1 Short Yes Robust 0.09 0.10 0.92 4.6
Yes Model-based 0.10 0.10 0.95 4.9
Long Yes Robust 0.15 0.16 0.90 3.5
Yes Model-based 0.15 0.12 0.87 3.9
Misspecified covariance matrix
1 Varying Yes Robust 0.09 0.09 0.93 0.0
Yes Model-based 0.09 0.08 0.92 0.0
0.1 Varying Yes Robust 0.11 0.12 0.90 3.7
Yes Model-based 0.11 0.11 0.93 2.9
Table 2: Estimation of , the effect of a trip-level covariate, using standard GEE methods. The methods are described in Sections 3.13.3 and compared in terms of empirical bias, standard deviation (SD), median standard error (SE), coverage probability (CP) of intended 95% confidence intervals and the percentage of estimates that are not available (NA) due to numerical problems such as noninvertible matrices. In each scenario, 1,000 replicates are generated, and the NA estimates are counted and then excluded in calculating the other summary statistics

Estimation of is a different story, as shown in the top section of Table 2. As in the case of estimating , bias is not a major issue in estimating . The precision in estimating depends mostly on the length of the serial correlation, with better precision for shorter-lived serial correlation, and seems insensitive to other factors (e.g., FSE, mean count). This is clearly different from the situation in Table 1, and an intuitive explanation is the following. In general, estimating the effect of a subject-level covariate is essentially comparing one group of subjects with another, while estimating the effect of a trip-level covariate is essentially comparing one group of trips with another group of trips by the same subjects. With , estimation of is basically a comparison of later trips with earlier trips, and it seems natural that serial correlation has a larger impact on this comparison than on a comparison of boys with girls, say. The latter comparison, on the other hand, is more likely to benefit from the use of FSE to separate the relevant information (i.e., overall incidence rates of individual drivers) from the noise (i.e., within-subject variation). In Table 2, none of the four methods is satisfactory in terms of coverage, though the robust variance estimate performs better than the model-based one. Several alternatives to the robust variance estimate have also been explored with little success (see Section 6).

3.2 GEE based on an estimated covariance matrix

We now consider GEE methods incorporating the covariance matrix for the . Under the GOUP model described in Section 2, it is straightforward to show, by appealing to well-known properties of normal and Poisson distributions, that

(6)
(8)

These expressions provide the marginal covariance matrix relevant in a GEE analysis without FSE. With FSE in the model, we need to condition on and the relevant formulas become

(9)
(11)

with defined in Section 2. Note that is not involved in the covariance matrix in a GEE analysis with FSE. Unknown parameters in the covariance matrix (, , and possibly ) can be estimated by applying moment methods and nonlinear regression techniques to residuals from a preliminary GEE analysis with working independence (see Appendix B for details). This preliminary GEE analysis can be performed with or without FSE, regardless of the primary GEE analysis for estimating . With FSE in the preliminary GEE analysis, the aforementioned techniques do not provide an estimate of . If desired, an estimate of  can be obtained from the IRLS estimation of  or simply as the error variance in the LS analysis, ignoring the fact that is an error-prone estimate of (see Section 3.1 and Appendix A for details). Thus, the covariance matrix can be estimated using three methods: FSE-LS, FSE-IRLS and no FSE, with the first two differing only in the estimate of .

Serial correlation Emp. bias Emp. SD
Method % NA
Short FSE-LS 1.1E01 0.23 0.10 0.11 1.4E02
 () FSE-IRLS 1.1E01 0.23 0.10 0.11 1.4E02
No FSE 4.1E02 0.25 0.29 0.22 5.9E03
Long FSE-LS 2.0E01 0.24 0.07 0.08 4.9E01
 () FSE-IRLS 2.0E01 0.24 0.07 0.08 4.9E01
No FSE 1.8E02 0.26 0.40 0.17 2.1E03
Short FSE-LS 1.2E02 0.23 0.21 0.23 2.9E03
 () FSE-IRLS 1.2E02 0.23 0.21 0.23 2.9E03
No FSE 2.6E02 0.23 0.24 0.21 4.5E03
Long FSE-LS 3.4E01 0.25 0.10 0.21 8.5E01
 () FSE-IRLS 3.4E01 0.24 0.10 0.21 8.5E01
No FSE 2.1E02 0.28 0.97 0.20 1.5E03
Short FSE-LS 6.3E02 2.97 0.51 1.18 4.1E03
 () FSE-IRLS 6.3E02 2.97 0.51 1.18 4.1E03
No FSE 6.5E02 0.35 0.39 0.39 3.6E03
Long FSE-LS 1.0E03 1.65 0.51 1.17 1.4E04
 () FSE-IRLS 1.0E03 1.24 0.51 1.17 1.4E04
No FSE 2.2E03 0.33 0.58 0.35 3.3E04
Table 3: Estimation of parameters in the covariance matrix using moment and nonlinear regression methods. The methods are described in Appendix B and compared in terms of empirical bias, standard deviation (SD) and the percentage of estimates that are not available (NA) due to convergence failure in nonlinear regression. In each scenario, 1,000 replicates are generated, and the NA estimates are counted and then excluded in calculating the other summary statistics

Table 3 shows the performance of the above three methods for estimating parameters in the covariance structure, in terms of bias, standard deviation and convergence. The methods are evaluated in the same scenarios as the previous simulation experiments, with the addition of a higher mean count (10) to show a more complete picture. For a mean count of 10, the FSE methods are nearly unbiased for the variance components but biased for in a way that results in underestimation of the length of the serial correlation. The bias for is larger for longer-lived serial correlation. The no FSE method is more biased for (in the same direction) and even visibly biased for . With decreasing counts, the bias problem becomes more severe, especially for , to the extent that estimation of the serial correlation is meaningless for a mean count of . The problem is compounded by a large amount of variability in estimating . Another issue with low counts is convergence in the nonlinear regression. For a mean count of , the convergence failure rates are approximately 8% and 13% (with short- and long-lived serial correlation, resp.) for the FSE methods and even worse for the no FSE method. The FSE methods appear more reasonable than the no FSE method, and we will use FSE-LS, which is simpler than FSE-IRLS, in the subsequent simulations.

GEE methods based on covariance matrices estimated through FSE-LS are evaluated in the same setting described in Section 3.1. When the nonlinear regression in FSE-LS fails to converge, the data-driven initial values will be taken as the final estimates [see equation (17) in Appendix B]. With thousands of trips per subject, inverting a covariance matrix is time-consuming, and iterating until convergence in the usual Fisher scoring algorithm [Fitzmaurice et al. (2008), Section 3.2.4] would be prohibitive. We therefore settle for a one-step estimate of the regression coefficients based on just one iteration in the Fisher scoring algorithm. Specifically, we start with a GEE analysis with working independence, use the residuals to estimate parameters in the covariance matrix, and update the regression coefficients just once using the estimated covariance matrix. Under this approach, it is necessary to include FSE in each GEE analysis; otherwise meaningful simulation results cannot be produced. This is not surprising because inversion of large covariance matrices can be difficult when the correlation is strong (without FSE) and poorly estimated. This is not a serious limitation because the use of FSE leads to better efficiency and coverage anyway, which we have observed in simulations where the true covariance matrix is used in GEE (results not shown). Once FSE are included, numerical problems become much less frequent (3–5% for a mean count of 0.1, 1% for higher counts).

We have found that incorporating the covariance matrix into the estimating equation does little to improve the estimation of in terms of efficiency and coverage. We therefore omit the results for estimating and henceforth focus on estimating . The results for the latter, reported in the middle section of Table 2, show that better efficiency is attained using an estimated covariance matrix than under working independence. In terms of coverage, the robust variance estimate performs well for higher counts but not for lower ones, especially when the serial correlation is long-lived. The model-based variance estimate performs well for short-lived serial correlation and poorly for long-lived serial correlation. The former case shows that underestimating the length of the serial correlation has little effect when the serial correlation is already negligibly short. Although not shown in Table 2, the model-based variance estimate has been found to work well in all cases if the true covariance matrix is used. Thus, the root of the problem appears to be the difficulty in estimating .

3.3 GEE based on a misspecified covariance matrix

It is natural to ask how the GEE methods based on the GOUP model would perform if the model is misspecified. For the NTDS, the GOUP model seems plausible for the most part. It is possible, however, that the length of the serial correlation, which is assumed constant, might change over time. To formalize this notion of changing length, note that the covariance structure given by (3) can be generalized into

(12)

for a function . For a constant , this reduces to the original GOUP model. In the next set of simulation experiments, we will generate data using the generalized GOUP model with taken to be a straight line from to . Thus, instead of working exclusively with short- or long-lived serial correlation, we allow it to be short-lived at the beginning and to gradually become long-lived at the end. This is motivated by the conjecture that teenage driving will tend to be haphazard at the beginning and will become more consistent over time. Unfortunately, it appears difficult to verify this conjecture directly using the data. Indeed, even when  is a constant, we have seen in Table 3 that it can be really difficult to estimate, especially when the counts are low. It seems reasonable to expect that a time-varying will be even more difficult to estimate. We therefore take a sensitivity analysis approach to this issue. The bottom section of Table 2 shows the performance in the present setting of GEE methods based on the original GOUP model. The point estimates are more (or less) efficient in this setting than if the serial correlation is consistently long-lived (or short-lived). The coverage property of the model-based variance estimate is also intermediate between the two extremes. As before, the robust variance estimate works well for large counts but not for small ones.

3.4 Summary

To summarize, valid inference on can be made through a GEE analysis with FSE (regardless of the covariance matrix) followed by a linear regression analysis, while inference on requires more work. For large counts, valid inference on can be made by estimating the covariance matrix and using the robust variance estimate; in this case, the inference seems robust with respect to the serial correlation structure. For small counts, the robust variance estimate is unsatisfactory and the model-based variance estimate performs well only for short-lived serial correlation. Because the serial correlation is difficult to estimate with small counts, one would not know in practice whether the model-based variance estimate is appropriate for a particular application. Therefore, it is necessary to develop methods that perform well (or better) for small counts with short- and long-lived serial correlation. This is the focus of the next section, where we explore WCR-type methods.

4 WCR and extensions

4.1 The standard WCR approach

WCR is a resampling approach for marginal analysis of clustered data, originally designed to deal with informative cluster sizes [Hoffman, Sen and Weinberg (2001), Follmann, Proschan and Leifer (2003)]. In its original form, the approach can be applied to the present problem as follows. Suppose we sample a trip from each subject randomly and form a subset of nonclustered data

(13)

where is uniformly distributed on , independently across . There is no correlation in the subsample (13), which can therefore be analyzed using a quasi-Poisson model with mean structure given by (1). This can be repeated a large number (say, ) of times to yield , where denotes the point estimate of based on the th subsample, and is the associated variance estimate. Then the WCR estimate of  is given by

(14)

and the associated variance estimate is

(15)

The first term on the right-hand side of (15) measures the variability within subsamples, while the second term measures the variability between subsamples.

In general, for any WCR method to work, it is essential that valid estimates be obtained from each subsample. Note that the estimates  are identically distributed. If is biased for , then given by (14) is also biased by the same amount. Similarly, if underestimates the sampling variance of , then given by (15) will underestimate the sampling variance of by a similar amount, because the second term on the right-hand side of (15) will be approximately unbiased for large . Furthermore, in terms of relative bias, the underestimation problem of is generally worse than that of , simply because the true variance of is smaller than that of . In light of these observations, we will study WCR methods in two steps, always considering a single subsample before moving to repeated sampling.

4.2 WCR with simple random sampling (WCR-SRS)

There is no reason why a WCR subsample must consist of exactly one observation from each subject. In fact, to fit a model with FSE, there has to be more than one observation from each subject. Without FSE, the standard WCR approach is feasible but, as we shall see, does not work well. We therefore consider an extended WCR approach where a simple random sample (SRS) is taken from each subject to form an WCR subsample. Thus, instead of a single number , we take an SRS from the index set for each . The resulting subsample is

for which a standard GEE analysis can be performed, under working independence or using an estimated covariance matrix, with or without FSE, using the robust or model-based variance estimate. This can be repeated a large number of times, and the resulting estimates can be combined using (14) and (15). If the covariance matrix for the is used, it will be estimated once from the full sample and the estimate will be applied to all subsamples.

Serial Include Variance Emp. Emp. Median Emp.
correlation FSE? estimate bias SD SE CP %NA
Working independence
Short No Robust 1.59 0.64 0.4
0.99 0.86 0.0
0.52 0.91 0.0
0.28 0.91 0.0
0.15 0.92 0.0
0.00 0.16 0.0
Yes Robust 1.16 0.87 0.2
0.50 0.87 0.1
0.28 0.90 0.1
0.14 0.91 0.0
0.00 0.18 0.0
Long No Robust 1.64 0.64 0.8
1.00 0.89 0.0
0.53 0.89 0.0
0.31 0.92 0.0
0.20 0.91 0.0
0.10 0.52 0.0
Yes Robust 1.13 0.85 0.1
0.52 0.89 0.1
0.30 0.90 0.0
0.20 0.90 0.0
0.10 0.56 0.0
Table 4: Estimation of , the effect of a trip-level covariate, using the WCR-SRS approach with different values of (number of trips to sample from each subject) and (number of repetitions in WCR). The methods are described in Sections 4.1 and 4.2 and compared in terms of empirical bias, standard deviation (SD), median standard error (SE), coverage probability (CP) of intended 95% confidence intervals and the percentage of estimates that are not available (NA). The mean count is fixed at 0.1. In each scenario, 1,000 replicates are generated, and the NA estimates are counted and then excluded in calculating the other summary statistics
Serial Include Variance Emp. Emp. Median Emp.
correlation FSE? estimate bias SD SE CP %NA
Estimated covariance matrix
Short Yes Robust 1.44 0.89 0.75 0.4
0.59 0.48 0.85 1.4
0.30 0.29 0.91 3.3
0.15 0.15 0.92 4.6
0.09 0.21 0.70 2.6
Model-based 1.49 1.19 0.91 0.2
0.58 0.55 0.94 1.1
0.29 0.30 0.94 2.2
0.15 0.15 0.97 4.1
0.10 0.12 0.68 1.4
Long Yes Robust 1.48 0.93 0.76 0.4
0.61 0.51 0.87 1.2
0.31 0.30 0.90 2.3
0.18 0.19 0.90 3.1
0.15 0.24 0.69 2.8
Model-based 1.67 1.22 0.92 0.4
0.59 0.54 0.93 1.9
0.32 0.31 0.94 2.4
0.18 0.17 0.92 3.2
0.16 0.16 0.72 2.6
Table 4: (Continued)

This WCR-SRS approach is evaluated in Table 4. Given the findings in the last section, this evaluation is focused on small counts. Under working independence, only the robust variance estimate is considered because the model-based variance estimate is clearly invalid. As before, no meaningful results are available when an estimated covariance matrix is used without FSE. For each method, we begin by analyzing a single subsample, with and different values of (size of the SRS from each subject). Table 4 shows poor performance for (i.e., standard WCR), apparently due to erratic estimates, which are likely to arise when the subsample is very small. Even for , the point estimate is visibly biased in some cases, probably because the subsample is still small. The coverage probability generally increases with , at least for . The best coverage is seen when an estimated covariance matrix is used together with FSE and the model-based variance estimate. As explained in Section 3.2, this method tends to work well when the serial correlation is weak, which is precisely what happens to a randomly selected subset of trips, which tend to be farther apart from each other than in the full sample. This explains why better coverage is seen here than in the corresponding portion of Table 2. Of course, this trend will be reversed when the number of selected trips gets too large, which explains the slight drop in coverage probability when increases further to 500. Overall, it appears that would be a reasonable choice in terms of bias and coverage. Efficiency is not a major consideration here because the efficiency loss due to can be recovered through repeated sampling. Table 4 also shows the performance of the WCR-SRS method with and , whose efficiency levels appear similar to the corresponding results in Table 2, as expected. For the working independence methods, the WCR variance estimate is seriously biased downward and the coverage probability is far below the nominal level, confirming the conjecture in Section 4.1 that the variance underestimation problem of  is generally worse than that of in terms of relative bias. For methods based on an estimated covariance matrix, the WCR variance estimate is not necessarily biased downward but the coverage properties are not good. This behavior is probably due to erratic estimates which arise frequently in repeated analyses of subsamples when a poorly estimated covariance matrix is used.

Table 4 does suggest a potential benefit of WCR, that is, the ability to produce a sparse subset of trips with weaker serial correlation. This advantage can be exploited further by selecting trips that are separated from each other by a specified amount. Specifically, for a given positive integer , we can choose a trip randomly among the first trips and every st trip from there on, until the end of the sequence. This can be done for each subject and the process can be repeated many times as before. This approach does lead to good coverage in single outputation when an estimated covariance matrix is used together with FSE and the model-based variance estimate (results not shown). However, upon moving to WCR, this approach exhibits the same strange behavior as seen in Table 4 (results not shown), probably for the same reason.

4.3 WCR with separated blocks (WCR-SB)

The previous experiments with WCR show that variance underestimation can be a serious problem, especially when the variance of is much smaller than that of . The latter difference can be reduced by including more observations in each subsample, and the question then becomes how to ensure reasonable performance of the variance estimate based on a subsample. To this end, we propose the following WCR approach involving separated blocks. With FSE in the model, the only source of correlation is serial correlation, which is weaker for trips farther apart. Thus, with an appropriate amount of separation, trips from the same subject can be treated as approximately independent. Let denote the block size (i.e., the number of consecutive trips to be analyzed together as a block) and the separation (i.e., the number of trips used to separate the blocks). For a given subject, one possible set of separated blocks can be formed by taking the first trips, skipping the next trips, taking the next , skipping the next , and so on. Each group of trips sampled together will be treated as a block and the different blocks will be treated as independent in the subsequent GEE analysis. A random shift can be added to this sampling process, and the random sampling process can be repeated many times as before.

Preliminary simulation results, as well as those in Table 4, suggest that a reasonable block size would be . Simple calculations based on (9) and (11) show that, with , the correlation between trips in different blocks is typically well below 0.05 in all scenarios considered in this paper. With these choices for and , the WCR-SB approach is evaluated through simulations and the results are presented in Table 5.

Serial Emp. Emp. Median Emp.
correlation bias SD SE CP %NA
Short 0.00 0.15 0.13 0.94 0.0
0.00 0.12 0.10 0.92 0.0
Long 0.00 0.22 0.19 0.92 0.0
0.01 0.20 0.17 0.92 0.0
Varying 0.00 0.18 0.15 0.92 0.1
0.00 0.15 0.13 0.92 0.0
Table 5: Estimation of , the effect of a trip-level covariate, using the WCR-SB approach with working independence and the robust variance estimate, implemented with (block size), (separation) and (single subsample) or 50. The method is evaluated in terms of empirical bias, standard deviation (SD), median standard error (SE), coverage probability (CP) of intended 95% confidence intervals and the percentage of estimates that are not available (NA). The mean count is fixed at 0.1. In each scenario, 1,000 replicates are generated, and the NA estimates are counted and then excluded in calculating the other summary statistics

We focus on working independence to avoid numerical stability issues, and only consider the robust variance estimate. The method appears to have better coverage when applied to a single set of separated blocks than to the full sample (Table 2), apparently owing to a larger number of approximately independent clusters (400 vs. 40). Note that the efficiency based on a single outputation is quite close to that in a full sample analysis. Considering this and the computational demand, the WCR version is implemented with repetitions, and the resulting WCR-SB estimates are virtually as efficient as the full sample estimates. In the case of long-lived serial correlation, better coverage (92%) is achieved here using the working independence method than in any of the previous simulations. The improvement (2% over Table 2) is not dramatic but still substantial, and similar phenomena have been observed consistently in other simulation experiments under similar scenarios. Table 5 also shows the performance of the WCR-SB method when the serial correlation is generated according to (12) rather than (3). The only noticeable effect of this change is an intermediate level of efficiency (between the extreme cases of short- and long-lived serial correlation).

5 Analysis of NTDS data

Our analysis of the NTDS data concerns the following elevated -force events: rapid start (longitudinal ), hard stop (longitudinal ), hard left/right turn (lateral deceleration/), yaw (5 degrees within 3 seconds), as well as a composite measure defined as the totality of these 5 types of events. Yaw is a measure of correction after a turn and is calculated as the absolute change in angle between an initial turn and the correction. For each specific -force event, the threshold (in parenthesis) is chosen to allow meaningful differences between drivers and different driving conditions to be revealed and estimated with a sufficient number of events. The mean counts based on the chosen thresholds are 0.13 (rapid start), 0.18 (hard stop), 0.20 (hard left turn), 0.15 (hard right turn), 0.04 (yaw) and 0.70 (composite measure). The mean count for yaw is relatively low, perhaps too low for the methods discussed in this paper, but the corresponding threshold is already the lowest for which we have data available. This limitation should be kept in mind when interpreting the results.

From the viewpoint of behavior science, it is reasonable to expect some serial correlation because risky driving may be related to the weather and the driver’s mental and physical conditions (e.g., mood and fatigue). All of these factors could result in serial correlation unless they are all included in the model, which is impossible in the NTDS due to the lack of this information. To explore the nature of the serial correlation in the NTDS data, a marginal model with FSE and all relevant covariates (to be described later) is fit for the composite measure under working independence. Pairwise products of standardized residuals are calculated for consecutive trips (because the number of all possible pairs is too large). After sorting by gap time, these pairs are grouped into 100 bins for further reduction. Within each bin, we calculate the median gap time and the mean product of standardized residuals, and the results are plotted in Figure 2 together

Figure 2: Serial correlation for the composite measure in the NTDS. Pairwise products of standardized residuals are calculated for consecutive trips and, after sorting by gap time, grouped into 100 bins. Each circle represents the median gap time and the mean product of standardized residuals in a bin, and the solid line is a lowess smooth.

with a lowess smooth. Figure 2 suggests that some serial correlation is present in the data, although a precise characterization of the serial correlation seems difficult. Under the GOUP model and using the techniques described in Appendix B, the estimated variance components are generally close to 1 (for the composite measure and other -force events), while the estimate of varies wildly and depends heavily on the initial value.

Figure 3: Incidence rates of various -force events (rapid start, hard stop, hard left/right turn, yaw, and the composite measure) as functions of time since licensure, estimated using regression splines (one node for every 3 months).

Figure 3 presents the estimated incidence rate (IR) per 100 miles of each -force event as a function of time since licensure. The IR is defined by (1) with replaced by 100, empty, and consisting of indicators of calendar month and a set of regression splines for (one knot for every 3 months). Calendar month is included in the model to adjust for a possible seasonal effect, which may confound with the effect of time since licensure because the enrollment of subjects was not uniform with respect to calendar month, with more subjects enrolled in the fall and fewer in the spring. The estimates shown in Figure 3 represent geometric means of the monthly estimates. Note that the IR considered here is a marginal quantity which involves the marginal intercept and which should therefore be estimated in a model without FSE. The estimates in Figure 3 are obtained under working independence, because the GOUP model without FSE is problematic to fit and its performance is not well understood (see Section 3.2). Under the regression spline model, there is some ambiguity as to whether the IR for the composite measure changes over time. The -value for the regression splines is 0.49 based on the robust variance estimate under working independence, although the validity of this test may be questionable (see Section 3). Under the GOUP model with FSE, the IR for the composite measure does appear to change over time ( for the robust variance estimate, for the model-based variance estimate). The latter observation is in contrast to a previous report that risky driving largely remains constant [Simons-Morton et al. (2011a)], and the difference is probably due to the lower thresholds used here, which lead to more events and perhaps more relevant information. Despite the ambiguity about statistical significance, the main conclusion from our marginal analysis is fairly consistent with the findings of Simons-Morton et al. (2011a). The latter reference reports IRs for adults/parents that are consistently and substantially lower than those for novice teenagers driving the same vehicles. In Figure 3, the estimated IR of the composite measure for teenage drivers increases over the first 6 months, declines over the next few months, and then increases again, suggesting the establishment of a risky driving style. This general impression is reinforced by the plots for the specific -force events, which show some temporal changes but no clear trends.

As noted by a referee, some instances of risky driving (e.g., mistakes due to the lack of skills) may depend more on the amount of practice, which can be quantified by accumulated mileage, than on time since licensure. This raises the question of which measure is more appropriate to adjust for. Accumulated mileage is a measure of driving experience, while time since licensure reflects both experience and maturity. From a public health perspective, we believe that time since licensure is more relevant to policy makers. Nonetheless, all analyses in this section have been repeated with time since licensure replaced by accumulated mileage, and the results are very similar and therefore omitted. With regard to Figure 3, this similarity suggests that persistent risky driving among teenage drivers is not only due to inexperience but also to immaturity (i.e., aspects of adolescent development that motivate novice young drivers to seek excitement and under-recognize risks).

Each -force event is further analyzed in a larger model that simultaneously adjusts for the following covariates: driver’s gender, risky friends, time of day, passenger condition, calendar month, and time since licensure. The risky behavior of a teenage driver’s friends is assessed using a 7-item index asking “How many of your friends would you estimate… smoke cigarettes, drink alcohol, get drunk at least once a week, use marijuana, drive after having two or more drinks in the previous hour, exceed speed limits, and do not use safety belts (none, a few, some, most, all).” The assessment was made at 4 time points (baseline, 6, 12 and 18 months), the 4 scores were averaged for each driver, and the average score was then dichotomized according to the median split among all drivers in the study. Night driving was determined from video data by visual observation of the ambient natural lighting at the start of the trip. Late night was defined as 10 pm to 6 am, and night trips that did not start at late night were considered early night trips. The presence and relative age (adult or teen) of each passenger was determined by the coders examining the video data. In summary, gender and risky friends enter the marginal model as subject-level binary covariates, time of day is a trip-level covariate with 3 levels (day, early night and late night), passenger condition is also a trip-level covariate with 3 levels (none, teens, at least one adult), calendar month is a trip-level covariate with 12 levels, and time since licensure is represented by the same set of regression splines as before. No signs of model misspecification are observed in residual plots (not shown).

Table 6 presents estimates of incidence rate ratios (IRR) that quantify the association of each -force event with the aforementioned covariates (except calendar month and time since licensure). An IRR is just the result of exponentiating the corresponding regression coefficient in the marginal model. The estimates are obtained using standard GEE methods (working independence or the GOUP model, robust or model-based variance estimate) as well as the proposed WCR-SB method (working independence, , , ). All of these methods include FSE. In a sensitivity analysis, a different separation size () for the WCR-SB method is used to produce similar results, which are therefore omitted. Since the point estimates depend primarily on the working correlation structure, only 2 sets of point estimates are shown in Table 6. For the subject-level covariates, the LS method is used to obtain 2 sets of 95% confidence intervals (for working independence and the GOUP model). For the trip-level covariates, Table 6 presents 4 sets of 95% confidence intervals corresponding to the robust variance estimate and the WCR-SB method under working independence as well as the robust and model-based variance estimates under the GOUP model. Table 6 contains some unrealistically large values (100, labeled as ) for the IRRs associating risky friends and gender with yaw and rapid start, probably due to very few events in the subgroup of teenage drivers acting as the denominator.

GOUP model Working independence
Covariate Comparison IRR Model-based Robust IRR Robust WCR-SB
Composite measure ()
Gender Male vs. female
Risky friends More vs. fewer
Time of day Early night vs. day
Late night vs. day
Passengers Teen vs. none
Adult vs. none
Rapid start ()
Gender Male vs. female
Risky friends More vs. fewer
Time of day Early night vs. day
Late night vs. day
Passengers Teen vs. none
Adult vs. none
Hard stop ()
Gender Male vs. female
Risky friends More vs. fewer
Time of day Early night vs. day
Late night vs. day
Passengers Teen vs. none
Adult vs. none
Table 6: Analysis of NTDS data using standard GEE methods (working independence or the GOUP model, robust or model-based variance estimate) and the proposed WCR-SB method (working independence, , , ). Each outcome variable (rapid start, hard stop, hard left/right turn, and a composite measure) is analyzed separately in a marginal model that simultaneously adjusts for gender, risky friends, time of day, passenger condition, calendar month1 and time since licensure (the last two as confounders). The results are summarized in terms of incidence rate ratios (IRRs) and 95% confidence intervals. The symbol denotes a value that is unrealistically large (100)
GOUP model Working independence
Covariate Comparison IRR Model-based Robust IRR Robust WCR-SB
Hard left turn ()
Gender Male vs. female 1.36 1.29
Risky friends More vs. fewer 2.41 2.20
Time of day Early night vs. day 0.53 0.64
Late night vs. day 0.78 1.00
Passengers Teen vs. none 0.81 0.76
Adult vs. none 0.38 0.32
Hard right turn ()
Gender Male vs. female 1.39 1.31
Risky friends More vs. fewer 2.04 1.92
Time of day Early night vs. day 0.63 0.69
Late night vs. day 0.83 0.92
Passengers Teen vs. none 0.64 0.62
Adult vs. none 0.26 0.22
Yaw ()
Gender Male vs. female 0.07 0.07
Risky friends More vs. fewer
Time of day Early night vs. day 0.92 0.94
Late night vs. day 0.88 0.92
Passengers Teen vs. none 1.19 1.17
Adult vs. none 0.54 0.44
Table 6: (Continued)

Despite some numerical differences between the different methods, the results in Table 6 indicate clearly that teenage risky driving is not associated with gender, positively associated with risky friends, negatively associated with early night, and negatively associated with the presence of passengers (more strongly for adults than for teens). The evidence is not conclusive with regard to late night, whose association with risky driving is significant in some cases but not in others. The association of risky friends with risky driving points to potentially destructive effects of risk-accepting social norms and social identity on risky behavior. Both the perception and the actuality of having risk-taking friends could contribute to the perceived acceptability of risky behavior. The association of early night with risky driving suggests that teens do recognize the danger of night driving (at least partially) and drive more carefully (or rather, less carelessly) at night. There is no contradiction between this finding and the apparent ambiguity about late night because late night trips often take place under unusual circumstances. Finally, it is worth noting that passengers, especially adult passengers, appear protective with respect to risky driving. Further research is warranted to confirm, better characterize and utilize such protective effects.

6 Discussion

Even with a great variety of methods available for longitudinal data analysis, it can still be difficult to analyze longitudinal data in long sequences, especially when the serial correlation is strong and long-lived. In this paper, we examine standard GEE methods and propose new ones for marginal analysis of longitudinal count data in a small number of very long sequences. The methods are evaluated and compared in simulation experiments mimicking the NTDS, and the main findings can be summarized as follows. We consider the use of FSE in this particular situation, a simple technique with important practical implications. It allows the effects of subject-level covariates to be estimated easily from a linear regression analysis for the estimated FSE, and it also helps with trip-level covariates by removing the correlation due to population heterogeneity. For trip-level covariates, we find that a standard GEE analysis under working independence can lead to inefficient estimates and serious undercoverage, and that both problems can be alleviated by incorporating a properly specified correlation structure. The latter approach works well for large counts but not for small counts, mainly because the serial correlation is hard to estimate with small counts. We therefore explore an alternative approach (WCR) for the case of small counts. The original version of WCR and an extension to simple random sampling seem unsatisfactory, however, because of numerical instability in repeated analyses of small samples and a bias magnification effect of WCR that results in variance underestimation. To address these issues, we propose an WCR-SB approach that involves separated blocks and that performs better than all of the previously considered methods (WCR and standard GEE).

In Table 6 the WCR-SB analyses are not dramatically different from the other analyses. This is reassuring to the scientists, but it also suggests that the NTDS data set is not ideal for demonstrating the advantage of the WCR-SB approach. To put the latter point in perspective, we note that better performance in the frequentist sense does not imply better results in every realization. The performance of the WCR-SB approach has been evaluated in simulation experiments which are, in fact, designed to mimic the NTDS. Most of the uncertainty in designing the simulation experiments is associated with the length of the serial correlation, which is very difficult to estimate with small counts, as shown in Table 3. It is certainly possible that the serial correlation in the NTDS is not sufficiently long-lived for the new method to make a material difference. In that case, a better example than the NTDS might be a proposed follow-up study that involves 100 teenage drivers to be followed for at least 3 years. The larger amount of data from the latter study will afford a better understanding of both the nature and the length of the serial correlation. Furthermore, as teenage drivers gain experience and perspective over time, their driving behavior may become less haphazard and more stable, in which case the serial correlation will gradually become longer-lived toward the end of the (longer) study duration. Thus, in terms of the length of the serial correlation, the follow-up study will provide a better opportunity than the NTDS to demonstrate the advantage of the WCR-SB approach.

The WCR-SB approach is designed for longitudinal data in a small number of long sequences. The strength of this approach relative to standard GEE methods is the ability to handle strong and long-lived serial correlation when the mean count is low. The weaknesses of the WCR-SB approach include increased computational demand (relative to standard GEE methods) and the need for information about the length of the serial correlation (in order to specify the separation size ). Because can be very difficult to estimate, one may need to consult the subject-matter scientist or perform a sensitivity analysis with different values of , as we did in analyzing the NTDS data (see Section 5).

As mentioned earlier at the end of Section 3.1, we have actually explored several existing alternatives to the robust variance estimate in the present situation. No appreciable improvement has been achieved using a standard bootstrap procedure (i.e., sampling subjects with replacement) and jackknife methods [Paik (1988), Lipsitz, Laird and Harrington (1990)]. A block bootstrap procedure [Künsch (1989)] appears to work well for short-lived serial correlation but not for long-lived serial correlation. In a simple setting that permits closed-form calculations of the finite-sample target of the robust variance estimate (with sample quantities replaced by their population counterparts) under working independence, we have found that the target can be far below the sampling variance of the point estimate observed in simulations, suggesting that the asymptotic theory may not provide a reasonable approximation in this situation. Given that, it seems unlikely that a substantial improvement can be made using the available bias correction methods [e.g., Mancl and DeRouen (2001)]. Another possible approach to variance estimation is window subsampling [Sherman (1996), Heagerty and Lumley (2000), Oman et al. (2007)]. A practical difficulty with this approach is specification of the window size, which has been found to have a large impact on the variance estimate. Heagerty and Lumley (2000) have suggested taking the maximum among variance estimates based on several window sizes, but it remains unclear how to choose the set of window sizes for the maximization. Maximizing over a large set of window sizes can lead to a very conservative variance estimate, as the maximum among several underestimators need not be an underestimator itself.

Under the WCR approach, we have considered various sampling schemes based on time. An interesting possibility, suggested by a referee, is to sample trips using a mechanism that involves other covariates than time and possibly the outcome. In the latter case, appropriate adjustments will be necessary to account for the outcome-dependent nature of the subsample. Further research is needed to explore the potential benefits of the more sophisticated sampling schemes.

This article has been focused on valid inference in the sense of (nearly) correct coverage. We have not discussed the issue of generalization from a sample of subjects to the target population, which is always important and especially so when the number of subjects is small. Such generalizations should be easier to justify when the within-subject variability is large relative to the between-subject variability, which appears to be the case in the NTDS.

Appendix A IRLS estimation of following a GEE analysis with FSE

Write and , and let denote the variance estimate from the GEE analysis. As the notation indicates, estimates the conditional variance because the GEE model with FSE is conditional on . The marginal variance of is given by

where denotes the identity matrix. A natural estimate of can be obtained as

(16)

provided a reasonable estimate is available. This suggests the following IRLS algorithm. We start by obtaining an initial estimate of , say, the LS estimate. Then we proceed to the following steps: {longlist}[Step 1.] Step 1. Calculate a moment estimate of , say, the mean squared residual minus the mean diagonal element of ; Step 2. Substitute the estimate of into (16) and re-estimate from a weighted least squares analysis based on the updated . These steps can be iterated a number of times or until convergence.

Appendix B Estimation of parameters in the covariance matrix

Let denote the fitted values from a preliminary GEE analysis without FSE. Then (6) shows that a moment estimate of the sum of , and can be obtained as

where is the total number of observations. Further, equation (8) suggests that , and can be estimated from a nonlinear regression analysis with as the response variable and with mean function

The above quantities could be modified using bias correction adjustments [e.g., Davis, Dunsmuir and Wang (2000), Section 3.2] and/or smoothing techniques prior to the nonlinear regression analysis. However, such modifications have not been found helpful in our experiments. Ideally, the nonlinear regression analysis should include all pairs of trips within subjects. However, this may be impractical for the NTDS data because a subject with thousands of trips would contribute millions of pairs, resulting in too many data points. As a compromise, we propose to base the nonlinear regression analysis on two types of pairs: consecutive trips (with ) and symmetric trips (with ). The pairs of consecutive trips are informative about short-lived serial correlation, while the pairs of symmetric trips help characterize the overall correlation over the entire range of the gap time .

Similar techniques can be used for a preliminary GEE analysis with fixed subject effects. With denoting the fitted values with fixed subject effects, the sum of and can be estimated by

while the separate values of and can be estimated from another nonlinear regression analysis with as the response variable and with mean function

These are justified by (9) and (11), respectively.

Initial values of the unknown parameters are required for fitting the above nonlinear regression models. In our experience, a reasonable way to obtain initial values appears to be the following linear regression method. Consider the case with FSE, and note that equation (11) can be rewritten as