[
Abstract
Microbial ecology serves as a foundation for a wide range of scientific and biomedical studies. Rapidlyevolving highthroughput sequencing technology enables the comprehensive search for microbial biomarkers using longitudinal experiments. Such experiments consist of repeated biological observations from each subject over time and are essential in accounting for the high betweensubject and withinsubject variability.
Unfortunately, many of the statistical tests based on parametric models rely on correctly specifying temporal dependence structure which is unavailable in most microbiome data.
In this paper, we propose an extension of the nonparametric bootstrap method that enables inference on these types longitudinal data. The proposed moving block bootstrap (MBB) method accounts for withinsubject dependency by using overlapping blocks of repeated observations within each subject to draw valid inferences based on approximately pivotal statistics. Our simulation studies show an increase in power compared to mergebysubject (MBS) strategies. We also show that compared to tests that presume independent samples (PIS), our proposed method reduces false microbial biomarker discovery rates.
In this paper, we illustrated the MBB method using three different pregnancy data and an oral microbiome data. We provide an opensource R package bootLong to make our method accessible and the study in this paper reproducible.
Block bootstrap for microbiome data]The Block Bootstrap Method for Longitudinal Microbiome Data Pratheepa Jeganathan et al.]Pratheepa Jeganathan Pratheepa Jeganathan et al.]Susan P. Holmes
1 Introduction
Highthroughput markergene and metagenomics sequencing of environmental DNA provide a costeffective and cultureindependent description of microbial communities (Lozupone et al., 2012; Yatsunenko et al., 2012). The adoption of MGS methods has transformed the study of the complex microbial communities, or “microbiomes”, that inhabit environments ranging from the human body to wastewater treatment plant on the hypersaline desert (Turnbaugh et al., 2007; Gilbert et al., 2010).
Microbial communities vary significantly over time and space, and among subjects, in studies of hostassociated microbiomes (Eckburg et al., 2005; Turnbaugh et al., 2007; Lozupone et al., 2012; Flores et al., 2014; Proctor & Relman, 2017). Microbiome researchers have increasingly adopted longitudinal experimental designs consisting of repeated biological observations from each subject (Schloissnig et al., 2013; Flores et al., 2014; Fukuyama et al., 2017; Proctor et al., 2018) for purposes such as the detection of microbial biomarkers of disease. Longitudinal designs help experimenters overcome some of the difficulties caused by the high temporal and subjecttosubject variability of microbiomes, and allow subjects to be used as their own ‘controls’ (Cook & Ware, 1983; Diggle, 2002).
Currently, the analysis of longitudinal microbiome data is limited by the lack of appropriate statistical methods that account for both the temporal dependency due to repeated observations, and specific characteristics such as sparsity and high dimensionality of microbiome data (Xia & Sun, 2017). These data typically have fewer samples than the number of measured bacteria, and technical variability due to batch effects and unequal sequencing depth of samples can be comparable to the levels of biological variability. Moreover, meanvariance dependencies and variation in withinsubject correlation profiles introduce heteroscedasticity (McMurdie & Holmes, 2014).
The standard statistical analysis in microbiome studies is differential abundance testing where the association of bacterial abundances with two or more experimental states or conditions is tested in a set of independent samples (e.g. health and disease). Recent methods have been developed to account for the sparsity and heteroscedasticity. In particular, the similarity between testing for differential gene expression and differentially abundant bacteria was leveraged by McMurdie & Holmes (2014) in their proposal to use gammaPoisson mixtures and adapting the R package DESeq2 (Love et al., 2014) to the microbiome context. This was shown to be effective in accounting for the technical variability of microbiome measurements due to unequal sequencing depth.
However, presuming independent samples (PIS) leads to high false positive rates and findings that cannot be replicated when applied to longitudinal data. Parametric mixture models that account for dependencies can only be applied if one can specify the temporal dependence structure within a subject (Sun et al., 2016). Varin & Czado (2009) introduced a pairwise likelihood approach for making inferences on longitudinal discrete data, however their method requires prior knowledge of the bivariate marginals, which is unavailable in our case.
For longitudinal data with a response variable, either continuous or discrete, Diggle (2002) proposed generalized marginal, mixed effect, or transitional modelbased asymptotic inferences. These generalized linear models can be extended to longitudinal microbiome data. A recent negative binomial mixed effects model approach for detecting differentially expressed genes used permutation testing (Sun et al., 2016). Although the modeling accounts for dependency due to repeated observations as a random effect, permutation testing for inference is only valid for exchangeable or conditionally exchangeable observations. In longitudinal microbiome data, repeated observations within a subject have different autocorrelations that prevent these types of exchangeability.
A simple conservative alternative (used for instance in DiGiulio et al. (2015)) is to merge abundance measurements by subject, thereby collapsing the temporal dependence structure. Mergingbysubject (MBS) refers to averaging abundances withinsubject. In the MBS method, temporal dynamics and the individual values of multiple withinsubject observations are ignored, thereby eliminating withinsubject dependency but also substantially reducing the sample size and power.
To overcome the issues stated above, we propose a moving block bootstrap (MBB) method with an optimal block size step. The bootstrap is a computationally intensive method which enables inferences when analytic derivations are unavailable for the sampling distributions of estimators. The original bootstrap method (Efron, 1979), developed for independent and identical variables has been widely extended to many applications where the samples are dependent. This has been done either by modeling the dependency structure and resampling residuals (parametric bootstrap) or by changing the resampling mechanism to account for these dependencies (nonparametric bootstrap) (Efron & Tibshirani, 1994; Davison & Hinkley, 1997; Davison & Hall, 1993; Hall et al., 1995; Berkowitz & Kilian, 2000; Lahiri, 1999; Politis & White, 2004).
The MBB accounts for autocorrelation within a time series by a blocking and resampling procedure (Lepage & Billard, 1992; Davison & Hall, 1993; Lahiri, 1999), in which blocks of temporally contiguous observations are constructed and resampled with replacement to make bootstrap realizations that approximate the distribution of the chosen statistic. Ku̇nsch’s and Liu & Singh’s overlapping and Carlstein’s nonoverlapping methods are two such blocking procedures for time series (Hall et al., 1995). Lahiri (1999) showed that the overlapping blocking procedure minimizes mean squared error of most estimators. Because the number of repeated observations is small in biomedical experiments, the overlapping block bootstrap is preferred here.
Crucial to the MBB method is identifying the optimal block size that best captures the temporal autocorrelation of the data. Intuitively, this means that we are trying to evaluate the “effective sample number of observations” of a timeseries (Box & Jenkins, 1976). If we have dependent observations, the effective size will be smaller than but how small depends on the level of correlation between observations. At one extreme if the correlation is almost perfect, then we cannot do better than the MBS procedure. As the dependency decreases we will come closer to the PIS model.
In time series analysis, Hall et al. (1995) proposed an empirical subsampling method to choose an optimal block size that minimizes the average squared difference between the bootstrap estimator (bias, variance, onesided and twosided probabilities) computed with the full time series and the one constructed from many subsets of overlapping subseries (hereafter, subsamples). Further, Berkowitz & Kilian (2000) proposed a method to use Monte Carlo trials from a fitted model and a loss function to determine the block size. In addition, Bühlmann & Künsch (1999) and Politis & White (2004) (correction version Patton et al. (2009)) proposed datadriven block size selection procedures based on spectral density in the context of variance estimation. Later, Lahiri et al. (2007) proposed a nonparametric plugin estimator method with a reduced computational burden using an explicit formula and a jackknifeafterbootstrap method. Recently, Nordman et al. (2014) compared the convergence rate of empirical block size selection methods. Here we propose a modified empirical subsampling method that identifies the optimal block size by minimizing the mean squared error (MSE) of a block bootstrap estimator of twosided probability.
In addition to the proposed MBB procedure, we describe appropriate preprocessing and transformation of the data, as well as the construction and interpretation of statistical tests. As a first step, we recommend eliminating bacteria with very low counts; this minimizes the number of multiple hypotheses tested. We then transform the data. Log transformations are often used in microbiome studies. However they are undefined in the presence of zero abundances and lead to a lack of power in detecting changes between the lower abundance bacteria. As the data often follow a gammaPoisson mixture model (McMurdie & Holmes, 2014), we can use the optimal variancestabilising transform for that distribution (arcsinh as originally proposed by Anscombe (1948)). Then, we choose a pivotal statistic to measure the discrepancy between the null hypothesis and sample information. A pivotal statistic is a function of the observed data and the parameter of interest that has a distribution that does not depend on unknown parameters under the null hypothesis (Cox & Hinkley, 1979). For example, the tstatistic is a pivotal quantity when the parameter of interest is the population mean. Finally, both confidence intervals or pvalues can be used to determine statistical significance at a desired threshold.
In section 2, we describe the MBB method for differential abundance analysis. We then provide the subsampling algorithm and the method used for choosing the optimal block size. In section 3, we demonstrate the MBB method on the specific problem of inferring differentially abundant bacteria between case and control cohorts using synthetic and real longitudinal microbiome data, and compare the MBB method with the MBS and PIS approaches. In section 4, we conclude with perspectives for future work.
2 Method
2.1 Moving Block Bootstrap Method for Microbiome Data
In this section, we illustrate how the moving block bootstrap (MBB) enables inference in the presence of an unknown order of withinsubject dependence in longitudinal bacteria abundances.
A count matrix from a longitudinal microbiome study is displayed in Table 1. In this design, denotes the abundance of an th bacterium on the th subject at time point . Moreover, is the number of subjects, is the number of repeated observations from the th subject, is the total number of bacteria, is the total number of biological samples.
Table 2 shows the associated sample information and covariates that are included in the matrix .
We shall illustrate MBB based inference for differential abundance analysis in this section. To simplify the illustration, we assume that we are interested in the effect of a factor variable with two levels such as case and control.
In order to identify a statistic , which measures the effect of on the abundance of th bacterium, we consider the following marginal model for each bacterium :
(1)  
where , , , is the mean abundance, is the effect of explanatory variable on the average population abundance of th bacterium, and is a suitable variancestabilizing transformation. Further, we assume that there is some order of dependence within each subject, so that
(2) 
where is some covariance function. Note that (1) resembles the marginal model described in Diggle (2002). First, we account for the unequal library size between samples using the medianratio method (Anders & Huber, 2010). Then, we fit a generalized linear model (GLM) with an inverse hyperbolic sine () link function. When we can write model (1) as:
(3)  
where, is the normalization constant that accounts for the unequal library sizes and
is the transformation of the expected abundance after accounting for library size.
For each bacterium, we estimate using the generalized estimating equation approach (Liang & Zeger, 1986). However, for the inferences on more than one bacterium, we obtain shrinkage estimators of by using the empirical Bayes method (Robbins, 1956; Stephens, 2016), which is more stable than unshrunken estimators(Efron, 2012; Love et al., 2014). Now, we define the studentized version of :
(4) 
where SE stands for standard error.
To avoid the influence of nuisance parameters on the distribution of a statistic (), we use the studentized version of that makes our test statistic approximately pivotal. We approximate the sampling distribution of and using the MBB method with a nested double bootstrap.
We consider that each row in Table 1 is independent time series of length . Let us assume that all the subjects have the same number of repeated observations . For the MBB method, we draw block bootstrap realizations by preserving the dependency withinsubject. Thus, for all the rows together in Table 1, we define the overlapping blocks of columns for each subject.
For each subject, consider to be the column vectors. We define the th block for each subject with the block size , where . The total number of blocks within a subject is where is between one and . Therefore, each subject is identified with set of blocks of observations. We resample numbers with replacement from set and obtain . We first choose observations out of to create a bootstrap realization of . Here, we do the pairwise bootstrap sampling to get the covariates corresponding to the bootstrap realization.
We repeat the blocking procedure with block size and pairwise resampling a large number of times, obtaining bootstrap realizations. From this resampling procedure, we compute bootstrap estimates for each bacterium. These provided the approximated sampling distribution of . These estimates can be used to construct a confidence interval for .
As a refinement of the above method, the sampling distribution of is found by using the bootstrap samples and as many double bootstrap resamples for each bootstrap realization. Thus, the standard error is estimated from the double MBB. Finally, the sampling distribution of is used to compute the pvalues. These pvalues are then adjusted according to Benjamini & Hochberg (1995)’s algorithm.
There is no analytical solution for the distribution of , so we check the invariance by simulation. In Figure S1, we show by a simulation that is approximately pivotal in the context of differential abundance analysis. We simulated the first dataset with realistic parameters and then perturbed the variance of observations to simulate the second set of data. Note that in Figure S1, almost all points in each plot fall on the straight line . This suggests that the sampling distribution of agrees in both datasets and thus, is a pivotal statistic.
Our next task is to determine the optimal block size for the given data, for which we propose a modified empirical subsampling procedure.
2.2 Choosing an Optimal Block Size
For each row in Table 1, let be the rejection probability of the bootstrap test with the test statistic as our statistical functional. Because we are interested in a twotailed test, we assume that is the twosided probability , where is an observed value of based on the full dataset. Let denote the estimate of using the moving block bootstrap (MBB) method with block size . In addition, we define bias and variance to compute the mean squared error (MSE) of . The optimal block size for each row in Table 1 is determined by balancing the tradeoff between bias and variance of . That is,
(5) 
In the differential abundance analysis, there are different bacteria. Thus, we define the optimal block size such that minimizes the norm of MSE in estimating . That is,
(6) 
where
Note that in (6) depends on the sampling distribution of
Thus, we define a modified subsampling method to compute .
We shall begin the subsampling procedure by choosing an initial block size . We obtain bootstrap values of to compute for all bacteria by applying the MBB procedure on Table 1. Then, we let be the number of repeated observations from each subject to create a subsample of Table 1.
Now, we choose a different block size than , say, to compute bootstrap values of by applying the MBB procedure on the subsample. From these bootstrap estimates, we compute for all bacteria. We can now compute the MSE of estimating with , the number of subsamples and the block size :
(7) 
(8) 
We repeat the above procedure with different choices of . Then, we determine the optimal block size for the subsample, which minimizes the norm of :
(9) 
Note that in (9) is optimal block size for subsample with repeated observations in each subject. Because MSE is proportional to the number of repeated observations (Hall et al., 1995), we scale up the optimal block size for the subsample until obtaining the optimal block size for the original data in Table 1:
(10) 
where is five if is the twosided probability.
In the differential abundance context, because the number of repeated observations per subject is different in practice, we choose a reasonable percentage of repeated observations from each subject instead of choosing repeated observations to compute the MSE. Thus, the optimal block size is
(11) 
It is necessary to make sure that MSE measures the uncertainty associated with different block sizes. Based on the simulation study, we recommend choosing to compute MSE over at least five subsamples.
In practice, we choose using correlogram. First, we use the variancestabilizing transformation on the original abundance table. Then, for each bacterium, we compute the average transformed abundance at each time point. Finally, we plot withinsubject autocorrelation for top six abundances bacteria. Using the autocorrelation plots, we consider the first occurrence of a lag with zero autocorrelation of top six total abundances bacteria as initial block size . There might be a spurious higher autocorrelation at larger lags due to small number of observations or trend.
We also choose using lagplot that identifies autocorrelation patterns in time series data. We extend this plot to visualize the withinsubject autocorrelation in longitudinal microbiome data. For each bacterium, we plot transformed abundances at different lags. The lagplot is interpreted as follows: If there is 1) no pattern in the lag plot, all the observations are not autocorrelated; if there is a linear pattern along the diagonal and if there is 2) less clustering of points along the line, then there is a weak to moderate autocorrelation; and if there is a linear pattern along the diagonal and 3) points tightly cluster along the line, then there is a high autocorrelation.
The variogram is another graphical tool to explore the withinsubject dependence structure when the repeated biological samples are collected at regular or irregular time intervals (Diggle, 2002, pg. 51). For each bacterium, the variogram at lag is directly related to autocorrelation as follows:
(12) 
where is the variance of abundances of th bacterium as in (2).
If there is a betweensubject variability, (12) is modified as follows:
(13) 
where is betweensubject variance.
The variogram with the variance overly on the plot is interpreted as follows: 1) autocorrelation approaches to zero when approaches to the variance; 2) autocorrelation decreases when increases; and 3) vice versa. We consider the first lag where variogram approaches the variance or gets flat as initial block size for top six total abundances bacteria. Because of the betweensubject variability, the variogram may not reach the variance of the abundances.
In the next section, we illustrate the MBB method on synthetic and real data.
3 Results and Discussion
In this section, we illustrate the moving block bootstrap (MBB) inference for longitudinal microbiome data with different orders of dependence using synthetic and real data.
3.1 Synthetic Data
In this assessment, we simulated longitudinal microbiome data at two levels (case/control) of our factor variable, although the MBB method could be used for testing multiple levels. We start by estimating realistic parameters for the gammaPoisson on real data from study DiGiulio et al. (2015).
We simulated stable and unstable abundances over time in two different settings. In SettingNZ, stable refers to the data with no zero abundances. In SettingZ, we simulated data with few zero abundances, this mostly resembles microbiome data. Within each setting, we use three different order of dependence:
(i) (deporder1)
(ii) (deporder2) and
(iii) (deporders1&2).
We simulated ten repeated observations from 15 subjects for each level (case/control). Giving a total of 300 simulated observations. Each such setup was simulated 50 times and provided estimates of the true and false positive rates.
In the MBB, we defined the pivotal quantity as the studentized shrinkage estimator and used the subsampling procedure to determine the optimal block size as explained in Section 2. We used the correlogram and variogram at different lags to specify the initial block size.
Figure 1 shows an example (SettingNZ, deporders1&2) of a correlogram for top six abundances taxa. In the top six taxa, we observed that correlation approaches zero at lag four, which is related to a block size of five. We noticed a spurious larger autocorrelation at larger lags and this may indicate trend in the abundances. Figure S2 shows an example (SettingNZ,deporders1&2) of a variogram for the same top six abundances taxa. We noticed in Figure S2 that variogram approaches the variance at lag four. In contrast, the increasing variogram after lag four may indicate trend in the abundances. In this SettingNZ (deporders1&2), we chose the initial block size as five.
We used 60% of repeated observations from each subject to create subsamples. In our simulation setup, because there were ten repeated observations from each subject, we computed the average mean squared error of twosided probability over five overlapping subsamples for block sizes 1, 2, 3, and 4.
Table 2 shows the uncertainty associated with optimal block size in each setup in SettingNZ. We noticed the optimal block size distribution has variability in each setup. In SettingNZ, we simulated data with no zero abundances. In addition, we set only temporal dependency for differentially abundant taxa. Thus, we observed the variability in the optimal block size, which accounts for the overall order of dependence.
For each simulation run in SettingNZ, we ran the MBB method to compute the adjusted pvalues with the optimal block size. Then, we computed the false positive rate (FPR) and true positive rates (TPR) based on the truth. Because we used block bootstrap samples, we computed false discoveries at FDR of .1 and .05. Note that the minimum of FDR that we can use to access the MBB method is with 200 block bootstrap realizations.
Figure 2 shows the FPR and TPR for SettingNZ (deporders1&2). This shows that TPR is same in all three analysis (MBB, MBS  mergingbysubject, PIS  Presume independent samples). The smallest FPR was obtained with MBS analysis and the next smallest FPR is obtained using the MBB method. The PIS method produced the largest FPR.
We also accessed the effect of over and underestimation of the optimal block size in SettingNZ (deporders1&2). Figure S3 shows that the median FPR is slightly increased when the optimal block size is underestimated whereas the FPR and TPR remain the same for the overestimation.
In SettingZ, we simulated data which to be unstable over time as in almost all microbiome studies. In particular, we simulated data with zero abundances. For this SettingZ, we used the autocorrelation plot and variogram to choose the initial block size and 60% of repeated observations for the subsampling procedure.
Table 4 shows that the subsampling method identified a consistent optimal block size for the unstable abundance data in SettingZ. The variability in the optimal block size is less when the withinsubject dependency is weak to moderate.
Figure 3 shows a gain in power using MBB method than MBS and PIS. In addition, the FPR is minimized in the MBB method by accounting for withinsubject dependency rather than presuming all are independent samples (PIS).
Figure 4 shows the specificity and sensitivity of the MBB method using ROC curves and areas under the curve (AUC) in SettingZ (deporder1). The area under the curve indicates that the MBB method accurately identifies the differentially abundant taxa.
3.2 Differential Abundance Analysis of Stanford Pregnancy DataA
The Stanford pregnancy data (DiGiulio et al., 2015, hereafter, StanfordA) were the key motivation for the development of this method. The cohort consists of 761 biological vaginal samples collected from 40 pregnant women over gestation. These data included 33 term and 7 preterm subjects.
DiGiulio et al. (2015) discussed a differential abundance analysis based on a mergebysubject (MBS) approach and mentioned the loss in power compared to a potential analysis which would account for dependency within each subject. DiGiulio et al. (2015) defined the community state type 4 (CST4) samples and showed that only Gardnerella and Anaerococcus species were significantly abundant by MBS analysis.
We reanalyzed the StanfordA data without subsetting CST 4 samples. We identified 598 taxa in all samples. We excluded marginal preterm subjects, which were defined as women undergoing labor during the 37th gestational week. This process resulted in 678 samples, including 28 term and 7 preterm subjects. Then, we chose 109 taxa that were present in at least 5% of samples. We tested each taxon to identify differentially abundant taxa.
Table S1 shows differentially abundant taxa from the MBS method. We observed that increased Gardnerella and Prevotella and decreased Lactobacillus gasseri abundances were risk factors for preterm birth at FDR .05 in the StanfordA data.
We used the MBB method to identify the differentially abundant taxa in the StanfordA data. First, we identified the initial block size as 10 using the correlogram in Figure S4. We used 70% of repeated observations within each subject to determine the optimal block size. The mean squared error for each block size was calculated over 12 overlapping subsamples. Finally, with the optimal block size nine as in Figure S5, we used MBB realizations and double MBB realizations to find differentially abundant taxa at FDR .05.
This analysis was done at the taxonomic level. We observed more than one strain with the same name in Table S2 and used the taxon identification number to distinguish among these variants. At FDR .05, subjects with subsequent preterm labor in StanfordA cohort were associated with significant increased abundances in bacterial vaginosisrelated (BV) species and decreased Lactobacillus species abundances. The large number of taxa with increased abundances in the subjects with preterm labor that were identified with the MBB method make sense, given the suspected high diversity of the riskassociated Gardnerellarich, Lactobacilluspoor vaginal communities and their assumed ecological interactions.
3.3 Differential Abundance Analysis of Stanford Pregnancy DataB
Next, we considered the followup Stanford pregnancy data (Callahan et al., 2017, hereafter, StanfordB) which were generated to test the findings in DiGiulio et al. (2015) in a similar population. This consisted of vaginal samples from 29 term and nine preterm birth subjects. These 897 vaginal samples were collected at each gestational week. We used the higher resolution amplicon sequence variants (ASVs) identification method dada2 (Callahan et al., 2016) and identified 1537 ASVs. For the differential abundance analysis, we filtered ASVs that were present in at least 10% of 897 samples, resulting in 97 ASVs.
Table S3 shows the differentially abundant ASVs at FDR .05 in the StanfordB cohort using MBS analysis and noted that two different strains of decreased Lactobacillus crispatus and Lactobacillus jensenii abundances were risk factors for preterm.
For the MBB procedure, we used the correlogram in Figure S6 to determine the initial block size as ten. Then, with the optimal block size three as in Figure S7, we identified 19 differentially abundant ASVs as shown in Table S4.
Table S4 shows that the risk factors of preterm birth are increased abundances of bacterial vaginosisrelated species and decreased Lactobacillus species, except Lactobacillus gasseri. Moreover, we observed that increased abundances of Gardnerella is a risk factor in both StanfordA and B cohorts. Tables S2 and S4 show that the diversity of ASVs that are enriched in preterm subjects is greater than those ASVs that are depleted in preterm subjects.
3.4 Differential Abundance Analysis of UAB Pregnancy Data
We analyzed the University of Alabama (UAB) pregnant women cohort to identify the ASVs related to the risk of preterm labor in this different population. The StanfordB and UAB cohorts have different racial profiles (Callahan et al., 2017). Using the DADA2 pipeline, we identified 2316 ASVs with prevalence of at least one. This cohort consisted of 1282 biological samples collected from 96 pregnant women (41 preterm and 55 term) weekly during gestation. We filtered ASVs that were present in at least 10% of total biological samples. This filtering reduced the number of ASVs to 183. Table S5 shows that the only reduced Lactobacillus gasseri from Lactobacillus vaginal community is differentially abundant using the MBS method.
We used the MBB method to analyze the UAB cohort data. We used the correlogram in Figure S6 to identify the initial block size as twelve. With the optimal block size nine as in Figure S7, we identified 35 differentially abundant ASVs at FDR .05.
Table S6 shows that in contrast to the MBS analysis, the risk of preterm is significantly associated with increased abundances of bacterial vaginosisrelated species and decreased abundances of Lactobacillus species in the UAB cohort.
In Figure S10, we compare results from the StanfordA, StanfordB, and UAB cohorts. The intersection matrix in Figure S10 shows that some of the bacterial vaginosisrelated genera associated with preterm labor are risk factors of at least two cohorts while increased Gardnerella, Prevotella, Dialister genera abundances are common risk factors in Stanford cohorts. We also noted that decreased different Lactobacillus species abundances are risk factors of preterm labor in three cohorts.
These findings suggest that the microbial biomarkers for risk of preterm labor are increased bacterial vaginosisrelated species and decreased Lactobacillus species. In addition, diversity is increased in preterm subjects.
3.5 Differential Abundance Analysis of Human Oral Cavity
Next, we did the differentially abundance analysis of oral microbial communities of eight dentally healthy individuals from whom samples of the molars and incisors were collected on a daily basis over 2931 consecutive days (Figure S11). Previous analysis of these data suggested not only that communities inhabiting the molars and incisors differ from one another but also that they tend to be relatively stable over the course of a month as assayed by the RV coefficient, a multivariate generalization of the correlation coefficient (Proctor et al., 2018).
Here, we sought to determine the individual ASVs that differ in abundance by tooth class (molar vs. incisor) while exhibiting similar or different patterns of temporal variation. Since the initial analysis of these data indicated that communities differ on opposing aspects (buccal, lingual) of individual teeth we analyzed only communities on the lingual, or tonguefacing, tooth aspect. Similarly, the initial analysis indicated communities vary based on jaw (upper, lower) so we restricted the present analysis to molars and incisors in the upper jaw. ASVs were filtered to retain only taxa that were present in at least 25% of total biological samples to reduce the effect of bias due to sampling. We took the mean abundances of 90 ASVs on the first molars (teeth 3,14) and the central incisors (teeth 8, 9) to represent molar and incisor communities, respectively.
The MBS method identified 30 ASVs that were found to be differentially abundant between the molars and incisors at FDR .05 (Table S7. Of the 12 taxa most strongly found to be enriched on the molars compared to the incisors 8 represented different strains of Prevotella. In contrast, no Prevotella strain was found to be overrepresented on the incisors compared to the molars; rather, a variety of genera were represented (e.g., Actinomyces, Capnocytophaga, Corynebacterium, Neisseria, Rothia, Streptococcus and Abiotrophia) with any given genus represented by at most 3 strains.
Examining the distribution of individual taxa over time revealed 4 notable patterns (Figure S15). First, several taxa found by the MBS method were present in high abundance at the molars and incisors, including Corynebacterium durum strain 16, Rothia dentocariosa strain 1, Rothia strain 11, and Actinomyces strain 10. Second, some taxa including Prevotella pallens strain 70 and Prevotella strain 76 appeared to spike and crash in abundance regardless of whether they were more abundant on the molars or incisors. Third, other taxa appeared to be stable on the molars but experience temporal fluctuation on the incisors, such as Campylobacter concisus strain 55, Prevotella nigrescens strain 49 and Prevotella melaninogenica strain 37. Finally, rarelys did differences between individuals appear to be the predominant factor driving the differences between the distributions. One exception includes Streptococcus sanguinis strain 5 which appeared to fluctuate around a stable mean for both molars and incisors in most individuals with the exception of Subject 1107 who experienced periodic extinctions at the incisors. Another exception includes Abiotrophia defectiva strain 9 which appeared to crash to zero abundance at several time points on the incisors in subject 1104. In these cases, the temporal variability may be a result of sampling or other technical artifact rather than biological variability, though it is not possible for us to test this hypothesis explicitly.
Utter et al. (2016) showed that individual ASVs in dental plaque vary between subjects and over time. Thus, we used the MBB method to improve the inference by accounting the underlying temporal variability within subjects. We used the correlogram to identify the initial block size as ten (Figure S12) and the optimal optimal block size as two (Figure S13). The MBB method identified 54 ASVs that were differentially abundant when comparing molars and incisors, including 23 that were overrepresented on the molars and 16 that were overrepresented on the incisors (Table S8). In addition to identifying the 8 Prevotella species that were more abundant on the molars, the MBB identified an additional 3 Prevotella species that were overrepresented on the molars. Likewise, the MBB method identified not just 1 Veillonella strain, as the MBS method did, but 3 strains that were overrepresented on the molars. In general, the MBB method identified a higher number of strains that distinguished between the molars and incisors. These observations are consistent with prior reports of high variability and interindividuality of the oral microbiome in healthy individuals, which appeared to be mediated by strain level diversity (Utter et al., 2016).
To gain an intuitive sense of the difference between the MBS and MBB method we plotted ASV distributions over time by subject for all taxa identified as significant by just the MBB method (Figure S14). Several taxa the MBB method detected were generally low abundance taxa, including Prevotella oris strain 87, Mogibacterium strain 72, Solobacterium moorei strain 68, and Peptostreptococcus stomatis strain 59. Most distributions appear to be relatively stable with the exception of the incisors or molars in certain individuals. For example, Veillonella strain 75 appears to fluctuate around a stable mean for both molars and incisors but is not observed at the incisors of 2 individuals (P19 and 1104) at one or more time points. Other taxa that follow similar population crashes include Streptococcus strain 20, Streptococcus strain 6 and Haemophilus strain 35. It is unclear, at this point, whether these taxa are not observed in these samples due to technical artifact or whether population level extinctions occur in the healthy human oral cavity.
One technical artifact we can test is whether the observed ASVs defined as different Veillonella strains arise from independent organisms. Veillonella species have 4 rRNA operons exhibiting notable sequence heterogeneity (Marchandin et al., 2003), which may give rise to ASVs that are classified as different organisms merely due to technical artifact. In such cases, the different ASVs would be autocorrelated in their distributions across time within sites. When analyzing the distributions of the distinct Veillonella species strains 21 and 75 it is possible to see that they differ from one another by subject (Figure S14), suggesting that the sequence variations indeed arose from independent organisms. Thus, the different ASVs assigned to Veillonella are unlikely to be spurious. Whether other technical artifacts may give rise to the observed patterns of temporal fluctuation in a select subset of individuals is unclear. A carefully designed follow up study should be undertaken to evaluate the source of these temporal fluctuations.
4 Conclusion
Longitudinal microbiome data are used to model either abundance over time or compare the abundances of bacteria between two or more cohorts. We have devised a method for making nonparametric inferences in longitudinal microbiome data in the latter case. With the optimal block size computed using subsampling, moving block bootstrap (MBB) resamples with replacement the overlapping blocks within each subject to make bootstrap realizations. Then, the MBB method computes the sampling distribution of the chosen pivotal quantity to draw valid inferences. Finally, it ranks bacteria for followup clinical studies based on the adjusted pvalues.
We use exploratory tools to setup the two tuning parameters: (i) initial block size and (ii) number of repeated observations for subsampling. These two tuning parameters may limit the MBB method to use for longitudinal design with at least five repeated observations. In addition, temporal variability in microbiome data may consist of technical and biological variabilities. Thus, we use appropriate prefiltering to remove the unwanted noise. For example, we used 5% , 10% , 10% and 25% prefiltering for StanfordA, StanfordB, UAB, and oral data, respectively.
Although our method is computationally intensive, an accurate inference can be executed using parallel computing in R which as implemented in the opensource R package https://github.com/PratheepaJ/bootLong.
Compared with parametric modelbased approaches, MBB has flexibility in handling heterogeneity and temporal dependence structure, whereas parametric methods have to define a complete model dependency. In addition, compared with MBS (mergebysubject) and PIS (presume independent samples) methods, the advantages of the MBB method are a gain in statistical power and reduced false positive rates, respectively.
By using the MBB method, we account for the temporal dependency withinsubjects, minimize false positive discoveries, and increase the power of the statistical test. The proposed method can be easily extended to make inferences on longitudinal “omics” data making biomedical research more reproducible.
Acknowledgments
This work was partially supported by the NIH (R01 grant AI112401 to P.J., D.A.R., S.P.H.), the NSF (DMS 1501767 to S.P.H.), and the March of Dimes Prematurity Research Center (to B.J.C. and P.J.).
References
 (1)
 Anders & Huber (2010) Anders, S. & Huber, W. (2010), ‘Differential expression analysis for sequence count data’, Genome biology 11(10), R106.
 Anscombe (1948) Anscombe, F. J. (1948), ‘The transformation of poisson, binomial and negativebinomial data’, Biometrika 35(3/4), 246–254.
 Benjamini & Hochberg (1995) Benjamini, Y. & Hochberg, Y. (1995), ‘Controlling the false discovery rate: a practical and powerful approach to multiple testing’, Journal of the royal statistical society. Series B (Methodological) pp. 289–300.
 Berkowitz & Kilian (2000) Berkowitz, J. & Kilian, L. (2000), ‘Recent developments in bootstrapping time series’, Econometric Reviews 19(1), 1–48.
 Box & Jenkins (1976) Box, G. E. & Jenkins, G. M. (1976), Time series analysis: forecasting and control, revised ed, HoldenDay.
 Bühlmann & Künsch (1999) Bühlmann, P. & Künsch, H. R. (1999), ‘Block length selection in the bootstrap for time series’, Computational Statistics & Data Analysis 31(3), 295–310.
 Callahan et al. (2017) Callahan, B. J., DiGiulio, D. B., Goltsman, D. S. A., Sun, C. L., Costello, E. K., Jeganathan, P., Biggio, J. R., Wong, R. J., Druzin, M. L., Shaw, G. M. et al. (2017), ‘Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of us women’, Proceedings of the National Academy of Sciences p. 201705899.
 Callahan et al. (2016) Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A. & Holmes, S. P. (2016), ‘Dada2: highresolution sample inference from illumina amplicon data’, Nature methods .
 Cook & Ware (1983) Cook, N. R. & Ware, J. H. (1983), ‘Design and analysis methods for longitudinal research’, Annual Review of Public Health 4(1), 1–23.
 Cox & Hinkley (1979) Cox, D. R. & Hinkley, D. V. (1979), Theoretical statistics, CRC Press.
 Davison & Hall (1993) Davison, A. C. & Hall, P. (1993), ‘On studentizing and blocking methods for implementing the bootstrap with dependent data’, Australian & New Zealand Journal of Statistics 35(2), 215–224.
 Davison & Hinkley (1997) Davison, A. C. & Hinkley, D. V. (1997), ‘Bootstrapping methods and their application’.
 Diggle (2002) Diggle, P. (2002), Analysis of longitudinal data, Oxford University Press.
 DiGiulio et al. (2015) DiGiulio, D. B., Callahan, B. J., McMurdie, P. J., Costello, E. K., Lyell, D. J., Robaczewska, A., Sun, C. L., Goltsman, D. S., Wong, R. J., Shaw, G. et al. (2015), ‘Temporal and spatial variation of the human microbiota during pregnancy’, Proceedings of the National Academy of Sciences 112(35), 11060–11065.
 Eckburg et al. (2005) Eckburg, P. B., Bik, E. M., Bernstein, C. N., Purdom, E., Dethlefsen, L., Sargent, M., Gill, S. R., Nelson, K. E. & Relman, D. A. (2005), ‘Diversity of the human intestinal microbial flora’, science 308(5728), 1635–1638.

Efron (1979)
Efron, B. (1979), ‘Bootstrap methods: Another
look at the jackknife’, The Annals of Statistics 7(1), 1–26.
http://dx.doi.org/10.1214/aos/1176344552  Efron (2012) Efron, B. (2012), Largescale inference: empirical Bayes methods for estimation, testing, and prediction, Vol. 1, Cambridge University Press.
 Efron & Tibshirani (1994) Efron, B. & Tibshirani, R. J. (1994), An introduction to the bootstrap, CRC press.
 Flores et al. (2014) Flores, G. E., Caporaso, J. G., Henley, J. B., Rideout, J. R., Domogala, D., Chase, J., Leff, J. W., VázquezBaeza, Y., Gonzalez, A., Knight, R. et al. (2014), ‘Temporal variability is a personalized feature of the human microbiome’, Genome biology 15(12), 531.
 Fukuyama et al. (2017) Fukuyama, J., Rumker, L., Sankaran, K., Jeganathan, P., Dethlefsen, L., Relman, D. A. & Holmes, S. P. (2017), ‘Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experiment’, PLoS Computational Biology 13(8), e1005706.
 Gilbert et al. (2010) Gilbert, J. A., Meyer, F., Jansson, J., Gordon, J., Pace, N., Tiedje, J., Ley, R., Fierer, N., Field, D., Kyrpides, N. et al. (2010), ‘The earth microbiome project: meeting report of the â1 st emp meeting on sample selection and acquisitionâ at argonne national laboratory october 6 th 2010’, Standards in genomic sciences 3(3), 249.
 Hall et al. (1995) Hall, P., Horowitz, J. L. & Jing, B.Y. (1995), ‘On blocking rules for the bootstrap with dependent data’, Biometrika 82(3), 561–574.
 Lahiri et al. (2007) Lahiri, S., Furukawa, K. & Lee, Y.D. (2007), ‘A nonparametric plugin rule for selecting optimal block lengths for block bootstrap methods’, Statistical Methodology 4(3), 292–321.
 Lahiri (1999) Lahiri, S. N. (1999), ‘Theoretical comparisons of block bootstrap methods’, Annals of Statistics pp. 386–404.
 Lepage & Billard (1992) Lepage, R. & Billard, L. (1992), Exploring the limits of bootstrap, Vol. 270, John Wiley & Sons.
 Liang & Zeger (1986) Liang, K.Y. & Zeger, S. L. (1986), ‘Longitudinal data analysis using generalized linear models’, Biometrika 73(1), 13–22.
 Love et al. (2014) Love, M. I., Huber, W. & Anders, S. (2014), ‘Moderated estimation of fold change and dispersion for rnaseq data with deseq2’, Genome biology 15(12), 1.
 Lozupone et al. (2012) Lozupone, C. A., Stombaugh, J. I., Gordon, J. I., Jansson, J. K. & Knight, R. (2012), ‘Diversity, stability and resilience of the human gut microbiota’, Nature 489(7415), 220–230.
 Marchandin et al. (2003) Marchandin, H., Teyssier, C., de Buochberg, M. S., JeanPierre, H., Carriere, C. & JumasBilak, E. (2003), ‘Intrachromosomal heterogeneity between the four 16s rrna gene copies in the genus veillonella: implications for phylogeny and taxonomy’, Microbiology 149(6), 1493–1501.
 McMurdie & Holmes (2014) McMurdie, P. J. & Holmes, S. (2014), ‘Waste not, want not: why rarefying microbiome data is inadmissible’, PLoS Comput Biol 10(4), e1003531.
 Nordman et al. (2014) Nordman, D. J., Lahiri, S. N. et al. (2014), ‘Convergence rates of empirical block length selectors for block bootstrap’, Bernoulli 20(2), 958–978.
 Patton et al. (2009) Patton, A., Politis, D. N. & White, H. (2009), ‘Correction to âautomatic blocklength selection for the dependent bootstrapâ by d. politis and h. white’, Econometric Reviews 28(4), 372–375.
 Politis & White (2004) Politis, D. N. & White, H. (2004), ‘Automatic blocklength selection for the dependent bootstrap’, Econometric Reviews 23(1), 53–70.
 Proctor et al. (2018) Proctor, D. M., Fukuyama, J. A., Loomer, P. M., Armitage, G. C., Lee, S. A., Davis, N. M., Ryder, M. I., Holmes, S. P. & Relman, D. A. (2018), ‘A spatial gradient of bacterial diversity in the human oral cavity shaped by salivary flow’, Nature Communications 9(1), 681.
 Proctor & Relman (2017) Proctor, D. M. & Relman, D. A. (2017), ‘The landscape ecology and microbiota of the human nose, mouth, and throat’, Cell Host & Microbe 21(4), 421–432.
 Robbins (1956) Robbins, H. (1956), An empirical bayes approach to statistics, Technical report, Columbia University, New York City, United States.
 Schloissnig et al. (2013) Schloissnig, S., Arumugam, M., Sunagawa, S., Mitreva, M., Tap, J., Zhu, A., Waller, A., Mende, D. R., Kultima, J. R., Martin, J. et al. (2013), ‘Genomic variation landscape of the human gut microbiome’, Nature 493(7430), 45–50.
 Stephens (2016) Stephens, M. (2016), ‘False discovery rates: a new deal’, Biostatistics 18(2), 275–294.
 Sun et al. (2016) Sun, X., Dalpiaz, D., Wu, D., Liu, J. S., Zhong, W. & Ma, P. (2016), ‘Statistical inference for time course rnaseq data using a negative binomial mixedeffect model’, BMC bioinformatics 17(1), 324.
 Turnbaugh et al. (2007) Turnbaugh, P. J., Ley, R. E., Hamady, M., FraserLiggett, C., Knight, R. & Gordon, J. I. (2007), ‘The human microbiome project: exploring the microbial part of ourselves in a changing world’, Nature 449(7164), 804.
 Utter et al. (2016) Utter, D. R., Mark Welch, J. L. & Borisy, G. G. (2016), ‘Individuality, stability, and variability of the plaque microbiome’, Frontiers in microbiology 7, 564.
 Varin & Czado (2009) Varin, C. & Czado, C. (2009), ‘A mixed autoregressive probit model for ordinal longitudinal data’, Biostatistics 11(1), 127–138.
 Xia & Sun (2017) Xia, Y. & Sun, J. (2017), ‘Hypothesis testing and statistical analysis of microbiome’, Genes & Diseases .
 Yatsunenko et al. (2012) Yatsunenko, T., Rey, F. E., Manary, M. J., Trehan, I., DominguezBello, M. G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R. N., Anokhin, A. P. et al. (2012), ‘Human gut microbiome viewed across age and geography’, Nature 486(7402), 222–227.
Supplemental Material
We provide https://github.com/PratheepaJ/bootLong_project, a Github repository to reproduce the results in this section.