Random forests for highdimensional longitudinal data
Abstract
Random forests is a stateoftheart supervised machine learning method which behaves well in highdimensional settings although some limitations may happen when , the number of predictors, is much larger than the number of observations . Repeated measurements can help by offering additional information but no approach has been proposed for highdimensional longitudinal data. Random forests have been adapted to standard (i.e., ) longitudinal data by using a semiparametric mixedeffects model, in which the nonparametric part is estimated using random forests. We first propose a stochastic extension of the model which allows the covariance structure to vary over time. Furthermore, we develop a new method which takes intraindividual covariance into consideration to build the forest. Simulations reveal the superiority of our approach compared to existing ones. The method has been applied to an HIV vaccine trial including 17 HIV infected patients with 10 repeated measurements of 20000 gene transcripts and the blood concentration of human immunodeficiency virus RNA at the time of antiretroviral interruption. The approach selected 21 gene transcripts for which the association with HIV viral load was fully relevant and consistent with results observed during primary infection.
Keywords: stochastic mixed effects model, treebased methods, high dimensional data, repeated measurements.
1 Introduction
Random forests (RF henceforth), introduced by Breiman (2001), is a nonparametric statistical learning method from the ensemble decisions trees approaches which is one of the stateoftheart machine learning method for prediction and classification (FernándezDelgado et al., 2014). It shows a good behavior for applications in highdimensional settings where the number of predictors is larger than the number of observations (e.g., Cutler et al., 2007; Chen and Ishwaran, 2012).
Theoretical background of RF has also been recently improved (Scornet et al., 2015; Mentch and Hooker, 2016; Wager, 2014; Biau and Scornet, 2016). Scornet et al. (2015) proved a consistency result for RF in the context of additive regression models. Mentch and Hooker (2016) and Wager (2014) study asymptotic normality of RF predictions and hence proposed confidence intervals for those predictions. We refer to Biau and Scornet (2016) for further reading on that matter.
However, some limitations may happen when is much larger than (Statnikov et al., 2008), i.e. . In this situation, parameters, specifically the number of variables randomly picked at each node of a tree, must be carefully tuned to control the randomness of the method (Genuer et al., 2008). Some recent improvements have been suggested to deal specifically with highdimensional data (Zhu et al., 2015; Linero, 2018). Zhu et al. (2015) used ideas from reinforcement learning with the treebased model framework to focus on relevant variables. Linero (2018) developed Bayesian regression trees (Chipman et al., 1998) where sparsity was implemented by using appropriate priors.
Besides methodological developments to deal with the issue of high dimension, the situation may be improved by additional observations available in each individual from repeated measurements leading to longitudinal data. There is already a large bulk of work on RF for survival (i.e. censored) data Hothorn et al. (2005); Ishwaran et al. (2008, 2010); Steingrimsson et al. (2018). Hothorn et al. (2005) proposed a general survival ensembles framework based on inverse probability censoring weighted loss function, while Ishwaran et al. (2008, 2010) introduced random survival forests by using a split criterion adapted to censored data, and then derive variable selection strategies for highdimensional survival data. More recently, Steingrimsson et al. (2018) proposed a new survival trees and forests based on the censoring unbiased transformations theory.
Less has been done to adapt random forest approaches to repeated measurement settings. The analysis of repeated measurements requires to take into account the specific correlation structure as done with mixed effects models (Laird and Ware, 1982; Verbeke and Molenberghs, 2009). The first approaches dealing with longitudinal and clustered data involved treebased methods (Segal, 1992; Hajjem et al., 2011; Sela and Simonoff, 2012).
Then, Hajjem et al. (2014) proposed a method based on RF. The main idea was to iterate between the fixed part and the random part to estimate the parameters through an Expectation Maximization (EM) algorithm. All these approaches can be viewed as semiparametric mixed effects model where the fixed effects part is modeled through treebased methods. Of note, other approaches based on smoothing splines have been proposed for semiparametric mixed effects model such as in JacqminGadda et al. (2002); Zhang et al. (1998); Wang (1998).
The main contributions of this paper are threefold: first we extend existing methods by using a stochastic semiparametric mixed effects model (Section 2). Secondly, we introduce a new method based on RF to handle highdimensional longitudinal data and derive theoretical guarantees for predictions of our model in an asymptotic framework (Section 3). Third, we compare existing methods with ours in an extensive simulation study (Section 5) and analyze a therapeutic vaccine trial in HIVinfected patients in Section 6.
All existing and proposed methods have been implemented together in an R package called longituRF^{1}^{1}1Available at https://github.com/Lcapitaine/longituRF.
2 The semiparametric stochastic mixed effects model
Let us consider longitudinal data with individuals, the th individual having observations over time. Suppose (for all and ), the response of the th individual at time , satisfies
(1) 
where is the vector of covariates, is the unknown mean behavior function, is a vector of random effects associated with the vector of covariates , is a stochastic process used to model serial correlation and denotes a measurement error. The , are independent, as well as the and the for all . We also assume that and are mutually independent.
We suppose that the are normally distributed as where is a positive definite matrix; is a centered Gaussian process with covariance function depending on a parameter ; and the are normally distributed as .
We consider in model (1) that the evolution of the response variable for the th individual over time varies around a mean behavior function that is given by . These variations specify the individual trajectories around and are given by the random effects and the stochastic process for the th individual.
We suppose here that the number of covariates for the mean behavior is much larger than the total number of observations, this leads us in a high dimensional context. Zhang et al. (1998) considered a semiparametric stochastic model close to model (1) in small dimension with a dimensional function of the time, hence (1) can be seen as a generalization of their model. Hajjem et al. (2014) considered a model similar to (1) but without the stochastic processes , and developed a method based on the EM algorithm (McCulloch, 1997) to estimate the mean behavior function with any regression method (splines, RF, kernelbased methods, etc…). Note that if the function is assumed linear then model (1) reduces to the linear stochastic model of Diggle and Hutchinson (1989).
3 Estimation
We suppose that the covariance matrix of the process verifies with a positive definite matrix only depending on observations time for all , . We also consider the case where depends on an additional parameter in section 3.2.
In the same spirit of Hajjem et al. (2014), we use a variant of the EM algorithm to estimate parameters. The main principle of the method is given in the Algorithm 1, while a detailed version can be found in the supplementary materials.

, estimate in the standard regression mode:
where . Then for given , and
for given , and , update , and
3.1 Mean behavior function estimation
At step 1 of Algorithm 1, we consider variance parameters known and given by estimations of the previous iteration. The mean behavior function can be estimated with any regression method. When is estimated with CART tree, Hajjem et al. (2011) refer to MERT (Mixed Effects Random Trees). CART (Breiman et al., 1984) consists in partitioning in a recursive way the explanatory variable space to obtain the best partition for prediction. At each step of the partitioning, the space is cut into two subparts. Hence, the obtained partition can naturally be associated to a binary tree which is called CART tree. Furthermore, we stress that each split is optimized among all explanatory variables and that the CART algorithm works with two steps: the maximal tree building following by the pruning step, in order to give the best predictor in terms of prediction error.
Similarly Hajjem et al. (2014) refer to MERF (Mixed Effects Random Forest) when is estimated with RF. RF are an aggregation of multiple randomized CART trees, where the aggregation consists in tacking the mean of individual trees predictions. Each tree is a maximal tree built using a random perturbation: first, it is built on a bootstrap sample of the learning set, and secondly, at each step of the partitioning, the best split is optimized among a randomly drawn subset of explanatory variables. The size of the subset of variables, often called mtry, is the most important parameter of the method. RF naturally estimate the prediction error with the OutOfBag (OOB) error as the following: to predict the response of one particular observation of the learning set, only trees built on bootstrap samples not containing this observation are aggregated. Furthermore, OOB samples (made of observations not selected in bootstrap samples) are also used to compute a variable importance (VI) score. For a fixed variable, the VI score of this variable is defined as the mean increase of the error of a tree on its associated OOB sample after a random permutation of this variable values.
For the considered model (1) we denote by SMERF (Stochastic Mixed Effects Random Forests) the generalization of MERF that we propose.
When the mean behavior function is estimated with a CART tree , Sela and Simonoff (2012) proposed to use the partition associated with to fit a linear mixed effects model. Let the indicator matrix defined as where is the th leaf of the tree . Considering the following model
the estimation of the values of the associated leafs of at step 1 are given by
with for all .
With this method, the values associated with the leafs of are computed by taking in account intraindividual covariance matrix instead of taking the simple mean of values in the leaf. The fitted tree is called REEMtree.
Following this work, we propose a novel method, named REEMforest, which aggregate a collection of REEMtree. Let randomized trees , the indicator matrix associated with the th random tree and the fitted leafs of estimated with the stochastic linear mixedeffects model . The REEMforest estimator is given by the mean of the fitted REEMtree:
Finally, when the considered model is (1) we denote by SREEMforest the method with the additional estimation of the stochastic process.
Once has been computed, the predictions for the random effects and the stochastic processes for known parameters are obtained by taking their conditional expectations given the data , the best linear unbiased predictors BLUP are:
3.2 Variance components estimation
At step 2 of Algorithm 1, the estimation of the variance parameters are obtained by taking the conditional expectation of their maximum likelihood estimators given the data . Thanks to the conditional independence between the individuals we can write, for fixed the likelihood function associated to model (2) as follows:
with
the density function on the vector . Moreover, since , by using the independence of , and we can easily write the likelihood function as:
Using that , the maximum likelihood estimators of and are:
Because , and are unknown these estimators are not computable, this is why we take the expectation given the data . The conditional expectations of the estimators and are given in Wu and Zhang (2006):
The conditional expectation of the maximum likelihood estimator given the data is
Estimators of variance parameters , and at step 2 are given by , and .
Gaussian processes such as OrnsteinUhlenbeck process and fractional Brownian motion have a variancecovariance function which depends on two parameters and . This covariance function can be written as with depending on . There is no analytic maximum likelihood estimator of . However, for a fixed value of , the estimation procedure described in this section holds. Thus for an ensemble of possible values of parameter, the estimator of is
where is the loglikelihood function.
4 Asymptotic analysis
4.1 Convergence of the expected outcome estimation
Given a fitted MERF forest , the associated random effects and the stochastic processes obtained with the algorithm 1, the fitted response variable for the th individual is
Considering the variance components known, the expectations of random effects and are related to the bias of by
This leads to
Under some regularity conditions on the function , thanks to Scornet et al. (2015) we conclude that .
This means that when the number of individuals is large enough, the mean fit tends to the true mean behavior .
4.2 Convergence of outofsample outcome predictions
The prediction of the response variable for the th individual at time is
(3) 
with and the fixed and random effects explanatory variables for the th individual at time and
with and
With this definition, if it is easy to check that
Similarly, for or the expectation of is a linear function of
. According to Scornet et al. (2015), for a new observation for the individual at time ,
5 Simulation study
5.1 Simulation model
5.1.1 Explanatory variables
In this section we present the approach used to simulate the data matrix of the explanatory variables . Our choices are motivated by the characteristics of the data coming from our application, which are transcriptomics data in a phase 1/2 vaccine trial (see Section 6 for more details).
Since we are in a high dimensional context, we assume that a large majority of variables are not associated to the response variable (also known as a sparsity assumption). In our study, those variables are simulated as i.i.d. random draws from a multivariate Gaussian distribution , where denotes the identity matrix of size (recall that denotes the total number of observations).
Moreover, since we deal with longitudinal data in the context of gene expression, we assume that some explanatory variables vary over time and that some explanatory variables are clustered into groups (which correspond to genes involved in the same biological pathway). Hejblum et al. (2015) highlighted ten examples of groups of genes with different temporal behaviors in the DALIA trial, and we mimic some of these trends by setting the following six behaviors over time in our simulations:
(4) 
The explanatory variables with a temporal behavior are then simulated as follows:
where is the group of the th covariate ; corresponds to a random translation at the group level and is an additional timedependent variability.
In the following, we investigate two situations with different values of the total number of variables, , as well as different sizes of each group of variables with temporal behavior.
5.1.2 Outcome variable
The two following models, which are special cases of model 2, are used to simulate the outcome variable . For all :
(5)  
(6) 
where with , is a Brownian motion with volatility and with . In these models, the random effects is associated with an exogenous variable , where for .
The mean behavior function depends on the dimension of the simulated data:

In the lowdimensional case (with ):
(7) 
In the highdimensional case (with and with at least 20 variables in the first two groups of explanatory variables):
(8) where and represent two sets of 20 genes randomly picked from the group and respectively.
The mean behavior function is actually quite the same in the two situations. The difference lies in the fact that in the highdimension case, 40 variables are related to the response variable, against 2 in the lowdimension case. It is indeed reasonable, in highdimension, to assume that several genes coming from the same group are linked to the mean behavior function .
5.2 Squared bias and prediction error
The different methods are compared in terms of squared bias (associated to each estimated quantity) and prediction performance, computed among repetitions of the simulation.
Squared biases are defined as follows:
with

a fixed grid of times (different from the times of measurements),


the fitted random forest returned by the algorithm after convergence, associated with the th repetition.


the estimation of on the th repetition and similarly for and .
To evaluate prediction performance, data associated to one simulation are randomly split times into a learning set and a test set. Each test set is obtained by randomly picking two measurements of each individual. With denoting the index of the th individual measurements in the th test set, we define the prediction error as:
where is the predicted response variable, defined in (3), of the th measure of the th individual, given by the fitted random forest returned by the algorithm after convergence.
5.3 Results
The number of individuals , is fixed to (the same as in the DALIA trial) all along the simulation study, and the number of measurements for the th individual is randomly chosen (with uniform distribution) between and for every , leading to an unbalanced design. We recall that the total number of observations is denoted by .
5.3.1 A lowdimensional case
We start by considering a lowdimensional example where . Hence, we have 6 explanatory variables in the dataset and all of them have a temporal behavior (respectively given by Equation (4)). This framework allows to compare different treebased methods as well as a linear mixed model for longitudinal data in a standard framework. First, we simulate one dataset under model (5) using the mean behavior function defined in Equation (7) and study the behavior of the MERF method on that dataset.
Figure 1 shows that the convergence of the EM algorithm for the MERF method is quite affected by the randomness aspect of the random forests. Indeed, the MERF method converge to the maximum likelihood only for high values of the mtry parameter (the number of variables randomly drawn before optimizing the split of a node of a tree composing the RF). It is expected that in an iterative algorithm such as the EM algorithm, an update with too much randomness compromises the convergence of the algorithm. Hence, we advocate for large values of mtry in those EMbased RF methods, even if the trees in the RF are then more similar with each other.
We now simulate 100 datasets (again under model (5) with mean behavior function (7)) and study squared biases on estimations of quantities of interest (, , and when appropriate), given by the four treebased methods (MERT and REEMtree for trees, MERF and REEMforest for forests) and also compare with the Linear Mixed Effects Model (LMEM) method. In addition, we simulate 100 additional datasets under the stochastic model (6) (still with defined by Equation (7)) and compare stochastic counterparts of the five methods described above, respectively denoted by: SMERT, SREEMtree, SMERF, SREEMforest and SLMEM.
Nonstochastic model  
LMEM  1.738  0.441  *  2.517 
MERT  1.831  0.254  *  0.240 
REEMtree  1.751  0.243  *  0.231 
MERF  0.480  0.204  *  0.009 
REEMforest  0.360  0.185  *  0.008 
Stochastic model  
SLMEM  2.413  0.567  0.082  3.920 
SMERT  2.600  0.371  0.059  1.081 
SREEMtree  2.969  0.321  0.001  0.361 
SMERF  0.891  0.287  0.011  0.003 
SREEMforest  0.853  0.253  0.021  0.0002 
As shown in Table 1, LMEM, MERT and REEMtree methods are close to each other in terms of bias on while MERF and REEMforest which use random forests, provide a much better mean behavior estimation. Moreover, the squared bias on for REEMforest is about 25% lower than MERF whereas the squared bias on of REEMtree is only 4% lower than the one obtained with MERT. Hence, in this framework, taking into account the intraindividual covariance structure to evaluate the tree leafs values generates a much more important decrease of the squared bias on when random forests are used compared with CART. Furthermore, the squared bias obtained on the random effects covariance matrix and the residual variance parameter are lower for all four treebased methods compared to LMEM, with forests estimating much better than trees. Finally REEMforest gives slightly lower bias than MERF.
For stochastic methods, tree methods (SMERT and SREEMtree) led to a higher squared bias of and lower squared bias of the estimated parameters compared to SLMEM. However, forests methods (SMERF and SREEMforests) still perform better than trees and SLMEM with squared biases almost all much lower.
Finally, we compare the different methods on their prediction capacity by computing prediction errors on 100 simulated datasets, either under model (5) or (6). For each dataset, a test set if formed by randomly picking, for each individual , two observations among its observations. This gives a test set containing observations and a learning set with observations. Breiman’s RF is also included in this study in addition to the five methods already mentioned to illustrate the gain of tacking into account the intraindividual correlation.
As expected, left part of Figure 2 shows that Breiman’s RF performed the worse in terms of prediction error, because it assumes that all observations are independent, and that LMEM reached a poor prediction ability, because is not linear. MERT and REEMtree gave intermediate performance, whereas MERF and REEMforest reached the lowest prediction errors, with similar performances. Overall, those comments remains valid for the stochastic case, illustrated on the right part of Figure 2.
As a conclusion, we demonstrate the benefits of RF approaches for longitudinal data analysis in a lowdimensional case, especially in terms of prediction error. REEMforest exhibited a slight advantage compared to MERF in terms of validity of the estimation of the mean behavior function and of the other parameters , and .
5.4 A highdimensional case
For the highdimensional context, we kept but set . We also set the size of each of the six groups containing explanatory variables with temporal behaviors (given by Equation 4) to 266, leading to a total of 1596 variables that changed over time among the 8000 variables in the dataset.
First of all, according to Figure 1 for the lowdimensional case and some preliminary experiments, we fix the mtry parameter of RF to in all RF runs. This ensures convergence of EMbased methods and avoid a too heavy computational burden compared with a systematic optimization of mtry on several values for each RF.
Nonstochastic scheme  
MERT  2.967  0.359  *  0.596 
REEMtree  2.658  0.292  *  0.475 
MERF  0.941  0.269  *  0.216 
REEMforest  0.917  0.260  *  0.211 
Stochastic scheme  
SMERT  7.191  0.807  0.019  1.767 
SREEMtree  4.861  0.596  0.007  0.487 
SMERF  1.768  0.353  0.00001  0.562 
SREEMforest  1.754  0.333  0.0001  0.549 
We simulated 50 datatsets under model (5) (and 50 other dataset under model (6)) with the mean behavior function given by Equation (8), and computed squared biases on estimations given by the four treebased methods: (S)MERT, (S)REEMtree, (S)MERF and (S)REEMforest. We did not compare anymore with LMEM which is not adapted to the highdimensional setting.
For the nonstochastic scheme, the squared bias on and all parameters obtained with REEMforest were slightly lower than the one obtained with the existing MERF method. For the stochastic model (18), the two forestbased methods gave similar bias on all parameters. As in the low dimensional context, forests led systematically to lower biases on all estimations compared to trees, especially for the estimation of . However in this highdimensional setting, (S)MERF and (S)REEMforest performed approximately the same.
We estimated prediction errors of the various methods, by randomly selecting test sets in each of the simulated datasets (in the same manner as in the lowdimensional case in the previous section). As illustrated in Figure 3, forests reached very low prediction error estimations compared to trees. This last result was expected because it is wellknown that RF performs better than trees for highdimensional data. It can also be seen that Breiman’s RF (which assume independence between all observations in the data) are competitive compared with trees, especially compared with (S)MERT. Hence, in that case, the gain of using RF instead of trees roughly compensates the fact that Breiman’s RF do not take into account the longitudinal feature of the data.
Finally, variable importance (VI) scores computed with the RF returned after convergence of the REEMforest method are plotted in decreasing order of VI in Figure 4 (only the 65 most important variables are plotted for the sake of clarity). This graph shows that the most important variables belong to one of the first three groups of explanatory variables. This result is satisfactory because the mean behavior function (defined by Equation 8) depends on variables that belongs to the first two groups only and the third group is very close to the second one in terms of dynamics (see Equations 4).
6 Application to the DALIA vaccine trial
DALIA is a therapeutic phase 1/2 vaccine trial including 19 HIVinfected patients who received an HIV vaccine before stopping their antiretroviral treatment (HAART). For a full description of the DALIA vaccine trial we refer to Lévy et al. (2014).
At each harvest time, 32979 gene transcripts were measured as well as the plasma HIV RNA concentration (which was logtransformed) for every patient. In this application, we were interested in finding the gene transcripts associated to the HIV viral load dynamics after antiretroviral treatment interruption. The analysis was performed on the 17 patients with available data at the time of treatment interruption.
Figure 5 illustrates the dynamics of the viral replication after antiretroviral treatment interruption with a large betweenindividuals variability.
A random intercept and a Gaussian process were included in the model:
(9) 
and we will refer to methods only using a random intercept as nonstochastic methods (MERF and REEMforest in the following).
Prediction errors were evaluated with 25 training/test sets random splits. As in the simulation study, a test set was obtained by randomly drawing two observations for each individual. We chose the stochastic process (between an OrnsteinUhlenbeck’s process and a fractional Brownian motion) that minimized the estimated prediction error. Hence, the fractional Brownian motion with Hurst exponent which is the standard Brownian motion was selected. Finally, the mtry parameter was fixed to in all experiments of this section, according to the likelihood profile (Figure 6).
As illustrated in Figure 7, Breiman’s RF were comparable in terms of prediction error with MERF and REEMforest which only included a random intercept. However, SMERF and SREEMforest outperformed simple RF and trees, with a slight advantage to SREEMforest. This confirms, in this real dataset, that taking precisely into account the longitudinal aspect of the data in RF leads to a significant drop of the prediction error.
6.1 Variable selection using random forests
Once the algorithm (e.g., SREEMforest) has converged, a variable selection process is applied to select the genes the most associated with the viral load dynamics. More precisely, the last estimations of and (which are outputs of the algorithm) are subtracted from the output variable , for all (as in step 1 of Algorithm 1) to come back to a classical regression framework (i.e., with independent observations). Hence, the Variable Selection Using Random Forests method from Genuer et al. (2010) can be apply by using the VSURF package (Genuer et al., 2015).
This method is a fully automatic variable selection procedure based on RF and designed to deal with high dimensional data in a regression framework as well as in supervised classification. It works in three steps: i) first, the variables are sorted in decreasing order of RF variable importance (VI), then a datadriven threshold is computed to eliminate variables with low VI; ii) variables left are then introduced (one by one according to the previous order) in nested RF models and the one minimizing the OOB error is selected; iii) a refined ascending sequence of RF models (obtained in a stepwise way) is then built and finally the last model of this sequence is returned.
6.2 Stability of the selected variables set
We illustrate the stability of the selected variables set by introducing a stability score and studying the behavior of this score against the RF parameter mtry.
Let and be the decreasing ordered variables respectively to the variable importance obtained with two runs of the SREEMforest method. Due to the randomness aspect of the RF, SREEMforest is random and the sequences and may be different. Hence, we introduce a stability score which measures the difference between two ordered sequence and :
where and
This score measures the proportion of variables ranked in a same neighborhood ( handles the size of the neighborhood). To stabilize the results, the score was computed with 30 pairs of sequences and and the mean of the obtained stability scores was provided.
The computation of these stability scores was restricted to the 50 most important variables given by different runs of SREEMforest applied to the DALIA vaccine trial dataset. In Figure 8, we note that, except for mtry set to 1000, we obtained a stability score around 0.5 for a neighborhood size of 4. This means that for two lists of the 50 most important variables obtained with SREEMforest, approximately 50% of them were at the same rank ( ranks). For a neighborhood size larger than 8, the score can exceed 75%. In conclusion, for a wide range of mtry values, variable ranking results were quite consistent.
6.3 Biological results
The 21 variables selected by VSURF (after convergence of SREEMforest) were mainly transcripts (OAS, LY6E, HERC5, IFI/IFIT, EPSTI1, MX1, RSAD2, EIF2AK2, XAF1) associated to the interferon pathway. For instance, they all belongs to the Chaussabel’s modules M1.2 and M3.4 annotated "Interferon" (Chaussabel and Baldwin, 2014). Interferon pathway is highly correlated to the viral replication as demonstrated previously (Bosinger et al., 2009). Only, two transcripts were not associated to the interferon pathway (EPSTI1 and SAMD9L). The commitment of the interferon pathway reflects the immune response to viral infection. The relevance of these results is another argument for the validation of the proposed approach.
7 Discussion
In this article, we introduced a new RFbased methodology suited for the analysis of highdimensional longitudinal data. We also generalized existing methods so they can now be used in the stochastic semiparametric mixedeffects model. The simulation study revealed the superiority of our approach and of the generalizations we introduced, and the method has been applied to a complex dataset coming from an HIV vaccine trial, illustrating the effectiveness and interest of our approach in such highdimensional longitudinal context.
One important aspect of the results of our study is the choice of the mtry parameter. Our advice is to choose a large value for mtry–roughly between and –, not smaller than , and this for two reasons: first, as we are in an (very) highdimensional context, the number of variables selected at each node of trees must not be too small–preventing to choose only noninformative variables too often–; and second, since the different approaches are based on an iterative EMalgorithm, the inner RF model must be stabilized–preventing the EMalgorithm to diverge–. Taking a too small value for mtry could lead to the nonconvergence of the method and hence to very suboptimal results, as illustrated by Figure 6.
Another key step in these approaches is the choice of the model, and more particularly the choice of the random effects. Driven by our application, we only use a random intercept (in addition to the stochastic process) regarding the number of individuals and the number of timepoints we had in the vaccine trial. However, in a context with more individuals and/or less time points, it could be interesting to add random effects on the different time points. This should make the model more flexible and hence increase the capacity of the method to well estimate the interindividual variability.
There are several issues that require further research. First, the theoretical analysis of such complex methods (nonparametric estimates plugged into an EM algorithm) seems rather difficult and remains, to the extend of our knowledge, an open issue. Finally, for the methodological part, another way of adapting RF to longitudinal data could be to change the split criterion of a node in the tree building, as it has been done e.g., for survival RF.
Appendix

, for given , and estimate on the model ; build matrices for every tree composing
for given , and , update , and
with
References
 Biau and Scornet (2016) Biau, G. and Scornet, E. (2016). A random forest guided tour. Test, 25(2):197–227.
 Bosinger et al. (2009) Bosinger, S. E., Li, Q., Gordon, S. N., Klatt, N. R., Duan, L., Xu, L., Francella, N., Sidahmed, A., Smith, A. J., Cramer, E. M., et al. (2009). Global genomic analysis reveals rapid control of a robust innate response in sivinfected sooty mangabeys. The Journal of clinical investigation, 119(12):3556–3572.
 Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
 Breiman et al. (1984) Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Chapman & Hall.
 Chaussabel and Baldwin (2014) Chaussabel, D. and Baldwin, N. (2014). Democratizing systems immunology with modular transcriptional repertoire analyses. Nature Reviews Immunology, 14(4):271.
 Chen and Ishwaran (2012) Chen, X. and Ishwaran, H. (2012). Random forests for genomic data analysis. Genomics, 99(6):323–329.
 Chipman et al. (1998) Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian cart model search. Journal of the American Statistical Association, 93(443):935–948.
 Cutler et al. (2007) Cutler, D. R., Edwards, T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., and Lawler, J. J. (2007). Random forests for classification in ecology. Ecology, 88(11):2783–2792.
 Diggle and Hutchinson (1989) Diggle, P. J. and Hutchinson, M. F. (1989). On spline smoothing with autocorrelated errors. Australian & New Zealand Journal of Statistics, 31(1):166–182.
 FernándezDelgado et al. (2014) FernándezDelgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res, 15(1):3133–3181.
 Genuer et al. (2008) Genuer, R., Poggi, J.M., and Tuleau, C. (2008). Random forests: some methodological insights. arXiv preprint arXiv:0811.3619.
 Genuer et al. (2010) Genuer, R., Poggi, J.M., and TuleauMalot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236.
 Genuer et al. (2015) Genuer, R., Poggi, J.M., and TuleauMalot, C. (2015). VSURF: an R package for variable selection using random forests. The R Journal, 7(2):19–33.
 Hajjem et al. (2011) Hajjem, A., Bellavance, F., and Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics & probability letters, 81(4):451–459.
 Hajjem et al. (2014) Hajjem, A., Bellavance, F., and Larocque, D. (2014). Mixedeffects random forest for clustered data. Journal of Statistical Computation and Simulation, 84(6):1313–1328.
 Hejblum et al. (2015) Hejblum, B. P., Skinner, J., and Thiébaut, R. (2015). Timecourse gene set analysis for longitudinal gene expression data. PLoS Comput Biol, 11(6):e1004310.
 Hothorn et al. (2005) Hothorn, T., Bühlmann, P., Dudoit, S., Molinaro, A., and Van Der Laan, M. J. (2005). Survival ensembles. Biostatistics, 7(3):355–373.
 Ishwaran et al. (2008) Ishwaran, H., Kogalur, U. B., Blackstone, E. H., Lauer, M. S., et al. (2008). Random survival forests. The Annals of Applied Statistics, 2(3):841–860.
 Ishwaran et al. (2010) Ishwaran, H., Kogalur, U. B., Gorodeski, E. Z., Minn, A. J., and Lauer, M. S. (2010). Highdimensional variable selection for survival data. Journal of the American Statistical Association, 105(489):205–217.
 JacqminGadda et al. (2002) JacqminGadda, H., Joly, P., Commenges, D., Binquet, C., and Chêne, G. (2002). Penalized likelihood approach to estimate a smooth mean curve on longitudinal data. Statistics in medicine, 21(16):2391–2402.
 Laird and Ware (1982) Laird, N. M. and Ware, J. H. (1982). Randomeffects models for longitudinal data. Biometrics, pages 963–974.
 Lévy et al. (2014) Lévy, Y., Thiébaut, R., Montes, M., Lacabaratz, C., Sloan, L., King, B., Pérusat, S., Harrod, C., Cobb, A., Roberts, L. K., et al. (2014). Dendritic cellbased therapeutic vaccine elicits polyfunctional hivspecific tcell immunity associated with control of viral load. European journal of immunology, 44(9):2802–2810.
 Linero (2018) Linero, A. R. (2018). Bayesian regression trees for highdimensional prediction and variable selection. Journal of the American Statistical Association, 113(522):626–636.
 McCulloch (1997) McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American statistical Association, 92(437):162–170.
 Mentch and Hooker (2016) Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881.
 Scornet et al. (2015) Scornet, E., Biau, G., and Vert, J.P. (2015). Consistency of random forests. The Annals of Statistics, 43(4):1716–1741.
 Segal (1992) Segal, M. R. (1992). Treestructured methods for longitudinal data. Journal of the American Statistical Association, 87(418):407–418.
 Sela and Simonoff (2012) Sela, R. J. and Simonoff, J. S. (2012). Reem trees: a data mining approach for longitudinal and clustered data. Machine learning, 86(2):169–207.
 Statnikov et al. (2008) Statnikov, A., Wang, L., and Aliferis, C. F. (2008). A comprehensive comparison of random forests and support vector machines for microarraybased cancer classification. BMC bioinformatics, 9(1):319.
 Steingrimsson et al. (2018) Steingrimsson, J. A., Diao, L., and Strawderman, R. L. (2018). Censoring unbiased regression trees and ensembles. Journal of the American Statistical Association, 0(0):1–14.
 Verbeke and Molenberghs (2009) Verbeke, G. and Molenberghs, G. (2009). Linear Mixed Models for Longitudinal Data. Springer.
 Wager (2014) Wager, S. (2014). Asymptotic theory for random forests. arXiv preprint arXiv:1405.0352.
 Wang (1998) Wang, Y. (1998). Smoothing spline models with correlated random errors. Journal of the American Statistical Association, 93(441):341–348.
 Wu and Zhang (2006) Wu, H. and Zhang, J.T. (2006). Nonparametric regression methods for longitudinal data analysis: mixedeffects modeling approaches. John Wiley & Sons.
 Zhang et al. (1998) Zhang, D., Lin, X., Raz, J., and Sowers, M. (1998). Semiparametric stochastic mixed models for longitudinal data. Journal of the American Statistical Association, 93(442):710–719.
 Zhu et al. (2015) Zhu, R., Zeng, D., and Kosorok, M. R. (2015). Reinforcement learning trees. Journal of the American Statistical Association, 110(512):1770–1784.