Title

# Title

Firstname Lastname
Affiliation
Sarah Brockhaus
LMU Munich &David Rügamer
LMU Munich &Sonja Greven
LMU Munich

# Boosting Functional Regression Models with FDboost

Firstname Lastname
Affiliation
Sarah Brockhaus
LMU Munich &David Rügamer
LMU Munich &Sonja Greven
LMU Munich
###### Abstract

The R add-on package FDboost is a flexible toolbox for the estimation of functional regression models by model-based boosting. It provides the possibility to fit regression models for scalar and functional response with effects of scalar as well as functional covariates, i.e., scalar-on-function, function-on-scalar and function-on-function regression models. In addition to mean regression, quantile regression models as well as generalized additive models for location scale and shape can be fitted with FDboost. Furthermore, boosting can be used in high-dimensional data settings with more covariates than observations. We provide a hands-on tutorial on model fitting and tuning, including the visualization of results. The methods for scalar-on-function regression are illustrated with spectrometric data of fossil fuels and those for functional response regression with a data set including bioelectrical signals for emotional episodes.

Boosting Functional Regression Models with FDboost

Sarah Brockhaus LMU Munich                        David Rügamer LMU Munich                        Sonja Greven LMU Munich

Keywords: functional data analysis, function-on-function regression, function-on-scalar regression, gradient boosting, model-based boosting, scalar-on-function regression.

## 1 Introduction

With the progress of technology today, we have the ability to observe more and more data of a functional nature, such as curves, trajectories or images (Ramsay and Silverman 2005). Functional data can be found in many scientific fields like demography, biology, medicine, meteorology and economics (see, e.g., Ullah and Finch 2013). In practice, the functions are observed on finite grids. In this paper, we deal with one-dimensional functional data that are observed over a real valued interval. Examples for such data are growth curves over time, acoustic signals, temperature curves and spectrometric measurements in a certain range of wavelengths. Regression models are a versatile tool for data analysis and various models have been proposed for regression with functional variables; see Morris (2015) and Greven and Scheipl (2017) for recent reviews of functional regression models. One can distinguish between three different types of functional regression models: scalar-on-function regression, a regression with scalar response and functional covariates, function-on-scalar regression referring to models with functional response and scalar covariates and function-on-function regression, which is used when both response and covariates are functional. Models for scalar-on-function regression are sometimes also called signal regression.

Greven and Scheipl (2017) lay out a generic framework for functional regression models including the three mentioned model types. Many types of covariate effects are discussed including linear and non-linear effects of scalar covariates as well as linear effects of functional covariates and interaction terms. They describe that estimation can be based on a mixed models framework (Scheipl et al. 2015, 2016) or on component-wise gradient boosting (Brockhaus et al. 2015, 2017). In this paper, we describe the latter approach and provide a hands-on tutorial for its implementation in R (R Core Team 2017) in the comprehensive R package FDboost (Brockhaus and Rügamer 2017).

Boosting estimates the model by iteratively combining simple models and can be seen as a method that conducts gradient descent (Bühlmann and Hothorn 2007). Boosting is capable of estimating models in high-dimensional data settings and implicitly does variable selection. The modeled features of the conditional response distribution can be chosen quite flexibly by minimizing different loss functions. The framework includes linear models (LMs), generalized linear models (GLMs) as well as quantile and expectile regression. Furthermore, generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos 2005) can be fitted (Mayr et al. 2012). GAMLSS model all distribution parameters of the conditional response distribution simultaneously depending on potentially different covariates. Brockhaus et al. (2016) discuss GAMLSS with scalar response and functional covariates. Stöcker et al. (2017) introduce GAMLSS for functional response. Due to variable selection and shrinkage of the coefficient estimates, no classical inference concepts are available for the boosted models. However, it is possible to quantify uncertainty by bootstrap (Efron 1979) and stability selection (Meinshausen and Bühlmann 2010). The main advantages of the boosting approach are the possibility to fit models in high dimensional data settings with variable selection and to estimate not only mean regression models but also GAMLSS and quantile regression models. The main disadvantage is the lack of formal inference.

Other frameworks for flexible regression models with functional response exist. Morris and Carroll (2006) and Meyer et al. (2015) use a basis transformations approach and Bayesian inference to model functional variables. Usually, loss-less transformations like a wavelet transformation are used. See Morris (2017) for a detailed comparison of the two frameworks.

In this tutorial, we present the R package FDboost (Brockhaus and Rügamer 2017), which is designed to fit a great variety of functional regression models by boosting. FDboost builds on the R package mboost (Hothorn et al. 2016) for statistical model-based boosting. Thus, in the back-end we rely on a well-tested implementation. FDboost provides a comprehensive implementation of the most important methods for boosting functional regression models. In particular, the package can be used to conveniently fit models with functional response. For effects of scalar covariates on functional responses, we provide base-learners with suitable identifiability constraints. In addition, base-learners that model effects of functional covariates are implemented. The package also contains functions for model tuning and for visualizing results.

As a case study for scalar-on-function regression, we use a dataset on fossil fuels, which was analyzed in Fuchs et al. (2015) and Brockhaus et al. (2015) and is part of the FDboost package. In this application, the heat value of fossil fuels should be predicted based on spectral data. As a case study for function-on-scalar and function-on-function regression, we use the emotion components data set, which is analyzed in Rügamer et al. (2016) in the context of factor-specific historical effect estimation and which is provided in an aggregated version in FDboost.

The remainder of the paper is structured as follows. We shortly review the generic functional regression model (Section 2) for scalar and for functional response. Then the boosting algorithm used for model fitting is introduced in Section 3. In Section 4, we give details on the infrastructure of the package FDboost. Scalar-on-function regression with FDboost is described in Subsection 4.1. Regression models for functional response with scalar and/or functional covariates are described in Subsection 4.2. We present possible covariate effects as well as discuss model tuning and show how to extract and display results. In Subsection 4.3, we discuss regression models that model other characteristics of the response distribution than the mean, in particular median regression and GAMLSS. In Subsection 4.4, we shortly comment on stability selection in combination with boosting. We conclude with a discussion in Section 5. The paper is structured such that the subsections on functional response can be skipped if one is only interested in scalar-on-function regression.

## 2 Functional regression models

In Subsection 2.1 we first introduce a generic model for scalar response with functional and scalar covariates. Afterwards, we deal with models with functional response in Subsection 2.2.

### 2.1 Scalar response and functional covariates

Let the random variable be the scalar response with realization . The covariate set can include both scalar and functional variables. We denote a generic scalar covariate by and a generic functional covariate by , with and , . We assume that we observe data pairs , where comprises the realizations of scalar covariates as well as the realizations of . In practice, is observed on a grid of evaluation points , such that each curve is observed as a vector . While different functional covariates may be observed on different grid points over different intervals, we do no introduce additional indices here for ease of notation.

We model the expectation of the response by an additive regression model

 E(Yi|Xi=xi)=h(xi)=J∑j=1hj(xi), (1)

where is the additive predictor containing the additive effects . Each effect can depend on one or more covariates in . Possible effects include linear, non-linear and interaction effects of scalar covariates as well as linear effects of functional covariates. Moreover, group-specific effects and interaction effects between scalar and functional variables are possible. To give an idea of possible effects , Table 1 lists effects of functional covariates that are currently implemented in FDboost.

The effects are linearized using a basis representation:

 hj(xi)=bj(xi)⊤θj,  j=1,…,J, (2)

with basis vector and coefficient vector that has to be estimated. The design matrix for the th effect consists of rows for all observations . A ridge-type penalty term is used for regularization, where is a suitable penalty matrix for and is a non-negative smoothing parameter. The smoothing parameter controls the effective degrees of freedom of the effect.

Consider, for example, a linear effect of a functional covariate . Using , this effect is computed as

 ∫Sxi(s)β(s)ds =∫Sxi(s)Kj∑k=1ϕk(s)θjk≈β(s)ds ≈R∑r=1(Δ(sr)xi(sr)Kj∑k=1ϕk(sr)θjk) =Kj∑k=1(R∑r=1Δ(sr)xi(sr)ϕk(sr)entries in bj(xi)θjk) =bj(xi)⊤θj,

where first, the smooth effect is expanded in basis functions, second, the integration is approximated by a weighted sum and, third, the terms are rearranged such that they fit into the scheme . The basis is thus computed as

 bj(xi)⊤ =[R∑r=1Δ(sr)xi(sr)ϕ1(sr) ⋯ R∑r=1Δ(sr)xi(sr)ϕKj(sr)] (3) ≈[∫Sxi(s)ϕ1(s)ds ⋯ ∫Sxi(s)ϕKj(s)ds],

with spline functions , , for the expansion of the smooth effect in direction and integration weights for numerical computation of the integral. The penalty matrix  is chosen such that it is suitable to regularize the splines . To set up a P-spline basis (Eilers and Marx 1996) for the smooth effect, in (3) are B-splines and the penalty  is a squared difference matrix.

#### 2.1.1 Case study: Heat value of fossil fuels

The aim of this application is to predict the heat value of fossil fuels using spectral data (Fuchs et al. 2015, Siemens AG). For samples, the dataset contains the heat value, the percentage of humidity and two spectral measurements, which can be thought of as functional variables observed over and observed over . One spectrum is ultraviolet-visible (UVVIS), the other a near infrared spectrum (NIR). For both spectra, the observation points are not equidistant. The dataset is contained in the R package FDboost. \MakeFramed\@setminipage

# load the data
data("fuelSubset", package = "FDboost")

# look at the structure of the data
str(fuelSubset)


List of 7
$heatan : num [1:129] 26.8 27.5 23.8 18.2 17.5 ...$ h2o         : num [1:129] 2.3 3 2 1.85 2.39 ...
$nir.lambda : num [1:231] 800 803 805 808 810 ...$ NIR         : num [1:129, 1:231] 0.2818 0.2916 -0.0042 -0.034 -0.1804 ...
$uvvis.lambda: num [1:134] 250 256 261 267 273 ...$ UVVIS       : num [1:129, 1:134] 0.145 -1.584 -0.814 -1.311 -1.373 ...
$h2o.fit : num [1:129] 2.58 3.43 1.83 2.03 3.07 ...  \endMakeFramed Figure 1 shows the two spectral measurements colored according to the heat value. Predictive models for the heat values, discussed in the next sections, will include scalar-on-function terms to accommodate the spectral covariates. ### 2.2 Functional response We denote the functional response by , where is the evaluation point at which the function is observed. We assume that , where is a real-valued interval , for example a time-interval. All response curves can be observed on one common grid or on curve-specific grids. For responses observed on one common grid, we write for the observations, with denoting the grid of evaluation points. For curve-specific evaluation points, the observations are denoted by , with . As above, the covariate set can contain both scalar and functional variables. As in model (1), we model the conditional expectation of the response. In this case, the expectation is modeled for each point :  E(Yi(t)|Xi=xi)=h(xi,t)=J∑j=1hj(xi,t). (4) As the response is a function of , the linear predictor as well as the additive effects are functions of . Each effect can depend on one or more covariates in as well as on . To give an idea of possible effects , Table 2 lists some effects that are currently implemented. All effects mentioned in Table 2 are varying over but can also be modeled as constant in . The upper part of the table contains linear, smooth and interaction effects for scalar covariates. The middle part of the table gives possible effects of functional covariates and interaction effects between scalar and functional covariates. The lower part of the table in addition shows some group-specific effects. In practice, all effects are linearized using a basis representation (Brockhaus et al. 2017):  hj(xi,tig)=bjY(xi,tig)⊤θj, j=1,…,J, (5) where the basis vector depends on covariates and the observation-point of the response . The corresponding coefficient vector has to be estimated. The design matrix for the th effect consists of rows for all observations and all time-points , . In the following, we will use a modularization of the basis into a first part depending on covariates and a second part that only depends on . This modular structure reduces the problem of specifying the basis to that of creating two suitable marginal bases. For many effects, the marginal bases are easy to define as they are known from regression with scalar response. First, we focus on responses observed on one common grid which does not depend on . In this case, we represent the effects using the Kronecker product of two marginal bases (Brockhaus et al. 2015)  hj(xi,tg)=(bj(xi)⊤⊗bY(tg)⊤)θj, (6) where the marginal basis vector , , depends on covariates in and the marginal basis vector , , depends on the grid point . The design matrix is computed as the Kronecker product of the two marginal design matrices, which have dimensions and . If the effect can be represented as in (6) it fits into the framework of linear array models (Currie et al. 2006). The representation as array model has computational advantages, saving time and memory. Brockhaus et al. (2015) discuss array models in the context of functional regression. Note that representation (6) is only possible for responses observed on one common grid, as otherwise depends on the curve-specific grid points . In this case, the marginal bases are combined by the row-wise tensor product (Scheipl et al. 2015; Brockhaus et al. 2017). This is a rather technical detail and is thoroughly explained in Brockhaus et al. (2017), also for the case where the basis for the covariates depends on such as for historical effects. We regularize the effects by a ridge-type penalty term . The penalty matrix for the composed basis can be constructed as (Wood 2006, Sec. 4.1.8)  PjY=λj(Pj⊗IKY)+λY(IKj⊗PY), (7) where is a suitable penalty for and is a suitable penalty for . The non-negative smoothing parameters and determine the degree of smoothing in each direction. The anisotropic penalty in (7) can be simplified in the case of an isotropic penalty depending on only one smoothing parameter :  PjY=λj(Pj⊗IKY+IKj⊗PY). (8) In this simplified case only one instead of two smoothing parameters has to be estimated. If in (8), this results in a penalty that only penalizes the marginal basis in direction:  PjY=λj(IKj⊗PY). (9) Consider, for example, a linear effect of a functional covariate . The basis vector and the penalty are the same as in (3). For the basis in direction, we use a spline representation  bY(tg)⊤=[ϕ1(tg)⋯ϕKY(tg)] (10) with spline functions , and the penalty matrix has to be chosen such that it is suitable for the chosen spline basis. Using P-splines again, are B-splines and is a squared difference matrix (Eilers and Marx 1996). The complete basis is  bj(xi)⊤⊗bY(tg)⊤=[∫Sxi(s)ϕ1(s)ds⋯∫Sxi(s)ϕKj(s)ds]⊗[ϕ1(tg)⋯ϕKY(tg)]. This choice expands in a tensor-product spline basis and approximates the integral using numerical integration. #### 2.2.1 Case study: Emotion components data with EEG and EMG The emotion components data set is based on a study of Gentsch et al. (2014), in which brain activity (EEG) as well as facial muscle activity (EMG) was simultaneously recorded during a computerised game. As the facial muscle activity should be traceable to the brain activity for a certain game situation, Rügamer et al. (2016) analyzed the synchronization of EEG and EMG signal using function-on-function regression models with factor-specific historical effects. During the gambling rounds, three binary game conditions were varied, resulting in a total of different study settings: • the goal conduciveness (game_outcome) corresponding to the monetary outcome (gain or loss) at the end of each game round, • the power setting, which determined whether the player was able or not able to change the final outcome in her favor (high or low, respectively) and • the control setting, which was manipulated to change the participant’s subjective feeling about her ability to cope with the game outcome. The player was told to frequently have high power in rounds with high control and have frequently low power in low control situations. We focus on the EMG of the frontalis muscle, which is used to raise the eyebrow. The EMG signal is a functional response , with ms, which is measured at a frequency of resulting in equidistant observed time points given by the vector t. The experimental conditions are scalar covariates. The EEG signal is observed over the same time interval as the EMG signal. We use the EEG signal from the Fz electrode, which is in the center front of the head. In the following, we consider an aggregated version of the data, in which the EEG and EMG signals are aggregated per subject and game condition. One participant is excluded, yielding subjects. \MakeFramed\@setminipage # load the data data("emotion", package = "FDboost") # look at the structure of the data str(emotion)  List of 8$ power       : Factor w/ 2 levels "high","low": 1 1 2 2 1 1 2 2 1 1 ...
$game_outcome: Factor w/ 2 levels "gain","loss": 1 2 1 2 1 2 1 2 1 2 ...$ control     : Factor w/ 2 levels "high","low": 1 1 1 1 2 2 2 2 1 1 ...
$subject : Factor w/ 23 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...$ EEG         : num [1:184, 1:384] -0.14 -0.318 -0.497 -0.601 -0.61 ...
$EMG : num [1:184, 1:384] -2.5581 -0.0671 2.9337 1.4511 -0.881 ...$ s           : int [1:384] 1 2 3 4 5 6 7 8 9 10 ...
$t : int [1:384] 1 2 3 4 5 6 7 8 9 10 ...  \endMakeFramed In order to fit simple and meaningful models for function-on-function regression, we define a subset of the data that contains only the observations for a certain game condition. We use the game condition with high control, gain and low power: \MakeFramed\@setminipage # define subset for a certain game condition subset <- emotion$control == "high" &
emotion$game_outcome == "gain" & emotion$power == "low"

emotionHGL <- list()

# subset scalar variables
emotionHGL$subject <- emotion$subject[subset]

# subset functional variables
emotionHGL$EMG <- emotion$EMG[subset,]
emotionHGL$EEG <- emotion$EEG[subset,]

# keep the evaluation points
emotionHGL$s <- emotionHGL$t <- emotion$t  \endMakeFramed In Figure 2 the EEG and EMG signal is depicted for each of the 23 participants and the 384 observation points. ## 3 Estimation by gradient boosting Boosting originates from the field of machine learning and aims at combining many weak learners to form a single strong learner (e.g., Friedman et al. 2000; Schapire and Freund 2012). In the boosting context, the weak learners are called base-learners. In our context, each effect corresponds to one base-learner, leading to base-learners in total. Boosting was originally designed for binary classification problems and was extended in various directions (see, e.g., Mayr et al. 2014b). Nowadays it is also used to estimate statistical models (Mayr et al. 2014a). Component-wise gradient boosting minimizes the expected loss (risk) via gradient descent in a step-wise procedure. In each boosting step, each base-learner is fitted separately to the negative gradient and only the best fitting base-learner is selected for the model update; hence the term ’component-wise’. To fit a model for the expectation, like models (1) and (4), the squared error loss ( loss) is minimized. In this case, the negative gradient corresponds to the residuals. In each step, the current additive predictor is updated, such that the estimated effect of the selected base-learner is added by a proportion of the estimate in this iteration controlled by the step-length . A typical choice is (see, e.g., Bühlmann and Hothorn 2007). Usually the algorithm is stopped before convergence, which is called early stopping. This leads to regularized effect estimates and therefore yields more stable predictions. Since some of the base-learners are never selected in the course of all iterations, boosting also performs variable selection. The optimal stopping iteration can be determined by resampling methods like cross-validation, sub-sampling or bootstrap. For each fold, the empirical out-of-bag risk is computed and the stopping iteration that yields the lowest empirical risk is chosen. As resampling must be conducted on the level of independent observations, this is done on the level of curves for functional response. By representing all base-learners as linear effects of covariates (if necessary, by using a basis representation for non-linear effects), base-learners also define the covariate effects in the sense of additive regression models and can be associated with a specific hat matrix as well as a certain number of degrees of freedom. To ensure a fair selection of base-learners within the course of all iterations, it is important to specify equal degrees of freedom for each base-learner. If base-learners exhibit different degrees of freedom, selection is biased towards more flexible base-learners with higher degrees of freedom as they are more likely to yield larger improvements of the fit in each iteration (see Hofner et al. 2011, for details). It is recommended to use a rather small number of degrees of freedom for all base-learners to work with weak learners (Kneib et al. 2009; Hofner et al. 2011). While the pre-defined degrees of freedom of each base-learner are fixed and relatively small, the additive model components in the final model fit adapt to the necessary complexity and can have various effective degrees of freedom depending on the frequency of their (repeated) selection. #### 3.0.1 Functional Response To adapt boosting for a functional response, we compute the loss at each point and integrate it over the domain of the response (Brockhaus et al. 2015). For the loss the optimization problem for functional response aims at minimizing  N∑i=1∫[yi(t)−h(xi,t)]2dt, (11) which is approximated by numerical integration. To obtain identifiable models, suitable identifiability constraints for the base-learners are necessary and implemented. FDboost also contains base-learners that model the effects of functional covariates. For a discussion of both points, please see Brockhaus et al. (2015). ## 4 The package FDboost Fitting functional regression models via boosting is implemented in the R package FDboost. The package uses the fitting algorithm and other infrastructure from the R package mboost (Hothorn et al. 2016). All base-learners and distribution families that are implemented in mboost can be used within FDboost. Many naming conventions and methods in FDboost are implemented in analogy to mboost. Thus, we recommend users of FDboost to first familiarize themselves with mboost. A tutorial for mboost can be found in Hofner et al. (2014). The main fitting function to estimate functional regression models, like models (1) and (4), is called FDboost(). The interface of FDboost() is as follows:111Note that for the presentation of functions we restrict ourselves to the most important function arguments. For the full list of arguments, we refer to the corresponding manuals. \MakeFramed\@setminipage FDboost(formula, timeformula, id = NULL, numInt = "equal", data, offset = NULL, ...)  \endMakeFramed First, we focus on the arguments that are necessary for regression models both with scalar and with functional response. formula specifies the base-learners for the covariate effects and timeformula specifies , which is the basis along . Per default, this basis is the same for all effects . The data is provided in the data argument as a data.frame or a named list. The data-object has to contain the response, all covariates and the evaluation points of functional variables. Prior to the model fit, an offset is subtracted from the response to center it. This corresponds to initializing the fit with this offset, e.g. an overall average, and leads to faster convergence and better stability of the boosting algorithm. For mean regression, by default the offset is the smoothed point-wise mean of the response over time without taking into account covariates. This offset is part of the intercept and corresponds to an initial estimate that is then updated. In the dots-argument, ’...’, further arguments passed to mboost() and mboost_fit() can be specified; see the manual of mboost() for details. An important argument is family determining the loss- and link-function for the model which is fitted. The default is family = Gaussian(), which minimizes the squared error loss and uses the identity as link function. Thus, per default a mean regression model for continuous response is fitted. For the duality of loss-function and the family argument, we refer to Section 4.3. Another important argument is control, which determines the number of boosting iterations and the step-length of the boosting algorithm. #### 4.0.1 Specification for scalar response For scalar response, we set timeformula = NULL as no expansion of the effects in direction is necessary. formula specifies the base-learners for the covariates effects as in (2). The arguments id and numInt are only needed for functional responses. For scalar response, offset = NULL results in a default offset, as, for example, the overall mean for mean regression. #### 4.0.2 Arguments needed for functional response For functional response, the set-up of the covariate effects generally follows (6) by separating the effects into two marginal parts. The marginal effects , , are represented in the formula as y b_1 + b_2 + …+ b_J. The marginal effect is represented in the timeformula, which has the form b_Y. The base-learners for the marginal effects also contain suitable penalty matrices. Internally, the base-learners specified in formula are combined with the base-learner specified in timeformula as in (6) and a suitable penalty matrix is constructed according to (8). Per default, the response is expected to be a matrix. In this case id = NULL. The matrix representation is not possible for a response which is observed on curve specific grids. In this case the response is provided as vector in long format and id specifies which position in the vector is attributed to which curve; see section 4.2 for details. The argument numInt provides the numerical integration scheme for computing the integral of the loss over in (11). Per default, numInt = "equal", and thus all integration weights are set to one; for numInt = "Riemann" Riemann sums are used. For functional response, offset = NULL induces a smooth offset varying over . For offset = "scalar", a scalar offset is computed. This corresponds to an offset that is constant along . For more details and the full list of arguments, see the manual of FDboost(). ### 4.1 Scalar response and functional covariates In this subsection, we give details on models with scalar response and functional covariates like model (1). Such models are called scalar-on-function regression models. As case study the data on fossil fuels is used. #### 4.1.1 Potential covariate effects: base-learners In order to fit a scalar-on-function model (1), the timeformula is set to NULL and potential covariate effects are specified in the formula argument. The effects of scalar covariates can be linear or non-linear. A linear effect is obtained using the base-learner bols(), which is also suitable for factor variables, in which case dummy variables are constructed for each factor level (Hofner et al. 2014). Per default, bols() contains an intercept. If the specified degrees of freedom are less than the number of columns in the design matrix, bols() penalizes the linear effect by a ridge penalty with the identity matrix as penalty matrix. The base-learner brandom() for factor variables sets up an effect, which is centered around zero and is penalized by a ridge penalty, having similar properties to a random effect. See the web appendix of Kneib et al. (2009) for a discussion on brandom(). A non-linear effect expanded by P-splines is obtained by the base-learner bbs(). For details on base-learners with scalar covariates, we refer to Hofner et al. (2014). Potential base-learners for functional covariates can be seen in Table 3. In this table exemplary linear predictors are listed in the left column. In the right column, the corresponding call to formula is given. For simplicity, only one possible parameterization leads to simple interpretations and one corresponding model call are shown, although FDboost allows to specify several parameterizations. For a linear effect of a functional covariate , two base-learners exist that use different basis expansions. Assuming to be smooth, bsignal() uses a P-spline representation for the expansion of . Assuming that the main modes of variation in the functional covariate are the important directions for the coefficient function , a representation with functional principal components is suitable (Ramsay and Silverman 2005). In the base-learner bfpc(), the coefficient function and the functional covariate are both represented by an expansion in the estimated functional principal components of . As penalty matrix, the identity matrix is used. In Appendix B, technical details on the representation of functional effects are given. The specification of a model with an interaction term between a scalar and a functional covariate is given at the end of Table 3. The interaction term is centered around the main effect of the functional covariate using bolsc for the scalar covariate (as is the linear effect of the scalar covariate around the intercept). Thus, the main effect of the functional covariate has to be included in the model. For more details on interaction effects, we refer to Brockhaus et al. (2015) and Rügamer et al. (2016). The interaction is formed using the operator %X% that builds the row-wise tensor product of the two marginal bases, see Appendix C. As explained in Section 3, all base-learners in a model should have equal and rather low degrees of freedom. The number of degrees of freedom that can be given to a base-learner is restricted. On the one hand, the maximum number is bounded by the number of columns of the design matrix (more precisely by the rank of the design matrix). On the other hand, for rank-deficient penalties, the minimum number of degrees of freedom is given by the rank of the null space of the penalty matrix. The interface of bsignal() is as follows: \MakeFramed\@setminipage bsignal(x, s, knots = 10, degree = 3, differences = 1, df = 4, lambda = NULL, check.ident = FALSE)  \endMakeFramed The arguments x and s specify the name of the functional covariate and the name of its argument. knots gives the number of inner knots for the P-spline basis, degree the degree of the B-splines and differences the order of the differences that are used for the penalty. Thus, per default, 14 cubic P-splines with first order difference penalty are used. The argument df specifies the number of degrees of freedom for the effect and lambda the smoothing parameter. Only one of those two arguments can be supplied. If check.ident = TRUE identifiability checks proposed by Scheipl and Greven (2016) for functional linear effects are additionally performed. The interface of bfpc() is: \MakeFramed\@setminipage bfpc(x, s, df = 4, lambda = NULL, pve = 0.99, npc = NULL)  \endMakeFramed The arguments x, s, df and lambda have the same meaning as in bsignal(). The two other arguments allow to control how many functional principal components are used as basis. Per default the number of functional principal components is chosen such that the proportion of the explained variance is 99%. This proportion can be changed using the argument pve (proportion variance explained). Alternatively, the number of components can be set to a specific value using npc (number principal components). The interface of bolsc() is very similar to that of bols(), which is laid out in detail in Hofner et al. (2014). In contrast to bols(), bolsc() centers the design matrix such that the resulting linear effect is centered around zero. More details on bolsc() are given in Section 4.2. \MakeFramed\@setminipage bolsc(..., df = NULL, lambda = 0, K = NULL)  \endMakeFramed In the dots argument, ..., one or more covariates can be specified. For factor variables bolsc() sets up a design matrix in dummy-coding. The arguments df and lambda have the same meaning as above. If lambda > 0 or df < the number of columns of the design matrix a ridge-penalty is applied. Per default, K = NULL, the penalty matrix is the identity matrix. Setting the argument K to another matrix allows for customized penalty matrices. #### Case study (ctd.): Fossil fuel data For the heat values , , we fit the model  \mathdsE(Y|x)=β0+f(z\tiny{h2o})+∫S\tiny{NIR}x\tiny{NIR}(s\tiny{% NIR})β\tiny{NIR}(s\tiny{NIR})ds\tiny{NIR}+∫S\tiny{UV}x\tiny{UV}(s\tiny{UV})β\tiny{UV% }(s\tiny{UV})ds\tiny{UV}, (12) with water content and centered spectral curves and , which are observed over the wavelengths and . We center the NIR and the UVVIS measurement per wavelength such that and analogously for UVVIS. Thus, the functional effects have mean zero, and analogously for UVVIS. This does not affect the interpretation of and , it only changes the interpretation of the intercept of the regression model. If all effects are centered, the intercept can be interpreted as overall mean and the other effects as deviations from the overall mean. Note that the functional covariates have to be supplied as number of curves by number of evaluation points matrices. The non-linear effect of the scalar variable H2O is specified using the bbs() base-learner. For the linear functional effect of NIR and UVVIS, we use the base-learner bsignal(). The degrees of freedom are set to 4 for each base-learner. For the functional effects, we use a P-spline basis with 20 inner knots. Because of the scalar response timeformula = NULL. \MakeFramed\@setminipage # center the functional covariates per # observed wavelength to center their effects fuelSubset$UVVIS <- scale(fuelSubset$UVVIS, scale = FALSE) fuelSubset$NIR <- scale(fuelSubset$NIR, scale = FALSE) # fit the scalar-on-function regression model sof <- FDboost(heatan ~ bbs(h2o, df = 4) + bsignal(UVVIS, s = uvvis.lambda, knots = 20, df = 4) + bsignal(NIR, s = nir.lambda, knots = 20, df = 4), timeformula = NULL, data = fuelSubset)  \endMakeFramed #### 4.1.2 Model tuning and early stopping Boosting iteratively selects base-learners to update the additive predictor. Fixing the base-learners and the step-length, the model complexity is controlled by the number of boosting iterations. With more boosting iterations the model becomes more complex (Bühlmann and Yu 2003). The step-length is chosen sufficiently small in the interval , usually as , which is also the default. For smaller step-length, more boosting iterations are required and vice versa (Friedman 2001). Note that the default number of boosting iterations is 100. This is arbitrary and in most cases not adequate. The number of boosting iterations and the step-length of the algorithm can be specified in the argument control. This argument must be supplied as a call to boost_control(). For example, control = boost_control(mstop = 50, nu = 0.2) implies 50 boosting iterations and step-length . The most important tuning parameter is the number of boosting iterations. For regression with scalar response, the function cvrisk.FDboost() can be used to determine the optimal stopping iteration. This function directly calls cvrisk.mboost() from the mboost package, which performs an empirical risk estimation using a specified resampling method. The interface of cvrisk.FDboost() is: \MakeFramed\@setminipage cvrisk.FDboost(object, folds = cvLong(id = object$id,
weights = model.weights(object)),
grid = 1:mstop(object))

\endMakeFramed

In the argument object, the fitted model object is specified. grid defines the grid on which the optimal stopping iteration is searched. Per default the grid from 1 to the current stopping iteration of the model object is used as search grid. But it is also possible to specify a larger grid, e.g., 1:5000. The argument folds expects an integer weight matrix with dimension (<number of observations> times <number of folds>). Depending on the range of values in the weight matrix, different types of resampling are performed. For example, if the weights sum to for each column but also have values larger than one, the resampling scheme corresponds to bootstrap while a -fold cross-validation is employed by using an incidence matrix, for which the rows sum to . If not manually specified, mboost and FDboost provide convenience functions – cv() and cvLong() – that construct such matrices on the basis of the given model object. The function cvLong() is suited for functional response and treats scalar response as the special case with one observation per curve. For scalar response, the function cv() from package mboost can be used, which has a simpler interface.

\MakeFramed\@setminipage
cv(weights, type = c("bootstrap", "kfold", "subsampling"),
B = ifelse(type == "kfold", 10, 25))

\endMakeFramed

The argument weights is used to specify the weights of the original model, which can be extracted using model.weights(object). Usually all model weights are one. Via argument type the resampling scheme is defined: "bootstrap" for non-parametric bootstrap, "kfold" for cross-validation and "subsampling" for resampling half of all observations for each fold. The number of folds is defined by B. Per default, 10 folds are used for cross-validation and 25 folds for bootstrap as well as for subsampling.

The function cvLong() is especially suited for functional response and has the additional argument id, which is used to specify which observations belong to the same response curve. For scalar response, id = 1:N.

#### Case study (ctd.): Fossil fuel data

To tune the scalar-on-function regression model (12), we search the optimal stopping iteration by 10-fold bootstrapping. First, the bootstrap folds are created using the function cv(). Second, for each bootstrap fold, the out-of-bag risk is computed for models with 1 to 1000 boosting iterations. The choice of the grid is independent of the number of boosting iterations of the fitted model object. \MakeFramed\@setminipage

# create folds for 10-fold bootstrap for the model object "sof"
set.seed(123)
folds_sof <- cv(weights = model.weights(sof), type = "bootstrap", B = 10)

# compute out-of-bag risk on each fold for 1 to 1000 boosting iterations
cvm_sof <- cvrisk(sof, folds = folds_sof, grid = 1:1000)


\endMakeFramed

The object cvm_sof contains the out-of-bag risk of each fold for all iterations.

#### 4.1.3 Methods to extract and visualize results from the resampling object

For a cvrisk-object as created by cvrisk(), the method mstop() extracts the estimated optimal number of boosting iterations, which corresponds to the number of boosting iterations yielding the minimal mean out-of-bag risk. plot() generates a plot of the estimated out-of-bag risk per stopping iteration in each fold. In addition, the mean out-of-bag risk per stopping iteration is displayed. The estimated optimal stopping iteration is marked by a dashed vertical line. In such a plot, the convergence behavior can be graphically examined.

#### Case study (ctd.): Fossil fuel data

We generate a plot that displays for each fold the estimated out-of-bag risk per stopping iteration for each fold; see Figure 3.

\MakeFramed\@setminipage
# plot the out-of-bag risk for all bootstrap folds
plot(cvm_sof, ylim = c(2, 15))

\endMakeFramed

For small numbers of boosting iterations, the out-of-bag risk declines sharply with a growing number of boosting iterations. With more and more iterations the model gets more complex and the out-of-bag risk starts to slowly increase. The dashed vertical line marks the estimated optimal stopping iteration of 511, which can be accessed using the function mstop():

\MakeFramed\@setminipage
# get estimated optimal stopping iteration
mstop(cvm_sof)

[1] 511

\endMakeFramed

Fitted FDboost objects inherit methods from class mboost. Thus, all methods available for mboost objects can also be applied to models fitted by FDboost(). The design and penalty matrices that are constructed by the base-learners can be extracted using the extract() function. For example, extract(object, which = 1) returns the design matrix of the first base-learner and extract(object, which = 1, what = "penalty") the corresponding penalty matrix. The number of boosting iterations for an FDboost object can be changed afterwards using the subset operator; e.g., object[50] sets the number of boosting iterations for object to 50. Note that the subset operator directly changes object, and hence no assignment is necessary. One can access the estimated coefficients by the coef() function. For smooth effects, coef() returns the smooth estimated effects evaluated on a regular grid. The spline-coefficients of smooth effects can be obtained by object$coef(). The estimated effects can be graphically displayed by the plot() function. The coefficient plots can be customized by various arguments. For example, coefficient surfaces can be displayed as image plots, setting pers = FALSE, or as perspective plots, setting pers = TRUE. To plot only some of the base-learners, the argument which can be used. For instance, plot(object, which = c(1,3)) plots the estimated effects of the first and the third base-learner. The fitted values and predictions for new data can be obtained by the methods fitted() and predict(), respectively. #### Case study (ctd.): Fossil fuel data In order to continue working with the optimal model, we set the number of boosting iterations to the estimated optimal value. \MakeFramed\@setminipage # number of boosting iterations is set to the estimated optimal mstop sof <- sof[mstop(cvm_sof)] # note that the following call also directly changes sof sof[mstop(cvm_sof)]  \endMakeFramed Then, we use plot() to display the estimated effects. Per default, plot() only displays effects of base-learners that were selected at least once. See Figure 4 for the resulting plots. \MakeFramed\@setminipage par(mfrow = c(1,3)) # plot the effects of all selected base-learners plot(sof, ask = FALSE, ylab = "")  \endMakeFramed The mean heat value is estimated to be higher for higher water content and lower for lower water content (see Figure 4 left). High values of the UVVIS spectrum at a wavelength of around 500 and 850 nm are associated with higher heat values. Higher values of the UVVIS spectrum at wavelength around 300 and 750 nm are associated with lower heat values (see Figure 4 middle). The effect of the NIR spectrum can be interpreted analogously. #### 4.1.5 Bootstrapped coefficient estimates In order to get a measure for the uncertainty associated with the estimated coefficient functions, one can employ nested bootstrap. The optimal number of boosting iterations in each bootstrap fold, in turn, is estimated by an inner resampling procedure. The bootstrapped coefficients are shrunken towards zero as boosting shrinks coefficients towards zero due to early stopping. Thus, the resulting bootstrap “confidence” interval is biased towards zero but still captures the variability of the coefficient estimates. In FDboost the function bootstrapCI() can be used to conveniently compute bootstrapped coefficients: \MakeFramed\@setminipage bootstrapCI(object, B_outer = 100, B_inner = 25, ...)  \endMakeFramed The argument object is the fitted model object. The maximal number of boosting iterations for each bootstrap fold is the number of boosting iterations of the model-object. Per default bootstrap is used with B_outer = 100 outer folds and B_inner = 25 inner folds. The dots argument, ... can be used to pass further arguments to applyFolds(), which is used for the outer bootstrap. In particular, setting the argument mc.cores to an integer greater will run the outer bootstrap in parallel on the number of cores that are specified via mc.cores (this does not work under Windows, as the parallelization is based on the function mclapply()). #### Case study (ctd.): Fossil fuel data We recompute the model on 100 bootstrap samples to compute bootstrapped coefficient estimates. In each bootstrap fold the optimal number of boosting iterations is estimated by an inner bootstrap with 10 folds. In contrast to other methods and analytic inference concepts, employing bootstrap for coefficient uncertainty is much more time consuming but can be easily parallelized. See the help page of bootstrapCI() for example code. The resulting estimated coefficients can be seen in Figure 5. \MakeFramed\@setminipage # compute bootstrapped coefficients with 100 folds for outer bootstrap, # parallelized on 10 cores (use mc.cores = 1 on Windows) # this takes some time! set.seed(123) sof_bootstrapCI <- bootstrapCI(sof[1000], B_outer = 100, B_inner = 10, mc.cores = 10) # plot the bootstrapped coefficient estimates par(mfrow = c(1,3)) plot(sof_bootstrapCI, ask = FALSE, commonRange = FALSE, ylab = "")  \endMakeFramed ### 4.2 Functional response In this subsection, we explain how to fit models with functional response like model (4). Models with scalar and functional covariates are treated, thus covering function-on-scalar and function-on-function regression models. #### 4.2.1 Specification of functional response If a functional variable is observed on one common grid, its observations can be represented by a matrix. In FDboost, such functional variables have to be supplied as number of curves by number of evaluation points matrices. That is, a functional response , with curves and evaluation points, is stored in an matrix with cases in rows and evaluation points in columns. This corresponds to a data representation in wide format. The variable must be given as vector . For the functional response, curve-specific observation grids are possible, i.e., the th response curve is observed at evaluation points specific for each curve . In this case, three pieces of information must be supplied: the values of the response, the evaluation points and the curve to which each of the observations belongs. The response is supplied as the vector . This vector has length . The variable contains all evaluation points . The argument id contains the information on which observation corresponds to which response curve. The argument id must be supplied as a right-sided formula id = idvariable. #### Case study (ctd.): Emotion components data In the following, we give an example for a model fit with a functional response. In the first model fit, the response is stored in the matrix EMG, in the second in the vector EMG_long. We fit an intercept model by defining the formula as y 1 and the timeformula as bbs(t). \MakeFramed\@setminipage # fit intercept model with response matrix fos_intercept <- FDboost(EMG ~ 1, timeformula = ~ bbs(t, df = 3), data = emotionHGL)  \endMakeFramed The corresponding mathematical formula is  \mathdsE(Y\tiny{EMG}(t))=β0(t), i.e., we simply estimate the mean curve of the functional EMG signal. To fit a model with response in long format, we first have to convert the data into the corresponding format. We therefore construct a dataset data_emotion_long that contains the response in long format. Usually, the long format specification is only necessary for responses that are observed on curve specific grids. We here provide this version for illustrative purposes, but in this example the following model specification is equivalent to the previous model fit fos_intercept. \MakeFramed\@setminipage # create data set with response in long format emotion_long <- emotionHGL # save the curve information in a long vector emotion_long$EMG_long <- as.vector(emotion_long$EMG) # compute t in long format emotion_long$time_long <- rep(emotionHGL$t, each = nrow(emotionHGL$EMG))
# compute the id-variable for the curves
emotion_long$curveid <- rep(1:nrow(emotionHGL$EMG),
ncol(emotionHGL$EMG)) # fit intercept model for response in long format fos_intercept_long <- FDboost(EMG_long ~ 1, timeformula = ~ bbs(time_long, df = 3), id = ~ curveid, data = emotion_long)  \endMakeFramed #### Effects in the formula that are combined with the timeformula Many covariate effects can be represented by the Kronecker product of two marginal bases as in (6). The response and the bases in covariate direction are specified in formula as y b_1 + …+ b_J. The base-learner for the expansion along is specified in timeformula as b_Y. Each base-learner in formula is combined with the base-leaner in timeformula using the operator %O%. This operator implements the Kronecker product of two basis vectors as in (6) with isotropic penalty matrix as in (8). Consider, for example, formula = Y b_1 + b_2, and the timeformula = b_Y. This yields Y b_1 %O% b_Y + b_2 %O% b_Y. If the effect should only be penalized in direction, the operator %A0% can be used as it sets up the penalty as (9). If formula contains base-learners that are composed of two base-learners by %O% or %A0%, those effects are not expanded with timeformula, allowing for model specifications with different effects in direction. This can be used, for example, to model some effects linearly and others non-linearly in or to construct effects using %A0%. For further details on these operators and their use, we refer to Appendix C. We start with base-learners for the timeformula. Theoretically, it is possible to use any base-learner which models the effect of a continuous variable. For a linear effect in , the base-learner bols() can be used. Usually, the effects are assumed to be smooth along . In this case, the base-learner bbs() can be used, which represents the smooth effect by P-splines (Schmid and Hothorn 2008a). Thus, bbs() uses a B-spline representation for the design matrix and a squared difference matrix as penalty matrix. Using the bbs() base-leaner in the timeformula corresponds to using a marginal basis as described in equation (10). Base-learners that can be used in formula are listed in Table 4. In this table, a selection of additive predictors that can be represented within the array framework are listed in the left column. In the right column, the corresponding formula is given. The timeformula is set to bbs(t) to model all effects as smooth effects in . For offset = NULL, the model contains a smooth offset . The smooth offset is computed prior to the model fit as smoothed population minimizer of the loss. For mean regression, the smooth offset is the smoothed mean over . The specification offset = "scalar" yields a constant offset . The resulting intercept in the final model is the sum of the offset and the smooth intercept specified in the formula as 1, i.e., . The upper part of Table 4 gives examples for linear predictors with scalar covariates. A linear effect of a scalar covariate is specified using the base-learner bolsc(). This base-learner works for continuous and for factor variables. A smooth effect of a continuous covariate is obtained by using the base-learner bbsc(). The base-learners bolsc() and bbsc() are similar to the base-learners bols() and bbs() from the mboost package, but enforce pointwise sum-to-zero constraints to ensure identifiability for models with functional response (the suffix ’c’ refers to ’constrained’). Since, for example, the effect contains a smooth intercept as special case, the model would not be identifiable without constraints, see Appendix A for more details. We use the constraint for all , which centers each effect for each point (Scheipl et al. 2015). This implies that effects varying over can be interpreted as deviations from the smooth intercept and that the intercept can be interpreted as global mean if all effects are centered in this way. It is possible to check whether all covariate effects sum to zero for all points by setting check0 = TRUE in the FDboost() call. To specify interaction effects of two scalar covariates, the base-learners for each of the covariates are combined using the operator %Xc% that applies the sum-to-zero constraint to the interaction effect. The lower part of Table 4 gives examples for linear predictors with functional covariates. In analogy to models with scalar response, the linear effect can be fitted by bsignal() or bfpc() and the interaction effect is formed using the operator %X% (see the explanations for Table 3). #### Case study (ctd.): Emotion components data For the emotion components data with the EMG signal as functional response, , , we fit models with scalar and functional covariate effects in the following. #### Function-on-scalar regression We specify a model for the conditional expectation of the EMG signal using a random intercept curve for each subject and a linear effect for the study setting power:  \mathdsE(Y\tiny{EMG}(t)|x)=β0(t)+23∑k=1I(xsubject=k)βsubject,k(t)+xpowerβpower(t), (13) with subject having values 1 to 23 for the participants of the study, and taking values for low and high power. Both covariate effects in the model are specified by using a centered base-learner. The linear effect of the factor variable subject and the effect of power are both specified using the bolsc() base-learner. Therefore, the effects sum up to zero for each time-point over all observations , i.e., for all . \MakeFramed\@setminipage # fit function-on-scalar model with centered linear effects fos_random_power <- FDboost(EMG ~ 1 + bolsc(subject, df = 2) + bolsc(power, df = 1) %A0% bbs(t, df = 6), timeformula = ~ bbs(t, df = 3), data = emotion)  \endMakeFramed As described in Section 3, it is important that all base-learners have the same number of degrees of freedom. In this model the degrees of freedom for each base-learner are . By specifying the bolsc-baselearner with df = 2 for subject, the subject effect is estimated with a Ridge penalty similar to a random effect, whereas the power effect is estimated unpenalized due to the use of the %A0%-operator. Analogously, a model with response in long format as in fos_intercept_long could be specified by changing the formula to the formula of fos_random_power. #### Function-on-function regression For the data subset for one specific game condition, we use the effect of the EEG signal to model the EMG signal:  \mathdsE(Y\tiny{EMG}(t)|x)=β0(t)+∫Sx\tiny{EEG}(s)β\tiny{EEG}(s,t)ds. (14) In this model each time-point of the covariate potentially influences each time-point of the response . We center the EEG signal per time point such that for each to center its effect per time-point. \MakeFramed\@setminipage # center the functional covariate per time-point to center its effect emotionHGL$EEG <- scale(emotionHGL$EEG, scale = FALSE) # fit function-on-function model with intercept and functional effect of EEG fof_signal <- FDboost(EMG ~ 1 + bsignal(EEG, s = s, df = 2), timeformula = ~ bbs(t, df = 3), data = emotionHGL)  \endMakeFramed We will show and interpret plots of the estimated coefficients later on. Assuming that the brain activity (measured via the EEG) triggers the muscle activity (measured via the EMG), it is reasonable to assume that EMG signals are only influenced by past EEG signals. Such a relationship can be represented using a historical effect , which will be discussed in the following paragraph. #### Effects in the formula comprising both the effect in covariate and t-direction If the covariate varies with , the effect cannot be separated into a marginal basis depending on the covariate and a marginal basis depending only on . In this case the effects are represented as in equation (5). Examples for such effects are historical and concurrent functional effects, as discussed in Brockhaus et al. (2017). In Table 5 we give an overview of possible additive predictors containing such effects. The concurrent effect is only meaningful if the functional response and the functional covariate are observed over the same domain. Models with concurrent effects can be seen as varying-coefficient models (Hastie and Tibshirani 1993), where the effect varies over . The base-learner bconcurrent() expands the smooth concurrent effect in P-splines. The historical effect uses only covariate information up to the current observation point of the response. The base-learner bhist() expands the coefficient surface in and in direction using P-splines to fit the historical effect. In Appendix B, details on the representation of functional effects are given. The interface of bhist() is: \MakeFramed\@setminipage bhist(x, s, time, limits = "s<=t", knots = 10, degree = 3, differences = 1, df = 4, lambda = NULL, check.ident = FALSE)  \endMakeFramed Most arguments of bhist() are analogous to those of bsignal(). bhist() has the additional argument time to specify the observation points of the response. Via the argument limits in bhist() the user can specify integration limits depending on . Per default a historical effect with limits is used. Other integration limits can be specified by using a function with arguments s and t, which returns TRUE for combinations of s and t that lie within the integration interval and FALSE otherwise. In the following, we give examples for functions that can be used for limits: \MakeFramed\@setminipage # historical effect; corresponds to default limits = "s<=t" limitsHist <- function(s, t) { s <= t } # lag effect with lag delta = 5 limitsLag <- function(s, t, delta = 5) { s >= t - delta & s <= t } # lead effect with lead delta = 5 limitsLead <- function(s, t, delta = 5) { s <= t - delta }  \endMakeFramed The base-learner bhistx() is especially suited to form interaction effects such as factor-specific historical effects (Rügamer et al. 2016), as bhist() cannot be used in combination with the row-wise tensor product operator %X% to form interaction effects. bhistx() requires the data to be supplied as an object of type hmatrix; see the manual of bhistx() for its setup. #### Case study (ctd.): Emotion components data Again, we use the subset of the data for one specific game condition. We start with a simple function-on-function regression model by specifying a concurrent effect of the EEG signal on the EMG signal:  \mathdsE(Y\tiny{EMG}(t)|x)=β0(t)+x% \tiny{EEG}(t)β(t). A concurrent effect is obtained by the base-learner bconcurrent(), which is not expanded by the base-learner in timeformula. In this model, timeformula is only used to expand the smooth intercept. \MakeFramed\@setminipage # fit function-on-function model with intercept and concurrent EEG effect fof_concurrent <- FDboost(EMG ~ 1 + bconcurrent(EEG, s = s, time = t, df = 6), timeformula = ~ bbs(t, df = 3), data = emotionHGL, control = boost_control(mstop = 200))  \endMakeFramed Assuming that the activity in the muscle can be completely traced back to previous activity in the brain, a more appropriate model seems to be a historical model including a historical effect  \mathdsE(Y\tiny{EMG}(t)|x)=β0(t)+∫u(t)l(t)x\tiny{EEG}(s)β\tiny{EEG}(s,t)ds. (15) From a neuro-anatomy perspective, the signal from the brain requires time to reach the muscle. We therefore set and , which is in line with Rügamer et al. (2016). \MakeFramed\@setminipage # fit function-on-function model with intercept and historical EEG effect # where limits specifies the used lag between EMG and EEG signal fof_historical <- FDboost(EMG ~ 1 + bhist(EEG, s = s, time = t, limits = function(s, t) s <= t - 3, df = 6), timeformula = ~ bbs(t, df = 3), data = emotionHGL, control = boost_control(mstop = 200))  \endMakeFramed More complex historical models are discussed in Rügamer et al. (2016). In particular, a model containing random effects for the participants, effects for the game conditions and game condition- as well as subject-specific historical effects of the EEG signal. It is also possible to combine effects listed in Table 4 and Table 5 to form more complex models. In particular, base-learners with and without array structure can be combined within one model. As in the component-wise boosting procedure each base-learner is evaluated separately, the array structure of the Kronecker product base-learners can still be exploited in such hybrid models. #### 4.2.2 Model tuning and early stopping For a fair selection of base-learner, additional care is needed for functional responses as only some of the base-learners in the formula are expanded by the base-learner in timeformula. In particular, all base-learners listed in Table 4 are expanded by timeformula, whereas base-learners given in Table 5 are not expanded by the timeformula. For the row-wise tensor product and the Kronecker product of two base-learners, the degrees of freedom for the combined base-learner is computed as product of the two marginally specified degrees of freedom. For instance, formula = y bbsc(z, df = 3) + bhist(x, s = s, df = 12) and timeformula = bbs(t, df = 4) implies degrees of freedom for the first combined base-learner and degrees of freedom for the second base-learner. The call extract(object, "df") displays the degrees of freedom for each base-learner in an FDboost object. For other tuning options such as the number of iterations and the specification of the step-length see section 4.1. To find the optimal number of boosting iterations for a model fit with functional response, FDboost provides two resampling functions. Depending on the specified model, some parameters are computed from the data prior to the model fit: per default a smooth functional offset is computed (offset = NULL in FDboost()) and for linear and smooth effects of scalar variables, defined by bolsc() and bbsc(), transformation matrices for the sum-to-zero constraints are computed. The function cvrisk.FDboost() uses the smooth functional offset and the transformation matrices from the original model fit in all folds. Thus, these parameters are treated as fixed and the uncertainty induced by their estimation is not considered in the resampling. On the other hand, applyFolds() recomputes the whole model in each fold. The two resampling methods are equal if no smooth offset is used and if the model does not contain any base-learner with a sum-to-zero constraint (i.e., neither bolsc() nor bbsc()). In general, we recommend to use the function applyFolds() to determine the optimal number of boosting iterations for a model with functional response. The interface of applyFolds() is: \MakeFramed\@setminipage applyFolds(object, folds = cv(rep(1, length(unique(object$id))),
type = "bootstrap"),
grid = 1:mstop(object))

\endMakeFramed

The interface is in analogy to the interface of cvrisk(). In the argument object, the fitted model object is specified. grid defines the grid on which the optimal stopping iteration is searched. Via the argument folds the resampling folds are defined by suitable weights. The function applyFolds() expects resampling weights that are defined on the level of curves, . That means that the folds must contain weights , , which can be done easily using the function cv().

#### 4.2.3 Methods to extract and display results

Methods to extract and visualize results are the same irrespective of scalar or functional response. Thus, we refer to the corresponding paragraphs at the end of Section 4.1.

#### Case study (ctd.): Emotion components data

As for scalar response, the plot-function can be used to access the estimated effects in a function-on-function regression. In the following, we compare the three basic types of functional covariate effects, which can be used in conjunction with a functional response. We first determine the optimal number of stopping iterations for all three presented models.

\MakeFramed\@setminipage
# create folds for 5-fold cross-validation: one weight per fold for each curve
set.seed(123)
folds_bs <- cv(weights = rep(1, fof_signal$ydim[1]), type = "kfold", B = 5) # compute out-of-bag risk on the 5 folds for 1 to 200 boosting iterations cvm_concurrent <- applyFolds(fof_concurrent, folds = folds_bs, grid = 1:200) ms_conc <- mstop(cvm_concurrent) # ms_conc = 50 fof_concurrent <- fof_concurrent[ms_conc] cvm_signal <- applyFolds(fof_signal, folds = folds_bs, grid = 1:200) ms_signal <- mstop(cvm_signal) # ms_signal = 15 fof_signal <- fof_signal[ms_signal] cvm_historical <- applyFolds(fof_historical, folds = folds_bs, grid = 1:200) ms_hist <- mstop(cvm_historical) # ms_hist = 14 fof_historical <- fof_historical[ms_hist]  \endMakeFramed Then, we plot the estimated effects into one figure: \MakeFramed\@setminipage # plot the effect of the functional covariate EEG in the three different models par(mfrow = c(1,3)) plot(fof_concurrent, which = 2, main = "Concurrent EEG effect") plot(fof_signal, which = 2, main = "Signal EEG effect", n1 = 80, n2 = 80, zlim = c(-0.02, 0.025), col = terrain.colors(20)) plot(fof_historical, which = 2, main = "Historical EEG effect", n1 = 80, n2 = 80, zlim = c(-0.02, 0.025), col = terrain.colors(20))  \endMakeFramed The concurrent effect corresponds to the diagonal of the other two surfaces in Figure 6 and assumes that off-diagonal time-points have no association. Due to the temporal lag between EEG and EMG discussed for model (15), there is no meaningful interpretation for this model and the effect is only shown for demonstrative purposes. The historical effect corresponds to the assumption that the upper triangle in the signal EEG effects should be zero, as future brain activity should not influence the present muscle activity. The results in Figure 6 (right panel) can be interpreted in the same manner as results of a scalar-on-function regression when keeping a certain time point fixed. For the time point of the EMG signal, for example, time points to of the EEG signal do not show an effect, but for the estimated effect on the expected EMG signal is positive. For a detailed description of the interpretation of historical effect surfaces as shown in Figure 6, we refer to the online appendix of Rügamer et al. (2016). Careful interpretation has to take into account that this data set has a rather small signal-to-noise ratio due to the oscillating nature of both signals. In such cases, it is recommended to check the uncertainty of estimated effects via bootstrap, e.g., by using the bootstrapCI() function as exemplarily shown in Figure 7. \MakeFramed\@setminipage # calculate bootstrap interval based on 100 bootstrap folds (default) # and use 10-fold cross-validation to determine mstop on the inner level fof_historical_bci <- bootstrapCI(fof_historical, mc.cores = 2, B_inner = 10, type_inner = "kfold") # plot results par(mfrow=c(1,3)) plot(fof_historical_bci, which = 2, ask = FALSE, pers = FALSE, col = terrain.colors(20), probs = c(0.05, 0.5, 0.95))  \endMakeFramed ### 4.3 Functional regression models beyond the mean Using boosting for model estimation it is possible to optimize other loss functions than the squared error loss. This allows to fit, e.g., generalized linear models (GLMs) and quantile regression models (Koenker 2005; Fenske et al. 2011). It is also possible to fit models for several parameters of the conditional response distribution in the framework of generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos 2005; Mayr et al. 2012). For the estimation of these more general models, a suitable loss function in accordance with the modeled characteristic of the response distribution is defined and optimized. The absolute error loss ( loss), for instance, implies median regression, and minimizing the -loss yields mean regression. In FDboost(), the regression type is specified by the family argument. The family argument expects an object of class Family, which implements the respective loss function with its corresponding negative gradient and link function. The default is family = Gaussian() which yields -boosting (Bühlmann and Yu 2003). This means that the mean squared error loss is minimized, which is equivalent to maximizing the log-likelihood of the normal distribution. Table 6 lists some loss functions currently implemented in mboost, which can be directly used in FDboost (see Hofner et al. 2014, for more families). Hofner et al. (2014) also give an example on how to implement new families via the function Family(). See also the help page ?Family for more details on all families. For a continuous response, several model types are available (Bühlmann and Hothorn 2007): -boosting yields mean regression; a more robust alternative is median regression, which optimizes the absolute error loss; the Huber loss is a combination of and loss (Huber 1964); quantile regression (Koenker 2005) can be used to model a certain quantile of the conditional response distribution; and expectile regression (Newey and Powell 1987) for modeling an expectile. For a non-negative continuous response, models assuming the gamma distribution can be useful. A binary response can be modeled in a GLM framework with a logit model or by minimizing the exponential loss, which corresponds to the first boosting algorithm ’AdaBoost’ (Friedman 2001). Count data can be modeled assuming a Poisson or negative binomial distribution (Schmid et al. 2010). For functional response, we compute the loss point-wise and integrate over the domain of the response. The following models can only be applied for scalar and not for functional response. For ordinal response, a proportional odds model can be used (Schmid et al. 2011). For categorical response, the multinomial logit model is available. For survival models, boosting Cox proportional hazard models and accelerated failure time models have been introduced by Schmid and Hothorn (2008b). #### Case study (ctd.): Emotion components data So far, we fitted a model for the conditional mean of the response. As a more robust alternative, we consider median regression by setting family = QuantReg(tau = 0.5). We use the update function, to update the functional model with the new family. \MakeFramed\@setminipage fof_signal_med <- update(fof_signal, family = QuantReg(tau = 0.5))  \endMakeFramed For median regression, the smooth intercept is the estimated median at each time-point and the effects are deviations from the median. The combination of GAMLSS with functional variables is discussed in Brockhaus et al. (2016) and Stöcker et al. (2017). For GAMLSS models, FDboost builds on the package gamboostLSS (Hofner et al. 2017), in which families are implemented to fit GAMLSS. For details on the boosting algorithm to fit GAMLSS, see Mayr et al. (2012) and Thomas et al. (2017). The families in gamboostLSS need to model at least two distribution parameters. For an overview of currently implemented response distributions for GAMLSS, we refer to Hofner et al. (2016). In FDboost, the function FDboostLSS() implements GAMLSS with functional data. The interface of FDboostLSS() is: \MakeFramed\@setminipage FDboostLSS(formula, timeformula, data = list(), families = GaussianLSS(), ...)  \endMakeFramed In formula a named list of formulas is supplied. Each list entry in the formula specifies the potential covariate effects for one of the distribution parameters. The names of the list are the names of the distribution parameters. The argument families is used to specify the assumed response distribution with its modeled distribution parameters. The default families = GaussianLSS() yields a Gaussian location scale model. In the dots-argument further arguments passed to FDboost() can be supplied. The model object which is fitted by FDboostLSS() is a list of FDboost model objects. It is not possible to automatically fit a smooth offset within FDboostLSS(). Per default, a scalar offset value is used for each distribution parameter. For functional response, it can thus be useful to center the response prior to the model fit. All integration weights for the loss function are set to one, corresponding to the negative log-likelihood of the observation points. For model objects fitted by FDboostLSS(), methods to estimate the optimal stopping iterations, as well as methods for plotting and prediction exist. For more details on boosting GAMLSS models, we refer to Hofner et al. (2016), which is a tutorial for the package gamboostLSS. #### Case study (ctd.): Fossil fuel data We fit a Gaussian location scale model for the heat value. Such a model is obtained by setting families = GaussianLSS(), where the expectation is modeled using the identity link and the standard deviation by a log-link. Mean and standard deviation of the heat value are modeled by different covariates:  Yi|xi ∼N(μi,σ2i), μi =β0+f(z\tiny{h2o},i)+∫S\tiny{NIR}x\tiny{NIR},i(s\tiny{% NIR})β\tiny{NIR}(s\tiny{NIR})ds\tiny{NIR}+∫S\tiny{UV}x\tiny{UV},i(s\tiny{UV})β\tiny{% UV}(s\tiny{UV})ds\tiny{UV} logσi =α0+α1z\tiny{h2o},i. The mean is modeled depending on the water content as well as depending on the NIR and the UVVIS spectrum. The standard deviation is modeled using a log-link and a linear predictor based on the water content. The formula has to be specified as a list of two formulas with names mu and sigma for mean and standard deviation of the normal distribution. We use the noncyclic fitting method that is introduced by Thomas et al. (2017). \MakeFramed\@setminipage # center the variable water content # such that the linear effect in the variance is centered around zero fuelSubset$h2o_center <- fuelSubset$h2o - mean(fuelSubset$h2o)

library(gamboostLSS)
# fit Gaussian location scale model
sof_ls <- FDboostLSS(list(mu = heatan ~ bbs(h2o, df = 4)
+ bsignal(UVVIS, uvvis.lambda, knots = 40, df = 4)
+ bsignal(NIR, nir.lambda, knots = 40, df = 4),
sigma = heatan ~ 1 + bols(h2o_center, df = 2)),
timeformula = NULL, data = fuelSubset,
families = GaussianLSS(), method = "noncyclic")

# model object is list containing model for mu and sigma
names(sof_ls)

[1] "mu"    "sigma"

\endMakeFramed

The optimal number of boosting iterations is searched on a grid of 1 to 2000 boosting iterations. The algorithm updates in each boosting iteration the base-learner that best fits the negative gradient. Thus, in each iteration the additive predictor for only one of the distribution parameters is updated.

\MakeFramed\@setminipage
# find optimal stopping iterations by 5-fold bootstrap
# takes some time, easy to parallelize on Linux
set.seed(123)
cvm_sof_ls <- cvrisk(sof_ls, folds = cv(model.weights(sof_ls[[1]]), B = 5),
grid = 1:2000, trace = FALSE)

\endMakeFramed

The estimated coefficients for the expectation are similar to the effects resulting from the pure mean model. The water content has a negative effect on the standard deviation, with higher water content being associated with lower variability.

### 4.4 Variable selection by stability selection

Variable selection can be refined using stability selection (Meinshausen and Bühlmann 2010; Shah and Samworth 2013). Stability selection is a procedure to select influential variables while controlling false discovery rates and maximal model complexity. For component-wise gradient boosting, it is implemented in mboost in the function stabsel() (Hofner et al. 2015), which can also be used for model objects fitted by FDboost(). Brockhaus et al. (2017) compute function-on-function regression models with more functional covariates than observations and perform variable selection by stability selection. Thomas et al. (2017) discuss stability selection for GAMLSS estimated by boosting.

## 5 Discussion

The R add-on package FDboost provides a comprehensive implementation to fit functional regression models by gradient boosting. The implementation allows to fit regression models with scalar or functional response depending on many covariate effects. The framework includes mean, mean with link function, median and quantile regression models as well as GAMLSS. Various covariate effects are implemented including linear and smooth effects of scalar covariates, linear effects of functional covariates and interaction effects, also between scalar and functional covariates (Rügamer et al. 2016). The linear functional effects can have flexible integration limits, for example, to form historical or lag effects (Brockhaus et al. 2017). Whenever possible, the effects are represented in the structure of linear array models (Currie et al. 2006) to increase computational efficiency (Brockhaus et al. 2015). Component-wise gradient boosting allows to fit models in high-dimensional data situations and performs data-driven variable selection. FDboost builds on the well tested and modular implementation of mboost (Hothorn et al. 2016). This facilitates the implementation of further base-learners in order to fit new covariate effects and that of families modeling other characteristics of the conditional response distribution.

## Appendix A Constraints for effects of scalar covariates

Consider a model for functional response with smooth intercept and an effect that contains a smooth intercept as special case, , and define the mean effect at each point as . This model can be parametrized in different ways, e.g., as

 E(Yi(t)) =β0(t)+hj(xi,t) =[β0(t)+¯hj(x,t)]+[hj(xi,t)−¯hj(x,t)] =~β0(t)+~hj(x,t).

The problem arises as (or any other smooth function in ) can be shifted between the intercept and the covariate effect. At the level of the design matrices of these effects, this can be explained by the fact that the columns of the design matrix and the columns of the design matrix of the functional intercept are linearly dependent. To obtain identifiable effects, Scheipl et al. (2015) propose to center such effects at each point . The centering is achieved by setting the point-wise expectation over the covariate effects to zero on , i.e., for all , approximated by the sum-to-zero constraint for all . How to enforce such constraints is described in Appendix A of Brockhaus et al. (2015). Other constraints to obtain identifiable models are possible. However, this sum-to-zero constraint for each point  yields an intuitive interpretation: the intercept can be interpreted as global mean and the covariate effects can be interpreted as deviations from the smooth intercept.

The constraint is enforced by a basis transformation of the design and penalty matrix. As shown in Brockhaus et al. (2015), it is sufficient to apply the constraint on the covariate-part of the design and the penalty matrix. Thus, it is not necessary to transform the basis in  direction.

## Appendix B Base-learners for functional covariates

The base-learner bsignal() sets up a linear effect of a functional variable using P-splines. We approximate the integral numerically as a weighted sum using integration weights (Wood 2011), see equation (3):

 bj(xi)⊤ =[R∑r=1Δ(sr)xi(sr)ϕ1(sr) ⋯ R∑r=1Δ(sr)xi(sr)