# Data-driven HRF estimation for encoding and decoding models

###### Abstract

Despite the common usage of a canonical, data-independent, hemodynamic response function (HRF), it is known that the shape of the HRF varies across brain regions and subjects. This suggests that a data-driven estimation of this function could lead to more statistical power when modeling BOLD fMRI data. However, unconstrained estimation of the HRF can yield highly unstable results when the number of free parameters is large. We develop a method for the joint estimation of activation and HRF by means of a rank constraint, forcing the estimated HRF to be equal across events or experimental conditions, yet permitting it to differ across voxels. Model estimation leads to an optimization problem that we propose to solve with an efficient quasi-Newton method, exploiting fast gradient computations. This model, called GLM with Rank-1 constraint (R1-GLM), can be extended to the setting of GLM with separate designs which has been shown to improve decoding accuracy in brain activity decoding experiments. We compare 10 different HRF modeling methods in terms of encoding and decoding score on two different datasets. Our results show that the R1-GLM model outperforms competing methods in both encoding and decoding settings, positioning it as an attractive method both from the points of view of accuracy and computational efficiency.

###### keywords:

Functional MRI (fMRI), Hemodynamic response function (HRF), machine learning, optimization, BOLD, Finite inpulse response (FIR), decoding, encodingAND

## 1 Introduction

The use of machine learning techniques to predict the cognitive state of a subject from their functional MRI (fMRI) data recorded during task performance has become a popular analysis approach for neuroimaging studies over the last decade (Cox2003; Haynes2006). It is now commonly referred to as brain reading or decoding. In this setting, the BOLD signal is used to predict the task or stimulus that the subject was performing. Although it is possible to perform decoding directly on raw BOLD signal (Mourao-Miranda2007; Miyawaki2008), the common approach in fast event-related designs consists in extracting the activation coefficients (beta-maps) from the BOLD signal to perform the decoding analysis on these estimates. Similarly, in the voxel-based encoding models (Kay2008; Naselaris2011), the activation coefficients are extracted from the BOLD signal, this time to learn a model to predict the BOLD response in a given voxel, based on a given representation of the stimuli. In addition, a third approach, known as representational similarity analysis or RSA (kriegeskorte2008representational) takes as input the activation coefficients. In this case a comparison is made between the similarity observed in the activation coefficients, quantified by a correlation measure, and the similarity between the stimuli, quantified by a similarity measure defined from the experimental setting.

These activation coefficients are computed by means of the General Linear Model (GLM) (Friston1995). While this approach has been successfully used in a wide range of studies, it does suffer from limitations (Poline2012). For instance, the GLM commonly relies on a data-independent canonical form of the hemodynamic response function (HRF) to estimate the activation coefficient. However it is known (Handwerker2004; Badillo2013) that the shape of this response function can vary substantially across subjects and brain regions. This suggests that an adaptive modeling of this response function should improve the accuracy of subsequent analysis.

To overcome the aforementioned limitation, Finite Impulse Response (FIR) models have been proposed within the GLM framework (Dale1999; Glover1999). These models do not assume any particular shape for the HRF and amount to estimating a large number of parameters in order to identify it. While the FIR-based modeling makes it possible to estimate the activation coefficient and the HRF simultaneously, the increased flexibility has a cost. The estimator is less robust and prone to overfitting, i.e. to generalize badly to unseen data. In general, FIR models are most appropriate for studies focused on the characterization of the shape of the hemodynamic response, and not for studies that are primarily focused on detecting activation (Poldrack, Chapter 5).

Several strategies aiming at reducing the number of degrees of freedom of the FIR model - and thus at limiting the risk of overfitting - have been proposed. One possibility is to constrain the shape of the HRF to be a linear combination of a small number of basis functions. A common choice of basis is formed by three elements consisting of a reference HRF as well as its time and dispersion derivatives (friston1998nonlinear), although it is also possible to compute a basis set that spans a desired function space (Woolrich2004). More generally, one can also define a parametric model of the HRF and estimate the parameters that best fit this function (Lindquist2007). However, in this case the estimated HRF may no longer be a linear function of the input parameters.

Sensitivity to noise and overfitting can also be reduced through regularization. For example, temporal regularization has been used in the smooth FIR (Goutte2000; Ciuciu2003; Casanova2008) to favor solutions with small second order time derivative. These approaches require the setting of one or several hyperparameters, at the voxel or potentially at the parcel level (if several voxels in a pre-defined parcel are assumed to share some aspects of the HRF timecourse). Even if efficient techniques such as generalized cross-validation (golub1979generalized) can be used to choose the regularization parameters, these methods are inherently more costly than basis-constrained methods. Basis-constrained methods also require setting the number of basis elements; however, this parameter is not continuous (as in the case of regularized methods), and in practice only few values are explored: for example the 3-element basis set formed by a reference HRF plus derivatives and the FIR model. This paper focuses on basis-constrained regularization of the HRF to avoid dealing with hyperparameter selection with the goal of remaining computationally attractive. A different approach to increase robustness of the estimates consists in linking the estimated HRFs across a predefined brain parcel, taking advantage of the spatially dependent nature of fMRI (Wang2013). However, hemodynamically-informed parcellations (Chaari2012; Badillo2013a) rely on the computation of a large number of estimations at the voxel or sub-parcel level. In this setting, the development of voxel-wise estimation procedures is complementary to the development of parcellation methods in that more robust estimation methods at the voxel level would naturally translate into more robust parcellation methods. In this paper we focus on voxel-wise estimation methods.

We propose a method for the simultaneous estimation of HRF and activation coefficients based on low-rank modeling. Within this model, and as in (Makni2008; Kay2008; vincent2010spatially; Degras2014), the HRF is constrained to be equal across the different conditions, yet permitting it to be different across voxels. Unlike previous works, we formulate this model as a constrained least squares problem, where the vector of coefficients is constrained to lie within the space of rank one matrices. We formulate the model within the framework of smooth optimization and use quasi-Newton methods to find the vector of estimates. This model was briefly presented in the conference paper (Pedregosa2013). Here we provide more experimental validation and a more detailed presentation of the method. We also added results using a GLM with separate designs (Mumford2012). Ten alternative approaches are now compared on two publicly available datasets. The solver has also been significantly improved to scale to full brain data.

The contributions of this paper are two-fold. First, we quantify the importance of HRF estimation in encoding and decoding models. While the benefit of data-driven estimates of the HRF have already been reported in the case of decoding (Turner2012) and encoding approaches (vu2011encoding), we here provide a comprehensive comparison of models. Second, we evaluate a method called GLM with Rank-1 constraint (R1-GLM) that improves encoding and decoding scores over state-of-the-art methods while remaining computationally tractable on a full brain volume. We propose an efficient algorithm for this method and discuss practical issues such as initialization. Finally, we provide access to an open source software implementation of the methods discussed in this paper.

Notation: and denote the Euclidean and infinity norm for vectors. We use lowercase boldface letter to denote vectors and uppercase boldface letter to denote matrices. denotes the identity matrix, denotes the vector of ones of size , denotes the Kronecker product and denotes the concatenation of the columns of a matrix into a single column vector. denotes the Moore-Penrose pseudoinverse. Given the vectors with for each , we will use the notation to represents the columnwise concatenation of the vectors into a matrix of size . We will use Matlab-style colon notation to denote slices of an array, that is will denote the first elements of .

## 2 Methods

In this section we describe different methods for extracting the HRF and activation coefficients from BOLD signals. We will refer to each different stimulus as condition and we will call trial a unique presentation of a given stimulus. We will denote by the total number of stimuli, the BOLD signal at a single voxel and the total number of images acquired.

### 2.1 The General Linear Model

The original GLM model (Friston1995) makes the assumption that the hemodynamic response is a linear transformation of the underlying neuronal signal. We define the -matrix as the columnwise stacking of different regressors, each one defined as the convolution of a reference HRF (Boynton1996; Glover1999) with the stimulus onsets for the given condition. In this work we used as reference HRF the one provided by the software SPM 8 (friston2011statistical). Assuming additive white noise, and to be full rank, the vector of estimates is given by , where is a vector of size representing the amplitude of each one of the conditions in a given voxel.

A popular modification of this setting consists in extending the GLM design matrix with the temporal and width derivatives of the reference HRF. This basis, formed by the reference HRF and its derivatives with respect to time and width parameters, will be used throughout this work. We will refer to it as the 3HRF basis. In this case, each one of the basis elements is convolved with the stimulus onsets of each condition, obtaining a design matrix of size . This way, for each condition, we estimate the form of the HRF as a sum of basis functions that correspond to the first order Taylor expansion of the parametrization of the response function. Another basis set that will be used is the Finite Impulse Response (FIR) set. This basis set spans the complete ambient vector space (of dimension corresponding to the length of the impulse response) and it is thus a flexible model for capturing the HRF shape. It consists of the canonical unit vectors (also known as stick function) for the given duration of the estimated HRF. Other basis functions such as FMRIB’s Linear Optimal Basis Sets (Woolrich2004) are equally possible but were not considered in this work.

More generally, one can extend this approach to any set of basis functions. Given the matrix formed by the stacking of basis elements , the design matrix is formed by successively stacking the regressors obtained by convolving each of the basis elements with the stimulus onsets of each condition. This results in a matrix of size and under the aforementioned conditions the vector of estimates is given by . In this case, is no longer a vector of size : it has length instead and can no longer be interpreted as the amplitude of the activation. One possibility to recover the trial-by-trial reponse amplitude is to select the parameters from a single time point as done by some of the models considered in (Mumford2012), however this procedure assumes that the peak BOLD response is located at that time point. Another possibility is to construct the estimated HRF and take as amplitude coefficient the peak amplitude of this estimated HRF. This is the approach that we have used in this paper.

### 2.2 GLM with rank constraint

In the basis-constrained GLM model, the HRF estimation is performed independently for each condition. This method works reliably whenever the number of conditions is small, but in experimental designs with a large number of conditions it performs poorly due to the limited conditioning of the problem and the increasing variance of the estimates.

At a given voxel, it is expected that for similar stimuli the estimated HRF are also similar (Henson2002). Hence, a natural idea is to promote a common HRF across the various stimuli (given that they are sufficiently similar), which should result in more robust estimates (Makni2008; vincent2010spatially). In this work we consider a model in which a common HRF is shared across the different stimuli. Besides the estimation of the HRF, a unique coefficient is obtained per column of our event matrix. This amounts to the estimation of free parameters instead of as in the standard basis-constrained GLM setting.

The novelty of our method stems from the observation that the formulation of the GLM model with a common HRF across conditions translates to a rank constraint on the vector of estimates. This assumption amounts to enforcing the vector of estimates to be of the form for some HRF and a vector of coefficients . More compactly, this can be written as . This can be seen as a constraint on the vector of coefficients to be the vectorization of a rank-one matrix, hence the name Rank-1 GLM (R1-GLM).

In this model, the coefficients have no longer a closed form expressions, but can be estimated by minimizing the mean squared error of a bilinear model. Given and as before, a matrix of nuisance parameters such as drift regressors, we define to be the objective function to be minimized. The optimization problem reads:

(1) |

The norm constraint is added to avoid the scale ambiguity between and and the sign is chosen so that the estimated HRF correlates positively with a given reference HRF . Otherwise the signs of the HRF and can be simultaneously flipped without changing the value of the cost function. Within its feasible set, the optimization problem is smooth and is convex with respect to , and , however it is not jointly convex in variables , and .

From a practical point of view this formulation has a number of advantages. First, in contrast with the GLM without rank-1 constraint the estimated coefficients are already factored into the estimated HRF and the activation coefficients. That is, once the estimation of the model parameters from Eq. (1) is obtained, is a vector of size and is a vector of size that can be both used in subsequent analysis, while in models without rank-1 constraint only the vector of coefficients (equivalent to in rank-1 constrained models) of size is estimated. In the latter case, the estimated HRF and the beta-maps still have to be extracted from this vector by methods such as normalization by the peak of the HRF, averaging or projecting to the set of Rank-1 matrices.

Second, it is readily adapted to prediction on unseen trials. While for classical (non rank-1 models) the HRF estimation is performed per condition with no HRF associated with unseen conditions, in this setting, because the estimated HRF is linked and equal across conditions it is natural to use this estimate on unseen conditions. This setting occurs often in encoding models where prediction on unseen trials is part of the cross-validation procedure.

This model can also be extended to a parametric HRF model. That is, given the hemodynamic response defined as a function of some parameters , we can formulate the analogous model of Eq. (1) as an optimization over the parameters and with the design matrix given by the convolution of the event matrix with the FIR basis:

(2) |

In section 2.4 we will discuss optimization strategies for both models.

### 2.3 Extension to separate designs

An extension to the classical GLM that improves the estimation with correlated designs was proposed in (Mumford2012). In this setting, each voxel is modeled as a linear combination of two regressors in a design matrix . The first one is the regressor associated with a given condition and the second one is the sum of all other regressors. This results in design matrices, one for each condition. The estimate for a given condition is given by the first element in the two-dimensional array , where is the design matrix for condition . We will denote this model GLM with separate designs (GLMS). It has been reported to find a better estimate in rapid event designs leading to a boost in accuracy for decoding tasks (Mumford2012; Schoenmakers2013; Lei2013).

This approach was further extended in (Turner2012) to include FIR basis instead of the predefined canonical function. Here we employ it in the more general setting of a predefined basis set. Given a set of basis functions we construct the design matrix for condition as the columnwise concatenation of two matrices and . is given by the columns associated with the current condition in the GLM matrix and is the sum of all other columns. In this case, the vector of estimates is given by the first vectors of . See (Turner2012) for a more complete description of the matrices and .

It is possible to use the same rank-1 constraint as before in the setting of separate designs, linking the HRF across conditions. We will refer to this model as Rank-1 GLM with separate designs (R1-GLMS). In this case the objective function has the form , where is a vector representing the activation of all events except the event of interest and will not be used in subsequent analyses. We can compute the vector of estimates as the solution to the optimization problem

(3) |

### 2.4 Optimization

For the estimation of rank-1 models on a full brain volume, a model is estimate at each voxel separately. Since a typical brain volume contains more than 40,000 voxels, the efficiency of the estimation at a single voxel is of great importance. In this section we will detail an efficient procedure based on quasi-Newton methods for the estimation of R1-GLM and R1-GLMS models on a given voxel.

One approach to minimize (1) is to alternate the minimization with respect to the variables , and . By recalling the Kronecker product identities (horn1991topics, Chapter 4.3), and using the identity we can rewrite the objective function (1) to be minimized as:

(4) | |||

(5) | |||

(6) |

Updating , or sequentially thus amounts to solving a (constrained) least squares problem at each iteration. A similar procedure is detailed in (Degras2014). However, this approach requires computing the matrices and at each iteration, which are typically dense, resulting in a high computational cost per iteration. Note also that the optimization problem is not jointly convex in variables , therefore we cannot apply convergence guarantees from convex analysis.

We rather propose a more efficient approach by optimizing both variables jointly. We define a global variable as the concatenation of into a single vector, , and cast the problem as an optimization with respect to this new variable. Generic solvers for numerical optimization (nocedal2006numerical) can then be used. The solvers that we will consider take as input an objective function and its gradient. In this case, the partial derivatives with respect to variable can be written as , whose expression can be easily derived using the aforementioned Kronecker product identities:

If instead a parametric model of the HRF is used as in Eq. (2), the equivalent partial derivatives can be easily computed by the chain rule.

For the sake of efficiency, it is essential to avoid evaluating the Kronecker products naively, but rather reformulate them using the above mentioned Kronecker identities. For example, the matrix should not be computed explicitly but should rather be stored as a linear operator such that when applied to a vector it computes , avoiding thus the explicit computation of .

Similar equations can be derived for the rank-1 model with separate designs of Eq. (3) (R1-GLMS), in which case the variable is defined as the concatenation of , i.e. . The gradient of with respect to can be computed as . The partial derivatives read:

A good initialization plays a crucial role in the convergence of any iterative algorithm. Furthermore, for non-convex problems a good initialization prevents the algorithm from converging to undesired local minima. We have used as initialization for the R1-GLM and R1-GLMS models the solution given by the GLM with separate designs (GLMS). Since the GLM with separate designs scales linearly in the number of voxels, this significantly reduces computation time whenever an important number of voxels is considered.

Whenever the design matrix has more rows than columns (as is the case in both datasets we consider with the 3HRF basis), it is possible to find an orthogonal transformation that significantly speeds up the computation of the Rank-1 model. Let be the “thin” QR decomposition of , that is, with an orthogonal matrix and a triangular matrix. Because of the invariance of the Euclidean norm to orthogonal transformations, the change of variable , yields a Rank-1 model in Eq. (1) with equivalent solutions. This reduces the size of the design matrix to a square triangular matrix of size (instead of ) and reduces the explained variable to a vector of size (instead of ). After this change of variable, the convergence of the Rank-1 model is significantly faster due to the faster computation of the objective function and its partial derivatives. We have observed that the total running time of the algorithm can be reduced by 30% using this transformation.

Some numerical solvers such as L-BFGS-B (liu1989limited) require the constraints to be given as box constraints. While our original problem includes an equality constraint we can easily adapt it to use convex box constraints instead. We replace the equality constraint by the convex inequality constraint , which is equivalent to the box constraint supported by the above solver. However, this change of constraint allows solutions in which can be arbitrarily close to zero. To avoid such degenerate cases we add the smooth term to the cost function. Since there is a free scale parameter between and , this does not bias the problem, but forces to lie as far as possible from the origin (thus saturating the box constraints). Once a descent direction has been found by the L-BFGS-B method we perform a line search procedure to determine the step length. The line-search procedure was implemented to satisfy the strong Wolfe conditions (nocedal2006numerical). Finally, when the optimization algorithm has converged to a stationary point, we rescale the solution setting to ensure that the equality constraint. This still leaves a sign ambiguity between the estimated HRF and the associated beta-maps. To make these parameters identifiable, the sign of the estimated HRF will be chosen so that these correlate positively with the reference HRF.

We have compared several first-order (Conjugate Gradient), quasi-Newton (L-BFGS) and Newton methods on this problems and found that in general quasi-Newton methods performed best in terms of computation time. In our implementation, we adopt the L-BFGS-B as the default solver.

In Algorithm 1 we describe an algorithm based on L-BFGS that can be used to optimize R1-GLM and R1-GLMS models (a reference implementation for the Python language is described in subsection Software). Variable is only used for the R1-GLMS method and its use is denoted within parenthesis, i.e. , so that for the R1-GLM it can simply be ignored.

The full estimation of the R1-GLM model with 3HRF basis for one subject of the dataset described in section Dataset 2: decoding of potential gain levels ( conditions, time points, voxels) took 14 minutes in a 8-cores Intel Xeon 2.67GHz machine. The total running time for the 17 subjects was less than four hours.

### 2.5 Software

We provide a software implementation of all the models discussed in this section
in the freely available (BSD licensed) pure-Python package hrf_estimation
^{13}^{13}13https://pypi.python.org/pypi/hrf_estimation.

## 3 Data description

With the aim of making the results in this paper easily reproducible, we have chosen two freely available datasets to validate our approach and to compare different HRF modeling techniques.

### 3.1 Dataset 1: encoding of visual information

The first dataset we will consider is described in (Kay2008; naselaris2009bayesian; kay2011data). It contains BOLD fMRI responses in human subjects viewing natural images. As in (Kay2008), we performed prediction of BOLD signal following the visual presentation of natural images and compared it against the measured fMRI BOLD signal. As the procedure consists of predicting the fMRI data from stimuli descriptors, it is an encoding model. This dataset is publicly available from http://crcns.org

Two subjects viewed 1750 training images, each presented twice, and 120 validation images, each presented 10 times, while fixating a central cross. Images were flashed 3 times per second (200 ms on-off-on-off-on) for one second every 4 seconds, leading to a rapid event-related design. The data were acquired in 5 scanner sessions on 5 different days, each comprising 5 runs of 70 training images –each image being presented twice within the run– and 2 runs of validation images showing 12 images, 10 times each. The images were recorded from the occipital cortex at a spatial resolution of 2mm2mm2.5mm and a temporal resolution of 1 second. Every brain volume for each subject has been aligned to the first volume of the first run of the first session for that subject. Across-session alignment was performed manually. Additionally, data were temporally interpolated to account for slice-timing differences. See (Kay2008) for further preprocessing details.

We performed local detrending using a Savitzky-Golay filter (savitzky1964smoothing) with a polynomial of degree 4 and a window length of 91 TR. The activation coefficients (beta-map) and HRF were extracted from the training set by means of the different methods we would like to compare. The training set consisted of 80% of the original session (4 out of 5 runs). This resulted in estimated coefficients (beta-map) for each of the images in the training set.

We proceed to train the encoding model. The stimuli are handled as local image contrasts, that are represented by spatially smoothed Gabor pyramid transform modulus with 2 orientations and 4 scales. Ridge regression (regularization parameter chosen by Generalized Cross-Validation (golub1979generalized)) was then used to learn a predictor of voxel activity on the training set. By using this encoding model and the estimated HRF it is possible to predict the BOLD signal for the 70 images in the test set (20 % of the original session). We emphasize that learning the HRF on the training set instead of on the full dataset is necessary to avoid overfitting while assessing the quality of the estimated HRF by any HRF-learning method: otherwise, the estimation of the HRF may incorporate specificities of the test set leading to artificially higher scores.

In a first step, we perform the image identification task from (Kay2008). From the training set we estimate the activation coefficients that will be used to compute the activation maps. We use an encoding model using Gabor filters that predicts the activation coefficient from the training stimuli. From the stimuli in the validation set we predict the activation coefficients that we then use to identify the correct image. The predicted image is the one yielding the highest correlation with the measured activity. This procedure mimics the one presented in (Kay2008, Supplementary material).

In a second step, we report score as the Pearson correlation between the measurements and the predicted BOLD signal on left out data. The prediction of BOLD signal on the test set is performed from conditions that were not present in the train set. In order to do this, an HRF for these conditions is necessary. As highlighted in the methods section, the construction of an HRF for these conditions is ambiguous for non Rank-1 methods that perform HRF estimation on the different stimuli. In these cases we chose to use the mean HRF across conditions as the HRF for unseen conditions. Finally, linear predictions on the left out fold were compared to the measured BOLD signals.

### 3.2 Dataset 2: decoding of potential gain levels

The second dataset described in (Tom2007) is a gambling task where each of the 17 subjects was asked to accept or reject gambles that offered a 50/50 chance of gaining or losing money. The magnitude of the potential gain and loss was independently varied across 16 levels between trials. Each gamble has an amount of potential gains and potential losses that can be used as class label. In this experiment, we only considered gain levels. This leads to the challenge of predicting or decoding the gain level from brain images. The dataset is publicly available from http://openfmri.org under the name mixed-gambles task dataset.

The data preprocessing included slice timing, motion correction, coregistration to the
anatomical images, tissue segmentation, normalization to MNI space and was
performed using the SPM 8 software through the
Pypreprocess^{14}^{14}14https://github.com/neurospin/pypreprocess
interface.

For all subjects three runs were recorded, each consisting of 240 images with a repetition time (TR) of 2 seconds and a stimulus presentation at every 4 seconds. In order to perform HRF estimation on more data than what is available on a single run, we performed the estimation on the three runs simultaneously. This assumes HRF consistency across runs, which was obtained by concatenating the data from the three runs and creating a block-diagonal design matrix correspondingly (each block is the design of one run).

After training a regression model on 90% of the data, we predict the gain level on the remaining 10%. As a performance measure we use Kendall tau rank correlation coefficient (kendall1938new) between the true gain levels and the predicted levels, which is a measure for the orderings of the data. We argue that this evaluation metric is better suited than a regression loss for this task because of the discrete and ordered nature of the labels. Also, this loss is less sensible to shrinkage of the prediction that might occur when penalizing a regression model (bekhti:hal-01032909). The Kendall tau coefficient always lies within the interval , with being perfect agreement between the two rankings and perfect disagreement. Chance level lies at zero. This metric was previously proposed for fMRI decoding with ordered labels in (Doyle2013).

## 4 Results

In order to compare the different methods discussed previously, we ran the same encoding and decoding studies while varying the estimation method for the activation coefficients (beta-maps). The methods we considered are standard GLM (denoted GLM), GLM with separate designs (GLMS), Rank-1 GLM (R1-GLM) and Rank-1 GLM with separate designs (R1-GLMS). For all these models we consider different basis sets for estimating the HRF: a set of three elements formed by the reference HRF and its time and dispersion derivative, a FIR basis set (of size 20 in the first dataset and of size 10 in the second dataset) formed by the canonical vectors and the single basis set formed by the reference HRF (denoted “fixed HRF”), which in this case is the HRF used by the SPM 8 software.

It should be reminded that the focus of this study is not the study of the HRF in itself (such as variability across subjects, tasks or regions) but instead its possible impact on the accuracy of encoding and decoding paradigms. For this reason we report encoding and decoding scores but we do not investigate any of the possible HRF variability factors.

### 4.1 Dataset 1: encoding of visual information

In the original study, 500 voxels were used to perform image identification. These voxels were selected as the voxels with the highest correlation with the true BOLD signal on left-out data using a (classical) GLM with the reference HRF. These voxels are therefore not the ones naturally benefiting the most from HRF estimation.

We first present the scores obtained in the image identification task for different variants of the GLM. This can be seen in Figure 1. The displayed score is the count of correctly identified images over the total number of images (chance level is therefore at 1/120). The identification algorithm here only uses the beta-maps obtained from the train and validation set. This makes the estimation of the HRF an intermediate result in this model. However, we expect that a correct estimation of the HRF directly translates into a better estimation of the activation coefficients in the sense of being able to acheive higher predictive accuracy. Our results are consistent with this hypothesis and in this task the rank-one (R1) and glm-separate (GLMS) models outperform the classical GLM model. The benefits range from 0.9% for R1-GLM in subject 2 to 8.2% for the same method and subject 1. It is worth noticing that methods with FIR basis obtain a higher score than methods using the 3HRF basis.

In order to test whether this increase is statistically significant we performed the following statistical test. The success of recovering the correct image can be modeled as a binomial distribution, with being be the probability of recovering the correct image with method A and be the probability of recovering the correct image with method B. We define the null hypothesis as the statement that both probabilities are equal, , and the alternate hypothesis that both probabilities and not equal, (this test is sometimes known as the binomial proportion test (rohmel1999unconditional)). The score test statistic for the one-tailed test is , where and is the number of repetitions, in this case . This statistic is normally distributed for large . The p-value associated with this statistical test when comparing every model (by order of performance) with the model “GLM with with fixed HRF” is for the first subject and for the second.

We will now use a different metric for evaluating the performance of the encoding model. This metric is the Pearson correlation between the BOLD predicted by the encoding model and the true BOLD signal, averaged across voxels. We will compute the this metric on a left-out session, which results in five scores for each method, corresponding to each of the cross-validation folds. Given two methods, a Wilcoxon signed-rank test can be used on these cross-validation scores to assess whether the score obtained by the two methods are significantly different. This way, irrespective of the variance across voxels, which is inherent to the study, we can reliably assess the relative ranking of the different models. In Figure 2 we show the scores for each method (averaged across sessions) and the p-value corresponding the Wilcoxon test between a given method and the previous one by order of performance.

We observed in Figure 2 that methods that learn the HRF together with some sort of regularization (be it Rank-1 constraint or induced by separate designs) perform noticeably better than methods that perform unconstrained HRF estimation, highlighting the importance of a robust estimation of the HRF as opposed to a free estimation as performed by the standard GLM model with FIR basis. This suggests that R1 and GLMS methods permit including FIR basis sets while minimizing the risk of overfitting inherent to the classical GLM model.

We also observed that models using the GLM with separate designs from (Mumford2012) perform significantly better on this dataset than the standard design, which is consistent with the purpose of these models. It improves estimation in highly correlated designs. The best performing model for both subjects in this task is the R1-GLMS with FIR basis, followed by the R1-GLM with FIR basis model for subject 1 and GLMS with FIR basis for subject 2. The difference between both models (Wilcoxon signed-rank test) was significant with a p-value . Since the results for both subjects are similar, we will only use subject 1 for the rest of the figures.

To further inspect the results, we investigated the estimation and
encoding scores at the voxel level. This provides some valuable information.
For example, parameters such as time-to-peak, width and undershoot of the
estimated HRF can be used to characterize the mis-modeling of a reference HRF
for the current study. Also, a voxel-wise comparison of the different methods
can be used to identify which voxels exhibit a greater improvement for a given
method. In the upper part of Figure 3 we show the HRF
estimated on the first subject by our best performing method (the Rank-1 with separate designs and
FIR basis). For comparison we also present two commonly used
reference HRFs: one used in the software SPM and one defined in (Glover1999, auditory
study) and used by software such as
NiPy^{15}^{15}15http://nipy.org and
fmristat^{16}^{16}16http://www.math.mcgill.ca/keith/fmristat/ . Because the HRF
estimation will fail on voxels for which there is not enough signal, we only
show the estimated HRF for voxels for which the encoding score is above the
mean encoding score. In this plot the time-to-peak of the estimated HRF is
color coded. One can observe a substantial variability in the time to peak,
confirming the existence of a non-negligeable variability
of the estimated HRFs, even within a single subject and a single task. In
particular, we found that only 50% of the estimated HRFs on the full brain volume
peaked between 4.5 and 5.5 seconds.

In the lower part of Figure 3 we can see a scatter plot in which the coordinates of each point are the encoding scores with two different methods. The first coordinate (X-axis) is given by the score using a canonical GLM whilst the second coordinate (Y-axis) corresponds to the Rank-1 separate with FIR basis. Points above the black diagonal exhibit a higher score with our method than with a canonical GLM. As previously, the color represents the time to peak of the estimated HRF. From this plot we can see that voxels that have a low correlation score using a canonical GLM do not gain significant improvement by using a Rank-1 Separate FIR model instead. However, voxels that already exhibit a sufficiently high correlation score using a canonical GLM () see a significant increase in performance when estimated using our method.

These results suggest as a strategy to limit the computational cost of learning the HRF on an encoding study to perform first a standard GLM (or GLMS) on the full volume and then perform HRF estimation only on the best performing voxels.

The methods that we have considered for HRF estimation can be subdivided according to the design matrices they use (standard or separate) and the basis they use to generate the estimated HRF (3HRF and FIR). We now focus on the performance gains of each of these individual components. In the upper part of Figure 4 we consider the top-performing model, the Rank-1 GLMS, and compare the performance of two different basis sets: FIR with 20 elements in the Y-axis and the reference HRF plus its time and dispersion derivatives (3HRF) in the X-axis. The abundance of points above the diagonal demonstrates the superiority of the FIR basis on this dataset. The color trend in this plot suggests that the score improvement of the FIR basis with respect to the 3HRF basis becomes more pronounced as the time-to-peak of the estimated HRF deviates from the reference HRF (peak at 5s), which can be explained by observing that the 3HRF basis corresponds to a local model around the time-to-peak. In the bottom part of this figure we compare the different design matrices (standard or separate). Here we can see the voxel-wise encoding score for two Rank-1 models with FIR basis and different design matrices: separate design on the Y-axis and classical design on the X-axis. Although both models give similar results, a Wilcoxon signed-rank test on the leave-one-session-out cross-validation score confirmed the superiority of the separate designs model in this dataset with p-value .

In Figure 5 we can see the voxel-wise encoding score on a single acquisition slice. In the upper column, the score is plotted on each voxel and thresholded at a value of 0.045, which would correspond to a p-value for testing non-correlation assuming each signal is normally distributed, while in the bottom row the 0.055 contour (p-value ) for the same data is shown as a green line. Here it can be seen how the top performing voxels follow the gray matter. A possible hypothesis to explain the increase of the encoding score between the method R1-GLMS with FIR basis and the same method with 3HRF basis could be related either to the shape of the HRF deviating more from a canonical shape in lateral visual areas or to the higher signal-to-noise ratio often found in the visual cortex when compared to lateral visual areas.

### 4.2 Dataset 2: decoding of potential gain levels

The mean decoding score was computed over 50 random splittings of the data, with a test set of size 10%. The decoding regression model consisted of univariate feature selection (ANOVA) followed by a Ridge regression classifier as implemented in scikit-learn (Pedregosa2011). Both parameters, number of voxels and amount of regularization in Ridge regression, were chosen by cross-validation.

The mean score for the 10 models considered can be seen in Figure 6. Similarly to how we assessed superiority of a given method in encoding, we will say that a given method outperforms another if the paired difference of both scores (this time across folds) is significantly greater than zero. This is computed by performing a Wilcoxon signed rank test across voxels. For this reason we report p-values together with the mean score in Figure 6.

As was the case in encoding, Rank-1 constrained methods obtain the highest scores. In this case however, methods with 3HRF basis outperform methods using FIR basis. This can be explained by factors such as smaller sample size of each of the runs, smaller number of trials in the dataset and experimental design.

## 5 Discussion

We have compared different HRF modeling techniques and examined their generalization score on two different datasets: one in which the main task was an encoding task and one in which it was a decoding task. We compared 10 different methods that share a common formulation within the context of the General Linear Model. This includes models with canonical and separate designs, with and without HRF estimation constrained by a basis set, and with and without rank-1 constraint. We have focused on voxel-independent models of the HRF, possibly constrained by a basis set, and have omitted for efficiency reasons other possible models such as Bayesian models (marrelec2003robust; Ciuciu2003; Makni2005) and regularized methods (Goutte2000; Casanova2008).

Other models such as spatial models (vincent2010spatially), and multi-subject methods (Zhang2012; Zhang2013) that adaptively learn the HRF across several subjects are outside the scope of this work. The latter models are more relevant in the case of standard group studies and second level analysis.

Our first dataset consists of an encoding study and revealed that it is possible to boost the encoding score by appropriately modeling the HRF. We used two different metrics to assess the quality of our estimates. The first metric is the fraction of correctly identified images by an encoding model. For this we computed the activation coefficients on both the training and validation dataset. We then learned a predictive model of the activation coefficients from the stimuli. This was used to identify a novel image from a set of 120 potential images from which the activation coefficients were previously computed. The benefits range from 0.9% points to 8.2% points across R1-constrained methods and subjects. The best-performing model in this task is the R1-GLM with FIR basis. The second metric is the Pearson correlation. By considering the voxel-wise score on a full brain volume we observed that the increase in performance obtained by estimating the HRF was not homogeneous across voxels and more important for voxels that already exhibited a good score with a classical design (GLM) and a fixed HRF. The best-performing method is the Rank-1 with separate designs (R1-GLMS) and FIR basis model, providing a significant improvement over the second best-performing model. We also found substantial variability of the shape in the estimated HRF within a single subject and a single task.

The second dataset consists of a decoding task and the results confirmed that constrained (rank-1) estimation of the HRF also increased the decoding score of a classifier. The metric here is Kendall tau. However, in this case the best performing basis was no longer FIR basis consisting of ten elements but the three elements 3HRF basis (HRF and derivatives) instead, which can be explained by factors such as differences in acquisition parameters, signal-to-noise ratio or by the regions involved in the task.

A higher performance increase was observed when considering the correlation score within the encoding model. This higher sensitivity to a correct (or incorrect) estimation of the HRF can be explained by the fact that the estimation of the HRF is used to generate the BOLD signal on the test set. The metric is the correlation between the generated signal and the BOLD signal. It is thus natural to expect that a correct estimation of the HRF has a higher impact on the results.

In the decoding setup, activation coefficients (beta-map) are computed but the evaluation metric is the accuracy at predicting the stimulus type. The validation metric used for decoding is less sensitive to the HRF estimation procedure than the correlation metric from the encoding study, although it allowed us to observe a statistically significant improvement.

## 6 Conclusion

We have presented a method for the joint estimation of HRF and activation coefficients within the GLM framework. Based on ideas from previous literature (Makni2008; vincent2010spatially) we assume the HRF to be equal across conditions but variable across voxels. Unlike previous work, we cast our model as an optimization problem and propose an efficient algorithm based on quasi-Newton methods. We also extend this approach to the setting of GLM with separate designs.

We quantify the improvement in terms of generalization score in both encoding and decoding settings. Our results show that the rank-1 constrained method (R1-GLM and R1-GLMS) outperforms competing methods in both encoding and decoding settings.

Acknowledgements This work was supported by grants IRMGroup ANR-10-BLAN-0126-02 and BrainPedia ANR-10-JCJC 1408-01. We would like to thank our colleagues Ronald Phlypo and Gael Varoquaux for fruitful discussions.