This paper develops tools to characterize how species are affected by environmental variability, based on a functional single index model relating a response such as growth rate or survival to environmental conditions. In ecology, the curvature of such responses are used, via Jensen’s inequality, to determine whether environmental variability is harmful or beneficial, and differing nonlinear responses to environmental variability can contribute to the coexistence of competing species.
Here, we address estimation and inference for these models with observational data on individual responses to environmental conditions. Because nonparametric estimation of the curvature (second derivative) in a nonparametric functional single index model requires unrealistic sample sizes, we instead focus on directly estimating the effect of the nonlinearity, by comparing the average response to a variable environment with the response at the expected environment, which we call the Jensen Effect. We develop a test statistic to assess whether this effect is significantly different from zero. In doing so we re-interpret the SiZer method of Chaudhuri and Marron (1999) by maximizing a test statistic over smoothing parameters. We show that our proposed method works well both in simulations and on real ecological data from the long-term data set described in Drake (2005).
The Jensen Effect and Functional Single Index Models:
Estimating the Ecological Implications of Nonlinear Reaction Norms
Zi Ye111PhD Candidate, Department of Statistical Science, Cornell University, email@example.com, Giles Hooker222Associate Professor, Department of Statistical Science, Cornell University, firstname.lastname@example.org, Stephen Ellner333Professor, Department of Ecology and Evolutionary Biology, Cornell University, email@example.com
1.1 Ecological Questions
In natural ecosystems, environmental conditions are highly variable over time and space (e.g., Vasseur and McCann, 2007) and many classical questions in ecology and evolution are therefore concerned with the potential consequences of this variation. Two important topics have been how species’ traits and life histories evolve so that species can persist in environments that may be favorable at some times and unfavorable at others (see Cohen (1966) and Koons et al. (2008)). Nonlinear responses to environmental variability can contribute to maintaining the biodiversity of competing species, allowing them to coexist stably (see for example Hutchinson (1961); Chesson and Warner (1981); Ellner (1987); Chesson (1994, 2000b, 2000a)). Nonlinear responses to environmental conditions are also important for forecasting responses to climate change, as environmental variability can either increase population growth rate (Drake (2005); Koons et al. (2009)) or decrease it (Lewontin and Cohen (1969)), depending on the shape of the norm of reaction between environmental variables and the components of population growth rate (survival, growth, and reproduction).
The goal of this paper is to develop methods for determining whether environmental variability is beneficial or harmful for some component of population growth rate, which will depend on whether the response curve is concave-up or concave-down over the range of environmental variability. Common statistical models for species growth and survival make parametric assumptions involving specific forms of nonlinearity, which can pre-determine the effect of environmental variability in the model. For example, in a logistic regression model for survival where the logit(survival) is a linear function of covariates, small variance in one covariate about its mean will increase average survivorship if survival at the mean covariate is below 0.5, and decrease average survivorship if survival at the mean covariate is above 0.5. The conclusion is then driven more by statistical convention than by the data. Recent statistical research in semi-parametric or nonparametric methods inspires us to understand the effect of environmental variation via nonparametric models that do not impose these assumptions. Specifically, we consider spline-based methods (see Wood (2000), Ramsay (2006), Ramsay et al. (2009) and Ruppert et al. (2003)) to predict nonlinear responses under environmental fluctuation.
The nonparametric model considered here is
where and are the growth (or future size) and environment of a plant. The function is the link function to be estimated, and is random error. We assume that the environmental is described by a climate history, such as temperature or precipitation observed at a fine time-scale, such as daily resolution. Following Teller et al. (2016), these are thought of as functional covariates, leading to representation of as a functional linear term
where is the coefficient function to be estimated, and is the observed climate history.
To assess the impact of variability in on the growth rate , we need to compare (growth under constant conditions) and . By the Jensen’s inequality, we have that if is a convex function, and a varying environment will accelerate the growth. Otherwise, if the link function is concave, a constant environment at the average of environmental factor would be more beneficial.
1.2 Model Formulation
Combining and , a functional model for observed species growth is
This is the Functional Single Index model introduced in Chen et al. (2011) and Ma (2016). The functional single index model is an extension of the single index model to a functional covariate via . This model allows a nonlinear relationship between response and covariate function . In addition, it improves stability in estimating link function through imposing smoothness on .
To assess the impact of environmental variability, we need to test the convexity or concavity of the link function . A first approach would be to estimate the curvature of link function over a domain. We illustrate our estimation procedure in Section , where we show that estimating second derivatives requires unrealistic sample sizes.
However, we observed that estimate of the link function obtained by our procedure is quite accurate. So instead of evaluating the curvature of , we examine the “Jensen Effect” directly, and estimate the quantity , or the sign of Jensen’s inequality, which only involves an estimate of . Inspired by the SiZer method Chaudhuri and Marron (1999), we derive a test based on the maximum estimated value over a range of smoothing parameters. We show that our estimate and hypothesis test work well on both simulated and real data set.
1.3 Related Literature
There is a large literature on the single index model examining both applied methodology and theoretical properties. The link function and the coefficient vector have been estimated by three different methods. (1) The most widely used is the Projection Pursuit Regression (PPR) approach introduced in Hardle et al. (1993). This method is a nested estimation procedure, with the link function estimated by local polynomial approximation and the coefficient function by minimizing the MSE. Theoretical properties were studied in Hardle et al. (1993) and Ichimura (1993). (2) The Average Derivative approach was introduced in Hristache et al. (2001). (3) Li (1991) introduced the Sliced Inverse Regression method, which considered the estimation of the coefficient vector as a dimension-reduction problem.
In contrast, there are few studies of the functional single index model. A counterpart to the Projection Pursuit Regression was introduced in Chen et al. (2011), where the coefficient function was approximated by a spline basis and the coefficient vector is estimated. In addition, a convergence rate was found for this method. Ma (2016) used two spline bases to approximate the coefficient function and the link function, respectively, and derived some asymptotic properties of the resulting estimate.
The SiZer (SIgnificant ZERo crossings of derivative) method that we adopt to assess significance, introduced in Chaudhuri and Marron (1999), was designed to bypass smoothing parameter selection by comparing the estimates of a curve over a range of smoothing parameters. Our test statistic is inspired by the SiZer method. Instead of trying to select an optimal smoothing parameter for estimation and inference, we examine estimates over a range of smoothing parameters and do inference based on maximizing a test statistic over that range.
2 Estimating Curvature
This section provides a short demonstration of the difficulties of estimating curvature nonparametrically in a functional single index model, which would be necessary for directly using Jensen’s inequality to examine the ecological consequences of environmental variability. We first define an estimate for and discuss some of the numerical challenges it involves, and then provide some results that attempt to estimate directly when selecting smoothing parameters by GCV.
2.1 Estimation Procedure
We consider estimating , and via a smoothed basis expansion. Assume that independent and identically distributed data pairs are observed where is a real-valued function on [0,1]. The Functional Single Index model is
where the coefficient function is, like X(t), defined on the interval . is assumed to be Gaussian random error. This integral may need to be evaluated numerically, depending on the representations used for and , and we assume that this is done up to ignorable error throughout the calculations below. To ensure identifiability of the model, we require that .
We use a -dimensional B-spline basis for the link function . For any in the range of possible values, the link function can be written as
where is a -dimensional column coefficient vector.
We use a -dimensional Fourier basis for the coefficient function , such that
where is a -dimensional column coefficient vector, and . The constraint is reduced to requiring in this case.
The coefficient vectors and are estimated by minimizing a penalized sum of squares
where and the penalty matrices and are available analytically for most common choices of basis expansion.
Equation specifies a nonlinear optimization problem, which we solve numerically using built-in optimizers in R (see below). Denoting the estimated coefficients as and , the estimates are
2.2 Bases, Optimization and Cross-Validation
Our objective criterion (4) requires nonlinear numerical optimization. In our experiments below we have used the R function optim with some additional modifications. The simulations reported in Section 4 used the BFGS gradient-based optimizer. However, at large values of we find that very tight convergence criteria are needed to reduce the numerical error below that of the estimated noise. This was mitigated with two strategies:
We initialize our optimization with chosen so that is exactly linear and is obtained from functional linear regression.
We re-initialize BFGS once it converges, and run it a second time. BFGS uses a sequentially-calculated approximate Hessian, and can stop early due to poor estimation of this Hessian. Re-initialization resets the approximate Hessian to the identity, so that optimization is restarted with a steepest descent step.
To maintain identifiability of our model, we normalize our estimate of within each evaluation of the objective function and multiply by .
In order to represent with a basis expansion, we need to control the range of its arguments. Throughout our estimates below, we have used the identifiability requirement that to use a range of where is the largest score for the maximum eigenvalue from a principal components decomposition of the . If we replace the argument with the corresponding end-point of the range and add a penalty of to the objective (4). In practice, while we find that this exceedence can occur during optimization, it never appears in the final result.
To illustrate the impact of initial conditions on our estimates, as well as the challenge of estimating curvature in functional single index models, Section 2.3 below presents examples of estimating with initial conditions either given by the true basis coefficients and or by starting from , using Nelder-Mead optimization. In each case, different initialization procedures yield quite different performance statistics, but all perform poorly in estimating curvature.
While our assessment of statistical significance avoids selecting , it will be useful to have a value for visualization and for an estimate of residual variance. To choose , we define a smoother matrix associated with , and the GCV value for selecting is calculated from
where is the -dimensional identity matrix. We derive from a Taylor expansion in (5) below.
2.3 A Simulated Demonstration
We present here a brief simulation study to observe the accuracy of curvature estimates. The covariate function was generated based on a Fourier basis
where , , , , , and are i.i.d with , , , . The coefficient function is
We observe that the coefficients for satisfy , under an orthonormal basis. The random errors
are simulated as i.i.d. Gaussian noise with mean and .
We selected the sample size as and examined three link functions:
To measure the performance of our estimators we define the MSE of the estimated and to be
where for .
Of particular concern in the results (Table 1) is the substantial discrepancy between estimates from different initial conditions. Ye and Hooker (2018) similarly observed that second derivative estimates were highly sensitive to the effort placed into optimization.
The plots in Figure 1 provide an example of our results. The estimate of the link function nearly overlaps the true curve, indicating that our estimate of the link function is quite accurate. However, for the second derivative, the estimate deviates from the true curve, in fact becoming negative towards the right-hand limit. This reduced accuracy is also evident in the results in Table 1. These plots indicate that our estimate of the curvature is not good enough to use as a basic for decisions on the convexity of . In addition, the performance of the estimators varies a lot from different initial values. In Figure 2 we see that different initial conditions can lead to either over- or under-fitting . Further examples are provided in Appendix A. We therefore directly assess the Jensen Effect via other methods, introduced in the next section.
3 Jensen Effect
The ecological interest in is in the comparison of and . Because reliable estimate of requires unrealistic sample sizes, we instead compare these quantities directly to estimate what we call the “Jensen Effect”.
We define a difference statistic by
where . If the link function is convex, then which indicates better growth with a varying environment; otherwise, the difference and a constant environment is better for growth. However, this estimate still depends on the smoothing parameters and .
To account for the choice of smoothing parameters, we extend the SiZer method, first introduced for local linear regression in Chaudhuri and Marron (1999), and then extended to smoothing splines setting in Marron and Zhang (2005). SiZer is a graphical tool to make inference about features on an estimated curve, and detect features of that curve in an innovative way. Traditional nonparametric methods have difficulties in selecting a appropriate bandwidth in a local linear approximation, or a smoothing parameter in a smoothing spline. Classical methods such that -fold cross-validation and generalized cross-validation provide point estimates, but it is difficult to incorporate their uncertainty into inference about the resulting curves. The SiZer method skips the process of selecting a tuning parameter. Instead, it examines the estimate of a curve over a range of tuning parameters, and observes features of the curve for each combination of observation point and tuning parameter. Inspired by the SiZer method, we look the difference over a range of values for and , and generate hypothesis tests using the maximum or minimum value of as a function of .
3.1 Hypothesis Test 1: Nonparametric Smoothing
We begin by briefly developing our SiZer-inspired test for a standard smoothing spline (treating the environment as known) before developing the test for a functional single index model. Here defining to be matrix of evaluations, and to be the second derivative penalty matrix, the standard smoothing spline estimate is
Define the -dimensional column vector and the augmented set of evaluation points with corresponding evaluation matrix . We can write
which we can standardize to obtain the -statistic
We treat the value of as a Gaussian process over with covariance function
which involves no unknown parameters. We can thus use as a test statistic, obtaining critical values by simulating from the Gaussian process . In practice, we need to estimate which we do based on the value selected by GCV.
An important consideration here is that we expect to inherit smoothing bias, but this should result in under-estimation of the Jensen Effect because it will shrink the estimated second derivative. By examining over the whole range of we can assess this effect at various levels of smoothing while maintaining a conservative test. We do still need to choose by GCV in our estimate . We expect to be relatively insensitive to the specific chosen; maintaining a constant in the -statistic removes the need to account for changes in across .
3.2 Hypothesis Test 2: Functional Single Index Models
The functional single index model complicates the process described above by including two smoothing parameters and nonlinear effects of , necessitating a Taylor expansion to approximate the recipe above. For each pair of smoothing parameters , we obtain an estimate of , , and , denoted as . Defining
the estimated difference function given is
To construct a t-statistic to test the significance of , an estimate of the variance of the difference function is needed. The estimated difference function is defined on an estimate of and , which are the coefficients of and respectively. Therefore, we need to calculate the covariance of the estimated and .
Recall (4), the penalized least squares criterion to be minimized, and define the matrices of linear basis effects and evaluations of the link function bases and derivatives with taken at its expected estimate. We derive gradients of PLS as
and expected Hessian
We can now obtain the sandwich covariance
where we estimate from
As as in Ruppert et al. (2003), the residual degree of freedom is defined as
is an approximate smoother matrix in which we use the values of selected by GCV.
We now define a t-statistic for as a function of ,
where is given by
Defining the -dimensional row vector as
the estimated covariance matrix of is . is therefore approximately a Gaussian process indexed by and we can get the estimated variance of from .
We want to test if . Denote the number of the smoothing parameters as , we test : . Under ,
, where is a -dimensional column vector, and the covariance matrix is -dimensional with the term equals to . The test statistic that we examine is .
In order to obtain a critical value for this statistic, we repeatedly simulate from and obtain a distribution for .
4 Simulation Study
In this section, we use simulated data to explore the power of our test for both single index and functional single index models.
4.1 Single Index Model
We test for a Jensen Effect by calculating the difference function over a range of smoothing parameters. If the link function is convex, the function will be positive for most of values, although it may have high variance at low and high bias at high . For each simulation, we conduct the hypothesis test introduced in previous section.
Our simulation study starts with the single index model with covariates generated uniformly on [-0.5, 0.5], and the coefficient so that .
To illustrate the Jensen Effect, we choose three different link functions, (1) , (2) , (3) . We represented by a 25-dimensional quintic B-spline basis. For each link function, we simulated data sets of size 100, with error standard deviation 0.1. We obtained critical values for our test by simulating normal samples from the null distribution. Figure 3 presents a sample of and functions functions versus for ; plots for the other link functions are in Appendix B. The rejection rates for these functions are: , and respectively.
4.2 Functional Single Index Model
To define a distribution for the functional covariates, we use a 25-dimensional Fourier basis , where . The covariate functions are generated as
where . The coefficient function is
Again we used the three link functions , , . We represented by a 25-dimensional quintic B-spline basis. For each link function, we generated simulated data sets of size 100 with error standard deviation 0.1, and for each such data set we generated normal samples from the null distribution to obtain critical values.
A plot of the and functions for is presented in Figure 4. We have placed equivalent plots for and in Appendix C. The rejection rates for the three link functions were , and , showing very good power with a reasonable sample size and close to nominal rate when the null hypothesis is true (no curvature).
4.3 Power Analysis
To investigate the power of our test in more detail we consider a series of increasingly nonlinear link functions
with for the single index model and for the functional single index model. As increases, becomes strongly convex. For each , we generate simulated data sets and again used normal samples under the null distribution to obtain critical values.
Figure 5 presents the rejection rate plotted against . We observe a sharp increase as increases, as expected. As the link function becomes more and more convex, the rejection rate will converge to 1.
5 Application to real ecological data
To demonstrate application of our tests, we analyze the North Temperate Lakes LTER: Zooplankton - Trout Lake Area data set (https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-ntl&identifier=37&revision=29). An earlier version of these data were analyzed by Drake (2005) to examine the temperature-dependence of copepod populations. In our data set, the density of the populations of nine species of copepods and rotifers, along with water temperature, were recorded from to in different lakes. Our choice of species was determined based on the number and length of observations available and differs from those studied in Drake (2005). Note that the growth response in these data is not the growth (in size) of an individual, but the growth (in numbers) of a population, but our functional single index model is still appropriate for this setting.
The values recorded in the original data set are:
: species’ density at a specific time and lake.
: record of temperature corresponding to the same time as . The temperature was recorded on irregular time points among different years and lakes, so we pre-processed the temperature data by fitting a smoothing spline.
At the time of each observed response, we used lake temperature values over the preceding days as the climate history covariate . For each lake, we fit a penalized spline functional single index model for the growth in population density as a function of temperature history in each lake. We used 37 fourier Basis functions to represent in each case and a 21-dimensional cubic B-spline basis to represent , and searched over values of from to .
The -values for 9 species are given in Table 2; plots of the estimates of , and functions are given in Appendix D. Kellicottia longispina and Polyarthra remata have significantly nonzero Jensen effects (), indicating evidence of nonlinear responses by those species. The function of these two species are different from over the prespecified range of smoothing parameters. In the plots of the function of these two species, we observe that their functions are above zero over the range of , showing that their link functions are convex. Thus, their average growth rate is increased by the presence of environmental variation.
To determine whether a species’ growth rate is increased or decreased by environmental variability, we examine the Jensen Effect. Our first attempt in this direction was to estimate the curvature of the link function in a Functional Single Index model. In our penalized spline based method, we found that unrealistic sample size was needed to obtain accurate estimates. So instead we investigated the net effect of Jensen’s inequality, by comparing the expected response (averaged across the environmental variation) to the response at the expected environment. Inspired by the SiZer method, our test is based on maximizing across a wide range encompassing all plausible smoothing parameter values, thus avoiding the need for smoothing parameter selection. We have shown that our method and test work well on both simulated and real data.
There are multiple potential extensions of this methodology. We have used observed data as representative of the covariates of interest. However, the test can be conducted for any assumed distribution of covariates and it may be of interest to describe regions of single index values in which the estimated link function exhibits Jensen’s Inequality. We have also applied this model to growth data, using least-squares. For survival data all standard link functions impose particular curvature structures. These can be modified by adding a nonparametric function within the link, but obtaining equivalent tests – since the relevant effect is on survival itself, not its logit-transform – requires further development.
- Chaudhuri and Marron (1999) Chaudhuri, P. and J. S. Marron (1999). Sizer for exploration of structures in curves. Journal of the American Statistical Association 94(447), 807–823.
- Chen et al. (2011) Chen, D., P. Hall, H.-G. Müller, et al. (2011). Single and multiple index functional regression models with nonparametric link. The Annals of Statistics 39(3), 1720–1747.
- Chesson (1994) Chesson, P. (1994). Multispecies competition in variable environments. Theoretical population biology 45(3), 227–276.
- Chesson (2000a) Chesson, P. (2000a). General theory of competitive coexistence in spatially-varying environments. Theoretical population biology 58(3), 211–237.
- Chesson (2000b) Chesson, P. (2000b). Mechanisms of maintenance of species diversity. Annual review of Ecology and Systematics 31(1), 343–366.
- Chesson and Warner (1981) Chesson, P. L. and R. R. Warner (1981). Environmental variability promotes coexistence in lottery competitive systems. The American Naturalist 117(6), 923–943.
- Cohen (1966) Cohen, D. (1966). Optimizing reproduction in a randomly varying environment. Journal of theoretical biology 12(1), 119–129.
- Drake (2005) Drake, J. M. (2005). Population effects of increased climate variation. Proceedings of the Royal Society of London B: Biological Sciences 272(1574), 1823–1827.
- Ellner (1987) Ellner, S. (1987). Alternate plant life history strategies and coexistence in randomly varying environments. In Theory and models in vegetation science, pp. 199–208. Springer.
- Hardle et al. (1993) Hardle, W., P. Hall, H. Ichimura, et al. (1993). Optimal smoothing in single-index models. The annals of Statistics 21(1), 157–178.
- Hristache et al. (2001) Hristache, M., A. Juditsky, and V. Spokoiny (2001). Direct estimation of the index coefficient in a single-index model. Annals of Statistics 29(3), 595–623.
- Hutchinson (1961) Hutchinson, G. E. (1961). The paradox of the plankton. The American Naturalist 95(882), 137–145.
- Ichimura (1993) Ichimura, H. (1993). Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics 58(1-2), 71–120.
- Koons et al. (2008) Koons, D. N., C. J. E. Metcalf, and S. Tuljapurkar (2008). Evolution of delayed reproduction in uncertain environments: a life-history perspective. The American Naturalist 172(6), 797–805.
- Koons et al. (2009) Koons, D. N., S. Pavard, A. Baudisch, J. E. Metcalf, et al. (2009). Is life-history buffering or lability adaptive in stochastic environments? Oikos 118(7), 972–980.
- Lewontin and Cohen (1969) Lewontin, R. C. and D. Cohen (1969). On population growth in a randomly varying environment. Proceedings of the National Academy of Sciences 62(4), 1056–1060.
- Li (1991) Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association 86(414), 316–327.
- Ma (2016) Ma, S. (2016). Estimation and inference in functional single-index models. Annals of the Institute of Statistical Mathematics 68(1), 181–208.
- Marron and Zhang (2005) Marron, J. and J. T. Zhang (2005). Sizer for smoothing splines. Computational Statistics 20(3), 481–502.
- Ramsay (2006) Ramsay, J. O. (2006). Functional data analysis. Wiley Online Library.
- Ramsay et al. (2009) Ramsay, J. O., G. Hooker, and S. Graves (2009). Functional data analysis with R and MATLAB. Springer Science & Business Media.
- Ruppert et al. (2003) Ruppert, D., M. P. Wand, and R. J. Carroll (2003). Semiparametric regression. Cambridge university press.
- Teller et al. (2016) Teller, B. J., P. B. Adler, C. B. Edwards, G. Hooker, and S. P. Ellner (2016). Linking demography with drivers: climate and competition. Methods in Ecology and Evolution 7(2), 171–183.
- Vasseur and McCann (2007) Vasseur, D. A. and K. S. McCann (2007). The Impact of Environmental Variability on Ecological Systems. Springer Netherlands.
- Wood (2000) Wood, S. N. (2000). Modelling and smoothing parameter estimation with multiple quadratic penalties. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62(2), 413–428.
- Ye and Hooker (2018) Ye, Z. and G. Hooker (2018). Local quadratic estimation of the curvature in a functional single index model. arXiv preprint arXiv:1803.09321.
Appendix A Plots of Estimated Curvature
This section provides plots provide examples of the accuracy of estimating a second derivative in a functional single index model. In each we plot the true and estimated link function and its second derivative. We also plot our estimated functional single index values versus their truth and the distortion that produces for the estimated second derivative.
Appendix B Diagnostic Plots for the Jensen Effect: Single Index Model
These plots give example functions using a single index model and the corresponding functions for links and respectively.
Appendix C Diagnostic Plots for the Functional Single Index Model
These plots give example functions using a single index model and the corresponding functions for links and respectively.
Appendix D Plots for the copepod data
Plots 12 through 20 provide descriptive plots for each species from the copepod study. We provide a plot of versus with the value selected by GCV in red. We also provide plots of , and at the value of selected by GCV along with approximate pointwise confidence intervals.