A Fully Nonparametric Modeling Approach to Binary Regression

A Fully Nonparametric Modeling Approach to Binary Regression

Abstract

We propose a general nonparametric Bayesian framework for binary regression, which is built from modeling for the joint response-covariate distribution. The observed binary responses are assumed to arise from underlying continuous random variables through discretization, and we model the joint distribution of these latent responses and the covariates using a Dirichlet process mixture of multivariate normals. We show that the kernel of the induced mixture model for the observed data is identifiable upon a restriction on the latent variables. To allow for appropriate dependence structure while facilitating identifiability, we use a square-root-free Cholesky decomposition of the covariance matrix in the normal mixture kernel. In addition to allowing for the necessary restriction, this modeling strategy provides substantial simplifications in implementation of Markov chain Monte Carlo posterior simulation. We present two data examples taken from areas for which the methodology is especially well suited. In particular, the first example involves estimation of relationships between environmental variables, and the second develops inference for natural selection surfaces in evolutionary biology. Finally, we discuss extensions to regression settings with multivariate ordinal responses.

KEY WORDS: Bayesian nonparametrics; Dirichlet process mixture model; Identifiability; Markov chain Monte Carlo; Ordinal regression

1Introduction

Binary responses measured along with covariates are present in several problems in science and engineering. From a modeling perspective, interest centers on determining the regression relationship between the response and covariates. Standard approaches to this problem – both classical and Bayesian – involve potentially restrictive distributional assumptions as well as those of linearity in relating the response to the covariates. Common modeling techniques result in a small range of monotonic, symmetric trends for the probability response curve, and assume that covariate effects are additive.

There has been substantial effort devoted to relaxing the functional form of the linear predictor through the use of basis functions, including spline based approaches [?], and generalized additive models [?]. The latter modify the linear predictor by applying a smoothing function to each covariate separately and assuming the transformed covariates are additive in their effects. However, the underlying distributional assumption is still present through the link function.

The motivation for Bayesian nonparametric methodology lies in the notion that the model should support a wide range of distributional shapes and regression relationships. In an effort to create more flexible models to combat overdispersion and asymmetry, which the standard links can not, several Bayesian semiparametric binary regression methods have been developed. Early work has targeted either the link, treating it as a random function with a nonparametric prior [?], or linearity, for instance, by viewing the intercept of the linear predictor as arising from an unknown distribution [?]. More recently, [?] relaxed the linearity assumption by placing a Gaussian process prior on the argument of the inverse link. [?] assumed each binary response to arise from a random colored tessellation, and placed a Dirichlet process (DP) prior [?] on the space of colored tessellations.

[?], [?], and [?] have proposed nonparametric solutions to the regression problem with categorical responses. These approaches build off the work of [?], which modeled the joint distribution of continuous responses and covariates with a DP mixture of normal distributions, inducing a flexible model for . The idea of inducing a regression model through the joint response-covariate distribution is attractive, since in many settings the covariates are not fixed prior to sampling, including several applications in the environmental, biomedical, and social sciences.

We target problems of this type, proposing a flexible model for fully nonparametric binary regression, in which the responses and covariates arise together as random vectors, requiring a model for their joint distribution. The foundation of the proposed methodology is different from the existing nonparametric modeling approaches. We elaborate further in Section 4, but here note that a key feature of the proposed model involves the introduction of latent continuous responses, in similar spirit to parametric probit models; see, for instance, [?]. Let denote the data, where each observation consists of a binary response along with a vector of covariates, . The continuous auxiliary variables, , determine the observed binary responses by their sign, such that if and only if . Instead of seeking a nonparametric model for the regression function, we estimate the joint distribution of latent responses and covariates, , using a DP mixture of multivariate normal distributions, which induces a flexible model for the regression relationship, . In addition to providing a general modeling platform, the latent responses are conceptually meaningful in many applications. The proposed model is shown to be identifiable provided the variance of within each mixture component is fixed, a restriction implemented through a square-root-free Cholesky decomposition of the mixture kernel covariance matrix. This aspect of the model formulation retains computational efficiency in posterior simulation while enabling the use of priors more flexible than the inverse-Wishart distribution. We develop two approaches to prior specification for the covariance parameters, one which involves prior simulation and can be used for problems with a small number of covariates, and a second which is more straightforward to apply as the dimension of the covariate space increases.

In Section 2, we formulate the mixture model for binary regression. We discuss identifiability for the parameters of the mixture kernel distribution, as well as prior specification approaches, and give details for posterior inference. In Section 3, the methodology is applied to problems from environmetrics and evolutionary biology, using two data sets from the literature for illustration. The latter example involves estimation of fitness surfaces, a problem for which our method is particularly powerful relative to existing approaches. Section 4 contains further discussion to place our contribution within the existing literature, and to indicate possible extensions. Technical details on the identifiability result, prior specification and posterior simulation, and the expressions for the model comparison criterion used in Section 3 are provided in the appendices.

2Methodology

2.1The modeling approach

Focusing on continuous covariates, , and a single binary response , with corresponding latent continuous response , a normal distribution is a natural choice for the kernel in a mixture representation for . The DP is then used as a prior for the random mixing distribution , to create a mixture model of the form: , , where is the DP precision parameter, and the parameters of the DP centering distribution.

According to the DP constructive definition [?], a realization is almost surely of the form , with independent realizations from , and arising through stick-breaking from beta random variables. In particular, let be independent , and define , and , for ; moreover, and are independent sequences of random variables. Applying the constructive definition with , the model admits a representation as a countable mixture of multivariate normals, .

For the normal kernel distribution, let denote the mean of , denote the mean of , and partition the covariance matrix such that , , a matrix, and , a row vector of length . Then, integrating over the latent response , the induced model for the observables assumes the form

where denotes the standard normal cumulative distribution function.

Flexible inference for the binary regression functional can be obtained through . Marginalizing over in , the marginal distribution for is . Hence, the implied conditional regression function can be expressed as a weighted sum of the form , with covariate-dependent weights , and probabilities

which have the probit form with component-specific intercept and slope parameters.

The dependence structure of the mixture kernel in is key to obtaining general inference for the implied binary regression function. However, is it sensible to leave all elements of the kernel covariance matrix unrestricted? In the case of a single mixture component, which arises in the limit as , the regression function has the form a single normal cumulative distribution function, as given in (Equation 2). This function takes the same value for any when and are scaled by a positive constant , and by , indicating that different combinations of and result in the same probability of positive response. Hence, there is an identification problem if and are unrestricted. This limiting case of our model is a parametric probit model, albeit with random covariates. In this setting, if identification constraints are not imposed, then prior distributions become increasingly important yet difficult to specify, and the use of noninformative priors can be problematic and create computational difficulties [?]. In addition, empirical evidence based on simulated data suggests that, without parameter restrictions, the correlations implied by the covariance matrices are not representative of the correlations that generated the data, and undesirable behavior is present in the uncertainty bands of the binary regression functional at the extreme regions of the covariate space. For these reasons, and the fundamental belief that within a particular cluster or mixture component the corresponding parameters should be identifiable, we now focus on restricting the kernel of the mixture.

Here, we employ the standard definition of likelihood identifiability, such that a parameter for a family of distributions is identifiable if distinct values of correspond to distinct probability density functions, that is, if , then is not the same function of as . Under our setting, the focus is on the kernel of the mixture model for the observed data, , which has the form

with . Note that if and are independent in the mixture kernel, the probability in the Bernoulli response becomes ; hence, a restriction – for instance, on – is required for identifiability. This is in fact the only restriction necessary to obtain an identifiable kernel, and we thus retain the ability to estimate , which is significant in capturing the dependence of on under the mixture distribution. The specific result is given in the following lemma whose proof can be found in Appendix A.

LEMMA 1. The parameters are identifiable in the model for observed data which has the form in (Equation 3), provided is fixed to a constant.
While intuitively straightforward, fixing to a constant is challenging operationally. The usual conditionally conjugate inverse-Wishart choice for does not offer the solution, due to the single degree of freedom parameter in the inverse-Wishart distribution, which does not allow for one element of to be fixed while freely estimating the rest of the matrix. One way to overcome this problem is to reparameterize , using a square-root-free Cholesky decomposition. This decomposition is useful for modeling longitudinal data [?], as well as specifying conditional independence assumptions for the elements of a normal random vector [?]. Let be a unit lower triangular matrix, and let be a diagonal matrix with positive elements, , such that . Hence, , where is also lower triangular with all its diagonal elements equal to 1, and . Moreover, , and thus the identifiability restriction can be implemented by setting the first element of equal to a constant value; is used from this point forward. Instead of mixing directly on , the mixing takes place on and the free elements of , . Hence, the mixture model for the joint density of the latent response and covariates is now written as:

While this decomposition of allows for the necessary flexibility in viewing only part of the covariance matrix as random, its real utility lies in the existence of a conditionally conjugate centering distribution , which enables development of an efficient Gibbs sampler for posterior simulation. In particular, a multivariate normal component for the vector, , of free elements of , and independent inverse-gamma components for result in full conditional distributions which are multivariate normal and inverse-gamma, respectively. Therefore, comprises independent components for , and , such that it has the form .

2.2Posterior inference for binary regression

In order to simulate from the full posterior distribution, we utilize the blocked Gibbs sampler [?]. As a consequence of the constructive definition of the DP, any distribution it generates can be represented as a countable mixture of point masses. This definition motivates the blocked Gibbs sampler, as it is based on a finite truncation approximation to . Specifically, is truncated to , where , and are defined through stick-breaking as in the original DP definition, whereas . Introducing configuration variables , each taking values in , the hierarchical version of the DP mixture model for the data given the latent continuous responses, , becomes

where , and the prior implied for by the stick-breaking construction defined through beta random variables corresponds to a generalized Dirichlet distribution (Connor & Mosimann, 1969). The full Bayesian model is completed with a prior for , with mean , and with conditionally conjugate hyperpriors for , specifically: , , , , and , for . Here, indicates that the positive definite matrix follows an inverse-Wishart distribution with density proportional to . The notation is used for element of the vector corresponding to the diagonal of . Moreover, where convenient, we use the notation for the structured covariance matrix, where the elements of are computed through .

A key feature of the modeling approach is that simulation from the full posterior distribution, , is possible via Gibbs sampling. We next discuss posterior simulation details focusing on a result that enables Gibbs sampling updates for the parameters that define the covariance matrices of the normal mixture components.

The updates for and are generic for any choice of mixture kernel; see [?]. Each , , is sampled from a discrete distribution on , with probabilities proportional to , for . The full conditional distributions for the components of are easily found using standard conjugate updating. The full conditional distribution for each is a truncated version of the normal distribution , with the restriction if , and if .

Letting be the vector of distinct values of , the full conditional distribution for is proportional to . If , then . If , then the full conditional distribution for each element of arises from the product of a normal likelihood component, based on , and the base distribution . Therefore, when , for , the full conditional for is multivariate normal with mean vector and covariance matrix , where is the size of mixture component .

Lemma 2, whose proof can be found in Appendix A, provides the result for the posterior full conditional distributions of the and the , for . Before stating the lemma, we fix the required notation. As discussed earlier, vector consists of the lower triangle of free elements of matrix . For instance, if , the mixture kernel is a trivariate normal, and the free elements of are , corresponding to . The matrix contains vector on its diagonal. Let represent the dimension of the mixture kernel. Let be a vector of length , containing nonzero terms, occurring in elements for . Let be a block diagonal matrix of dimension with blocks, which can be constructed from square matrices of dimensions . Matrix occurs in rows and columns to of .

LEMMA 2. Consider the following Bayesian probability model: with a multivariate normal prior for , independent inverse-gamma priors on the diagonal elements of , , , and a multivariate normal prior on the vector comprising the lower triangular elements of , . Then, the posterior full conditional distribution for , , is an inverse-gamma distribution with shape parameter and scale parameter . In addition, the posterior full conditional for is multivariate normal with mean vector and covariance matrix . Here, the non-zero elements of are , and the -th element of matrix , for , is given by , for .

This lemma provides the information necessary to obtain the remaining full conditional distributions, which are available in closed form. Let denote the augmented latent response-covariate vector, such that and , for . Then, when , for , the full conditional distribution for is inverse-gamma with shape parameter and scale parameter . The full conditional for is multivariate normal with covariance matrix , and mean vector . The non-zero terms in the vector are , and for , the matrix contains elements , .

The mixing distribution , approximated by , is imputed as a component of the posterior simulation algorithm, enabling full inference for any functional of . The binary regression functional is the main quantity of interest, and is estimated by , where , with given in (Equation 2), and . Therefore, full inference for can be readily obtained for any covariate value , providing a point estimate along with uncertainty quantification for the binary regression function. Inference can also be obtained for the covariate distribution, , as well as the covariate distribution conditional on a particular value of , , which we refer to as inverse inferences, discussed further in the context of the data example of Section 3.1.

2.3Prior specification

We discuss two approaches to hyperprior specification considering the limiting case of the model as , which corresponds to a single mixture component. Both approaches use an approximate range and center of , say and , both vectors of length , with the objective being to center and scale the mixture kernel appropriately using only a small amount of prior information. Under the assumption of a single mixture component, the marginal moments are given by , and . We therefore set , and let , using and as proxies for the marginal mean and variance of , for . We set , which yields a dispersed prior for albeit with finite prior expectation, and determine such that . Next, we must determine values for the prior hyperparameters associated with and the , and this is where the two approaches differ.

The first approach uses prior simulation to induce approximately uniform priors on all correlations of the mixture kernel covariance matrix, while appropriately centering the variances. Note that the number of correlations grows at a rate of , making this approach practically feasible only for a small number of covariates. In particular, with a single covariate the kernel covariance matrix comprises correlation, , and variance, . Here, and are scalar parameters with components and , respectively, and the hyperpriors are: , , and . We set , and build the specification for the other hyperparameters from . We first fix the shape parameters , and to values that yield relatively large prior dispersion, for instance, results in infinite prior variance for the inverse-gamma distributions. Next, using as a proxy for , we find constants , where , such that , , and , while at the same time the induced prior on is approximately uniform on . Finally, with specified, , , and can be determined accordingly.

While this approach is attractive when a relatively noninformative prior is desired, it is difficult to implement with a moderate to large number of covariates. An alternative strategy arises from studying the distribution which is implied for if is inverse-Wishart distributed. Using properties of partitioned Wishart and inverse-Wishart matrices [?], it can be shown that implies inverse-gamma distributions for the , and a normal distribution for given the . It is customary to specify noninformative priors on the inverse-Wishart scale, usually fixing the degrees of freedom parameter to a small value, and the inverse scale parameter to be a diagonal matrix. Here, we use the smallest possible integer value for that ensures a finite expectation for the distribution, that is, , and set . Then, as shown in Appendix B, the distributions implied on , for , are . Hence, we let , and ; for the data examples of Section 3, we worked with exponential priors for the resulting in . Moreover, the distribution implies a normal distribution for the -th row of matrix , given ; see Appendix B. This can be translated into a distribution for conditionally on the , specifically, a normal distribution with zero mean vector and covariance matrix , which denotes a block diagonal matrix with elements , for . Now, after marginalizing out , the prior component for becomes . We therefore specify to be equal to the zero mean vector, and since we have a further prior on , and is a function of , we set , where is a proxy for obtained by replacing with its marginal prior mean. Finally, and can be specified to be equal to each other or assigned different portions of .

3Data Illustrations

3.1Ozone data

Ozone is a gas which has detrimental consequences when it occurs near the Earth’s surface. Ground-level ozone is a harmful pollutant, making up most of the smog which is visible in the sky over large cities. Because of the effects ozone has on the environment and our health, its concentration is monitored by environmental agencies. Rather than recording the actual concentration, presence or absence of an exceedance over a given ozone concentration threshold may be measured, and the number of ozone exceedances in a particular area is of interest.

We work with data set ozone from the “ElemStatLearn” R package. The data set includes measurements of ozone concentration in parts per billion, wind speed in miles per hour, temperature in degrees Fahrenheit, and radiation in langleys, recorded over 111 days from May to September of 1973 in New York. To construct a binary ozone exceedance response, we define an exceedance as an ozone concentration which is larger than 70 parts per billion. Therefore, we can model the probability of an ozone exceedance as a function of wind speed, temperature, and radiation, using the DP mixture binary regression model. In addition, the modeling approach is evidently appropriate here, since it is natural to estimate conditional relationships between the four environmental variables through modeling the stochastic mechanism for their joint distribution. We are not suggesting dichotomizing a continuous response in practice, but use this example to illustrate a practically relevant setting in which a binary response may arise as a discretized version of a continuous response. Moreover, the existence of the continuous ozone concentrations enables comparison of inferences from the binary regression model with a model based on the actual continuous responses.

Prior specification was performed using the first approach discussed in Section 2.3 that favors uniform priors for the correlations of the kernel covariance matrix. Although the corresponding priors were not all close to the uniform on under the inverse-Wishart prior specification approach, both methods resulted in prior mean estimates for that were, for each of the three random covariates, constant around , with interval bands that essentially span the unit interval. All posterior inference results discussed below were robust to the prior choice.

Ozone data. Posterior mean (solid line) and 90\% uncertainty bands (in gray) for probability of exceedance versus wind speed (left panels), temperature (middle panels), and radiation (right panels). The top row plots results under the binary regression model, including the binary response data in each panel. The bottom row shows results under the density estimation model. Refer to Section 3.1 for further details. Ozone data. Posterior mean (solid line) and 90\% uncertainty bands (in gray) for probability of exceedance versus wind speed (left panels), temperature (middle panels), and radiation (right panels). The top row plots results under the binary regression model, including the binary response data in each panel. The bottom row shows results under the density estimation model. Refer to Section 3.1 for further details. Ozone data. Posterior mean (solid line) and 90\% uncertainty bands (in gray) for probability of exceedance versus wind speed (left panels), temperature (middle panels), and radiation (right panels). The top row plots results under the binary regression model, including the binary response data in each panel. The bottom row shows results under the density estimation model. Refer to Section 3.1 for further details.
Ozone data. Posterior mean (solid line) and 90\% uncertainty bands (in gray) for probability of exceedance versus wind speed (left panels), temperature (middle panels), and radiation (right panels). The top row plots results under the binary regression model, including the binary response data in each panel. The bottom row shows results under the density estimation model. Refer to Section 3.1 for further details. Ozone data. Posterior mean (solid line) and 90\% uncertainty bands (in gray) for probability of exceedance versus wind speed (left panels), temperature (middle panels), and radiation (right panels). The top row plots results under the binary regression model, including the binary response data in each panel. The bottom row shows results under the density estimation model. Refer to Section 3.1 for further details. Ozone data. Posterior mean (solid line) and 90\% uncertainty bands (in gray) for probability of exceedance versus wind speed (left panels), temperature (middle panels), and radiation (right panels). The top row plots results under the binary regression model, including the binary response data in each panel. The bottom row shows results under the density estimation model. Refer to Section 3.1 for further details.

The marginal binary response curves for the probability of exceedance as a function of wind speed, temperature, and radiation, are shown in the top row of Figure ?. There is a decreasing trend in probability as wind speed increases, with the probability being essentially 0 when wind speed is greater than 15 mph. The opposite trend is observed with temperature, as the probability of exceedance is near 0 when temperature is less than 75 degrees, and above 0.8 when temperature exceeds 90 degrees. A non-monotonic unimodal response curve is obtained as a function of radiation, with peak probability occurring at moderate values of radiation, and declining with higher and lower values. Bivariate surfaces indicating probability of exceedance as a function of temperature and wind speed, as well as radiation and wind speed, are shown in Figure ?. An attractive feature of the joint modeling approach is that interactions and dependence between covariates are naturally accounted for, without the need to make simplifying assumptions, such as additivity in covariate effects, or to accommodate interactions with additional terms.

For this illustrative data example, the continuous ozone concentration responses are also available. We can therefore compare the binary regression model inferences for with the ones for , under the corresponding density estimation model – a DP mixture based on a four-dimensional normal kernel with unrestricted covariance matrix – applied to the original data set . Results are shown in the bottom row of Figure ?, based on a prior choice for the density estimation model that induces prior estimates for the curves that are similarly diffuse to the ones for . Save for some differences in the uncertainty bands, the density estimation model reveals similar trends for the regression functions to the ones uncovered by the binary regression model.

Ozone data. Posterior mean surface for probability of exceedance versus temperature and wind speed (left panel), and radiation and wind speed (right panel). Probabilities ranging from 0 to 1 are indicated by a spectrum of colors from white to red. Ozone data. Posterior mean surface for probability of exceedance versus temperature and wind speed (left panel), and radiation and wind speed (right panel). Probabilities ranging from 0 to 1 are indicated by a spectrum of colors from white to red.
Ozone data. Posterior mean estimates (solid lines) and 90\% uncertainty bands (dashed lines) for the density of wind speed (left panel), temperature (middle panel), and radiation (right panel), given an ozone concentration exceedance (blue) and non-exceedance (black). Ozone data. Posterior mean estimates (solid lines) and 90\% uncertainty bands (dashed lines) for the density of wind speed (left panel), temperature (middle panel), and radiation (right panel), given an ozone concentration exceedance (blue) and non-exceedance (black). Ozone data. Posterior mean estimates (solid lines) and 90\% uncertainty bands (dashed lines) for the density of wind speed (left panel), temperature (middle panel), and radiation (right panel), given an ozone concentration exceedance (blue) and non-exceedance (black).

As another appealing consequence of estimating the joint response-covariate distribution, we can obtain inference for the distribution of covariates conditional on a particular value of . These inverse inferences may be of interest in many settings, as they indicate how the covariate distribution differs given a positive versus a negative binary response. Such inferences are not possible under a model directly for the conditional response distribution (with the implicit assumption of fixed covariates). Figure ? shows estimates for the density of each covariate conditional on the binary exceedance response, and , for . Note that when an exceedance occurs, temperature is generally higher and wind speed lower. In addition, the conditional densities associated with an exceedance have smaller dispersion than those associated with a non-exceedance, indicating that a smaller range of covariate values are supported when an exceedance occurs.

Recall from Section 2.1 that if we make the simplifying assumption for the covariance matrix of the kernel in , we obtain a kernel for that comprises independent components and . The implied conditional regression function is again a weighted sum of probabilities with the same covariate-dependent weights as the proposed model, but probabilities which are not functions of ; the probability in expression (Equation 2) reduces to . Mixtures of this product-kernel form have been previously proposed in the literature; see, for instance, [?].

We fitted the simpler product-kernel model to the ozone data, using hyperpriors that induce similarly diffuse prior estimates for the regression functions with the general binary regression model. Differences in the response probabilities produced by the product-kernel mixture model (not shown here) tend to occur at peaks or low points of the curves in Figure ?. In general, the product-kernel model underestimates the probability surface or curve when it takes a high value, and overestimates regions of low probability. In addition, the uncertainty bands from the product-kernel model are generally wider than those produced by the proposed model.

For a more formal comparison, we use the posterior predictive loss criterion of [?]. The criterion favors the model that minimizes the predictive loss measure , with penalty term , and goodness of fit term . Here, is the mean under model of the posterior predictive distribution for replicated response with corresponding covariate value . The variance is similarly defined. Details involving expressions contributing to for each model are given in Appendix C, but note that computations are based on the conditional posterior predictive distribution of given . The penalty term under the product-kernel model is , while it is under the proposed model, and the goodness of fit terms are and , respectively. Hence, regardless of the choice for constant , the criterion favors the general DP mixture binary regression model.

3.2Estimating natural selection functions in song sparrows

In addition to enabling more general modeling of binary regression relationships, the latent variables may be practically relevant in specific applications. Often, we may only observe whether or not some event occurred, although there exists an underlying continuous response which drives the binary observation. The ozone data was used to illustrate an environmental application for which the latent continuous responses are actually present. In applications in biology, the latent response may represent maturity, which is recorded on a discretized scale, or an unobservable trait or measure of health. In general, the continuous responses may be latent either because they are actually unobservable, or as consequence of recording taking place on a discretized scale. As an example of the former scenario, consider a binary response which represents survival. While we only observe survival on a binary scale, it is meaningful to conceptualize an underlying process which drives survival. Quantifying the probability of survival as a function of phenotypic traits is of great interest in evolutionary biology [?]. Survival can be thought of as a measure of fitness, and the fitness surface describes the relationship between phenotypic traits and fitness. The proposed methodology is particularly well-suited for this area of application, as it allows flexible inference for the shape of the fitness surface and for the distribution of population traits under a joint modeling framework that incorporates the scientifically relevant latent fitness responses.

As an illustration, we consider a standard data set from the relevant literature that records overwinter mortality along with six morphological traits in a population of female song sparrows [?]. The traits measured consist of weight, wing length, tarsus length, beak length, beak depth, and beak width. Our initial analysis included four traits – weight, wing length, tarsus length, and beak length – as beak width and depth are highly discretized, correlated with beak length, and did not appear to be associated with a trend in survival. This analysis revealed tarsus length and beak length to be the main targets of selection, which is consistent with the findings of [?]. A key objective in this example is to obtain inferences for functionals used to assess the strength and form of natural selection acting on phenotypic traits, and we thus focus on the two traits associated with survival.

The model was applied with standardized covariates tarsus length () and beak length (), measured in millimeters, using the second approach to prior specification involving the inverse-Wishart distribution. The estimated selection curves are shown in Figure ?, revealing a strong decreasing trend in fitness over tarsus length, in which a sparrow with tarsus length millimeters has a lower probability of surviving overwinter than a sparrow with tarsus length just mm shorter. The opposite trend in fitness is present over beak length, as longer beaks are associated with higher probabilities of survival. The posterior median estimate for the probability of survival as a function of both traits (Figure ?, left panel) confirms that the combination of long beaks and short tarsi is optimal for fitness; importantly, it also indicates that a short tarsus provides the more significant contribution to higher probability of survival. The corresponding posterior interquartile range estimate (Figure ?, right panel) depicts more uncertainty in the survival probability surface for sparrows having both a short beak and short tarsus, and those with both a long beak and long tarsus.

Song sparrows data. Posterior mean (solid line) and 90\% uncertainty bands (in gray) for the probability of survival as a function of tarsus length (left panel) and beak length (right panel). Plotted in each panel are the corresponding observations. Song sparrows data. Posterior mean (solid line) and 90\% uncertainty bands (in gray) for the probability of survival as a function of tarsus length (left panel) and beak length (right panel). Plotted in each panel are the corresponding observations.
Song sparrows data. Posterior median surface (left panel) and interquartile range surface (right panel) for the probability of survival as a function of tarsus length and beak length. Song sparrows data. Posterior median surface (left panel) and interquartile range surface (right panel) for the probability of survival as a function of tarsus length and beak length.

For each of the two traits, we estimated the standardized directional selection differential, , which provides a measure of selection intensity representing the change in mean value of a phenotype produced by selection (Lande and Arnold, 1983). Here, is the mean value of phenotypic trait before selection, and is the mean value after selection; the marginal probability is referred to as mean absolute fitness. Under our model, , the mean absolute fitness is given by , and is approximated with a Riemann sum. The posterior mean estimate for the standardized selection differential for tarsus length was , with a posterior credible interval of . For beak length, the posterior mean and credible interval for the standardized selection differential were and . Note that these intervals do not contain zero. Combined with the estimated regression curves, these results give strong evidence that directional selection is acting on tarsus length and beak length, favoring sparrows with long beaks and short tarsi.

The average gradient of the selection surface, weighted by the phenotype distribution, is given under our model by the vector

Under a linear regression structure with a multivariate normal distribution for the phenotypic traits, the selection gradient is equivalent to the vector of linear regression slopes [?]. [?] do not incorporate in their approach a distributional assumption for , and approximate the -th selection gradient by . Our joint mixture modeling approach avoids the assumption of normality for the phenotypic distribution, as well as the need to estimate the integral by assuming the sample represents the population distribution. The integrand of the -th component of the selection gradient vector can be written as , for . We omit the specific expressions for each of these two terms, but note that both are analytically available as a consequence of the mixture of normals representation for . Finally, the average gradient of the relative selection surface, also referred to as the directional selection gradient by [?], is obtained by dividing each element of the selection gradient vector by mean absolute fitness. We obtained posterior mean estimates of and , with corresponding credible intervals of and , for the directional selection gradient associated with tarsus length and beak length, respectively.

The presence of stabilizing or disruptive selection can be explored by considering the change in the phenotypic variance-covariance matrix due to selection, that is, the change from the pre-selection covariance matrix , with elements , to the post-selection covariance matrix , with elements . The stabilizing selection differential matrix is given by + (Lande and Arnold, 1983), where negative values for a particular trait indicate the presence of stabilizing selection, while positive values indicate disruptive selection. The posterior mean for the matrix element corresponding to tarsus length is , that for beak length is , and the off-diagonal element has a posterior mean of . The posterior credible intervals for each element of the matrix all include zero, indicating lack of significant evidence for stabilizing or disruptive selection acting on either trait.

Song sparrows data. Posterior predictive samples for \mathrm{corr}(z,x_1) (left panel), \mathrm{corr}(z,x_2) (middle panel), and \mathrm{corr}(x_1,x_2) (right panel).

Finally, as a means to check if a kernel with independent components for and would be adequate for this data example, we study in posterior predictive space the correlations between the latent response and the two traits. Denoting by the vector comprising all model parameters, the joint posterior predictive distribution is given by