Bayesian Modeling and MCMC Computation in Linear Logistic Regression for Presence-only Data

Bayesian Modeling and MCMC Computation in Linear Logistic Regression for Presence-only Data

Fabio Divino,Natalia Golini,Giovanna Jona Lasinio    Antti Penttinen
1-Division of Physics, Computer Science and Mathematics, University of Molise
2-Department of Statistical Sciences, University of Rome "La Sapienza"
3-Department of Mathematics and Statistics, University of Jyväskylä
Abstract

Presence-only data are referred to situations in which, given a censoring mechanism, a binary response can be observed only with respect to on outcome, usually called presence. In this work we present a Bayesian approach to the problem of presence-only data based on a two levels scheme. A probability law and a case-control design are combined to handle the double source of uncertainty: one due to the censoring and one due to the sampling. We propose a new formalization for the logistic model with presence-only data that allows further insight into inferential issues related to the model. We concentrate on the case of the linear logistic regression and, in order to make inference on the parameters of interest, we present a Markov Chain Monte Carlo algorithm with data augmentation that does not require the a priori knowledge of the population prevalence. A simulation study concerning 24,000 simulated datasets related to different scenarios is presented comparing our proposal to optimal benchmarks.

Keywords:

Bayesian modeling, case-control design, data augmentation, logistic regression, Markov Chain Monte Carlo, population prevalence, presence-only data, simulation.

1 Introduction

There is a significant body of literature in statistics, econometrics and ecology dealing with the modeling of discrete responses under biased or preferential sampling designs. They are particularly popular in the natural sciences when species distributions are studied. Such sample schemes may reduce the survey cost especially when one of the responses is rare. A large part of statistical literature concerns the case-control design, retrospective, choice-based or response-based sampling (Lancaster and Imbens, 1996). In the simplest case a sample of cases and a sample of controls are available and for each observation a set of “attributes/covariates” is observed in both samples. Then inference is carried out following standard statistical procedures (Armenian, 2009).
A case that has received increasing attention in the literature is the situation where the sample of controls is a random sample from the whole population with information only on the attributes and not on the response (Lancaster and Imbens, 1996). This situation is fairly common in ecological studies where only species’ presence is recorded when field surveys are carried out. In the ecological literature, since the 1990’s such data are called presence only data (see Araùjo and Williams, 2000, and references therein). Pearce and Boyce (2006) define presence-only data as “consisting only of observations of the organism but with no reliable data where the species was not found”. Atlases, museum and herbarium records, species lists, incidental observation databases and radio-tracking studies are examples of such data.
In recent years we find a considerably growing literature describing approaches to the modeling of this type of data, among the many ecological papers we recall Keating and Cherry (2004), Pearce and Boyce (2006), Elith et al. (2006), Elith and Leathwick (2009), Franklin (2010) and, most notably, in the statistical literature Ward et al. (2009), Warton and Shepherd (2010), Chakraborty et al. (2011), Di Lorenzo et al. (2011) and Dorazio (2012). While in Warton and Shepherd (2010) and Chakraborty et al. (2011) to model the presence-only data Poisson point processes are considered in the likelihood and Bayesian framework respectively, in Ward et al. (2009) and Di Lorenzo et al. (2011) a modified case-control logistic model is adopted in the likelihood and Bayesian perspective respectively, in both papers there is no account for possible dependence structure in the observations. In Dorazio (2012) the asymptotic relations between the two approaches are discussed.
A different approach, MaxEnt, is based on the maximum entropy principle (Jaynes, 1957). In MaxEnt (Phillips et al., 2006; Elith et al., 2011) the relative entropy between the distribution of covariates at locations where the species is present and the unconditional background distribution of covariates is maximized subject to some constrains concerning empirical statistics (see Philips et al., 2006, for details). As pointed by Dorazio (2012) “the MaxEnt method requires knowledge of species’ prevalence for its estimator of occurrence to be consistent”.
In what follows we are going to use the name presence-only data when referring to the above sketched general problem of having information on the presence and covariates jointly on a sample from a population, while information on only the covariates is available on any sample from the same population. This work is developed in the same discrete setting as in Ward et al. (2009) and Di Lorenzo et al. (2011), i.e., we have a population of independent units, no dependence structure, such as spatial correlation, is anticipated. We defer the treatment of this extension to a subsequent work.
The main contribution of the paper is a new rigorous formalization of the logistic regression model with presence-only data that allows further insight into the inferential issues. This leads us to an algorithmic procedure that, among other results, returns a MCMC approximation of the response prevalence under general knowledge of the process generating the data. We also present a large simulation study involving 24,000 simulated datasets and comparing our approach to other two models representing optimal benchmarks.
The paper is organized as follows. Section 2 introduces a general framework for the presence-only data problems, Section 3 presents our Bayesian approach, Section 4 describes the MCMC algorithm while results related to the simulation study are reported in Section 5. Finally in Section 6 some conclusions are drawn and future developments briefly described.

2 Linear logistic regression for presence-only data.

The analysis of a binary response related to a set of explicative covariates is usually carried out through the use of the logistic regression where the logit of the conditional probability of occurrence is modeled as a function of covariates. In this section, we first introduce a general framework for the modeling of presence-only data and then consider the case of the linear logistic regression. The approach proposed is built on two levels and we partially follow the formulation introduced by Ward et al. (2009) but adopting a Bayesian scheme as in Divino et al. (2011).

2.1 A two level approach.

Let be a binary variable informing on the presence () or absence () of a population’s attribute and let denote a set of highly informative, on the same attribute, covariates which are available on the same population. Then, the presence-only problem can be formalized by considering a censorship mechanism that acts when observing the response , so that part of the population units are not reachable. In particular, we refer to the situation in which we are able to detect only a partial set of units on which the attribute of interest is present while the information on the covariates is available on the entire population. In this situation, we have to consider two types of uncertainty: the uncertainty due to the mechanism of censorship and the uncertainty due to the sampling procedure. Moreover, since we are not able to collect a random sample of observable data, we need to adjust for the sampling mechanism through the use of a case-control scheme (Breslow and Dey, 1980; Breslow, 2005; Armenian, 2009) .
In order to build a Bayesian model, in this framework we adopt the following conceptual scheme in two levels.

Level 1.

Given the population of interest of size , the binary responses are generated independently by a probability law .

Level 2.

Let be the subset of where we observe . A modified case-control design is applied so that a sample of presences, considered as cases, is selected from and a sample of “contaminated” controls (Lancaster and Imbens, 1996) is selected from the whole population , with all the covariates but no information on .

Here, we cannot approach the model construction using only a finite population approach (Särndal, 1978) because of the censoring mechanism that “masks” distributional information on already at the population level. By the introduction of Level 1 we can describe the censored observations as random quantities generated by the model . Hence, the problem of presence-only data can be formalized as a problem of missing data (Rubin, 1976; Little and Rubin, 1987).

2.2 The model generating population data.

At the first level, we assume that the law is defined in terms of the conditional probability of occurrence , denoted by , when the covariates are . Moreover, we consider that the relation between and is formalized through a regression function on the logit scale

(1)

that is

(2)

When the data are independently generated from , we denote by the empirical prevalence of the binary response in , expressed as the ratio of the number of presences to the size of the population, that is

2.3 The modified case-control design.

At the second level, we adopt a case-control design modified for presence-only data (Lancaster and Imbens, 1996) in order to account for the specific sampling procedure considered. The use of the case-control scheme is necessary at all times when it is appropriate to select observations in fixed proportions with respect to the values of the response variable. This can occur when the attribute of interest represents a phenomenon that is rare among the units of the population as for example a rare disease or a rare exposure in epidemiological studies (Woodward, 2005).
Now, let be a binary indicator of inclusion into the sample ( denotes that a unit is in the sample), let and be the inclusion probability of the absences and the presences, respectively. Under the assumption that, given , the sampling mechanism is independent from the covariates , the conditional probability of occurrence is modified through the Bayes rule as

(3)

Hence, the corresponding case-control regression function defined as the logit of (3) is given by

(4)

In particular, if the selection of cases () and controls () is made independently without replacement, the inclusion probabilities are given in terms of the empirical prevalence by

and

so that the equation (4) becomes

(5)

In our framework, since the response variable is already censored at the population level, the standard case-control design cannot be adopted but it should be modified in such a way that a sample of presences is matched with an independent sample drawn from the entire population, named the background sample (Zaniewski et al., 2002; Ward et al., 2009). Remark that in this sample the response variable is unobserved and only the covariates are available.
In this way, the complete sample is composed by a set of independent background data, where the response is not observed, drawn from the entire and by a set of independent observations selected from the sub-population of presences . This procedure implies that the reference population is augmented with its subset so that the total number of observations considered in the sampling scheme becomes . To illustrate the sampling framework we are going to adopt here, let us consider the following situation: we can label population units of type only when they are isolated from units of type . This can be formalized by introducing a binary stratum variable such that indicates when an observation is drawn from the entire population while denotes the sampling from the sub-population . Remark that implies whilst implies that is an unknown value . Moreover, by construction is independent from the covariates , given the response . The introduction of the variable allows us to define the structure of the data at the population level and at the sample level in terms of presences/absences () and known/unknown data (), as reported in Table 1 and Table 2.

Y/Z Total
Total
Table 1: Data structure at the population level.
Y/Z Total
Total
Table 2: Data structure at the sample level.

In Table 1, is the number of absences in the population while in Table 2, and respectively denote the unknown frequencies of absences and presences in the sub-sample . Remark that, in the above described situation, the inclusion probability of units with or without the mentioned attribute changes. In fact, while an absence can be drawn only when sampling from , a presence can be selected when sampling both from and from . Thus, one has

(6)

and

(7)

The introduction of the stratum variable allows us also to exactly derive the logistic regression model under the case-control design modified for presence-only data. In fact, when we consider the population augmented with its subset , the model represents the conditional probability to mark a presence only when , that is . On the other hand, when , we simply have . We can prove the following result.

Proposition 1.

Under the assumption that is independent from given , one has

(8)

Proof.

From the hypothesis of conditional independence it results

that can be express also as

Let consider the case with and , one has

The probabilities enclosed in the second term can be derived from Table 1 and one has

(9)

In the case and one similarly obtains

(10)

From (10) it results and by substituting in (9), one can derive that and hence . Now, it is simple to obtain that


If we assume that, given , the inclusion into the sample () is independent from the covariates , one has 111see Appendix for the detailed proof.

(11)

and

(12)

Then, from the ratio of (12) to (11), it results

and by plugging the quantities and , as defined in (6) and in (7), into the logit of , one obtains the following relation

(13)

that represents the logistic regression model under the case-control design for presence-only data. As well, we can now formalize the presence-only data regression function as

(14)

Although the derivation is substantially different, we end with the same formulation as in Ward et al. (2009). Now, in order to make parameter estimation possible, we need to handle the ratio

(15)

where the quantities and are unknown ().
In the recent literature, two main approaches have been proposed. The first one by Ward et al. (2009) replace the ratio with the ratio of the expected numbers of presences and absences in the sample, that is

(16)

These authors adopt a likelihood approach and computation is carried out via the EM algorithm. As they underline, this approximation can be easily implemented if the empirical population prevalence is known a priori. They discuss also the possibility to estimate jointly with the regression function when the prevalence is identifiable, as for example in the linear logistic regression, and with respect to this case they present a simulation example. The difficulty in obtaining efficient joint estimates because of the correlation between and the intercept of the linear regression term is discussed. Notice that Ward et al. (2009) considers a slightly different representation of the ratio (16), omitting the multiplier “2” in the denominator.
Di Lorenzo et al. (2011), dealing with a problem of abundance data, use the approximation (16), but they adopt a Bayesian approach and consider the population prevalence as a further parameter in the model. They choose an informative Beta prior for , but their MCMC algorithm contains an unusual weakness since the simulation of is performed from its prior and not from the posterior that can be derived through the interaction between the parameter and the regression function .
A different approximation of the ratio (15) can be obtained by considering the sample prevalence in (the background sample)

where

Due to the censorship process, this quantity is unknown but it would be the maximum likelihood estimator for if the data could be observed. Now, replacing by in (15) one obtains

(17)

that allows to formulate a computable version of the regression function for presence-only data as

(18)

This function depends on the data in which are not directly observable, but if is treated as missing data one can enclose it into the estimation process and then obtain a consistent approximation for . In particular, in a Bayesian framework, this idea can be performed by using a Markov Chain Monte Carlo computation with data augmentation. Moreover from the use of MCMC simulations we can also obtain an approximation of and therefore an estimate of the empirical population prevalence . Details are given in Section 4.
The approximation (17) can, in principle, be always adopted, but some care must be used as identifiability issues are present. We follow the recommendation in Ward et al. (2009) to approach jointly estimates of and only when the latter is identifiable with respect to the regression function, as for example in the linear regression case (see Ward et al., 2009, for mathematical details).

2.4 The linear logistic regression.

If we consider a linear regression function , where is the vector of the regression parameters, a computable model for presence-only data can be defined through the following approximation

(19)

or equivalently through the approximation of the conditional probability of occurrence at the sample level

(20)

In this particular case, all the unknowns of the model are the linear parameters vector and the missing data in the background sample .

3 The hierarchical Bayesian model.

Due to the censorship process affecting the data, we can acquire complete information only on the stratum variable and not on the binary response . Then, it seems natural to model as the observable variable. If we consider the conditional joint distribution of and

(21)

through the marginalization over , the probability can be obtained and we can express the relation between presences and covariates in terms of regression of respect to . Notice that, while can be obtained from (20), the term , due to the conditional independence between and given , simply reduce to be equal to that can be derived from Table 2.
We point out that, even if the response does not play an explicit role after the marginalization step, we need to keep it in the model as a hidden variable in order to obtain the approximation for the quantity , necessary to correct the linear regression function for presence-only data.
Now, we can formalize the hierarchical Bayesian model to estimate the parameters of a linear logistic regression under the case-control scheme adjusted for presence-only data. In order to better explain the conditional relationship underlying the hierarchy, we introduce the graph in Figure 1. The dashed node indicates a variable hidden with respect to the conditional relationships.

Figure 1: Graphical representation of the hierarchical Bayesian model.

The priors.

At the top of the hierarchy, we assume the hyper parameter distributed as . At the second level, we consider the prior probability distribution on depending on the hyper parameter , that is . At the third level, the unobserved data in are considered latent parameters with prior distribution Bernoulli (denoted by ) with probability of occurrence given by the approximation in (20), that is

This point is important for deriving the predictive distribution of the unobserved data necessary in the estimation algorithm.

The likelihood.

At the lowest level of the hierarchy, we have the likelihood, defined with respect to the observable stratum variable . Recalling that from the Table 2 we have and , when (21) is marginalized over , one obtains the approximation

(22)

and hence

(23)

Thus, we can assume that for all the conditional distribution of is Bernoulli with probability of occurence given by (22), that is

Recalling that for all while for all , the likelihood function can be written as

Ward et al. (2009) defines this function as the observed likelihood versus the full likelihood that, instead, considers the distribution of the stratum variable jointly with the response .

The posterior.

Now, through the Bayes rule we derive the full posterior

(24)

that can be used to make inference on the quantities of interest.

4 The MCMC computation.

Samples from (24) can be obtained via Markov Chain Monte Carlo simulation (Robert and Casella, 2004; Liu, 2008). While it seems quite standard to implement a direct sampler for the vector and the hyper parameter , we need to sample also the latent . For this reason we introduce a step of data augmentation (Tanner and Wong, 1987; Tanner, 1996) in the estimation procedure. The basic idea of the data augmentation technique is to augment the set of observed data to a set of completed data that follow a simpler distribution (Liu and Wu, 1999). In our framework, we need to augment the observations of the stratum variable with the missing values in order to have, at each iteration, a consistent value of the quantity , necessary to adjust the regression function for presence-only data. The following result allows for an easy implementation of the data augmentation step.

Proposition 2.

Using the approximation (17) of the ratio (15), the posterior predictive probability of occurrence for an unobserved response in the sub-sample is approximated by the model that generates the data at the population level, that is

(25)

Proof.

From the conditional independence between and given , the predictive probability of occurrence in is given by

From Table 2 we have that and hence

(26)

Now, recalling that in the general case one has

(27)

and

(28)

by substituting (27) and (28) in (26), one obtains


4.1 The data augmentation algorithm.

A general MCMC scheme to perform inference on a linear regression model for presence-only data can be defined as follow.

 

Step 0.

Initialize , and

Step 1.

Set

Step 2.

Sample from

Step 3.

Sample from

Step 4.

Sample from for all

Goto Step 1

 

After the initialization of all the arrays (Step 0), Step 1 sets a current value for the quantity to adjust the regression function . Step 2 and Step 3 consider the sampling from the posterior of the hyper parameter and the regression parameter , respectively, and they can be performed by Metropolis-Hasting schemes (Robert and Casella, 2004). Step 4 concerns the data augmentation for the unobserved in order to update consistently the quantity at the following iteration. From the result (25), this simulation can be obtained by a Gibbs sampler (Robert and Casella, 2004) since the posterior predictive distribution for all is approximated by Bernoulli with parameter of occurrence .

4.2 The estimation of the prevalence .

From the data augmentation algorithm we can obtain a MCMC estimate of the population prevalence . In fact, if at each iteration , after the Markov chain has reached the equilibrium, we save the current value , we can obtain a consistent MCMC approximation of the sample prevalence in by

(29)

where is the ergodic mean of the augmentations over the Markov chain, that is

Therefore, since would be a consistent estimator for , represents also a consistent estimation of the empirical population prevalence.

5 A comparative simulation study.

We present a simulation experiment to evaluate the performances of the model (20). To this aim we generate several datasets in the way described below and we compare our proposal with respect to two models acting in two different situations: (a) the censorship process does not act on the population so that the data are completely observed; (b) the censorship is present, but we assume known the population prevalence so that approximation (16) can be used. In (a) we are able to estimate a linear logistic model (denoted by ), no correction is required and . In (b) we consider a linear logistic model for presence-only data, denoted by , with regression function . Model (20) (denoted by ) is estimated when the censorship process acts on the data and no information is available on the population prevalence. In this case, the regression function is given by . Remark that model can be estimated when the least amount of information is available, requires less information than but more than and can be used only in the ideal situation of complete information. We assume as benchmark model in the case of presence-only data.

The generation of data.

In order to set the simulation study, we need to generate the covariates and the binary response . In particular, we consider two covariates: , giving strong information on the distribution of the response , and , representing a term of noise, not available in the estimation step. We assume distributed as a mixture of two Gaussian densities (denoted by N), centred in and respectively, and with equal variances , that is

The weight is a realization of a Bernoulli random variable with probability of occurrence fixed to . has standard Gaussian distribution . Finally, the binary response , given the covariates and , is Bernoulli distributed with probability of occurrence

We generate covariates and binary response with respect to a population of size . Three general scenarios with different level of complexity have been considered:

(i)

, , : only the informative covariate generates the data;

(ii)

, , : a term of noise is added to the informative covariate;

(iii)

, , : , and a constant effect generate the data.

The case-control sampling.

For each scenario, we sample under the case-control design with a ratio of presence/unobserved equal to and with respect to eight different sample sizes

For example, if the sample size is equal to , we build the corresponding simulated experiment by extracting a random sample from of presences and a random sample from of unobserved values, covariates are available for the whole sample . We consider independent replications of each experiment. In summary, we generate a database of 24,000 datasets (8 sample sizes, 3 scenarios and 1000 replications). With respect to the generating of the data, we considered a quite general framework since the contribution of an informative covariate was combined with a constant effect and a white Gaussian noise. With respect to the three scenarios, we obtained empirical population prevalences respectively , and .

The MCMC estimation.

The estimation is performed in a Bayesian framework for all the models , and . The likelihood function we use in the estimation is based on a model that does not always replicate the model used to generate data. More precisely for all experiments (i), (ii) and (iii) the estimation model is:

(30)

than with scenario (i) the model that generates the data and the one defining the likelihood are the same, whilst for scenarios (ii) and (iii) the likelihood model becomes increasingly different from the one that generates the data. Notice that we consider a simpler structure than the one shown in Figure (1) as we choose a Gaussian prior for all regression parameters (, ) and no hyper parameter is considered. Then, MCMC estimates are computed using 5000 runs after 10000 iterations of burn-in, no thinning is applied as samples autocorrelation is negligible.

Results.

In what follows we report Figures and Tables built on scenario (iii) as it represents the most complex of the three alternatives and it is our “worst” case. In each replicate of an experiment, point estimates are computed as posterior means over 5000 iterations. In Figure 2 boxplots describing point estimates behaviour are reported, horizontal lines corresponding to the “true” values are drawn. The first box corresponds to procedure , the second to and the third to our proposal . In the prevalence is estimated as the ratio of the observed presences in to the sample size . In , although is assumed known a priori, we consider its posterior prediction in . Finally in , the prevalence is obtained at each MCMC step as described in section 4.2 and then the mean over 5000 runs is taken. In Table 3 further details of the point estimates are reported: the median and in parenthesis the first and third quartiles. From the Figures and the values we can see that the three procedure lead to “comparable” values with the obvious reduction of variability when increases. Remark that the estimates for , although affected by a larger variability with small sample sizes, rapidly approaches and behaviour with increasing sample size. This can be seen more clearly in Figure 3 where rooted mean square errors (rmse) are reported. As far as is concerned the lack of knowledge on leads to biased point estimates regardless the estimation procedure. Tables 5 and 6 in Appendix report point estimates for scenarios (i) and (ii). For scenarios (i) unbiased estimates are obtained while (ii) is affected by the same distortion as (iii) but with smaller variability.

Model
50 1.42 (0.68 ; 2.33) 1.15 (0.88 ; 1.55) 0.28 (0.25 ; 0.35)
3.13 (1.78 ; 4.46) 1.69 (1.17 ; 2.28) 0.31 (0.26 ; 0.35)
1.79 (-3.38 ; 4.26) 1.44 (0.76 ; 2.12) 0.24 (0.13 ; 0.34)
100 1.14 (0.72 ; 1.62) 1.00 (0.86 ; 1.22) 0.29 (0.25 ; 0.33)
2.12 (1.11 ; 3.51) 1.30 (0.97 ; 1.80) 0.30 (0.26 ; 0.34)
1.92 (0.16 ; 3.87) 1.24 (0.89 ; 1.78) 0.28 (0.21 ; 0.36)
200 1.01 (0.72 ; 1.36) 0.94 (0.83 : 1.06) 0.29 (0.26 ; 0.31)
1.53 (0.89 ; 2.39) 1.08 (0.87 ; 1.37) 0.29 (0.27 ; 0.32)
1.49 (0.59 ; 2.62) 1.07 (0.83 ; 1.37) 0.29 (0.24 ; 0.34)
500 0.94 (0.75 ; 1.15) 0.89 (0.82 ; 0.96) 0.29 (0.27 ; 0.30)
1.12 (0.78 ; 1.57) 0.94 (0.82 ; 1.10) 0.29 (0.28 ; 0.31)
1.17 (0.62 ; 1.82) 0.94 (0.80 ; 1.12) 0.29 (0.26 ; 0.32)
1000 0.91 (0.78 ; 1.04) 0.88 (0.83 ; 0.92) 0.28 (0.28 ; 0.30)
1.03 (0.79 ; 1.34) 0.91 (0.83 ; 1.01) 0.29 (0.28 ; 0.30)
1.05 (0.68 ; 1.49) 0.91 (0.82 ; 1.03) 0.29 (0.27 ; 0.31)
1500 0.89 (0.80 ; 1.00) 0.86 (0.83 ; 0.91) 0.29 (0.28 ; 0.29)
1.00 (0.78 ; 1.24) 0.89 (0.82 ; 0.98) 0.29 (0.28 ; 0.30)
1.01 (0.71 ; 1.35) 0.90 (0.82 ; 0.99) 0.29 (0.27 ; 0.31)
2000 0.89 (0.82 ; 0.98) 0.87 (0.84 ; 0.90) 0.29 (0.28 ; 0.29)
0.96 (0.79 ; 1.15) 0.89 (0.83 ; 0.95) 0.29 (0.28 ; 0.29)
0.96 (0.71 ; 1.23) 0.88 (0.82 ; 0.96) 0.29 (0.27 ; 0.30)
3000 0.90 (0.83 ; 0.97) 0.87 (0.84 ; 0.89) 0.29 (0.28 ; 0.29)
0.94 (0.82 ; 1.09) 0.88 (0.84 ; 0.93) 0.29 (0.28 ; 0.29)
0.95 (0.76 ; 1.17) 0.88 (0.83 ; 0.94) 0.29 (0.28 ; 0.30)
Table 3: Scenario (iii): point estimates of regression parameters and prevalence computed as medians over 1000 replicates with increasing sample sizes and different models (, and ). In parenthesis distributions quartiles are reported.
50
100
200
500
1000
1500
2000
3000
Figure 2: Scenario (iii): boxplots of simulations with increasing sample sizes and different models(, and ).
(a)
(b)
(c)
Figure 3: Scenario (iii): root mean squared errors for different models (, and ) over the 1000 replications, plots with increasing sample sizes for (a), (b) and (c). Dashed trajectories are reported to show the patterns.

From Ward et al. (2009) we know that pairwise correlation between parameters is present. In Table 4 we report the empirical pairwise correlation measures, obtained as the averages with respect to the 1000 samples, with increasing sample sizes across the different models. No significant differences in the pattern of correlation between the models and while the correlation has a general weaker pattern in than . With respect to the correlation more significant difference are present between and . In Figure 4 scatterplots of versus in the 1000 replicates are plotted with equal axis across estimation procedures and sample sizes. These pictures help us to understand how this correlation evolves with increasing sample sizes. produces the most positive correlated point estimates this being an advantage whenever the model is properly specified.

Model
50 0.65 0.26 -0.09 0.59 0.10 -0.30 0.68 0.81 0.31
100 0.75 0.24 -0.12 0.89 0.29 0.02 0.82 0.76 0.37
200 0.78 0.34 -0.04 0.94 0.39 0.18 0.90 0.78 0.48
500 0.79 0.38 0.00 0.95 0.41 0.24 0.92 0.77 0.51
1000 0.77 0.38 -0.01 0.94 0.46 0.27 0.91 0.81 0.54
1500 0.78 0.42 0.00 0.95 0.48 0.28 0.92 0.81 0.55
2000 0.77 0.35 -0.06 0.94 0.49 0.30 0.92 0.81 0.55
3000 0.81 0.37 -0.01 0.95 0.43 0.23 0.91 0.80 0.52
Table 4: Scenario (iii): pairwise parameters correlation (average over the 1000 replicates) with increasing sample sizes and different models (, and ).
50
100
200
500
1000
1500
2000
3000
Figure 4: Scenario (iii): scatterplot of versus with increasing sample sizes and different models (, and ).

To verify the predictive performance we considered relative measures of specificity and sensitivity (Fawcett, 2006) build as the ratio of the same measures for (numerator) and for (denominator) respectively. In Figure 5 the obtained values are reported versus sample sizes. Remark that rapidly reaches the same level of performance as with increasing sample size.

Figure 5: Scenario (iii): relative specificity and sensitivity computed as ratios between and specificity and sensitivity measures with increasing sample sizes. Dashed trajectories are reported to show the patterns.

6 Conclusions

In this work, we presented a Bayesian procedure to estimate the parameters of logistic regressions for presence-only data. The approach we proposed is based on a two levels scheme where a generating probability law is combined with a case-control design adjusted for presence-only data. The new formalization allows to consider rigorously all the mathematical details of the model as for instance the approximation of the ratio (15) that represents the crucial point when modeling presence-only data in the finite population setting. We want to point out that our formalization is substantially different from the work by Ward et al. (2009), although we end with the same statistical model. We concentrated on the case of the linear logistic regression because we were aware that some care is necessary to handle the identifiability issues present in the model.
The comparative simulation study considered three scenarios with different levels of complexity across increasing sample sizes. We presented detailed results with respect to the most difficult case where the contribution of an informative covariate was mixed with a constant effect and a white Gaussian noise. In term of point estimation, the estimates based on our model were comparable to those obtained under the presence-only data benchmark in which the empirical population prevalence was assumed to be known. On the other hand, this lack of information on the population prevalence affected the efficiency of the estimates, that resulted smaller for our model than for the benchmark. This difference was significant only when the sample size was smaller than 1000, i.e. when the number of observed presences was smaller than 200. From the predictive point of view, our model performed as well as the benchmark already for sample sizes about , i.e. for a number of observed presences at least . Also the pairwise correlation between and , that represents an important issue as pointed by Ward et al. (2009), became negligible with increasing sample sizes.
From the computational point of view, the procedure were carried out through a MCMC scheme with data augmentation and implemented in Fortran codes.
Future work will investigate the possibility of adding dependence structures among the population units into the model as, for instance, through the use of regression functions with structured random effects.

Appendix

Proposition 3.

Under the assumption that, given , the inclusion into the sample () is independent from the covariates , it results

and

Proof.

In general we have that

(31)

From the conditional independence between and given , the (31) becomes

Recalling that and the definitions of and the proofs for and can be derive by simple algebra.

Model
50 0.40 (-0.31 ; 1.45) 1.56 (1.08 ; 2.72) 0.20 (0.18 ; 0.25)
2.19 (0.68 ; 3.57) 2.19 (1.37 ; 3.74) 0.23 (0.18 ; 0.27)
1.03 (-2.51 ; 3.35) 2.00 (1.07 ; 3.44) 0.19 (0.12 ; 0.26)
100 0.31 (-0.18 ; 0.88) 1.23 (0.99 ; 1.61) 0.21 (0.19 ; 0.25)
1.24 (0.29 ; 2.36) 1.55 (1.12 ; 2.22) 0.23 (0.19 ; 0.26)
1.22 (-0.40 ; 2.69) 1.50 (1.07 ; 2.22) 0.16 (0.16 ; 0.27)
200 0.11 (-0.20 ; 0.46) 1.08 (0.95 ; 1.28) 0.22 (0.19 ; 0.24)
0.46 (0.00 ; 1.27) 1.23 (0.99 ; 1.56) 0.22 (0.20 ; 0.24)
0.48 (-0.24 ; 1.55) 1.23 (0.96 ; 1.59) 0.22 (0.19 ; 0.25)
500 0.06 (-0.10 ; 0.25) 1.02 (0.94 ; 1.12) 0.22 (0.20 ; 0.23)
0.17 (-0.09 ; 0.47) 1.04 (0.92 ; 1.19) 0.22 (0.20 ; 0.23)
0.14 (-0.26 ; 0.61) 1.03 (0.89 ; 1.20) 0.22 (0.19 ; 0.24)
1000 0.04 (-0.05 ; 0.17) 1.01 (0.95 ; 1.07) 0.22 (0.21 ; 0.22)
0.04 (-0.13 ; 0.26) 0.99 (0.91 ; 1.08) 0.21 (0.21 ; 0.22)
0.03 (-0.22 ; 0.34) 0.98 (0.90 ; 1.09) 0.21 (0.20 ; 0.23)
1500 0.05 (-0.04 ; 0.15) 0.99 (0.95 ; 1.04) 0.21 (0.21 ; 0.22)
0.01 (-0.12 ; 0.18) 0.97 (0.91 ; 1.05) 0.21 (0.21 ; 0.22)
0.00 (-0.24 ; 0.23) 0.97 (0.90 ; 1.05) 0.21 (0.20 ; 0.22)
2000 0.03 (-0.04 ; 0.10) 0.99 (0.95 ; 1.03) 0.21 (0.21 ; 0.22)
0.00 (-0.12 ; 0.14) 0.97 (0.92 ; 1.03) 0.21 (0.21 ; 0.22)
-0.02 (-0.22 ; 0.14) 0.96 (0.90 ; 1.02) 0.21 (0.20 ; 0.22)
3000 0.03 (-0.02 ; 0.10) 0.98 (0.96 ; 1.02) 0.21 (0.21 ; 0.22)
0.00 (-0.10 ; 0.09) 0.96 (0.92 ; 1.00) 0.21 (0.21 ; 0.22)
-0.03 (-0.18 ; 0.11) 0.95 (0.91 ; 1.00) 0.21 (0.20 ; 0.22)
Table 5: Scenario (i): point estimates of regression parameters and prevalence computed as medians over 1000 replicates with increasing sample sizes and different models (, and ). In parenthesis distributions quartiles are reported.
Model
50 0.42 (-0.33 ; 1.42) 1.34 (0.94 ; 2.13) 0.23 (0.18 ; 0.28)
2.12 (0.78 ; 3.39) 1.95 (1.25 ; 2.96) 0.24 (0.20 ; 0.28)
1.26 (-2.99 ; 3.33) 1.75 (0.95 ; 2.79) 0.20 (0.13 ; 0.28)
100 0.19 (-0.20 ; 0.73) 1.07 (0.88 ; 1.35) 0.23 (0.19 ; 0.25)
1.13 (0.36 ; 2.40) 1.39 (1.00 ; 1.96) 0.23 (0.21 ; 0.26)
1.03 (-0.40 ; 2.65) 1.34 (0.96 ; 1.96) 0.22 (0.17 ; 0.28)
200 0.13 (-0.18 ; 0.45) 0.97 (0.84 ; 1.12) 0.23 (0.20 ; 0.25)
0.48 (0.02 ; 1.17) 1.08 (0.88 ; 1.36) 0.23 (0.21 ; 0.25)
0.48 (-0.38 ; 1.56) 1.07 (0.83 ; 1.41) 0.23 (0.18 ; 0.27)
500 0.09 (-0.07 ; 0.27) 0.92 (0.85 ; 1.00) 0.22 (0.21 ; 0.24)
0.22 (-0.04 ; 0.53) 0.95 (0.84 ; 1.09) 0.23 (0.21 ; 0.24)
0.23 (-0.24 ; 0.69) 0.95 (0.81 ; 1.09) 0.22 (0.20 ; 0.25)
1000 0.08 (-0.02 ; 0.20) 0.90 (0.86 ; 0.95) 0.22 (0.21 ; 0.23)
0.09 (-0.07 ; 0.31) 0.90 (0.83 ; 0.99) 0.22 (0.21 ; 0.23)
0.08 (-0.19 ; 0.38) 0.89 (0.82 ; 1.00) 0.22 (0.21 ; 0.24)
1500 0.08 (0.00 ; 0.18) 0.90 (0.86 ; 0.94) 0.22 (0.22 ; 0.23)
0.08 (-0.05 ; 0.23) 0.89 (0.84 ; 0.96) 0.22 (0.22 ; 0.23)
0.05 (-0.17 ; 0.30) 0.89 (0.82 ; 0.96) 0.22 (0.21 ; 0.23)
2000 0.07 (0.00 ; 0.15) 0.89 (0.86 ; 0.92) 0.22 (0.22 ; 0.23)
0.06 (-0.06 ; 0.20) 0.89 (0.84 ; 0.95) 0.22 (0.22 ; 0.23)
0.02 (-0.17 ; 0.24) 0.88 (0.82 ; 0.94) 0.22 (0.21 ; 0.23)
3000 0.07 (0.01 ; 0.13) 0.89 (0.87 ; 0.91) 0.22 (0.22 ; 0.23)
0.04 (-0.04 ; 0.15) 0.88 (0.84 ; 0.92) 0.22 (0.22 ; 0.23)
0.02 (-0.13 ; 0.17) 0.87 (0.83 ; 0.92) 0.22 (0.21 ; 0.23)
Table 6: Scenario (ii): point estimates of regression parameters and prevalence computed as medians over 1000 replicates with increasing sample sizes and different models (, and ). In parenthesis distributions quartiles are reported.

References

  • Araùjo and Williams (2000) Araùjo, M. and Williams, P. (2000). Selecting areas for species persistence using occurrence data. Biological Conservation, 96, 331–345.
  • Armenian (2009) Armenian, H. (2009). The Case-Control Method: Design And Applications. Oxford University Press, New York, USA.
  • Breslow (2005) Breslow, N. E. (2005). Handbook of Epidemiology, chapter 6: Case-Control Studies, pages 287–319. Springer, New York, USA.
  • Breslow and Dey (1980) Breslow, N. E. and Dey, N. E. (1980). Statistical Methods In Cancer Research, Volume 1 - The analysis of case-control studies. WHO International Agency for Research on Cancer, Lyon, France.
  • Chakraborty et al. (2011) Chakraborty, A., Gelfand, A. E., Wilson, A. M., Latimer, A. M., and Silander, J. A. (2011). Point pattern modelling for degraded presence-only data over large regions. Journal of the Royal Statistical Society: Series C (Applied Statistics), 5, 757–776.
  • Di Lorenzo et al. (2011) Di Lorenzo, B., Farcomeni, A., and Golini, N. (2011). A Bayesian model for presence-only semicontinuous data with application to prediction of abundance of Taxus Baccata in two Italian regions. Journal of Agricultural, Biological and Environmental Statistics, 16(3), 339–356.
  • Divino et al. (2011) Divino, F., Golini, N., Jona Lasinio, G., and Pettinen, A. (2011). Data augmentation approach in bayesian modelling of presence-only data. Procedia Environmental Sciences, 7, 38–43.
  • Dorazio (2012) Dorazio, R. M. (2012). Predicting the geographic distribution of a species from presence-only data subject to detection errors. Biometrics, 68, 1303–1312.
  • Elith and Leathwick (2009) Elith, J. and Leathwick, J. R. (2009). Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution and Systematics, 40, 677–697.
  • Elith et al. (2006) Elith, J., Graham, C. H., Anderson, R. P., Dudik, M., Ferrier, S., Guisan, A., Hijmans, R. J., Huettmann, F., Leathwick, J. R., Lehmann, A., Li, J., Lohmann, L. G., Loiselle, B. A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Overton, J. M., Peterson, A. T., Phillips, S. J., Richardson, K. S., Scachetti-Pereira, R., Schapire, R. E., Soberon, J., Williams, S., Wisz, M. S., and Zimmermann, N. E. (2006). Novel methods improve prediction of species’ distribution from occurence data. Ecography, 29, 129–151.
  • Elith et al. (2011) Elith, J., Phillips, S. J., Hastie, T., Dudík, M., Chee, Y. E., and Yates, C. J. (2011). A statistical explanation of MaxEnt for ecologists. Diversity and Distributions, 17, 43–57.
  • Fawcett (2006) Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letter, 27, 861–874.
  • Franklin (2010) Franklin, J. (2010). Mapping Species Distributions: Spatial Inference And Prediction. Cambridge University Press, Cambridge, UK.
  • Jaynes (1957) Jaynes, E. T. (1957). Information theory and statistical mechanics. The Physical Review, 106(4), 620–630.
  • Keating and Cherry (2004) Keating, K. A. and Cherry, S. (2004). Use and interpretation of logistic regression in habitat-selection studies. Journal of Wildlife Management, 68, 774–789.
  • Lancaster and Imbens (1996) Lancaster, T. and Imbens, G. (1996). Case-control studies with contaminated controls. Journal of Econometrics, 71, 145–160.
  • Little and Rubin (1987) Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis With Missing Data. John Wiley & Sons, New York, USA.
  • Liu (2008) Liu, J. S. (2008). Monte Carlo Strategies In Scientific Computing. Springer, New York, USA.
  • Liu and Wu (1999) Liu, S. Y. and Wu, Y. N. (1999). Parameter expansion for data augmentation. Journal of American Statistical Association, 94, 1264–1274.
  • Pearce and Boyce (2006) Pearce, J. L. and Boyce, M. S. (2006). Modelling distribution and abundance with presence-only data. Journal of Applied Ecology, 43, 405–412.
  • Phillips et al. (2006) Phillips, S. J., Anderson, R. P., and Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190, 231–259.
  • Robert and Casella (2004) Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Metods. Springer, New York, USA.
  • Rubin (1976) Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.