Bayesian Sparse Propensity Score Estimation for Unit Nonresponse
Abstract
Nonresponse weighting adjustment using propensity score is a popular method for handling unit nonresponse. However, including all available auxiliary variables into the propensity model can lead to inefficient and inconsistent estimation, especially with highdimensional covariates. In this paper, a new Bayesian method using the SpikeandSlab prior is proposed for sparse propensity score estimation. The proposed method is not based on any model assumption on the outcome variable and is computationally efficient. Instead of doing model selection and parameter estimation separately as in many frequentist methods, the proposed method simultaneously selects the sparse response probability model and provides consistent parameter estimation. Some asymptotic properties of the proposed method are presented. The efficiency of this sparse propensity score estimator is further improved by incorporating related auxiliary variables from the full sample. The finitesample performance of the proposed method is investigated in two limited simulation studies, including a partially simulated real data example from the Korean Labor and Income Panel Survey.
Key words: Approximate Bayesian computation, Data augmentation, High dimensional data, Missing at random, SpikeandSlab prior.
1 Introduction
{sec::Intro}
Nonresponse in the collected data is a common problem in survey sampling, clinical trials, and many other areas of research. Ignoring nonresponse can lead to biased estimation unless the response mechanism is completely missing at random (Rubin, 1976). To handle nonresponse, various statistical methods have been developed. Little and Rubin (2002) and Kim and Shao (2013) provide comprehensive overviews of the statistical methods for handling missing data.
The propensity score weighting is one of the most popular tools for adjusting for nonresponse bias, which builds on a model for the response probability only and uses the inverse of the estimated response probabilities as weights for estimating parameters. The propensity score weighting method is well established in the literature. See Rosenbaum (1987), Flanders and Greenland (1991), Robins et al. (1994), Robins et al. (1995), Paik (1997) and Kim and Kim (2007). However, when the dimension of the covariates for the propensity score is high, the full response model including all the covariates may have several problems. First, the computation for parameter estimation can be problematic as it involves high dimensional matrix inversion and the convergence is not guaranteed. Second, estimating zero coefficients in the propensity model increases the variability of the propensity scores and thus leads to inefficient estimation of the model parameters. Furthermore, the asymptotic normality of the propensity score estimator is not guaranteed if the dimension of the covariates is high. That is, the assumptions for the Central Limit Theorem may not be satisfied if we include all the covariates into the propensity model. Therefore, model selection to obtain a sparse propensity model is a challenging but important practical problem. While sparse model estimation is well studied in the literature (Tibshirani, 1996; Fan and Li, 2001; Zou and Hastie, 2005; Zou, 2006; Park and Casella, 2008; Kyung et al., 2010), to the best of our knowledge, not much work has been done for sparse propensity score estimation in the missing data context.
Our main goal is to develop a valid inference procedure for estimating parameters with the sparse propensity score adjustment in a high dimensional setup. In this paper, we propose a new Bayesian approach for sparse propensity score estimation. One advantage of the Bayesian approach is that both model selection and parameter estimation can be simultaneously performed in the posterior inference. To develop the sparse posterior distribution, we use stochastic search variable selection with the SpikeandSlab prior, which is a mixture of flat distribution and degenerate distribution at zero, or a mixture of their approximations (Mitchell and Beauchamp, 1988; George and McCulloch, 1993, 1997; Narisetty et al., 2014). However, implementing the Bayesian variable selection method to propensity score (PS) estimation is challenging, because the likelihood function for the parameter of interest is not available as the outcome model is unspecified. To resolve this issue, we derive an approximate likelihood from the sampling distribution of the PS estimator before applying SpikeandSlab prior for the PS model selection. Note that, selecting a correct propensity model does not necessarily achieve efficient estimation. Incorporating auxiliary variables observed from the full sample (Zhou and Kim, 2012) using generalized method of moments technique, however, can achieve optimal estimation. Thus, to achieve the optimal PS estimation in a Bayesian way, we propose using a working outcome model and the SpikeandSlab prior to select only relevant auxiliary variables. The proposed Bayesian method is implemented by data augmentation algorithm (Tanner and Wong, 1987; Wei and Tanner, 1990) and the computation of posterior distribution is fast and efficient.
The rest of this paper is organized as follows. In Section 2, we introduce the basic setup of the PS estimation. The proposed method is fully described in Section 3. Some asymptotic theories including model selection consistency are established in Section 4. The optimal sparse PS estimator is introduced in Section 5. The performance of the proposed method is examined through extensive simulation studies in Section 6. Some concluding remarks are made in Section 7. All technical proofs are relegated to Appendix.
2 Setup
{sec::setup}
Let be independent and identically distributed (IID) realizations of random vector , where is a scalar response variable and is a dimensional vector of covariates. The dimension is allowed to increase with the sample size. Suppose we are interested in estimating parameter , which can be estimated by , under complete response. Note that no distribution assumptions are made on .
To handle the missing data problem, the response propensity model can be used. To introduce this PS method, suppose that are fully observed and are subject to missingness. Let be the response indicator of , that is,
Assume that are independently distributed from a Bernoulli distribution with the success probability . We further assume that the response mechanism is missing at random (MAR) in the sense that
Furthermore, we assume a parametric model for the response probability
(1) {phi} 
where is a known distribution function and is a dimensional unknown parameter. Then the propensity score estimator of , say , can be obtained by solving
(2) {sec2:eq4} 
with respect to , where is a consistent estimator of in (1). From the response model in (1), the maximum likelihood estimator (MLE) of is obtained by maximizing the loglikelihood function,
(3) {eq:likelihood_phi} 
where and the score equation for obtaining can be written as
(4) {sec2:eq6} 
However, when is sparse, that is, contains many zero values, the MLE from the fully saturated model often increases its variance and fails to be consistent (Zou, 2006). Such phenomenon unfavorably leads to poor inference on the parameter of interest . In addition, the propensity model with unnecessary covariates may increase the variance of the resulting PS estimator. However, including important covariates into the PS model is still critical to obtain consistency.
Penalized likelihood estimation techniques have been proposed to overcome the drawbacks of MLE for high dimensional regression problems. Thus, we may achieve sparse and consistent estimation for by adding a suitable penalty function to (3). For example, LASSO (Tibshirani, 1996) produces a sparse estimator of via penalization,
(5) {lasso} 
where is a predetermined parameter to control the degree of penalization. Thus, we can easily obtain a penalized PS estimate of by solving (2) for a given . However, the penalized likelihood method is limited to the point estimation in the PS method. The derivation of the variance estimator of is very challenging under the penalization approach (Lee et al., 2016; Tibshirani et al., 2016). More importantly, the resulting PS estimator can be inefficient as it does not fully incorporate all available information. That is, the penalized likelihood estimation technique can only select the covariates in the true PS model, which, as shown in Section 4, does not necessarily lead to efficiency gain. To get efficient PS estimation, it is better to include covariates correlated with , even if they are not selected in the true PS model.
All the aforementioned concerns have motivated us to tackle the sparse propensity estimation problem under a Bayesian framework. We use Bayesian stochastic variable search and approximate Bayesian computation (Beaumont et al., 2002; Soubeyrand and HaonLasportes, 2015) for the sparse propensity score estimation and optimal PS estimation. The details are discussed in the following section.
3 Bayesian Sparse Propensity Score Estimation
{sec:proposed}
To formulate our proposal, we first introduce the Bayesian PS estimation discussed in Sang and Kim (2018). Note that is the solution to the joint estimating equations in (2) and (4). Using asymptotic normality
(6) {eq:asym} 
where is a nonstochastic, symmetric, and positivedefinite matrix, as , the approximate posterior distribution of proposed by Sang and Kim (2018) is given by
(7) {post1} 
where is the density function of the sampling distribution in (6), and are the prior distributions of and , respectively. Note that, in (7), instead of using the explicit likelihood function of in (3), the asymptotic distribution of the estimating equations is used to replace the likelihood function as we do not make any distribution assumption on .
To extend the method of Sang and Kim (2018) to the sparse propensity score model, we introduce a vector of latent variables , such that
(8) 
Thus, is an indicator function for including the th covariate into the response probability model in (1).
To account for the sparsity of the response model, we assign the SpikeandSlab Gaussian mixture prior for and an independent Bernoulli prior for as follows:
(9) {ss:prior}  
(10) {ber:prior} 
where , , and are deterministic hyperparameters. To induce sparsity for , the scale hyperparameters and need to be small and large fixed values, respectively. For example, we use and in the simulation study in Section 6. The mixing probability can be interpreted as the prior probability that is nonzero. Under the absence of prior information for , we can set for all or set a uniform prior for . Note that (9) is the prior distribution of for a given model and can be denoted by .
Now, under the given model , the posterior distribution in (7) can be written as
(11) {post2} 
Note that, in (11), we can express the joint density as a product of the marginal density of and the conditional density of given . That is,
(12) {post0} 
where are the density functions derived from the joint asymptotic normality in (6).
Thus, combining (11) with (12), the posterior distribution in (11) can be written as
(13) {post3} 
where
(14) {step2} 
and
(15) {step3} 
Therefore, following the standard Bayesian procedure, the posterior distribution of can be obtained from
(16) {Joint_P}  
where is the prior distribution of in (10), is the posterior distribution of in (14), and is the posterior distribution of in (15).
Using the Gibbs sampling (Casella and George, 1992) procedures, our proposed Bayesian sparse propensity score (BSPS) method can be described by the following two steps:

Step 1 (Model step): Given , generate model from .

Step 2 (Posterior step): Given , generate from .
Step 1 is the new step for model selection. Step 2 is already discussed in Sang and Kim (2018).
We first discuss Step 1. Using (13), the posterior distribution of given can be derived as
where is the likelihood function of . Thus, using (9) and (10), Step 1 can be simplified as generating from
(17) {step 1} 
where denotes a Gaussian density function with mean and variance . Thus, Step 1 does not require any iterative algorithm and hence computationally efficient.
For Step 2, given , we can use (13) to generate the posterior values by the following two steps:
For Step 2a, since the likelihood of is known, we can use
(18) {post_phi1} 
to generate the posterior of given the model and data. In Step 2b, are generated from in (15), where the conditional distribution is derived from the joint normality in (6). The computational details of generating the posterior values from Step 2 efficiently are described in Appendix A.
4 Asymptotic properties
To establish the asymptotic properties, we first assume the regularity conditions for the existence of the unique solution to , as discussed in Silvapulle (1981). To establish the asymptotic properties of the PS estimator under a high dimensional setup, assume can be decomposed as , where satisfy and . Note that is not helpful in explaining . Let be the dimension of , respectively, such that . Let and . We now make the following assumption:
{sec::asymp}

(A1) In a neighborhood of the true parameters , assume , and hold.
Condition (A1) is the usual regularity conditions for PS estimation. Define to be the PS estimator of using the covariate for the PS model and for computing the PS estimator. Let for simplicity.
Theorem 1
Assume that the solution to (4) is unique. Under assumptions (A1) and MAR assumption in (2), we can establish the following.

{high_PS}

The bias of the PS estimator satisfies
(19) {bias} 
The variance of the PS estimator has
(20) {variance} 
The PS estimator including is more efficient than the PS estimator including only, in the sense of
The proof of Theorem 1 is shown in Appendix C. In (19), the bias of the PS estimator depends on the order of . If is bounded, then the bias of PS estimator is asymptotically negligible. If increases with , then the PS estimator can be significant biased. From the first two statements of Theorem 1, we can see that the PS estimator is significantly biased and its variance keeps increasing as increases. Under the sparsity setup, the true response model is not necessarily optimal. That is, including which is correlated with helps to improve the performance of the PS estimator.
We now establish the model selection consistency under the Bayesian framework. The Bayesian model selection consistency is satisfied if the posterior probability of the true model converges to one as the sample size goes to infinity (Casella et al., 2009). To achieve the model selection consistency or Oracle property (Fan and Li, 2001; Zou, 2006), we further assume the following condition.

(A2) Assume and .

(A3) In the SpikeSlab prior in (9), , , and .
{asump:a6}
Condition (A1) is the sparsity assumption. The choice of represents a noninformative prior for each covariate component. The following theorem establishes the Oracle property of the proposed Bayesian sparse propensity score method.
Theorem 2
Under assumptions (A1)–(A3), and the MAR assumption in (2), we have
in probability, where is the true response model and is the marginal posterior probability in (17).
{thm:model_consistnecy}
The proof of Theorem 2 is given in Appendix D. According to Theorem 2, we observe that the probability that Step 1 selects the true model becomes very close to one when the sample size is sufficiently large. Thus, the proposed Bayesian method can effectively eliminate irrelevant covariates and select important ones to adjust for nonresponse bias.
Note that, in Theorem 2, we assume , which can be extended with a small modification in Step 1. Instead of using noninformative priors, we use this assumption to make the prior distribution to satisfy . That can be implemented as dropping the generated candidate models until we obtain . Thus, our proposed method can be easily extended to and the ultrahigh dimensional setup of Chen and Chen (2008).
Since we assume the true response model is sparse, is fixed with increasing . Thus, the asymptotic normality can be established under the regularity conditions.
Theorem 3
Sang and Kim (2018) have already established the asymptotic normality of the Bayesian propensity score (BPS) estimator under the correctly specified response model. By Theorem 2, the probability that Step 1 selects the true model converges to one. Consequently, the asymptotic distribution of our BSPS estimator is the same as the asymptotic distribution of BPS estimator under the true model, which leads to the asymptotic normality of the BSPS estimator.
Remark 1
From Theorem 2, we can see that the model uncertainty of vanishes as . However, in finite samples, model selection always contributes to the variability of . The advantage of the proposed Bayesian method lies in its capture of the variability of the model uncertainty in the finite sample case. The posterior distribution of in Step 1 incorporates the model selection uncertainty automatically. By the Law of Large Numbers (LLN), we can show that
where is generated from Step 2 given model . In the finite sample, represents the variability due to the model uncertainty. Bt Theorem 2, , as , which leads to .
5 Optimal Bayesian sparse propensity score estimation
In Section 4, Theorem 1 points to the efficiency gain in including the covariates , which are correlated with the outcome variable, into the PS model. However, the sparse Bayesian method in Section 3 only selects the covariates involved in the true response model. Therefore, in this section, we propose an optimal Bayesian sparse propensity score estimation that improves the efficiency by incorporating the relevant auxiliary variables from the full sample.
{sec::extension}
To select the covariates correlated with given that is selected, we propose to use the following “working” outcome model
(21) {working} 
where and independently. Let be the density function for the working model in (21). Note that, the purpose of the outcome model in (21) is to select important covariates in addition to to improve efficiency. The validity of the resulting PS estimator does not require the outcome model assumption (21) to hold. Therefore, the same model selection method using the SpikeandSlab prior can also be used. Let , where
To select additional relevant variables given , we can assign
(22) {uz} 
where and are fixed to be small and large values, respectively. The prior distribution for is informative for , as we want to include in advance before including .
Let the prior distribution of , say , be the inverse gamma distribution with parameters . We set to create a noninformative prior. Then, the posterior distribution
(23) {post_u} 
where . To generate from the posterior distribution, the same data augmentation method (Tanner and Wong, 1987) can be used. The implementation of (23) can be described as follows.

Istep: Given and , generate from
(24) {Istep} for
Given the augmented PS model , the following reduced estimating equations can be used to estimate the optimal without any model assumptions on .
where and are respectively subvectors of and corresponding to the chosen model . Given , let . To generate the posterior distribution of given and , approximate Bayesian computation can be used. Under some regularity conditions, we can establish that
(26) {asym3} 
Using the asymptotic sampling distribution in (26) to replace the role of likelihood, the posterior distribution of can be generated from
(27) {pos_zeta} 
where is the density function from (26) and is a flat prior.
6 Simulation studies
{sec::simulation}
In this section, we conduct two simulation studies to examine the finite sample performance of the proposed Bayesian method. The first simulation study investigates the proposed Bayesian method under the IID setup. In the second simulation study, we apply our proposed method using real data obtained from the 2006 Korean Labor and Income Panel Study (KLIPS).
6.1 Simulation study I
In the first simulation, our data generation process consists of the following two parts.

Generate a random sample of size , , from each of the following models:
where with , and the errors are generated independently from . The covariates are independently generated from , where . We use two values for : for independent covariates and for correlated covariates. Also, we use and .

For , generate the response indicator of from the following response mechanism:
Note that in our setup controls the amount of sparsity on the propensity score. As increases, the propensity score becomes more sparse. We are interested in estimating .
For each setup, we generate Monte Carlo samples of size and we apply the following methods:

PS: The traditional PS estimator, say , is obtained by solving the joint estimating equations
where . The variance of is estimated by Taylor linearization. The 95% confidence intervals are constructed from the asymptotic normal distribution of .

TPS: The true propensity score (TPS) method in which the ordinary PS method is applied using the covariates in the true response mechanism. The 95% confidence intervals are constructed from the asymptotic normal distribution of

LASSO: We first apply the LASSO method to select the response model with in (5) chosen by the default 5fold crossvalidation method. The algorithm is implemented in “glmnet” (Friedman et al., 2009). Then we apply the traditional PS estimation method to the selected response model. Variances and confidence intervals are obtained by using the asymptotic normal distribution of for the selected response model.

BSPS: The Bayesian sparse PS method proposed in Section 3. In BSPS, we set , , and to induce noninformative priors. Using the formula in Section 3, we compute the BSPS estimate and its variance estimate based on the posterior sample of size after burnin iterations. The 95% confidence intervals are constructed from the quantiles of the posterior sample.

OBSPS: The optimal Bayesian sparse PS method proposed in Section 5. In OBSPS, we use the same setup in BSPS and let .
To assess the variable selection performance of BSPS, OBSPS, and LASSO methods, we compute true positive rate (TPR) and true negative rate (TNR), where TPR is the proportion of the regression coefficients that are correctly identified as nonzero and TNR is the proportion of the regression coefficients that are correctly identified as zero. The coverage probabilities of each methods are computed by counting how often the confidence intervals contain the true parameter values. In BSPS and LASSO, we present TPR and TNR for the response model. In OBSPS, we show TPR and TNR for the working model to select correlated covariates. The simulation results for models and are presented in Tables 1 and 2, respectively.
Table 1 shows the numerical results for . Overall, the proposed methods perform similarly between correlated covariates and independent covariates . When dimension is low, specifically, when p=10, PS and TPS have similar performance in terms of bias and standard errors. TPS are more efficient than PS due to sparsity. LASSO and BSPS can select the true response model with large probabilities. However, LASSO and BSPS obtain larger standard errors than TPS due to additional model uncertainty under finite samples. Overall, BSPS outperforms LASSO in term of model consistency. OBSPS always provides the most efficient estimators by incorporating relevant auxiliary variables from the full sample. For small , all methods achieve approximately 95% coverage probabilities for corresponding confidence intervals or credible intervals for Bayesian models.
When increases to in , the PS estimator using all variables shows large standard errors. Moreover, the average of the estimated standard errors for the PS estimator is much smaller than the true standard error of the PS estimator, which leads to biased interval estimation and low coverage probability. Note that TPS is the gold standard method, where we pretend that we know the truth. Thus, TPS is invariant for large . BSPS performs better than LASSO for selecting the true response model. Therefore, the model uncertainty of BSPS and OBSPS are much smaller than LASSO. Table 1 shows that Monte Carlo standard error of LASSO is much larger than estimated standard error due to model uncertainty. However, the increased variances of BSPS and OBSPS are not as obvious as LASSO due to better model selection performance. In summary, BSPS obtains comparable estimator and inference with TPS. OBSPS is still most efficient relative to all other methods and it identifies the relevant covariates with probability one.
When increases to , PS fails to achieve convergence in solving score equation of . Thus, no numerical results are presented for PS. LASSO obtains low coverage probabilities, because of large model uncertainty. However, BSPS still works well and obtains similar performance with TPS. OBSPS outperforms by far all other methods.
Table 2 presents the numerical results for , where the outcome model is quadratic but our working model is still linear. The same conclusion from Table 1 can be made for the results of Table 2. OBSPS can only correctly identify without , since is not correlated with , even though the true outcome model has . Overall, OBSPS is the most efficient method. The model uncertainty of LASSO keeps increasing as increase, which leads to low coverage probabilities and biased estimation for standard errors. BSPS achieves comparable results with TPS.
6.2 Simulation study II
We also apply the Bayesian sparse propensity score method to the 2006 Korean Labor and Income Panel Survey (KLIPS) data. A breif description of the panel survey can be found at http://www.kli.re.kr/klips/en/about/introduce.jsp. In KLIPS data, there are 2,506 regular wage earners. The study variable is the monthly income in 2006. The auxiliary variables include the average monthly income in previous year and demographic variables. We grouped age into three levels: .
In this simulation study, we use the KLIPS data as a finite population. The realized sample is then obtained from the population by Simple Random Sampling (SRS) with sample size independently. Since the KLIPS data are fully observed, we artificially create a nonresponse scheme by applying the missing mechanism in (28). Note the two major differences here compared with the first simulation study. One is the mixed data types of the auxiliary variables. Another is the unknown outcome regression model. The simulation process is described in the following:

Obtain 200 samples from the KLIPS data by SRS.

Apply the response mechanism to the sample, so that the auxiliary variables are fully observed and the study variable is subject to missingness.

Apply the PS, LASSO, BSPS and OBSPS methods in simulation study I to the incomplete sample.

Repeat Step 1–3 for times.
The true response function is
(28) {R} 
where , is average monthly income in the previous year, and the response rate is approximately 70%. Suppose we are interested in the average monthly income . To fit the response model, we assume the response mechanism is
which is known up to the parameter . Thus, the joint estimating equations are
(29) {eq:joint_real_data} 
The analysis result is summarized in Table 3.
Method  MSE  Rbias 

PS  320.8  78.3 
LASSO  321.8  38.2 
BSPS  323.0  3.6 
OBSPS  311.2  4.4 
From Table 3, the mean square errors of all four methods are similar. However, the proposed optimal Bayesian sparse propensity score method (OBSPS) is most efficient, because OBSPS incorporates the relevant auxiliary variables in propensity score estimation. Due to large dimensions of auxiliary variables, the traditional propensity score (PS) estimation including all variables fails to provide a consistent variance estimator, as explained in Theorem 1. The propensity score model using LASSO also highly underestimates the variance due to large model uncertainty. In summary, the proposed BSPS and OBSPS provide consistent variance estimators uniformly regardless of the dimension of covariates.
7 Discussion
This paper presents a Bayesian approach to PS estimation using the SpikeandSlab prior for the response propensity model. Through the proposed BSPS method, model selection consistency holds and the uncertainty in model selection is fully captured by the Bayesian framework. The efficiency of the PS estimation can be further improved by incorporating relevant auxiliary variables, the socalled optimal Bayesian sparse propensity score (OBSPS) method. The simulation study in Section 6 shows that the Bayesian approach provides valid frequentist coverage probabilities in finite samples. Since the PS estimation is widely used in causal inference (Morgan and Winship, 2014; Hudgens and Halloran, 2008), applying the proposed methods to the sparse Bayesian causal inference can be developed similarly. Also, our proposed method is developed under the assumption of MAR. Extension of our proposed method to nonignorable nonresponse is a topic for future research.
{sec::discussion}
Appendices
In these appendices, we present the technical derivations and proofs for all stated theorems in this paper.
Appendix A Computational details
{App::A}
To generate from (18), the computation using the MetropolisHastings algorithm (Chib and Greenberg, 1995) can be quite heavy. Thus, instead of using the likelihood function of directly, we propose to use the Laplace approximation method. To discuss the approximation of (18), let be the maximizer of . From the SpikeandSlab prior in (9), is a Gaussian distribution with mean 0 and variance , where . Thus, maximizing is equivalent to solving
(30) {laplace_eq} 
Denote
(31) {V_phi} 
where is the negative fisher information matrix of defined as
Note that is always positive definite. Using second oder Taylor expansion, the Laplace approximation is
Therefore, generating from (18) is approximately equivalent to generating from , where is a consistent estimator with plugged in .
For Step 2b, note that, under some regularity conditions, we can establish that
(32) {asym2} 
Correspondingly, can be decomposed as
Thus, the asymptotic distribution in (32) implies that goes to a normal distribution