C_{p} criterion for semiparametric approach in causal inference

# Cp criterion for semiparametric approach in causal inference

## Abstract

For marginal structural models, which recently play an important role in causal inference, we consider a model selection problem in the framework of a semiparametric approach using inverse-probability-weighted estimation or doubly robust estimation. In this framework, the modeling target is a potential outcome which may be a missing value, and so we cannot apply the AIC nor its extended version to this problem. In other words, there is no analytical information criterion obtained according to its classical derivation for this problem. Hence, we define a mean squared error appropriate for treating the potential outcome, and then we derive its asymptotic unbiased estimator as a criterion from an asymptotics for the semiparametric approach and using an ignorable treatment assignment condition. In simulation study, it is shown that the proposed criterion exceeds a conventionally derived existing criterion in the squared error and model selection frequency. Specifically, in all simulation settings, the proposed criterion provides clearly smaller squared errors and higher frequencies selecting the true or nearly true model. Moreover, in real data analysis, we check that there is a clear difference between the selections by the two criteria.

Keywords: Doubly robust estimation; Inverse-probability-weighted estimation; Marginal structural model; Missing data analysis; Model selection; Statistical asymptotic theory

## 1 Introduction

The marginal structural model (Robins 1997, Robins et al. 2000) is one of the most basic models in causal inference. This is a potential outcome model, and the data are regarded to be partly missed. Therefore, if we do estimation naively despite that the outcome and missing mechanism are correlated, the estimator will have a large bias. While this bias is removed if we can correctly specify the correlation, it is common to rely on a semiparametric approach using inverse-probability-weighted estimation (Robins et al. 1994) or doubly robust estimation (Scharfstein et al. 1999, Bang and Robins 2005) without the difficult modeling.

As an example, let us consider a simple marginal structural model (Platt et al. 2013, Talbot et al. 2015), where is a potential outcome for the -th sample with the treatment , is an indicator which is if the treatment is received and otherwise, and is an error. In this model, with is regarded as being missed. Therefore, if we estimate the regression form by the least squares method in spite of existing the correlation between and , a bias yields as a matter of course. Then, supposing that a confounder between and is observed, a semiparametric approach using the propensity score (Rosenbaum and Rubin 1983) is commonly used. Under this setting, we treat a model selection problem for the regression form of interest, which is the selection problem of the order in the polynomial in this example.

To be surprising, there is no information criterion made by adjusting classical ones to this basic problem except for one. The valuable one is QIC in Platt et al. (2013). This criterion is made by replacing the goodness-of-fit term in QIC (Pan 2001), the quasi-maximum log-likelihood, with a quasi-maximum weighted log-likelihood in order to cope with the missing values. That is, QIC uses the same penalty term as in QIC although QIC does not cope with the missing values. In this paper, we show that if we evaluate the penalty term based on the original definition of an information criterion, it becomes quite different term from QIC’s.

As written in Platt et al. (2013), while the model selection problem for the regression form is little treated, the confounder selection problem is treated in, for example, Brookhart and van der Laan (2006) and Vansteelandt et al. (2012). These papers use a cross-validation-type method with a high computational cost or the FIC (Claeskens and Hjort 2003) based on a special assumption of local misspecification. In this paper, it is not considered to develop them for our problem, and we construct a method without relying on such a computational cost or special assumption.

In Section 2, the model and assumption are explained, and we introduce the inverse-probability-weighted estimation and doubly robust estimation under them. In Section 3, first we give two kinds of mean squared errors, MwSE and MuSE, appropriate for treating missing mechanism, and then we get goodness-of-fit and penalty terms similarly to in the derivation of the conventional criterion. Note that the goodness-of-fit term in MwSE becomes the same one as in QIC. Next, we asymptotically evaluate the penalty terms for the inverse-probability-weighted and doubly robust estimations by using techniques similar to in showing the consistency of these estimators. As a result, this asymptotic takes the form we can easily evaluate, and we set it as our proposed criterion. In Sections 4 and 5, we compare the performances of the existing and proposed criteria through simulation studies under basic situations as mentioned above and real data analysis, respectively. In Section 6, to explore the possibility for improvement and generalization of the proposed criterion, we mention about modifying the mean squared error and applying it to missing data analysis.

## 2 Preliminary

### 2.1 Model and assumptions

The marginal structural mean model is a model for the marginal means of potential outcomes. Let us assume that there are kinds of treatments, and we denote a potential outcome for the -th treatment by , and let be a random indicator which is if the -th treatment is received and otherwise (). Then, we consider a marginal structural model

 y=H∑h=1t(h)y(h)=H∑h=1t(h)(X(h)β+ε),

which assumes a linear regression model by each potential outcome. In the right hand side, is an independent variable matrix, is an error vector whose mean is and dispersion matrix is , where is a zero vector or a zero matrix and is an identity matrix. Note that in the left hand side is an observed outcome. In this model, potential outcomes, ’s with , are regarded as missing values. Therefore, if we estimate naively from observed outcomes, the estimator will have a bias because in general. In this paper, we suppose that a confounder vector between and is observed so that this bias can be removed.

For this model, we make several basic assumptions. First, let us consider . Although we consider a non-random variable as the components of in the example in Section 1, here we allow it to include a part of confounder vector in order to treat more general setting. In addition, to reduce the complexity of expressions, we assume that these independent variables are standardized so that . This assumption is not essential, and actually the final form of the derived criterion in the following does not depend on whether we make this assumption or not. Next, we assume a weakly ignorable treatment assignment condition (Imbens 2000)

 y(h)⊥t(h)∣z(h∈{1,2,…,H}),

which is to assure that we can remove the above-mentioned bias. Note that we can replace with in this condition.

Now we have samples following this model, and we put subindex in variables for the -th sample. In addition, let , , and , and then we can express the model by

 ~y=H∑h=1T(h)~y(h)=H∑h=1T(h)(~X(h)β+~ε).

Here, we assume that the samples are independent each other, that is,

 Missing dimension or its units for \hskip

From this, it holds as a matter of course. Moreover, we assume that and are independent as done for conventional regression models.

### 2.2 Estimation method

If the relationship between the potential outcome and confounder is correctly modeled, we can easily give a consistent estimator of the marginal mean for under the ignorable treatment assignment condition. However, this modeling is difficult in general. Therefore, in recent years, it is often the case that we rely on a semiparametric approach using so-called the propensity score, , which does not depend on the correct modeling. Here, is a parameter vector relating to the propensity score. In this paper, we treat two kinds of estimation methods basic in this approach.

The first one is the inverse-probability-weighted estimation (Robins et al. 1994). In this method, missing values are restored through weighting the observed values by the inverse of the propensity score, and then a conventional estimation is used. Specifically, we define a weighted squared loss function as

 H∑h=1(~y−~X(h)β)TW(h)(α)(~y−~X(h)β) (1)

using a weight matrix , and then the inverse-probability-weighted estimator

 ^βIPW(α)≡{H∑h=1~X(h)TW(h)(α)~X(h)}−1H∑h=1~X(h)TW(h)(α)~y (2)

is given by minimizing the loss function with respect to . If is unknown, we obtain the maximum likelihood estimator through , the conditional probability function of given , and we use it in place of , where . This inverse-probability-weighted estimator is consistent under the ignorable treatment assignment condition.

While is correlated with in general in the marginal structural model, the inverse-probability-weighted estimation does not directly use the information of for estimating the marginal mean of . The doubly robust estimation (Scharfstein et al. 1999, Bang and Robins 2005) implements it to improve the inverse-probability-weighted estimation, and it uses , the conditional probability density function of given . Here, is a parameter vector relating to the conditional distribution. Denoting the expectation based on this conditional distribution by , the doubly robust estimator is given by minimizing with respcet to the expression which is made by adding

 H∑h=1(E[~y(h)∣~z;γ]−~X(h)β)T{I−W(h)(α)}(E[~y(h)∣~z;γ]−~X(h)β)

to (1). In the framework of the doubly robust estimation, usually and are unknown, and so we replace them with the maximum likelihood estimators and which are obtained through and , respectively. To avoid complex statements, hereafter we omit these arguments. Then, the doubly robust estimator is expressed as

 ^βDR≡(H∑h=1~X(h)T~X(h))−1H∑h=1{~X(h)TW(h)~y+~X(h)T(I−W(h))E[~y(h)∣~z]}.

This estimator not only improves the inverse-probability-weighted estimator but also achieves to be semiparametrically efficient (Robins and Rotnitzky 1995). In addition, when either the propensity score or the conditional distribution is correctly specified, the estimator is consistent.

## 3 Proposed model selection criteria

### 3.1 Mean squared errors for causal inference

Before defining a mean squared error for causal inference, we will explain about QIC proposed by Platt et al. (2013). When there are no missing data and the dispersion matrix of is , the criterion in Pan (2001) is written as

 QIC=H∑h=1(~y(h)−~X(h)^β)T(~y(h)−~X(h)^β)+2σ2p,

where is a quasi-maximum likelihood estimator. This is an unbiased estimator of so-called a quasi-likelihood version of the Kullback-Leibler divergence, in other words, this is a criterion derived from the conventional mean squared error, and so QIC is regarded as a reasonable criterion. On the other hand, when there are missing data, QIC cannot be obtained and

 QICw=H∑h=1(~y(h)−~X(h)^β)TW(h)(~y(h)−~X(h)^β)+2σ2p

is proposed. This criterion is based on the fact that if is the above-mentioned quasi-maximum likelihood estimator, it holds under the ignorable treatment assignment condition because . However, if is the inverse-probability-weighted estimator or the doubly robust estimator, it is not conditionally independent of , and so we have in general. Even more important is that is a penalty for an estimator ignoring the existence of missing data and not for the semiparmetric estimator, and it becomes a problem in using QIC. Actually, the variance of the latter estimator is much larger than that of the former estimator, and so we need to enlarge the penalty for the latter estimator.

Hence, let us consider two kinds of appropriate mean squared errors for the case where there are missing data. As the first kind, we define a mean weighted squared error by

 MwSE= Missing or unrecognized delimiter for \right = H∑h=1E[(~y−~X(h)^β)TW(h)(~y−~X(h)^β)] −H∑h=1E[(~y−E[~y(h)∣~X(h)])TW(h)(~y−E[~y(h)∣~X(h)])] +2H∑h=1E[(~y−E[~y(h)∣~X(h)])TW(h)(~X(h)^β−E[~y(h)∣~X(h)])]. (3)

According to the derivation of the conventional criterion, this is decomposed into three terms after the definition. This sum of weighted squared differences can be regard as the sum of squared differences between the expectations for the data restored by using the weight, which is also used in the inverse-probability-weighted estimation, and their estimators. Actually, the first term in the decomposition is the expectation of (1). That is, we consider the same loss function in the derivation and in the error evaluation for the estimator, and so it is natural in that term. As the second kind, we define a mean unweighted squared error by

 MuSE= Missing or unrecognized delimiter for \right = H∑h=1E[(~y−~X(h)^β)TT(h)(~y−~X(h)^β)] −H∑h=1E[(~y−E[~y(h)∣~X(h)])TT(h)(~y−E[~y(h)∣~X(h)])] +2H∑h=1E[(~y−E[~y(h)∣~X(h)])TT(h)(~X(h)^β−E[~y(h)∣~X(h)])]. (4)

This is the sum of squared differences between the expectations for observed data themselves and their estimators. In term of the improvement of estimation accuracy for observed data, this loss function may be more natural than before. According to the derivation of the conventional criterion, we remove the expectation in the first term, ignore the second term independent of models and asymptotically estimate the third term after setting , and we propose it as a criterion in causal inference. In the asymptotic evaluation, a main term is extracted from the contents of the expectation, and we take its expectation explicitly. Then, we denote the criteria derived from MwSE and MuSE by w and u, respectively.

### 3.2 Criterion for inverse-probability-weighted estimation with known propensity scores

Let us derive w for the inverse-probability-weighted estimation when is known. In (2), the inversed matrix divided by is expressed as

 1NH∑h=1N∑i=1t(h)ie(h)iX(h)TiX(h)i=H∑h=1E[t(h)e(h)X(h)TX(h)]{1+oP(1)}=I{1+oP(1)}. (5)

The second equality holds because of the assumption for and because the expectation is written as from the ignorable treatment assignment condition. In addition, using , the error of the inverse-probability-weighted estimator is expressed as

 ^βIPW−β=1NH∑h=1~X(h)TW(h)~ε{1+oP(1)}=1NH∑h=1N∑i=1t(h)ie(h)iX(h)Tiεi{1+oP(1)}. (6)

Therefore, replacing with this main term in the third term in the right hand side of (3), the expectation in it is asymptotically evaluated as

 E⎡⎢⎣~εTW(h)~X(h)1NH∑k=1N∑j=1t(h)je(h)jX(h)Tjεj⎤⎥⎦=1NH∑k=1N∑i,j=1E⎡⎢⎣t(h)ie(h)iεTiX(h)it(k)je(k)jX(k)Tjεj⎤⎥⎦. (7)

This expectation for the case where is the product of the expectations for and from the independence among samples, and it can be ignored because it holds from the ignorable treatment assignment condition and the independence between and that

 E⎡⎣t(h)ie(h)iεTiX(h)i⎤⎦=E⎡⎣E⎡⎣t(h)ie(h)i∣zi⎤⎦E[εi∣zi]TX(h)i⎤⎦=E[εTiX(h)i]=0. (8)

Therefore, we have only to consider the case where . When , we can ignore the expectation for the case of because in this case , and so (7) is expressed as

 1NN∑i=1E⎡⎣t(h)2ie(h)2iεTiX(h)iX(h)Tiεi⎤⎦=1NN∑i=1E⎡⎣1e(h)iεTiX(h)iX(h)Tiεi⎤⎦.

We obtain this equality from using and the ignorable treatment assignment condition similarly to in the derivation of (8). Because ’s are identically distributed, w in the following theorem is derived as a result. For the derivation of u, which is also given in the theorem, see Appendix.

###### Theorem 1.

For the case where the propensity score is known, the criteria for the inverse-probability-weighted estimation are given as follows:

 {\rm w}Cp=H∑h=1(~y−~X(h)^βIPW)TW(h)(~y−~X(h)^βIPW)+2H∑h=1E[1e(h)εTX(h)X(h)Tε]

and

 {\rm u}Cp=H∑h=1(~y−~X(h)^βIPW)TT(h)(~y−~X(h)^βIPW)+2σ2p.

Although the expectation in the penalty term for w cannot be calculated in general, we can easily give its consistent estimator such as . Also in the followings, we propose to use such simple consistent estimators in place of the penalty terms.

Speaking of the forms of criteria, the penalty term for QIC is the same as for u. We can say that the increase of the penalty owing to considering the inverse-probability-weighted estimation and the decrease of the penalty owing to considering the loss function only for observed data are the same amount. On the other hand, the goodness-of-fit term for QIC is the same as for w. Considering that , the penalty in w is almost the inversed propensity score times the penalty for QIC. Thus, we can predict that the performances of w and QIC are quite different.

### 3.3 Criterion for inverse-probability-weighted estimation with unknown propensity scores

Let us derive w for the inverse-probability-weighted estimation when is unknown. As written in Section 2.2, we use the maximum likelihood estimator based on as . Then, letting and , as indicated in Hoshino et al. (2006), the error of the inverse-probability-weighted estimator is expressed as

 ^βIPW−β=1NH∑h=1N∑i=1⎛⎝t(h)ie(h)iX(h)Tiεi−Λ(h)J−1H∑k=1t(k)ie(k)i∂e(k)i∂α⎞⎠{1+oP(1)} (9)

(see Appendix). Using this in the third term in the right hand side of (3), the expectation is asymptotically evaluated as the expression which is made by adding

 −1NH∑k,l=1N∑i,j=1E⎡⎢⎣t(h)ie(h)iεTiX(h)iΛ(k)J−1t(l)je(l)j∂e(l)j∂α⎤⎥⎦=−H∑k=1E[εTX(h)Λ(k)J−11e(h)∂e(h)∂α]

to (7), and then w in the following theorem is derived. This equality is obtained from the fact that the samples are independently and identically distributed and the ignorable treatment assignment condition. See Appendix for more detail, which derives u in a similar way.

###### Theorem 2.

For the case where the propensity score is unknown, the criteria for the inverse-probability-weighted estimation are given as follows:

 {\rm w}Cp= H∑h=1(~y−~X(h)^βIPW)TW(h)(~y−~X(h)^βIPW)+2H∑h=1E[1e(h)εTX(h)X(h)Tε] −2H∑k,h=1tr⎛⎝E[1e(k)X(k)Tε∂e(k)∂αT]E[H∑l=11e(l)∂e(l)∂α∂e(l)∂αT]−1E[1e(h)X(h)Tε∂e(h)∂αT]T⎞⎠

and

 {\rm u}Cp= H∑h=1(~y−~X(h)^βIPW)TT(h)(~y−~X(h)^βIPW)+2σ2p −2H∑k,h=1tr⎛⎝E[1e(k)X(k)Tε∂e(k)∂αT]E[H∑l=11e(l)∂e(l)∂α∂e(l)∂αT]−1E[X(h)Tε∂e(h)∂αT]T⎞⎠.

From Theorems 1 and 2, it can be seen that the penalty for the unknown propensity score tends to be smaller than that for the known propensity score. It is known that the asymptotic variance for the inverse-probability-weighted estimator becomes smaller if the propensity score is estimated even for the case where it is known (see, e.g., Henmi and Eguchi 2004). The property of the penalties is consistent with this fact.

### 3.4 Criterion for doubly robust estimation

Let us derive w for the doubly robust estimation. In a similar way in Hoshino (2007), which derived the asymptotic distribution of the doubly robust estimator for a structural equation model with a missing mechanism, the error of the doubly robust estimator is shown to be expressed as

 ^βDR−β=1NH∑h=1N∑i=1⎧⎨⎩t(h)ie(h)iX(h)Tiεi+⎛⎝1−t(h)ie(h)i⎞⎠X(h)TiE[εi∣zi]⎫⎬⎭{1+oP(1)} (10)

(see Appendix). Its main term does not include the score function for , which indicates that is semiparametrically efficient. Using it in the third term in the right hand side of (3), the expectation is asymptotically evaluated as the expression which is made by adding

 1NH∑k=1N∑i,j=1E⎡⎢⎣t(h)ie(h)iεTiX(h)i⎛⎜⎝1−t(k)je(k)j⎞⎟⎠X(k)TjE[εj∣zj]⎤⎥⎦ Missing or unrecognized delimiter for \left

to (7), and then w in the following theorem is derived. This equality is obtained from the fact that the samples are independently and identically distributed and the ignorable treatment assignment condition. Note that unlike the case of the inverse-probability-weighted estimation, the expectation does not become even if . See Appendix for more detail, which derives u in a similar way.

###### Theorem 3.

The criteria for the doubly robust estimation are given as follows:

 {\rm w}Cp= H∑h=1(~y−~X(h)^βDR)TW(h)(~y−~X(h)^βDR)+2H∑h=1E[1e(h)εTX(h)X(h)Tε] Missing \left or extra \right

and

 {\rm u}Cp= H∑h=1(~y−~X(h)^βDR)TT(h)(~y−~X(h)^βDR)+2σ2p +2H∑h,k=1E[e(h)E[ε∣z]TX(h)X(k)TE[ε∣z]]−2H∑h=1E[E[ε∣z]TX(h)X(h)TE[ε∣z]].

## 4 Simulation study

### 4.1 Setup

Let us evaluate the performance of the proposed criterion through simulation study using a marginal structural model , which is introduced in Section 1. According to the setting in Platt et al. (2013), we set . In addition, letting , we consider a polynomial model whose order is at most because . As the true structure, let us consider

 y(h)=1+x(h)+bx(h)2+z1+ϵ,

and we set is , or to examine a second-order polynomial structure which is far from or close to first-order polynomial model. We assume that and are independently distributed according to a uniform distribution and a Gaussian distribution , respectively, and then is a noise with mean and variance . As for the propensity score, letting the true value of be , we assumed that

 e(h)∝exp(1{h≠1}αh−1z1).

In addition, we consider or as sample size.

Under this setting, the experiment of selecting from by each criterion is repeated times. In the -th order polynomial model, using nonsingular matrix such that , we set and . Then, we can express and it holds , which enable us to calculate all the criteria.

### 4.2 Results

First, let us investigate whether the asymptotic evaluation approximates the penalty well or not. For the third terms in the right hand side of (3) and (4), in Table 1, we compare Monte Carlo evaluations and the asymptotic evaluations in Theorems 1, 2 and 3. Note that the asymptotic evaluation in Theorem 1 for u is in this setting. From the table, we can check that the accuracy of evaluations tends to become high as the sample size increases. Considering that the penalty in QIC is , we can say that the penalty in w is more than enough close to the Monte Carlo evaluation even if .

Next, to compare the performances of these criteria, we evaluate the average of weighted or unweighted squared errors for the model selected by each criterion. In Tables 2, 3 and 4, the values are respectively for the inverse-probability-weighted estimation with known propensity scores, for the inverse-probability-weighted estimation with unknown propensity scores and for the doubly robust estimation. In all cases, w provides clearly smaller squared errors than QIC. On the other hand, u provides larger squared errors than QIC when the true structure is close to the first-order polynomial, while it is sometimes superior to w. Thus, basically we propose to use w.

Let us check the selection frequencies of the optimal model, which are given as a reference in the tables. Note that, in all tables, the true structure is second-order polynomial. When the true structure is extremely close to first-order polynomial, however, it must be appropriate to select the first-order polynomial considering a prediction. Therefore, a high selection frequency of the first-order polynomial model does not necessarily indicate an unreasonable model selection. Meanwhile, a high selection frequency of more than third-order polynomial is clearly unreasonable. In this view point, obviously QIC has a problem. On the other hand, we can see that w always selects the true second-order polynomial with high frequency.