Semiparametric theory
Abstract
In this paper we give a brief review of semiparametric theory, using as a running example the common problem of estimating an average causal effect. Semiparametric models allow at least part of the datagenerating process to be unspecified and unrestricted, and can often yield robust estimators that nonetheless behave similarly to those based on parametric likelihood assumptions, e.g., fast rates of convergence to normal limiting distributions. We discuss the basics of semiparametric theory, focusing on influence functions.
Keywords: causal inference, double robustness, efficiency theory, functional estimation, influence functions, nonparametric theory, robust inference, tangent space.
In this paper we give a review of semiparametric theory, using as a running example the common problem of estimating an average causal effect. Our review draws heavily on foundational work in general semiparametrics by \textcitebegun1983information, bickel1993efficient, pfanzagl1982contributions, van2000asymptotic,van2002semiparametric, among others \autocitenewey1994asymptotic, kosorok2007introduction, as well as many modern developments in missing data and causal inference problems by Robins & van der Laan \autociterobins1986new, robins1994estimation, robins1995analysis, robins2000marginal, van2003unified, van2011targeted, van2014higher, robins2017minimax, and colleagues \autocitehahn1998role, tsiatis2006semiparametric. We refer to \textcitetsiatis2006semiparametric for a very readable review with more details.
1 Setup
A standard setup in semiparametric theory is as follows. We suppose we observe a sample of independent and identically distributed observations distributed according to some unknown probability distribution . Then our goal is estimation and inference for a realvalued target parameter, or functional, .
In this paper we focus on an example where we observe observations of , with covariates, a binary treatment (or missingness) indicator, and an outcome of interest, and our goal is to estimate the “treatment effect”
(1) 
with . Under causal assumptions such as no unmeasured confounding, the statistical parameter represents the causal quantity , i.e., the mean outcome had everyone taken treatment. If we let then also represents the mean of the partially observed outcome under a missing at random assumption, where here is a missingness indicator. In what follows we write for the density of a variable at , but when there is no ambiguity we let .
Other archetypal functionals considered in the literature include:

integral density functionals: , e.g., gives the integrated square density and gives the entropy

instrumental variable effect:

hazard ratio: where

optimal treatment regime value: over all

gformula:
The central feature of semiparametrics is that at least part of the datagenerating process can be unrestricted or unspecified. This is crucial because knowledge of the true distribution is typically lacking in practice, especially when the observations include numerous and/or continuous components. Luckily, it turns out that in many functional estimation problems there exist estimators that are consistent and asymptotically normal, even in large nonparametric models that put minimal restrictions on . In other words, estimating functionals is typically an easier statistical problem than estimating all of .
2 Semiparametric Models
A statistical model is a set of possible probability distributions, which is assumed to contain the true observed data distribution . Using a parametric model amounts to assuming the true distribution is known up to a finitedimensional realvalued parameter , e.g., we may have with . For example, if is a scalar random variable one might assume it is normally distributed with unknown mean and variance, , in which case the model is indexed by . Semiparametric models are simply sets of probability distributions that cannot be indexed by only a Euclidean parameter, i.e., models that are indexed by an infinitedimensional parameter in some way. Semiparametric models can vary widely in the amount of structure they impose; for example, they can range from nonparametric models for which consists of all possible probability distributions, to simple regression models that characterize the regression function parametrically but leave the residual error distribution unspecified.
For the treatment effect functional , and in other general causal inference and missing data problems, one may often want to incorporate some knowledge about the treatment mechanism but leave the other components and unspecified. This is because the covariate/outcome mechanisms are often complex natural processes outside of human control, whereas the treatment mechanism is known in randomized trials, and can be wellunderstood in some observational settings (for example, when a medical treatment is assigned in a standardized way, which is communicated by physicians to researchers). In an experiment where is set to be , for example, this amounts to the restriction
where are viewed as unspecified infinitedimensional nuisance parameters.
Of course it is not always the case that there is substantive information available about some component of , such as the treatment mechanism in the above model. In many studies no parts of the datagenerating process are under human control, and all components may be unknown and possibly very complex (e.g., in studies where even the exposure itself is a disease or other medical condition). It would then often be more appropriate to consider inference under a nonparametric model that makes no parametric assumptions about the distribution . For instance, in the treatment effect example, one would thus also allow to be an unrestricted nuisance function. However, in order to obtain estimators that converge at rates in nonparametric models, nuisance functions will often have to satisfy some structural conditions, such as Hölder smoothness or bounded variation.
Semiparametric models can also arise via parametric assumptions about nonEuclidean functionals. For example, the causal assumptions that identify also imply for any ; thus one might employ a parametric assumption of the form but leave the rest of unrestricted. Similarly, the famous Cox proportional hazards model assumes that the hazard ratio follows the parametric form . These restrictions are somewhat similar in spirit to classical parametric models. Unlike the experiment represented by model (LABEL:eq:piknown), the assumptions are not guaranteed by the study design.
3 Influence Functions
Here we discuss influence functions, foundational objects in nonparametric efficiency theory that allow us to characterize a wide range of possible estimators and their efficiency. There are two notions of an influence function: one corresponds to estimators and one corresponds to parameters. To distinguish these cases we will call the former influence functions and the latter influence curves; we focus on the former in this section.
Let denote the empirical distribution of the data, with the Dirac measure, so that sample averages can be written as . Then an estimator is asymptotically linear and has influence function if it can be approximated by an empirical average of , i.e.,
(2) 
where has mean zero and finite variance (i.e., and ). Here employs the usual stochastic order notation so that means where denotes convergence in probability.
Importantly, by the central limit theorem, (2) implies is asymptotically normal with
(3) 
where denotes convergence in distribution. Thus if we know the influence function for an estimator, we know its asymptotic distribution and can easily construct confidence intervals and hypothesis tests. Also, the influence function for an asymptotically linear estimator is almost surely unique, so in this sense the influence function contains all information about an estimator’s asymptotic behavior (up to error).
Consider our running example where is defined as in (1). When the propensity score is known to be , a simple weighting estimator is given by
It is straightforward to check using iterated expectation that . Then the influence function for is simply given by since exactly, without any approximation error. Interestingly, it can be shown that the estimator that uses an estimated propensity score in place of the true is at least as efficient as the estimator that uses the true . This follows from the fact that the influence function for equals that of minus its projection, so that the variance of the former influence function must be less than or equal to that of the latter, by the Pythagorean theorem.
Now consider the socalled doubly robust estimator where
for an estimator of the regression , and . What is the influence function for ? Consider the decomposition
where denotes the expected value of an estimated over a new observation, conditional on the data used to construct it. The above decomposition in fact holds even replacing with , i.e., if either or is consistent for its true target (not necessarily both), since then . This is the famous property called double robustness.
The first term in the decomposition will be under empirical process conditions, e.g., if the estimators are regular enough so that lies in a Donsker class, or if sample splitting is used so that is constructed on separate data. The second term is simply a centered sample average, and thus converges to a normal distribution after scaling, by the central limit theorem. The third term is the really interesting one. For special estimators like , it can be even when nuisance estimators converge at slower nonparametric rates. For example, with a sufficient condition is that and converge to at a faster than rate in norm. Under these kinds of conditions ensuring that the first and third terms in the decomposition are , the influence function of the estimator will be given by , since .
So far we have seen that, given an estimator , we can learn about its asymptotic properties by considering its influence function . But we can also use influence functions to find or construct estimators, for example by solving estimating equations that use the putative influence function as an estimating function. There is a deep connection between (asymptotically linear) estimators for a given model and functional, and the corresponding influence functions. In some sense, if we know one then we know the other. This leads to the notion of an influence function for a parameter , which we call an influence curve.
4 Tangent Spaces & Influence Curves
Here we use the term influence curves to denote influence functions for parameters. These are essentially putative influence functions: functions that could be the influence function of a properly constructed estimator, but which may not correspond to an estimator at all, and yet still exist and can be characterized based on the form of the functional . First, though, we need to understand tangent spaces and parametric submodels.
As discussed in the previous section, influence functions (now called influence curves) are functions of the observed data , and have mean zero and finite variance. Such functions reside in the Hilbert space of measurable functions with , equipped with covariance inner product . The space of influence curves will be a subspace of this Hilbert space. A Hilbert space is a complete inner product space, and generalizes usual Euclidean space; it provides a notion of distance and direction for spaces whose elements are potentially infinitedimensional functions.
A fundamentally important subspace of in semiparametric problems is the tangent space. For parametric models indexed by realvalued parameter , the tangent space is defined as the linear subspace of spanned by the score vector, i.e.,
where . If we can decompose then we can equivalently write for
where is the score function for the target parameter, and similarly is the score for the nuisance parameter ( denotes the direct sum ). In the above formulation, the space is called the nuisance tangent space. Influence curves for reside in the orthogonal complement of the nuisance tangent space, denoted by . In such parametric settings, this orthogonal space is
where denotes projections of on the space , i.e., for all . The subspace of influence curves is the set of elements that satisfy . The efficient influence curve is the influence curve with the smallest covariance , and is given by , where is the efficient score, given by .
Thus if we can characterize the nuisance tangent space and its orthogonal complement, then we can characterize influence curves. In fact, one can show that all regular asymptotically linear estimators have influence functions that reside in with , and conversely any element in this space corresponds to the influence function for some regular asymptotically linear estimator. Thus characterizing the nuisance tangent space allows us to also characterize all potential (regular asymptotically linear) estimators.
In parametric models the tangent space is defined as the span of the score vector . However, in semiparametric models the nuisance parameter is infinitedimensional, and so we cannot define scores analogously, as it would require differentiation with respect to this nuisance parameter. How do we extend tangent spaces to infinitedimensional semiparametric models? The answer lies in a clever device called a parametric submodel.
A parametric submodel is a set of distributions contained in a larger model , which also contains the truth, i.e., . A typical example of a parametric submodel is given by
where and so that when . Note that is the score function for the above submodel. One intuition behind parametric submodels comes via efficiency bounds. First note that it is an easier problem to estimate under the smaller parametric submodel than it is to estimate under the entire larger semiparametric model . Therefore the efficiency bound under the larger model must be larger than the efficiency bound under any parametric submodel. In fact the efficiency bound for semiparametric models is typically defined in exactly this way, as the supremum of all such parametric submodel efficiency bounds.
Now, just as the tangent space is defined as the linear span of the score vector in parametric models, in semiparametric models the tangent space is defined as the (closure of the) linear span of scores of the parametric submodels, i.e., . Similarly, the nuisance tangent space for a semiparametric model is the set of scores in that do not vary the target parameter , i.e.,
Importantly, in nonparametric models the tangent space is the whole Hilbert space of mean zero functions. For more restrictive semiparametric models the tangent space will be a proper subspace.
Now we can define influence curves, in much the same way as in parametric models. A parameter is pathwise differentiable with influence curve if
(4) 
for any regular parametric submodel with scores in the tangent space. The efficient influence curve is the unique influence curve that is also an element of the tangent space (and thus can be defined as the projection of any influence curve on the tangent space ). It is also the curve with the smallest covariance for all , and can further be expressed as , where is the efficient score, i.e., the projection of the score onto the tangent space, i.e., .
5 Finding & Using Influence Curves
Characterizing the influence curves, i.e., the putative influence functions, for a particular functional and model is a critical task with very important ramifications. The efficient influence curve gives the efficiency bound for estimating , thus providing a benchmark against which estimators can be compared. Perhaps more importantly, influence curves can be used to construct estimators with very favorable properties, such as double robustness and general secondorder bias, and which improve on naive plugin estimators that require stronger smoothness conditions and often impractical bandwidth choices. In particular, given an influence curve depending on nuisance functions and the parameter of interest , one can construct an estimator by solving the estimating equation based on the estimated influence curve. The resulting estimator can be shown to have influence function using the logic from Section 3.
In semiparametric models the tangent space is a proper subspace of , and deriving influence curves can be a delicate task that generally requires characterizing the nuisance tangent space and its complement. In nonparametric models the situation is often more hopeful: then there is only one influence curve, it is efficient, and it can often be computed directly via derivative calculations. For example, one can temporarily assume discrete data and compute the Gateaux derivative along the submodel for the Dirac measure at , which yields the influence curve evaluated at . Then it is typically straightforward to see the corresponding influence curve in the general case, e.g., by replacing probability mass functions with densities. For the treatment effect example, these calculations show that the efficient influence curve for the effect is exactly the influence function given in Section 3.
[title=6 References]