# Quasi-Bayesian analysis of nonparametric instrumental variables models\thanksrefT1

## Abstract

This paper aims at developing a quasi-Bayesian analysis of the nonparametric instrumental variables model, with a focus on the asymptotic properties of quasi-posterior distributions. In this paper, instead of assuming a distributional assumption on the data generating process, we consider a quasi-likelihood induced from the conditional moment restriction, and put priors on the function-valued parameter. We call the resulting posterior quasi-posterior, which corresponds to “Gibbs posterior” in the literature. Here we focus on priors constructed on slowly growing finite-dimensional sieves. We derive rates of contraction and a nonparametric Bernstein–von Mises type result for the quasi-posterior distribution, and rates of convergence for the quasi-Bayes estimator defined by the posterior expectation. We show that, with priors suitably chosen, the quasi-posterior distribution (the quasi-Bayes estimator) attains the minimax optimal rate of contraction (convergence, resp.). These results greatly sharpen the previous related work.

10.1214/13-AOS1150 \volume41 \issue5 2013 \firstpage2359 \lastpage2390 \newproclaimassumptionAssumption \newproclaimremarkRemark

Quasi-Bayes for NPIV

T1Supported by the Grant-in-Aid for Young Scientists (B) (25780152) from the JSPS.

A]\fnmsKengo \snmKato\correflabel=e1]kkato@e.u-tokyo.ac.jp

class=AMS] \kwd62G08 \kwd62G20 Asymptotic normality \kwdinverse problem \kwdnonparametric instrumental variables model \kwdquasi-Bayes \kwdrates of contraction

## 1 Introduction

### 1.1 Overview

Let be a triplet of scalar random variables, where is a dependent variable, is an endogenous variable and is an instrumental variable. Without loosing much generality, we assume that the support of is contained in . The support of may be unbounded. We consider the nonparametric instrumental variables (NPIV) model of the form

(1) |

where is an unknown structural function of interest. Alternatively, we can write the model in a more conventional form

where is potentially correlated with and hence .

A model of the form (1) is of principal importance in
econometrics (see [28, 31]). From a statistical perspective,
the problem of recovering the structural function is
challenging since it is an *ill-posed* inverse problem with an
additional difficulty of *unknown* operator [ in (2)
ahead]. Statistical inverse problems, including the current problem,
have attracted considerable interests in statistics and econometrics
(see, e.g., [8, 9]). For mathematical background of inverse
problems, we refer to [43].

To see that the problem of recovering the structural function is an ill-posed inverse problem, suppose that has a square-integrable joint density on and denote by the marginal density of . Define the linear operator by

Then the NPIV model (1) is equivalent to the operator equation

(2) |

where . Suppose that is injective
to guarantee identification of .^{1}

Approaches to estimating the structural function are roughly
classified into two types: the method involving the Tikhonov
regularization [28, 16] and the sieve-based method
[45, 2, 5, 32].^{2}

This paper aims at developing a quasi-Bayesian analysis of the NPIV model, with a focus on the asymptotic properties of quasi-posterior distributions. The approach taken is quasi-Bayes in the sense that it neither needs to assume any specific distribution of , nor has to put a nonparametric prior on the unknown likelihood function. The analysis is then based upon a quasi-likelihood induced from the conditional moment restriction. The quasi-likelihood is constructed by first estimating the conditional moment function in a nonparametric way, and taking as if it were a likelihood of . For this quasi-likelihood, we put a prior on the function-valued parameter . By doing so, formally, the posterior distribution for may be defined, which we call “quasi-posterior distribution.” This posterior corresponds to what [35] called “Gibbs posterior,” and has a substantial interpretation (see Proposition 1 ahead). The quasi-Bayesian approach in this paper builds upon [12] where the dimension of the parameter of interest is finite and fixed.

We focus here on priors constructed on slowly growing
finite-dimensional sieves (called “sieve or series priors”), where
the dimensions of the sieve spaces (which grow with the sample size)
play the role of regularization to deal with the problem of
ill-posedness. Potentially, there are several choices in sieve spaces,
but we choose to use wavelet bases to form sieve spaces. Wavelet bases
are useful to treat smoothness function classes such as
Hölder–Zygmund and Sobolev spaces in a unified and convenient way.
We also use wavelet series estimation of the conditional moment
function.^{3}

Under this setup, we study the asymptotic properties of the quasi-posterior distribution. The results obtained are summarized as follows. First, we derive rates of contraction for the quasi-posterior distribution and establish conditions on priors under which the minimax optimal rate of contraction is attained. Here the contraction is stated in the standard -norm. Second, we show asymptotic normality of the quasi-posterior of the first generalized Fourier coefficients, where is the dimension of the sieve space. This may be viewed as a nonparametric Bernstein–von Mises type result (see [54], Chapter 10, for the classical Bernstein–von Mises theorem for regular parametric models). Third, we derive rates of convergence of the quasi-Bayes estimator defined by the posterior expectation and show that under some conditions it attains the minimax optimal rate of convergence. Finally, we give some specific sieve priors for which the quasi-posterior distribution (the quasi-Bayes estimator) attains the minimax optimal rate of contraction (convergence, resp.). These results greatly sharpen the previous work of, for example, [44], as we will review below.

### 1.2 Literature review and contributions

Closely related are [20] and [44]. The former paper worked on the reduced form equation with and assumed to be normally distributed. They considered a Gaussian prior on , and the posterior distribution is also Gaussian (conditionally on the variance of ). They proposed to “regularize” the posterior and studied the asymptotic properties of the “regularized” posterior distribution and its expectation. Clearly, the present paper largely differs from [20] in that (i) we do not assume normality of the “error”; (ii) roughly speaking, Florens and Simoni’s method is tied with the Tikhonov regularization method, while ours is tied with the sieve-based method with slowly growing sieves. We note the settings of [19, 18] are largely different from the present paper; moreover in the NPIV example, some high-level conditions on estimated operators are assumed in [19, 18], and hence they are not directly comparable to the present paper. Liao and Jiang [44] developed an important unified framework in estimating conditional moment restriction models based on a quasi-Bayesian approach, and their scope is more general than ours. They analyzed NPIV models in detail in their Section 4. Their posterior construction is similar to ours such as the use of sieve priors, but differs from ours in detail. For example, [44] transformed the conditional moment restriction into unconditional moment restrictions with increasing number of restrictions. On the other hand, we directly work on the conditional moment restriction, although whether Liao and Jiang’s approach will lose any efficiency in the frequentist sense is not formally clear.

Importantly and substantially, neither [20] nor [44] established sharp contraction rates for their (quasi-)posterior distributions, nor asymptotic normality results. It is unclear whether Florens and Simoni’s [20] rates (in their Theorem 2) are optimal, since their assumptions are substantially different from the past literature such as [28] and [11]; moreover, strictly speaking [20] did not formally derive contraction rates for their regularized posterior when the operator is unknown (note that [19, 18], though not directly comparable to the present paper, also did not formally derive posterior contraction rates in the NPIV example). Liao and Jiang [44] only established posterior consistency. Here we focus on a simple but important model, and establish the sharper asymptotic results for the quasi-posterior distribution. Notably, a wide class of (finite dimensional) sieve priors is shown to lead to the optimal contraction rate. Moreover, in [44], a point estimator of the structural function is not formally analyzed. Hence, the primal contribution of this paper is to considerably deepen the understanding of the asymptotic properties of the quasi-Bayesian procedure for the NPIV model.

The present paper deals with a quasi-Bayesian analysis of an infinite-dimensional model. The literature on theoretical studies of Bayesian analysis of infinite-dimensional models is large. See [24, 50, 26, 38, 25, 27] for general contraction rates results for posterior distributions in infinite-dimensional models. Note that these results do not directly apply to our case: the proof of the main general theorem (Theorem 1) depends on the construction of suitable “tests” (see the proof of Proposition 4), but how to construct such tests in a specific problem in a nonlikelihood framework is not trivial, especially in the current NPIV model where we have to deal with the ill-posedness of inverse problem. Moreover, Proposition 4 alone is not sufficient for obtaining sharp contraction rates and an additional work is needed (see the proof of Theorem 1).

There is also a large literature on the Bayesian analysis of (ill-posed) inverse problems. One stream of research on this topic lies in the applied mathematics literature; see [51] and references therein. However, their models and scopes are substantially different from those of the present paper; for example, [29, 30] considered (ill-conditioned) finite-dimensional linear regression models with Gaussian errors and priors, and contractions rates of posterior distributions are not formally studied there. In the statistics literature, we may refer to [15, 40, 41, 1, 39] (in addition to [44, 19, 20, 18] that are already discussed), although their results are not applicable to the analysis of NPIV models because of its particular structure (i.e., especially the operator is unknown, and non-Gaussian “errors” and priors are allowed). Hence the present paper provides a further contribution to the Bayesian analysis of ill-posed inverse problems.

Our asymptotic normality result builds upon the previous work on asymptotic normality of (quasi-)posterior distributions for models with increasing number of parameters [22, 23, 4, 3, 7, 13, 6]. Related is [6], in which the author established Bernstein–von Mises theorems for Gaussian regression models with increasing number of regressors and improved upon the earlier work of [22] in several aspects. Reference [6] covered nonparametric models by taking into account modeling bias in the analysis. However, none of these papers covered the NPIV model, nor more generally linear inverse problems.

Finally, while we here assume injectivity of the operator in (2), as one of anonymous referees pointed out, this condition is not a trivial assumption (see also the discussion after Assumption 3.2 in Section 3.2), and there are a number of works that relax the injectivity assumption and explore partial identification approach, such as [46, 44, 42] and [10], Appendix A.

### 1.3 Organization and notation

The remainder of the paper is organized as follows. Section 2 gives an informal discussion of the quasi-Bayesian analysis of the NPIV model. Section 3 contains the main results of the paper where general theorems on contraction rates and asymptotic normality for quasi-posterior distributions, as well as convergence rates for quasi-Bayes estimators, are stated. Section 4 analyzes some specific sieve priors. Section 5 contains the proofs of the main results. Section 6 concludes with some further discussions. The Appendix contains some omitted technical results. Because of the space limitation, the Appendix is contained in the supplemental file [36].

Notation: For any given (random or nonrandom, scalar or vector) sequence , we use the notation , which should be distinguished from the population expectation . For any vector , let where is the transpose of . For any two sequences of positive constants and , we write if the ratio is bounded, and if and . Let denote the usual space with respect to the Lebesgue measure for functions defined on . Let denote the -norm, that is, . The inner product in is denoted by , that is, . Let denote the metric space of all continuous functions on , equipped with the uniform metric. The Euclidean norm is denoted by . For any matrix , let and denote the minimum and maximum singular values of , respectively. Let denote the operator norm of a matrix [i.e., ]. Denote by the density of the multivariate normal distribution with mean vector and covariance matrix .

## 2 Quasi-Bayesian analysis: Informal discussion

In this section, we outline a quasi-Bayesian analysis of the NPIV model (1). The discussion here is informal. The formal discussion is given in Section 3.

Let be a parameter space (say, some smoothness class of functions, such as a Hölder–Zygmund or Sobolev space), for which we assume . We assume that is at least contained in : . Define the conditional moment function as . Then satisfies the conditional moment restriction

(3) |

Equivalently, we have .

In this paper, for the purpose of robustness, any specific distribution of is not assumed, which we believe is more practical in statistical and econometric applications. So a Bayesian analysis in the standard sense is not applicable here since a proper likelihood for ( is a generic version of ) is not available. Instead, we use a quasi-likelihood induced from the conditional moment restriction (3).

Let be i.i.d. observations of . Let and . By (3),a plausible candidate of the quasi-likelihood would be

since is maximized at the true structural function . However, this is infeasible since is unknown. Instead of using , we replace by a suitable estimate and use the quasi-likelihood of the form

Below we use a wavelet series estimator of .

The quasi-Bayesian analysis considered here uses this quasi-likelihood as if it were a proper likelihood and puts priors on . In this paper, as in [44], we shall use sieve priors (more precisely, priors constructed on slowly growing sieves; [44] indeed considered another class of priors, see their supplementary material). The basic idea is to construct a sequence of finite-dimensional sieves (say, ) that well approximates the parameter space (i.e., each function in is well approximated by some function in as becomes large), and put priors concentrating on these sieves. Each sieve space is a subset of a linear space spanned by some basis functions. Hence the problem reduces to putting priors on the coefficients on those basis functions. Such priors are typically called “(finite dimensional) sieve priors” (or “series priors”) and have been widely used in the nonparametric Bayesian and quasi-Bayesian analysis (see, e.g., [24, 48, 25]).

Let be a so-constructed prior on . Then, formally, the posterior-like distribution of given may be defined by

(4) |

which we call “quasi-posterior distribution.” The quasi-posterior distribution is not a proper posterior distribution in the strict Bayesian sense since is not a proper likelihood. Nevertheless, is a proper distribution, that is, . Similar to proper posterior distributions, contraction of the quasi-posterior distribution around intuitively means that it contains more and more accurate information about the true structural function as the sample size increases. Hence, as in proper posterior distributions, it is of fundamental importance to study rates of contraction of quasi-posterior distributions. Here we say that the quasi-posterior contracts around at rate if .

This quasi-posterior corresponds to what [58] called “Gibbs
algorithm” and what [35] called “Gibbs posterior.”
The framework of the quasi-posterior (Gibbs posterior) allows us a
flexibility since a stringent distributional assumption, such as
normality, on the data generating process is not required. Such a
framework widens a Bayesian approach to broad fields of statistical
problems.^{4}

###### Proposition 1

Let be a fixed constant. Let be a prior distribution for defined on, say, the Borel -field of . Suppose that the data are fixed and the maps are measurable with respect to the Borel -field of . Then, the distribution

minimizes the empirical information complexity defined by

(5) |

over all distributions absolutely continuous with respect to . Here

is the Kullback–Leibler divergence from to .

Immediate from [57], Proposition 5.1.

The proposition shows that, given the data and a prior on , the quasi-posterior defined in (4) is obtained as a minimizer of the empirical information complexity defined by (5) with . This gives a rational to use as a quasi-posterior since, among all possible “quasi-posteriors”, this optimally balances the average of the natural loss function and its complexity (or deviation) relative to the initial prior distribution measured by the Kullback–Leibler divergence. The scaling constant (“temperature”) is typically treated as a fixed constant (see, e.g., [58, 35]). An alternative way is to choose in a data-dependent manner, by, for example, cross validation as mentioned in [58]. It is not difficult to see that the theory below can be extended to the case where is even random, as long as converges in probability to a fixed positive constant. However, for the sake of simplicity, we take as a benchmark choice (note that as long as is a fixed positive constant, the analysis can be reduced to the case with by renormalization).

The quasi-posterior distribution provides point estimators of . A most natural estimator would be the estimator defined by the posterior expectation (the expectation of the quasi-posterior distribution), that is,

(6) |

where the integral is understood as pointwise.

Quasi-Bayesian approaches (not necessarily in the present form) are widely used and there are several other attempts of making probabilistic interpretation of such approaches. See, for example, [37] where the “limited information likelihood” is derived as the “best” (in a suitable sense) approximation to the true likelihood function under a set of moment restrictions and the Bayesian analysis with the limited information likelihood is argued ([44] adapted this approach to conditional moment restriction models), and [47] where a version of the empirical likelihood is interpreted in a Bayesian framework.

## 3 Main results

In this section, we study the asymptotic properties of the quasi-posterior distribution and the quasi-Bayes estimator. In doing so, we have to specify certain regularity properties, such as the smoothness of and the degree of ill-posedness of the problem. How to characterize the “smoothness” of is important since it is related to how to put priors. For this purpose, we find wavelet theory useful, and use sieve spaces constructed by using wavelet bases.

### 3.1 Posterior construction

To construct quasi-posterior distributions, we have to estimate and construct a sequence of sieve spaces for on which priors concentrate. For the former purpose, we use a (wavelet) series estimator of , as in [2] and [10]. For the latter purpose, we construct a sequence of sieve spaces formed by the wavelet basis.

We begin with stating the parameter space for and the wavelet basis used. We assume that the parameter space is either (Hölder–Zygmund space) or (Sobolev space), where is the Besov space of functions on with parameter (the parameter generally corresponds to “smoothness;” we add “” on the parameter space, , to clarify its dependence on ). See Appendix A.2 in the supplemental file [36] for the definition of Besov spaces. We assume that , under which .

Fix (sufficiently large) , and let be an -regular Cohen–Daubechies–Vial (CDV) wavelet basis for [14], where is a positive integer larger than . See Appendix A.1 in the supplemental file [36] for CDV wavelet bases. For the notational convenience, we write , and for . Here and in what follows:

Take and fix an -regular CDV wavelet basis of with , |

and we keep this convention. Let be the linear subspace of spanned by , and denote by the projection operator onto , that is, for any , . In what follows, for any , the notation means that it is a vector of dimension . For example, .

[(Approximation property)] For either or , we have for all . Here the constant depends only on and the corresponding Besov norm of .

The use of CDV wavelet bases is not crucial and one may use other reasonable bases such as the Fourier and Hermite polynomial bases. The theory below can be extended to such bases with some modifications. However, CDV wavelet bases are particularly well suited to approximate (not necessarily periodic) smooth functions, which is the reason why we use here CDV wavelet bases. On the other hand, for example, the Fourier basis is only appropriate to approximate periodic functions and it is often not natural to assume that the structural function is periodic.

We shall now move to the posterior construction. For , define the -dimensional vector of functions by

Let be a sequence of positive integers such that and . Then a wavelet series estimator of is defined as

where we replace the inverse matrix by the generalized inverse if the former does not exist; the probability of such an event converges to zero as under the assumptions below. We use this wavelet series estimator throughout the analysis.

For the same , we shall take as a sieve space for . We consider priors that concentrate on , that is, . Formally, we think of that priors on are defined on the Borel -field of (hence the quasi-posterior is understood to be defined on the Borel -field of , which is possible since the map is continuous on ). Since the map , is homeomorphic from onto , putting priors on is equivalent to putting priors on (the latter are of course defined on the Borel -field of ). Practically, priors on are induced from priors on . For the later purpose, it is useful to determine the correspondence between priors for these two parameterizations. Unless otherwise stated, we follow the convention of the notation such that

: a prior on : the induced prior on . |

We shall call a generating prior, and the induced prior.

Correspondingly, the quasi-posterior for is defined. With a slight abuse of notation, for , we write , and take as a quasi-likelihood for . Note that in this particular setting, the log quasi-likelihood is quadratic in . Let denote the resulting quasi-posterior distribution for :

(7) |

For the quasi-Bayes estimator defined by (6), since for every , the map is continuous on , and conditional on the quasi-posterior is a Borel probability measure on , the integral exists as soon as . Furthermore, can be computed by using the relation

as soon as the integral on the right-hand side exists. Hence, practically, it is sufficient to compute the expectation of .

The use of the same wavelet basis to estimate and to construct a sequence of sieve spaces for is not essential and can be relaxed. Suppose that we have another CDV wavelet basis for and use this basis to estimate . Then, all the results below apply by simply replacing by . To keep the notation simple, we use the same wavelet basis.

However, the use of the same resolution level is essential (at least at the proof level) in establishing the asymptotic properties of the quasi-posterior distribution. It may be a technical artifact, but we do not extend the theory in this direction since there is no clear theoretical benefit to do so (note that in the purely frequentist estimation case, [10] allowed for using different cut-off levels for approximating and ).

### 3.2 Basic assumptions

We state some basic assumptions. We do not state here assumptions on priors, which will be stated in the theorems below. In what follows, let be a sufficiently large constant. {assumption} (i) has a joint density on satisfying that . (ii) where . (iii) . Assumption 3.2 is a usual restriction in the literature, up to minor differences (see [28, 32]). Denote by and the marginal densities of and , respectively, that is, and . Then Assumption 3.2(i) implies that and . A primitive regularity condition that guarantees Assumption 3.2(iii) is that for all . To see this, for with , we have

(8) | |||||

where we have used the fact that is orthonormal in .

For identification of , we assume: {assumption} The linear operator is injective.

For smoothness of , as mentioned before, we assume: {assumption} , , where is either or .

The identification condition (Assumption 3.2) is equivalent to the “completeness” of the conditional distribution of conditional on [45]. We refer the reader to [49, 17] and [34] for discussion on the completeness condition. We should note that restricting the domain of to a “small” set, such as a Sobolev ball, would substantially relax Assumption 3.2, which however requires a different analysis. For the sake of simplicity, we assume the injectivity of on the full domain.

As discussed in the Introduction, solving (2) is an ill-posed inverse problem. Thus, the statistical difficulty of estimating depends on the difficulty of continuously inverting , which is usually referred to as “ill-posedness” of the inverse problem (2). Typically, the ill-posedness is characterized by the decay rate of ( is the th largest singular value of ), which is plausible if were known and the singular value decomposition of were used (see [9]). However, here, is unknown and the known wavelet basis is used instead of the singular value system. Thus, it is suitable to quantify the ill-posedness using the wavelet basis . To this end, define

This quantity corresponds to (the reciprocal of) what is called “sieve measure of ill-posedness” in the literature [5, 32]. We at least have to assume that for all . Note however that

by which, necessarily, as . For this quantity, we assume: {assumption} (i) (Mildly ill-posed case) , or (severely ill-posed case) , ;

(ii)

Assumption 3.2(i) lower bounds as , thereby quantifies the ill-posedness. We cover both the “mildly ill-posed” and “severely ill-posed” cases (this definition of mild ill-posedness and severe ill-posedness is due to [31, 32]). The severely ill-posed case happens, for example, when the joint density is analytic (see [43], Theorem 15.20).

Assumption 3.2(ii) is a “stability” condition about the bias , which states that is sufficiently “small” relative to . Note that in the (ideal) case in which, for example, is self-adjoint and is the eigen-basis of , for all , in which case Assumption 3.2(ii) is trivially satisfied. Assumption 3.2(ii) allows more general situations in which may not be self-adjoint and may not be the eigen-basis of by allowing for a certain “slack.” This assumption, although looks technical, is common in the study of rates of convergence in estimation of the structural function . Indeed, essentially similar conditions have appeared in the past literature such as [5, 11, 32]. For example, [5], Assumption 6, essentially states (in our notation) that , which implies our Assumption 3.2(ii) since (Bessel’s inequality).

For given values of and , let denote the set of all distributions of satisfying Assumptions 3.2–3.2 with in case of and in case of . By [28, 11], it is shown that the minimax rate of convergence (in ) of estimation of over this distribution class is in the mildly ill-posed case (where ) and in the severely ill-posed case [where ] as the sample size (the assumption on the conditional second moment of given is not binding; that is, replacing Assumption 3.2(ii) by a stronger one, such as for some determined outside the class of distributions, does not alter these minimax rates).

By Theorem 2.5 of [24], it is readily seen that these rates are the fastest possible rates of contraction of (general) quasi-posterior distributions in this setting. More formally, we can state the following assertion:

*Let be the quasi-posterior distribution
defined on, say, the Borel -field of , constructed
from putting a suitable prior on to the quasi-likelihood * (the prior here needs not be a sieve prior).
Suppose now that for some .
Then there exists a point estimator that converges (in probability)
at least as fast as uniformly in .

The proof is just a small modification of that of Theorem 2.5 in [24] and hence omitted. Importantly, the quasi-posterior cannot contract at a rate faster than the optimal rate of convergence for point estimators ([24], page 507, lines 19–20). Hence, in the minimax sense, the fastest possible rate of contraction of the quasi-posterior distribution is in the mildly ill-posed case and in the severely ill-posed case (Proposition 2 in Section 4 ahead shows that these rates are indeed attainable for suitable sieve priors).

### 3.3 Main results: General theorems

This section presents general theorems on contraction rates and asymptotic normality for the quasi-posterior distribution as well as convergence rates for the quasi-Bayes estimator. In what follows, let be i.i.d. observations of . Denote by the vector of the first generalized Fourier coefficients of , that is, . Let denote the total variation norm between two distributions.

###### Theorem 1

Suppose that Assumptions 3.2–3.2 are satisfied. Take in such a way that and . Let be a sequence of positive constants such that and . Suppose that generating priors has densities on and satisfy the following conditions: {longlist}[(P2)] (P1) (Small ball condition). There exists a constant such that for all sufficiently large, . (P2) (Prior flatness condition). Let . There exists a sequence of constants sufficiently slowly such that for all sufficiently large, is positive for all , and Then for every sequence , we have (9) Furthermore, assume that . Then we have where , and where is a “maximum quasi-likelihood estimator” of , that is, (10)

See Section 5.1.

The condition appears essentially because the operator is unknown. In our setup, this results in estimating the matrix by its empirical counterpart . In the proof, we have to suitably lower bound the minimum singular value of , denoted by , which is an empirical counterpart of the sieve measure of ill-posedness . By Lemma 1, we have , so that to make the estimation effect in negligible, we need .