Confidence Intervals forProjections of Partially Identified Parameters††thanks: We are grateful to Elie Tamer and three anonymous reviewers for very useful suggestions that substantially improved the paper. We thank for their comments Ivan Canay and seminar and conference participants at Bonn, BC/BU joint workshop, Brown, Cambridge, Chicago, Columbia, Cornell, CREST, Kobe, Maryland, Michigan, Michigan State, NYU, Penn State, Royal Holloway, Syracuse, Toronto, UCLA, UCSD, UPenn, Vanderbilt, Vienna, Yale, Wisconsin, CEME, ES-NAWM 2015, Frontiers of Theoretical Econometrics Conference, ES-World Congress 2015, ES-Asia Meeting 2016, KEA-KAEA International Conference, Verein für Socialpolitik Ausschuss für Ökonometrie, and ES-ESM 2017. We are grateful to Zhonghao Fu, Debi Mohapatra, Sida Peng, Talal Rahim, Matthew Thirkettle, and Yi Zhang for excellent research assistance. A MATLAB package implementing the method proposed in this paper, Kaido, Molinari, Stoye, and Thirkettle (2017), is available at https://molinari.economics.cornell.edu/programs/KMSportable_V3.zip. We are especially grateful to Matthew Thirkettle for his contributions to this package. Finally, we gratefully acknowledge financial support through NSF grants SES-1230071 (Kaido), SES-0922330 (Molinari), and SES-1260980 (Stoye).
We propose a bootstrap-based calibrated projection procedure to build confidence intervals for single components and for smooth functions of a partially identified parameter vector in moment (in)equality models. The method controls asymptotic coverage uniformly over a large class of data generating processes.
The extreme points of the calibrated projection confidence interval are obtained by extremizing the value of the component (or function) of interest subject to a proper relaxation of studentized sample analogs of the moment (in)equality conditions. The degree of relaxation, or critical level, is calibrated so that the component (or function) of , not itself, is uniformly asymptotically covered with prespecified probability. This calibration is based on repeatedly checking feasibility of linear programming problems, rendering it computationally attractive.
Nonetheless, the program defining an extreme point of the confidence interval is generally nonlinear and potentially intricate. We provide an algorithm, based on the response surface method for global optimization, that approximates the solution rapidly and accurately. The algorithm is of independent interest for inference on optimal values of stochastic nonlinear programs. We establish its convergence under conditions satisfied by canonical examples in the moment (in)equalities literature.
Our assumptions and those used in the leading alternative approach (a profiling based method) are not nested. An extensive Monte Carlo analysis confirms the accuracy of the solution algorithm and the good statistical as well as computational performance of calibrated projection, including in comparison to other methods.
Keywords: Partial identification; Inference on projections; Moment inequalities; Uniform inference.
.tocmtchapter \etocsettagdepthmtchaptersubsection \etocsettagdepthmtappendixnone
This paper provides theoretically and computationally attractive confidence intervals for projections and smooth functions of a parameter vector , , that is partially or point identified through a finite number of moment (in)equalities. The values of that satisfy these (in)equalities constitute the identification region .
Until recently, the rich literature on inference in this class of models focused on confidence sets for the entire vector , usually obtained by test inversion as
where is a test statistic that aggregates violations of the sample analog of the moment (in)equalities, and is a critical value that controls asymptotic coverage, often uniformly over a large class of data generating processes (DGPs). In point identified moment equality models, this would be akin to building confidence ellipsoids for by inversion of the -test statistic proposed by Anderson and Rubin (1949).
However, applied researchers are frequently primarily interested in a specific component (or function) of , e.g., the returns to education. Even if not, they may simply want to report separate confidence intervals for components of a vector, as is standard practice in other contexts. Thus, consider the projection , where is a known unit vector. To date, it has been common to report as confidence interval for the projection of :
where denotes sample size; see for example Ciliberto and Tamer (2009), Grieco (2014) and Dickstein and Morales (2016). Such projection is asymptotically valid, but typically yields conservative and therefore needlessly large confidence intervals. The potential severity of this effect is easily appreciated in a point identified example. Given a -consistent estimator with limiting covariance matrix equal to the identity matrix, a 95% confidence interval for is obtained as , . In contrast, if , then projection of a 95% Wald confidence ellipsoid yields with true coverage of essentially . We refer to this problem as projection conservatism.
Our first contribution is to provide a bootstrap-based calibrated projection method that largely anticipates and corrects for projection conservatism. For each candidate , is calibrated so that across bootstrap repetitions the projection of is covered with at least some pre-specified probability. Computationally, this bootstrap is relatively attractive because we linearize all constraints around , so that coverage of corresponds to the projection of a stochastic linear constraint set covering zero. We then propose the confidence interval
We prove that asymptotically covers with probability at least uniformly over a large class of DGPs and that it is weakly shorter than (1.2) for each .111 This comparison is based on projection of the confidence set of Andrews and Soares (2010) and holds the choice of tuning parameters and criterion function in (1.2) and (1.3) constant across methods. We also provide simple conditions under which it is asymptotically strictly shorter.
Our second contribution is a general method to accurately and rapidly compute projection-based confidence intervals. These can be our calibrated projection confidence intervals, but they can also correspond to projection of many other methods for inference on either or its identified set . Examples include Chernozhukov, Hong, and Tamer (2007), Andrews and Soares (2010), or (for conditional moment inequalities) Andrews and Shi (2013). Projection-based inference extends well beyond its application in partial identification, hence our computational method proves useful more broadly. For example, Freyberger and Reeves (2017a, b, Section S.3) use it to construct uniform confidence bands for an unknown function of interest under (nonparametric) shape restrictions.
We propose an algorithm that is based on the response surface method, thus it resembles an expected improvement algorithm (see e.g. Jones, 2001; Jones, Schonlau, and Welch, 1998, and references therein). Bull (2011) established convergence of the expected improvement algorithm for unconstrained optimization problems where the objective is a “black box” function. Building on his results, we show convergence of our algorithm for constrained optimization problems in which the constraint functions are “black box” functions, assuming that they are sufficiently smooth. We then verify this smoothness condition for canonical examples in the moment (in)equalities literature. Our extensive Monte Carlo experiments confirm that the algorithm is fast and accurate.222Freyberger and Reeves (2017b, Section S.3) similarly find our method to be accurate and to considerably reduce computational time.
Previous implementations of projection-based inference were based on approximating the set by searching for vectors such that (using, e.g., grid-search or simulated annealing with no cooling), and reporting the smallest and largest value of among parameter values that were “guessed and verified” to belong to . This becomes computationally cumbersome as increases because it typically requires a number of evaluation points that grows exponentially with . In contrast, our method typically requires a number of evaluation points that grows linearly with .
The main alternative inference prodedure for projections was introduced in Romano and Shaikh (2008) and significantly advanced in Bugni, Canay, and Shi (2017, BCS henceforth). It is based on profiling out a test statistic. The classes of DGPs for which our procedure and the profiling-based method of BCS (BCS-profiling henceforth) can be shown to be uniformly valid are non-nested. We show that in well behaved cases, calibrated projection and BCS-profiling are asymptotically equivalent. We also provide conditions under which calibrated projection has lower probability of false coverage, thereby establishing that the two methods’ power properties are non-ranked. Computationally, calibrated projection has the advantage that the bootstrap iterates over linear as opposed to nonlinear programming problems. While the “outer” optimization problems in (1.3) are potentially intricate, our algorithm is geared toward them. Our Monte Carlo simulations suggest that these two factors give calibrated projection a considerable computational edge over BCS-profiling, with an average speed gain of about 78-times.
In an influential paper, Pakes, Porter, Ho, and Ishii (2011) also use linearization but, subject to this approximation, directly bootstrap the sample projection.333The published version, i.e. Pakes, Porter, Ho, and Ishii (2015), does not contain the inference part. This is valid only under stringent conditions, and we show that calibrated projection can be much simplified under those conditions. Other related papers that explicitly consider inference on projections include Andrews, Berry, and Jia (2004), Beresteanu and Molinari (2008), Bontemps, Magnac, and Maurin (2012), Chen, Tamer, and Torgovitsky (2011), Kaido (2016), Kitagawa (2012), Kline and Tamer (2015), and Wan (2013). However, some are Bayesian, as opposed to our frequentist approach, and none of them establish uniform validity of confidence sets. Chen, Christensen, and Tamer (2017) establish uniform validity of MCMC-based confidence intervals for projections, but these are aimed at covering the entire set , whereas we aim at covering the projection of . Finally, Gafarov, Meier, and Montiel-Olea (2016) have used our insight in the context of set identified spatial VARs.
Structure of the paper. Section 2 sets up notation and describes our approach in detail. Section 3 describes computational implementation of the method and establishes convergence of our proposed algorithm. Section 4 lays out our assumptions and, under these assumptions, establishes uniform validity of calibrated projection for inference on projections and smooth functions of . It also shows that more stringent conditions allow for several simplifications to the method, including that it can suffice to evaluate at only two values of and that one can dispense with a tuning parameter. The section closes with a formal comparison of calibrated projection and BCS-profiling. Section 5 reports the results of Monte Carlo simulations. Section 6 draws conclusions. The proof of convergence of our algorithm is in Appendix A. All other proofs, background material for our algorithm, and additional results are in the Online Appendix.444Section B provides convergence-related results and background material for our algorithm and describes how to compute . Section C verifies, for a number of canonical moment (in)equality models, the assumptions that we invoke to show validity of our inference procedure and for our algorithm. Section D contains proofs of the Theorems in this paper’s Section 4. Section E collects Lemmas supporting the preceding proofs. Section F provides further comparisons with the profiling method of Bugni, Canay, and Shi (2017), including an example where calibrated projection has higher power in finite sample. Section G provides comparisons with “uncalibrated” projection of the confidence region in Andrews and Soares (2010), including simple conditions under which is asymptotically strictly shorter than .
2 Detailed Explanation of the Method
Let be a random vector with distribution , let denote the parameter space, and let for denote measurable functions characterizing the model, known up to parameter vector . The true parameter value is assumed to satisfy the moment inequality and equality restrictions
for the sample moments and the analog estimators of the population moment functions’ standard deviations .555Under Assumption 4.3-b, in equation (2.5) instead of we use the estimator specified in (E.188) in Lemma E.10 p.E.188 of the Online Appendix for (with defined in the assumption). In equation (3.2) we use for all . To ease notation, we distinguish the two only where needed.
The confidence interval in (1.3) then becomes , where
and similarly for . Here, we define moments, where for . That is, we split moment equality constraints into two opposing inequality constraints and relax them separately.666For a simple analogy, consider the point identified model defined by the single moment equality , where is a scalar. In this case, . The upper endpoint of the confidence interval can be written as , with , and similarly for the lower endpoint.
For a class of DGPs that we specify below, define the asymptotic size of by
Our goal is to calibrate so that (2.6) is at least equal to a prespecified level while anticipating projection conservatism. To build intuition, fix s.t. . The projection of is covered when
where the second equivalence follows from substituting and taking to be the choice parameter. (Intuitively, we localize around at rate .)
We control asymptotic size by finding such that 0 asymptotically lies within the optimal values of the NLPs in (2.7) with probability . To reduce computational burden, we approximate the event in equation (2.7) through linear expansion in of the constraint set. To each constraint , we add and subtract and apply the mean value theorem to obtain
Here is a normalized empirical process indexed by , is the gradient of the normalized moment, is the studentized population moment, and the mean value lies componentwise between and .777The mean value changes with but we omit the dependence to ease notation.
Calibration of requires careful analysis of the local behavior of the moment restrictions at each point in the identification region. This is because the extent of projection conservatism depends on (i) the asymptotic behavior of the sample moments entering the inequality restrictions, which can change discontinuously depending on whether they bind at () or not, and (ii) the local geometry of the identification region at , i.e. the shape of the constraint set formed by the moment restrictions, and its relation to the level set of the objective function . Features (i) and (ii) can be quite different at different points in , making uniform inference for the projection challenging. In particular, (ii) does not arise if one only considers inference for the entire parameter vector, and hence is a new challenge requiring new methods.888This is perhaps best expressed in the testing framework: Inference for projections entails a null hypothesis specifying the value of a single component (or a function) of . The components not under test become additional nuisance parameters, and dealing with them presents challenges that one does not face when testing hypotheses that specify the value of the entire vector . This is where this paper’s core theoretical innovation lies.
An important component of this innovation is to add to (2.7) the constraint that , where and a tuning parameter. This is slightly conservative but regularizes the effect of the local geometry of at on the inference problem. See Section 4.3 for further discussion. We show that the probability of the event in (2.7), with restricted to be in , is asymptotically approximated by the probability that lies between the optimal values of two programs that are linear in . The constraint sets of these programs are characterized by (i) a Gaussian process evaluated at (that we can approximate through a simple nonparametric bootstrap), (ii) a gradient (that we can uniformly consistently estimate999See Online Appendix C for proposal of such estimators in some canonical moment (in)equality examples. on compact sets), and (iii) the parameter that measures the extent to which each moment inequality is binding (that we can conservatively estimate using insights from Andrews and Soares (2010)). This suggests a computationally attractive bootstrap procedure based on linear programs.
3 Computing Calibrated Projection Confidence Intervals
3.1 Computing the Critical Level
For a given , we calibrate through a bootstrap procedure that iterates over linear programs.101010If is defined through smooth convex (in)equalities, these can be linearized too. Define
where is a bootstrap analog of ,111111Bugni, Canay, and Shi (2017) approximate the stochastic process using with i.i.d. This approximation is equally valid in our approach, and can be computationally faster as it avoids repeated evaluation of across bootstrap replications. is a consistent estimator of , is a constant chosen by the researcher (see Section 4.3), , and is defined by
where is a user-specified thresholding sequence such that , is one of the generalized moment selection (GMS) functions proposed by Andrews and Soares (2010), and . A common choice of is given component-wise by
Restrictions on and the rate at which diverges are imposed in Assumption 4.2.
For concreteness, in (3.3) we write out the “hard thresholding” GMS function. As we establish below, our results apply to all but one of the GMS functions in Andrews and Soares (2010).121212These are in Andrews and Soares (2010), all of which depend on . We do not consider GMS function in Andrews and Soares (2010), which depends also on the covariance matrix of the moment functions.
Heuristically, the random convex polyhedral set in (3.1) is a local (to ) linearized bootstrap approximation to the random constraint set in (2.7). To see this, note that the bootstrapped empirical process and the estimator of the gradient approximate the first two terms in the constraint in (2.7) as linearized in (2.8). Next, for , the GMS function conservatively approximates the local slackness parameter . This is needed because the scaling of precludes consistent estimation. The problem is resolved by shrinking estimated intercepts toward zero, thereby tightening constraints and hence increasing . As with other uses of GMS, the resulting conservative distortion vanishes pointwise but not uniformly. Finally, restricting to the “-box” has a strong regularizing effect: It ensures uniform validity in challenging situations, including several that are assumed away in most of the literature. We discuss this point in more detail in Section 4.3.
The critical level to be used in (1.3) is the smallest value of that makes the bootstrap probability of the event
at least . Because is convex, we have
so that we can equivalently define
where denotes the law of the random set induced by the bootstrap sampling process, i.e. by the distribution of , conditional on the data. Importantly, can be assessed by repeatedly checking feasibility of a linear program.131313We implement a program in for simplicity but, because defines a linear subspace, one could reduce this to . We describe in detail in Online Appendix B.4 how we compute through a root-finding algorithm.
3.2 Computation of the Outer Maximization Problem
Projection based methods as in (1.2) and (1.3) have nonlinear constraints involving a critical value which in general is an unknown function of . Moreover, in all methods, including ours and Andrews and Soares (2010), the gradients of the critical values with respect to are not available in closed form. When the dimension of the parameter vector is large, directly solving optimization problems with such constraints can be expensive even if evaluating the critical value at each is cheap.
To mitigate this issue, we provide an algorithm that is a contribution to the moment (in)equalities literature in its own right and that can be helpful for implementing other approaches.141414This algorithm is based on the response surface method used in the optimization literature; see Jones (2001), Jones, Schonlau, and Welch (1998), and references therein. We apply it to constrained optimization problems of the following form:
where is an optimal solution of the problem, are known functions, and is a function that requires a higher computational cost. In our context, and, for calibrated projection, . Conditional on the data , these functions are considered deterministic. A key feature of the problem is that the function is relatively costly to evaluate.151515Here we assume that computing the sample moments is less expensive than computing the critical value. When computation of moments is also very expensive, our proposed algorithm can be used to approximate these too. Our algorithm evaluates on finitely many values of . For other values, it imposes a probabilistic model that gets updated as specific values are computed and that is used to determine the next evaluation point. Under reasonable conditions, the resulting sequence of approximate optimal values converges to .
Specifically, after drawing an initial set of evaluation points that grows linearly with the dimensionality of parameter space, the algorithm has three steps called E, A, and M below.
Draw randomly (uniformly) over a set of initial evaluation points. We suggest setting .
- E-step: (Evaluation)
Evaluate for , where . Set . The current estimate of the optimal value can be computed using
where is the set of feasible evaluation points.
- A-step: (Approximation)
Approximate by a flexible auxiliary model. We use a Gaussian-process regression model (or kriging), which for a mean-zero Gaussian process indexed by and with constant variance specifies
where is a kernel with parameter vector , e.g. . The unknown parameters can be estimated by running a GLS regression of on a constant with the given correlation matrix. The unknown parameters can be estimated by a (concentrated) MLE.
The (best linear) predictor of the critical value and its gradient at an arbitrary point are then given by
where is a vector whose -th component is as given above with estimated parameters, , and is an -by- matrix whose entry is with estimated parameters. This approximating (or surrogate) model has the property that its predictor satisfies . Hence, it provides an analytical interpolation to the evaluated critical values together with an analytical gradient.161616See details in Jones, Schonlau, and Welch (1998). We use the DACE Matlab kriging toolbox (http://www2.imm.dtu.dk/projects/dace/) for this step in our Monte Carlo experiments. Further, the amount of uncertainty left in (at an arbitrary point) is captured by the following variance:
- M-step: (Maximization):
With probability maximize the expected improvement function to obtain the next evaluation point, with:
where . This step can be implemented by standard nonlinear optimization solvers, e.g. Matlab’s
KNITRO(see Appendix B.3 for details). With probability , draw randomly from a uniform distribution over .
Once the next evaluation point is determined, one adds it to the set of evaluation points and iterates the E-A-M steps. This yields an increasing sequence of approximate optimal values . Once a convergence criterion is met, the value is reported as the end point of . We discuss convergence criteria in Section 5.
The advantages of E-A-M are as follows. First, we control the number of points at which we evaluate the critical value. Since the evaluation of the critical value is the relatively expensive step, controlling the number of evaluations is important. One should also note that the E-step with the initial evaluation points can easily be parallelized. For any additional E-step (i.e. ), one needs to evaluate only at a single point . The M-step is crucial for reducing the number of additional evaluation points. To determine the next evaluation point, one needs to take into account the trade-off between “exploitation” (i.e. the benefit of drawing a point at which the optimal value is high) and “exploration” (i.e. the benefit of drawing a point in a region in which the approximation error of is currently large). The expected improvement function in (3.13) quantifies this trade-off, and draws a point only in an area where one can expect the largest improvement in the optimal value, yielding substantial computational savings.171717It is also possible to draw multiple points in each iteration. See Schonlau, Welch, and Jones (1998).
Second, the proposed algorithm simplifies the M-step by providing constraints and their gradients for program (3.13) in closed form. Availability of analytical gradients greatly aids fast and stable numerical optimization. The price is the additional approximation step. In the numerical exercises of Section 5, this price turns out to be low.
3.3 Convergence of the E-A-M Algorithm
We now provide formal conditions under which converges to the true end point of as .181818We build on Bull (2011), who proves a convergence result for the algorithm proposed by Jones, Schonlau, and Welch (1998) applied to an unconstrained optimization problem in which the objective function is unknown outside the evaluation points. Our convergence result recognizes that the parameters of the Gaussian process prior in (3.8) are estimated for each iteration of the A-step using the “observations” , and hence change with . Because of this, a requirement for convergence is that is a sufficiently smooth function of . We show that a high-level condition guaranteeing this level of smoothness ensures a general convergence result for the E-A-M algorithm. This is a novel contribution to the literature on response surface methods for constrained optimization.
In the statement of Theorem 3.1 below, is the reproducing kernel Hilbert space (RKHS) on determined by the kernel used to define the correlation functional in (3.9). The norm on this space is ; see Online Appendix B.2 for details. Also, the expectation is taken with respect to the law of determined by the Initialization-step and the M-step, holding the sample fixed. See Appendix A for a precise definition of and a proof of the theorem.
Suppose is a compact hyperrectangle with nonempty interior and that . Let the evaluation points be drawn according to the Initialization and the M steps. Let in (3.9) be a Matérn kernel with index and . Let satisfy for some , where . Then
The requirement that is a compact hyperrectangle with nonempty interior can be replaced by a requirement that belongs to the interior of a closed hyperrectangle in such that satisfies the smoothness requirement in Theorem 3.1 on that rectangle.
In order to apply Theorem 3.1 to calibrated projection, we provide low level conditions (Assumption B.1 in Online Appendix B.1.1) under which the map uniformly stochastically satisfies a Lipschitz-type condition. To get smoothness, we work with a mollified version of , denoted and provided in equation (B.1), with .191919For a discussion of mollification, see e.g. Rockafellar and Wets (2005, Example 7.19) Theorem B.1 in the Online Appendix shows that and can be made uniformly arbitrarily close to each other and that yields valid inference in the sense of equation (2.6). In practice, one may therefore directly apply the E-A-M steps to .
The key condition imposed in Theorem B.1 is Assumption B.1. It requires that the GMS function used is Lipschitz in its argument, and that the standardized moment functions are Lipschitz in . In Online Appendix C.1 we establish that the latter condition is satisfied by some canonical examples in the moment (in)equality literature, namely the mean with missing data, linear regression and best linear prediction with interval data (and discrete covariates), and entry games with multiple equilibria (and discrete covariates).202020It can also be shown to hold in semi-parametric binary regression models with discrete or interval valued covariates under the assumptions of Magnac and Maurin (2008).
4 Asymptotic Validity of Inference
We posit that , the distribution of the observed data, belongs to a class of distributions denoted by . We write stochastic order relations that hold uniformly over using the notations and ; see Online Appendix D.1 for the formal definitions. Below, , , , , , , denote generic constants which may be different in different appearances but cannot depend on . Given a square matrix , we write for its smallest eigenvalue.
(a) is a compact hyperrectangle with nonempty interior.
(b) All distributions satisfy the following:
and for some ;
for for all ;
For some and and for all , .
Assumption 4.1-(a) requires that is a hyperrectangle, but can be replaced with the assumption that is defined through a finite number of nonstochastic inequality constraints smooth in and such that is convex. Compactness is a standard assumption on for extremum estimation. We additionally require convexity as we use mean value expansions of in ; see (2.8). Assumption 4.1-(b) defines our moment (in)equalities model. Assumption 4.2 constrains the GMS function and the rate at which its tuning parameter diverges. Both 4.1-(b) and 4.2 are based on Andrews and Soares (2010) and are standard in the literature,212121Continuity of for is restrictive only for GMS function in Andrews and Soares (2010). although typically with . The slower rate is satisfied for the popular choice, recommended by Andrews and Soares (2010), of .
Next, and unlike some other papers in the literature, we impose restrictions on the correlation matrix of the moment functions. These conditions can be easily verified in practice because they are implied when the correlation matrix of the moment equality functions and the moment inequality functions specified below have a determinant larger than a predefined constant for any .
All distributions satisfy one of the following two conditions for some constants :
Let . Denote
The functions are defined on . There exists , , and measurable functions , such that for each ,
For each and any choice , denoting , where