Agglomerative Hierarchical Clustering for Selecting Valid Instrumental Variables

Agglomerative Hierarchical Clustering for Selecting Valid Instrumental Variables

Abstract

We propose an instrumental variable (IV) selection procedure which combines the agglomerative hierarchical clustering method and the Hansen-Sargan overidentification test for selecting valid instruments for IV estimation from a large set of candidate instruments. Some of the instruments may be invalid in the sense that they may fail the exclusion restriction. We show that under the plurality rule, our method can achieve oracle selection and estimation results. Compared to the previous IV selection methods, our method has the advantages that it can deal with the weak instruments problem effectively, and can be easily extended to settings where there are multiple endogenous regressors and heterogenous treatment effects. We conduct Monte Carlo simulations to examine the performance of our method, and compare it with two existing methods, the Hard Thresholding method (HT) and the Confidence Interval method (CIM). The simulation results show that our method achieves oracle selection and estimation results in both single and multiple endogenous regressors settings in large samples when all the instruments are strong. Also, our method works well when some of the candidate instruments are weak, outperforming HT and CIM. We apply our method to the estimation of the effect of immigration on wages in the US.

\addbibresource

IVSelection.bib

1 Introduction

Instrumental variables estimation is a widely used statistical method for analysing the causal effects of treatment variables on an outcome factor when the causal pathway between them is confounded. Consistent IV estimation requires that all instruments are valid. This requires that (a) instruments must be associated with the endogenous variables (relevance condition) and (b) instruments should not affect the outcome directly or through unobserved factors (exclusion restrictions). In practice, a main challenge in IV estimation is that when there are a large number of candidate instruments, some of them may be invalid in the sense that they may fail the exclusion restrictions. Many IV applications select valid instruments from the set of potential instruments merely based on economic intuition, or even simply include all the candidate instruments in IV estimation. This kind of practice is problematic because including invalid instruments may lead to severely biased results. Therefore, it is important to develop IV selection methods in presence of possibly invalid instruments, when complete knowledge about the candidate instruments’ validity is absent.

The importance of developing IV selection techniques can be illustrated by a class of IV application: the shift-share IV estimation in international economics, where the instruments are constructed by many class-specific share variables. For example, Apfel (2019) estimates the effect of immigration on wages in the US labor market. The instruments for contemporaneous immigration pattern is the lagged immigration pattern, which is constructed by 19 origin-country-specific share variables. Researches in this area have documented that to make a valid IV, each of the 19 share variables must satisfy the exclusion restriction. However, some of the shares may violate the exclusion restrictions, as they may affect the wage variable directly through long-term dynamic adjustment process, or be correlated with unobserved demand shocks.

In this paper, we propose an IV selection and estimation method, which combines the agglomerative hierarchical clustering algorithm, a machine learning algorithm typically employed in cluster analysis, with the Sargan test for overidentifying restrictions. The estimator that we develop relies on the plurality rule (Guo2018Confidence) which states that the largest group of IVs consists of valid instruments, where instruments form a group if their instrument-specific just-identified estimators converge to the same value. Under the plurality rule, our method can achieve oracle selection, which means that the valid instruments can be selected consistently, and the IV estimator using the selected valid instruments has the same limiting distribution as the estimator if the true set of valid instruments were known.

Previous work has tackled the IV selection problem in the single endogenous variable case. Kang2016Instrumental propose a selection method based on least absolute shrinkage and selection operator (LASSO). Windmeijer2019Use make improvements by proposing an adaptive Lasso based method under the assumption that more than half of the candidate instruments are valid (the majority rule), which theoretically guarantees consistent selection. Guo2018Confidence propose the Hard Thresholding method under the suffcient and necessary identification condition (the plurality rule), which is a relaxation to the majority rule. Under the same identification condition, Windmeijer2020Confidence propose the Confidence Interval method, which has better finite sample performance. Our research adds to the literature in five ways:

  1. We combine agglomerative hierarchical clustering with a traditional statistical test, the Sargan over-identification test, to yield a novel downward testing algorithm for IV selection. This new method provides the theoretical guarantee that it can select the true set of valid instruments consistently, and is computationally feasible.

  2. We extend the method to settings with multiple endogenous regressors. Such an extension is not available for the aforementioned methods, but it is straightforward in our setting.

  3. Our method performs well in the presence of weak valid or invalid instruments, which is an advantage over existing methods.

  4. We also discuss the application of our method to a setting with heterogeneous treatment effects.

  5. Our algorithm is computationally less complex than the CIM method. Also, compared with the commonly used K-means clustering, our method does not need to pre-specify the number of clusters or any starting points – the only pre-specified parameter for our algorithm is the critical value for the Sargan test, which has been well established in the existing theory to guarantee consistent selection, making our method easy to implement in practice.

We conduct Monte Carlo simulations to examine the performance of our method, and compare with two existing methods, the Hard Thresholding method (HT) and the Confidence Interval method (CIM). The simulation results show that our method achieves oracle performance in both single and multiple endogenous regressors settings in large samples when all the instruments are strong. Also, our method works well when some of the candidate instruments are weak, outperforming HT and CIM. We apply our method to shift-share IV estimation to estimate the dynamic effects of immigration on wages in the US.

The remainder of this paper is structured as follows. In Section 2, we state the model and assumptions and illustrate some of the well-established properties of the 2SLS just-identified estimator. In Section 3, we describe the basic method and the algorithm, and investigate its asymptotic properties. In Section 4, we present extensions to settings with multiple endogenous regressors and heterogenous treatment effects, and discuss our method’s performance in presence of weak instruments. In Section 5, we provide Monte Carlo simulation results. In Section 6, we apply our method to estimate the effects of immigration on wages. Section 7 concludes.

2 Model and Assumptions

2.1 Setup

In the following we introduce notational conventions used throughout this paper. Matrices are in bold. Vectors are in bold and italic. Let an -vector of the observed outcome, , …, be endogenous regressor vectors (each ), which can be subsumed in an - matrix , , …, be instrument vectors, which can be subsumed in an - matrix . Let error terms be and for , which are all error-vectors and are correlated with cov. The latter covariances measure the endogeneity of regressors in . The coefficient vector of interest is (). The matrix contains first-stage coefficients. Let be the number of invalid instruments in set , be the number of valid instruments in set and be the total number of instruments in set . The mean of a variable is , denotes the L2-norm and denotes cardinality. The projection matrix is , the annihilator matrix is and are the fitted values.

2.2 Model

We start from the model setup with a single endogenous regressor, i.e. throughout Section 2 and Section 3, . The extension of our method to the cases with multiple endogenous regressors can be found in Section 4.1. Following previous research, we adopt the following observed data model which takes the potentially invalid instruments into account:

(1)

and the first stage reduced form is

(2)

is a -vector with entries , each of which is associated with an instrument. Each entry indicates which of the instruments has a direct effect on the outcome variable and hence is invalid. An IV which is associated with a zero-entry in is valid( (Guo2018Confidence) . The number of valid IVs is .

2.3 Assumptions

The first assumption makes sure that the just-identified estimators all exist

Assumption 1.

Existence of just-identified estimators.
For all , we assume

Assumption 2.

Rank assumption

Assumption 3.

Error structure

Assumption 4.

The next assumption is key. It states that the largest group is composed by valid IVs. A group of IVs is defined as a set which converges to the same value , i.e. (Guo2018Confidence).

Assumption 5.

Plurality Rule

The assumptions above will be modified when there are more than one endogenous regressors.

2.4 Properties of Just-identified Estimators

From and , we have the outcome-instrument reduced form

where . There are just-identified 2SLS estimators. We write these estimators as in Windmeijer2020Confidence.

where and are the OLS estimator for and respectively. Then we have

Hence, the inconsistency of is . We define a group following the definition in Guo2018Confidence as:

Then the group consisting of all valid instruments is

Let there be groups.

3 IV Selection and Estimation Method

We explore clustering methods for IV selection and estimation. First we fit the general clustering framework to the IV selection problem, which is summarized in the minimisation problem in 3. This general method needs a pre-specified parameter , which is the number of clusters. We show that when equals the number of groups, it can achieve consistent selection. However, the fact that consistent selection depends on makes it difficult to implement in practice, as we do not have prior knowledge about the number of groups. If K is too large (larger than the number of groups), then the largest group will be split. If K is too small, then the largest group might be in a cluster with some other group. To tackle this problem, we propose a downward testing procedure which combines the agglomerative hierarchical clustering method (Ward’s method) with the Sargan test for overidentifying restrictions to select the valid instruments, which allows us to systematically select .

Note that in this section, we develop our methods and properties with . In Section 4.1 we extend the method to cases where . All proofs in the Appendix are for a general .

3.1 Clustering Method for IV Selection

Let be a partition of just-identified estimators into cluster cells. Let be the set of identities of the just-identified estimators which are in cluster . The clustering result is the solution to the following minimization problem:

(3)

is the mean of all just-identified estimators in cluster .
Based on Assumption 5, the group that consists of valid IVs is selected as the cluster that contains the largest number of just-identified estimators:

Then the set of invalid instruments is

Now we show that when the number of clusters is equal to the number of groups , , then the partition minimizing the sum in 3 is such that , i.e. each cluster is formed by a group. Define this partition as the true partition .

To see that, first note that if the partition is such that ,

, we have , and . This is the case for all , hence . Second, if the partition is such that some , i.e. , then for some and . This means that when there is a unique solution for 3, which is such that . Based on Assumption 5, the valid instruments are those contained in the largest cluster. This of course relies on the correct choice of which satisfies .

3.2 Ward’s Algorithm for IV Selection

To tackle the difficulty of choosing the correct value of without prior knowledge of the number of groups, we propose a selection method which combines the Ward’s algorithm, a general agglomerative hierarchical clustering procedure proposed by Ward1963Hierarchical, with the Sargan test for overidentification. Our selection algorithm has two parts. The first part is Ward’s algorithm, as is listed in Algorithm 1, which generates clusters of the just-identified estimators for each . After obtaining the clusters for each , we use a downward testing procedure based on the Sargan test (Algorithm 2) to select the set of valid instruments.

Algorithm 1.

Ward’s algorithm

  1. Input: Each just-identified point-estimate is calculated.

  2. Initialization: Each just-identified estimate has its own cluster. The total number of clusters in the beginning hence is .

  3. Joining: The two clusters which are closest as measured by the Euclidean distance of their means are joined to a new cluster.

  4. Iteration: The joining step is repeated until all just-identified point-estimates are in one cluster.

This yields a path of steps, on which there are clusters of size . After generating the whole clustering path by Algorithm 1, we select the set of valid instruments following Algorithm 2:

Algorithm 2.

Selection

  1. Starting from , find the cluster that contains the largest number of just-identified estimators.

  2. Do Sargan test on the instruments contained in the largest cluster using the rest as controls.

  3. Repeat for each .

  4. Select the largest cluster (in terms of number of just-identified estimators) that does not get rejected by the Sargan test. If there are multiple such clusters, select the one with the smallest Sargan statistic.

  5. Select the instruments contained in the cluster from Step 4 as valid instruments.

The Sargan statistic in Step 4 is given by

where is the 2SLS estimator using the instruments contained in the largest cluster for each as valid instruments and controlling for the rest of the instruments, and is the 2SLS residual. We show later that to guarantee consistent selection, the critical value for the Sargan

Figure 1: Illustration of the algorithm

test, denoted by should satisfy and . In practice, we choose the quantity following Windmeijer2020Confidence.

The procedure is illustrated in figure 2. Here, we have a situation with six instruments. Three of them are valid as they affect the outcome variable only through the endogenous regressor, while it is not the case for the other three invalid instruments. In the graph the circles above the real line denote the just-identified estimators for the coefficient estimated by each of the six instruments. From left to right, we number these estimators and their corresponding instruments as No.1 to No.6.

In the initial Step (0) of the clustering process, each just-identified estimator has its own cluster. In step 1, we join the two estimators which are closest in terms of their Euclidian distance, i.e. those estimated with instrument No.3 and No.4 (the two orange circles). These two estimators now form one cluster and we only have five clusters. We re-check the distance between each two of the five clusters and merge the closest two into a new cluster. Continue with this procedure, until there is only one cluster left in the bottom right graph. We evaluate the Sargan test at each step, using the instruments contained in the largest cluster. When the p-value is larger than a certain threshold, say , we stop the procedure. Ideally this will be the case at step 2 or 3 of the algorithm, because here the largest group (in orange) is formed only by valid IVs (2,3 and 4). If this is the case, only the valid IVs are selected as valid.

3.3 Oracle Selection and Estimation Property

In this section, we state the theoretical properties of the IV selection results obtained by Algorithm 1 and Algorithm 2 and the post-selection estimators. See Section 4.1 for detailed theoretical results developed for the general case . We establish that our method can achieve oracle properties in the sense that it can select the valid instruments consistently, and that the post-selection IV estimator has the same limiting distribution as if we knew the true set of valid instruments.

Theorem 1.

Consistent selection
Let be the critical value for the Sargan test in Algorithm 2. Let be the set of instruments selected from Algorithm 1 and Algorithm 2. Under Assumptions 1 - 5, for and ,

The post-selection 2SLS estimator using the selected valid instruments and controlling for the selected invalid instruments has the same asymptotic distribution as the oracle estimator:

Theorem 2.

Let with , being the selected invalid and valid instruments respectively. Let be the 2SLS estimator given by

Under Assumption 1-5, the limiting distribution of is

where is the asymptotic variance for the oracle 2SLS estimator given by

with being the true set of invalid instruments.

The proof of Theorem 2 follows from the proof of Guo2018Confidence

3.4 Computational Complexity

Recent implementations of the hierarchical agglomerative clustering algorithm have a computational cost of (Amorim2016Ward). In the downward testing procedure, a maximum of different models needs to be tested. Therefore, the computational cost of the downward testing algorithm is . This is an improvement on the CIM which has a time complexity of and where the maximal number of tests is .

4 Extensions

In this section, we propose extensions of the method to a setting with multiple endogenous regressors and to a setting with heterogeneous effects. We also discuss the performance of our method in presence of weak instruments as compared with the HT and CI method in this situation.

4.1 Multiple Endogenous Regressors

One shortcoming of previous methods that try to select invalid instruments is that they only allow for one endogeneous regressor. Therefore, in this section we show how our method can be naturally extended to select invalid instruments when . First of all, the input of our method, all the just-identified estimators, are estimated by all the combinations from . Hence we now have instead of just-identified estimators. Let be a set of identities of any instruments such that the model is exactly identified with these instruments. Let denote the corresponding instrument matrix. To guarantee that all the just-identified estimators exist, we modify Assumption 1 as follows:

Assumption 1.a.

Existence of just-identified estimators
For all possible values of , let be the combinations of the -row of for all . Then we assume

The plurality assumption also needs modification for . For , Assumption 5 states that the valid instruments form the largest group, where instruments form a group if their just-identified estimators converge to the same value. If we find the largest set of just-identified estimators that converge to the same value, then this set is automatically the largest group of instruments as each just-identified estimator is estimated by a single instrument. However, when , each just-identified estimator is estimated by multiple instruments, hence the equivalence between the largest set of just-identified estimators and the largest group of instruments may not hold. In this case, we modify the plurality rule so it is based on the combinations of instruments instead of individual instruments. The modification starts with revisiting the asymptotics of the just-identified estimators for .

There are just-identified models. We write the corresponding just-identified estimators for and analogously to the proof of Proposition A1 in Windmeijer2020Confidence for the case . First, for an arbitrary , partition the matrix , where is a matrix containing the -th columns of , and is a matrix containing the rest columns of . is the equivalent partition of the matrix of first-stage coefficients. , then , with

The just-identified 2SLS estimators using as instruments and controlling for the rest instruments can be written as

Note that is equal to

By Assumption 4, we have the following asymptotics

Hence, the inconsistency of is and there are inconsistency terms . Let be the just-identified 2SLS estimator estimated with . As when , not each IV is associated with a single , we introduce the concept of a family: A family is a set of IV combinations that is associated with just-identified estimators which converge to the same vector, .

Then the family that consists of IV combinations which generate consistent estimators is

Let there be families. Note that when a group of IVs automatically is a family.

Analogously to Assumption 5, we assume that is the largest family:

We show in Appendix B, that a combination of IVs is an element of if and only if all of the IVs used for its estimation are in fact valid. This means that the family of valid IVs consists of all combinations that use IVs out of and hence . Therefore, the plurality assumption can be modified to

Assumption 5.a.

New plurality

The inconsistency term of the other families depends on the first-stage coefficient vectors and hence there is no direct relation from to . This means that one family can be estimated with IV combinations which have different vectors . We show this in Appendix C.

One way in which the new plurality could be fulfilled is

Assumption 5.b.

and

where and are two different sets of IVs.

The second part of the assumption makes sure that the family of valid IVs consists of more than one element, without which Assumption 5.a. can not be fulfilled. The third part of this assumption makes sure that sets containing at least one invalid IV do not converge to the same value and the valid group of IVs also translates into the largest family. This can be seen as a technical assumption and is stronger than needed.

The procedure to estimate is analogous to the one in the preceding section (see Appendix A for illustration) only that now we need to account for the presence of families.

The valid IVs are then selected as those that are involved in estimating the largest cluster.

The cluster containing the valid instruments is chosen as the one where the number of estimates in the cluster is maximal and the Sargan test is not rejected (Sargan statistic smaller than the threshold ). In cases where there are multiple such clusters, we select the cluster in which more IVs are involved.

One ambiguity arises in finite samples: asymptotically, following Assumption 5.b the largest group (in terms of direct effects) being valid also implies the largest family being valid. However in finite samples, the number of IVs involved in the estimation of the largest cluster might be smaller than the number involved in the estimation of another cluster. Therefore, we could also select the valid IVs as

directly selecting the family associated with the maximal number of IVs instead of the largest cluster. It is an empirical question which of the two methods should be used.

The method has oracle properties as stated in Theorem 1 and Theorem 2. Here we formally establish the theoretical results for the general case . See Appendix D for proofs of all theorems. The following Lemma establishes that when assigning a just-identified estimator to either (1) clusters that are formed by just-identified estimators from the same family as , or (2) clusters that contain at least one element that is from a different family from , asymptotically will be assigned to the first type of clusters.

Lemma 1.

Let be a just-identified estimator such that , be the cluster such that all the elements in are from , . For , let be a cluster such that at least one element in is from a different family from , and . Under assumptions 1 to 5 and Algorithm 1, if assigning to either or , is assigned to with probability converging to 1.

In Algorithm 1, we start from the number of clusters . For each step onward, according to Step 3 in Algorithm 1, there would be two clusters joining with each other and forming a new cluster. For a cluster that contains more than one just-identified estimators, the cluster mean can be viewed as a single estimator averaging over all the estimators in it (e.g. and in Step 3 Algorithm 1). Therefore, the joining process for each step in Algorithm 1 can be viewed as assigning an estimator to a cluster. Note that a cluster that consists of just-identified estimators from the same family can be viewed as an estimator from this family. Based on Lemma 1, along the path of Algorithm 1, estimators from different families will not be joined with each other until all the estimators from the same family have merged into one cluster. If for each family, the just-identified estimators contained in it have merged into the same cluster, then we know that the total number of clusters would be . This implies that when the number of clusters is smaller than , then the current clusters must be subsets of families.

Corollary 1.

Under assumptions 1 to 4, in steps 3 and 4 of Algorithm 1:

To better understand why this is the case, consider the following analogy. There are guests ( just-identified estimates) which belong to families. These people live in a hotel, which has rooms (clusters). Each day, one room disappears, and one of the people needs to move into the room of some other guest. The people in a family have closer ties, so the person whose room disappears will move into the room of somebody from their own family. This goes on until each family is living respectively in one crowded room. The hotel now continues to shrink. Only now are people from different families merged together into the same rooms. The largest family can be detected, when all people from the same family have been merged into one room, but people from other families have not been merged into one room completely (or have just been all merged into one room respectively).

In Algorithm 1, the number of clusters starts with and ends with . For each step inbetween, the number of clusters decreases by 1, hence there must be a step where . Based on Lemma 1 and Corollary 1, estimators from different families are joined together only when all elements of their own family have been completely joined to their clusters. This implies that in particular when , there is a cluster . Therefore, the path generated by Algorithm 1 contains the true partition as there must be one step such that .

Corollary 2.

When , .

The theoretical results above establish that the selection path generated by Algorithm 1 covers the true set of valid instruments . Next we show that by Algorithm 2, we can locate this and select the valid instruments consistently. This consistent selection property is summarized in Theorem 1 which holds for under Assumption 1(1.a.) to Assumption 5 (5.a., 5.b.). This is also the case for Theorem 2.

4.2 Weak Instruments

A major advantage of our method is that it can deal with the presence of weak instruments (valid or invalid) efficiently. The intuition is that for weak instruments, their instrument-specific just identified estimators tend to have much larger magnitude than those of the strong instruments. Hence, the Euclidian distance between these two types of estimators tends to be large, causing them to be less likely to be joined with each other. The existing method, the HT and CI method, can face problems in the presence of weak instruments. CI would always selects the weak instruments as valid, while the first stage hard thresholding of the HT method might keep invalid instruments and rule out valid ones under certain correlation structures among the instruments.

4.3 Heterogeneous Treatment Effects

The instrumental variable estimator also has a local average treatment effect (LATE) interpretation, as estimating the average treatment effect of a sub-population, whose treatment can be changed by the instrument (Imbens1994Estimation). Hence, LATEs will naturally vary with the instruments. For example, an increase in minimum school-leaving age and proximity to school will see different populations increase their schooling. In this section we show such a setting and argue that our method can retrieve the largest group associated with a given LATE or the whole set of different LATEs.

For simplicity, we look at a setting with a binary treatment , a binary instrument and potential outcomes and . The outcome and the treatments can be written as

Assumption 6.

Independence

Assumption 7.

First Stage

Assumption 8.

Monotonicity

If the last three assumptions are fulfilled, Imbens1994Estimation show that the IV estimand is the average treatment effect of compliers:

(4)

In the following, we show a setting in which the LATEs are dependent on one potentially unobserved variable . For this, we make use of the setting in Angrist2010Extrapolate.

The treatment is determined by the following latent-index assignment mechanism

(5)

where and the potential outcomes depend on the variable :

where the errors are .

Angrist2010Extrapolate then assume

Assumption 9.

Conditional Effect Ignorability:

The authors then show that under this assumption LATE can be written as function of :

(6)

Next, we are interested in a setting where the by-IV treatment effects form groups:

(7)

This might be the case, when different compliant populations have the same or different lead to the same . Keep in mind that the number of groups is .

Lemma 2.

Under Assumptions 6, 7 and 8, in steps 3 and 4 of Algorithm 1:

(8)

This follows by the same assumptions as above. In the same way:

Theorem 3.

Consistent selection of LATE groups
Let be the critical value for the Sargan test in Algorithm 2. Under Assumptions 1 - 5 and Lemma 2, for and , there is at least one step s.t. and

where is the largest group.

The difference to the setting with invalid IVs is that in the LATE setting not only the largest cluster contains valuable information, but also the smaller clusters contain coefficient estimates obtained with valid instruments.

5 Monte Carlo Simulations

First, we apply our method to simulated data. In the single regressor setting, we want to compare the performance of the new clustering method with that of the existing Confidence Interval Method and the Two-Stage Hard Thresholding Method. Therefore, we run simulations in which we follow closely the setting in Windmeijer2020Confidence: There are 21 IVs, twelve of which are invalid, while nine are valid with , , where is an vector of ones and is an vector of zeros. We set and . The true is and with . Errors are

The results are in table 1. The oracle 2SLS estimator has the lowest bias and the coverage rate of the 95 % confidence interval is at 0.948. The naive 2SLS estimator has a median absolute error of about 1.034 and never covers the true value. This does not change when increasing the sample size to 10,000 and was expected.

When using two-stage hard thresholding (HT) with 500 observations, the MAE is larger than for naive 2SLS and the method never chooses the oracle model, leading none of the confidence intervals to cover the true value. When using CIM, the MAE is already low when , the number of IVs chosen as invalid is close to twelve, the frequency with which the oracle model is selected is at 0.96, and the coverage rate is at about 0.94. Results are very similar for our agglomerative clustering method. When increasing the sample size, the performance improves for all three selection methods and the MAE is equal to that of the oracle estimator in all cases. The coverage rate is now very close to the correct 95% for all estimators and the probability to select the oracle model is close to one, except for HT where it is at 0.83, which however does not deteriorate the performance in terms of MAE.

MAE SD # invalid p allinv Coverage p oracle
N=500
oracle 0.017 0.024 12 1 0.948 1
naive 1.034 0.026 0 0 0 0
HT 1.174 0.124 12.932 0 0 0
CIM 0.018 0.215 12.043 0.992 0.936 0.963
HC 0.018 0.183 12.073 0.987 0.928 0.974
N=5000
oracle 0.005 0.007 12 1 0.952 1
naive 1.053 0.007 0 0 0 0
HT 0.005 0.008 12.321 1 0.952 0.73
CIM 0.005 0.008 12.012 1 0.952 0.988
HC 0.005 0.125 12.05 0.993 0.937 0.981
N=10,000
oracle 0.004 0.005 12 1 0.941 1
naive 1.05 0.005 0 0 0 0
HT 0.004 0.006 12.206 1 0.929 0.83
CIM 0.004 0.005 12.012 1 0.939 0.988
HC 0.004 0.104 12.047 0.996 0.926 0.979
This table reports median absolute error standard deviation, number of IVs selected as invalid, frequency with which all invalid IVs have been selected as invalid, coverage rate of the 95 % confidence interval and frequency with which oracle model has been selected. The true coefficient . WLHB setting and invalid weaker setting are described in the text. 1000 repetitions per setting.
Table 1: WLHB
MAE SD # invalid p allinv Coverage p oracle
N=500
oracle 0.017 0.024 12 1 0.947 1
naive 0.204 0.054 0 0 0.463 0
HT 0.017 0.024 12 1 0.947 1
CIM 13.522 8.767 11.509 0.023 0.079 0
HC 0.017 1.146 12.012 0.991 0.934 0.983
N=5000
oracle 0.005 0.007 12 1 0.959 1
naive 0.357 0.018 0 0 0 0
HT 0.005 0.007 12 1 0.959 1
CIM 20.401 4.613 10.329 0.014 0.016 0.002
HC 0.006 4.296 11.967 0.923 0.876 0.913
N=10,000
oracle 0.003 0.005 12 1 0.947 1
naive 0.356 0.013 0 0 0 0
HT 0.005 0.056 10.84 0.798 0.756 0.798
CIM 21.895 5.451 10.952 0.054 0.053 0.044
HC 0.004 4.46 11.978 0.928 0.862 0.908
This table reports median absolute error standard deviation, number of IVs selected as invalid, frequency with which all invalid IVs have been selected as invalid, coverage rate of the 95 % confidence interval and frequency with which oracle model has been selected. The true coefficient . WLHB setting and invalid weaker setting are described in the text. 1000 repetitions per setting.
Table 2: Invalid weak
MAE SD # invalid p allinv Coverage p oracle
N=500
oracle 0.138 0.192 12 1 0.939 1
naive 1.84 0.04 0 0 0 0
HT 1.714 0.19 9.953 0 0 0
CIM 1.264 0.571 6.121 0 0 0
HC 1.224 0.577 11.43 0.077 0.075 0.006
N=5000
oracle 0.05 0.071 12 1 0.954 1
naive 1.841 0.012 0 0 0 0
HT 0.268 0.577 16.642 0.596 0.557 0
CIM 1.241 0.477 10.501 0.147 0.139 0.147
HC 0.201 0.766 13.047 0.541 0.49 0.16
N=10,000
oracle 0.036 0.052 12 1 0.959 1
naive 1.845 0.008 0 0 0 0
HT 0.122 0.256 16.256 0.951 0.783 0.004
CIM 0.055 0.567 11.73 0.715 0.686 0.71
HC 0.055 0.589 12.655 0.828 0.754 0.536
This table reports median absolute error standard deviation, number of IVs selected as invalid, frequency with which all invalid IVs have been selected as invalid, coverage rate of the 95 % confidence interval and frequency with which oracle model has been selected. The true coefficient . WLHB setting and invalid weaker setting are described in the text. 1000 repetitions per setting.
Table 3: Valid weak
MAE SD # invalid p allinv Coverage p oracle
P=2
N=500
Oracle 0.025 0.041 0 0 0.92 0
Naive 0.104 0.056 12 1 0.01 1
AC 0.073 0.137 12.28 0.53 0.47 0.3
N=5000
Oracle 0.008 0.012 0 0 0.95 0
Naive 0.092 0.014 12 1 0 1
AC 0.008 0.011 12.04 1 0.96 0.96
N=10000
Oracle 0.006 0.009 0 0 0.95 0
Naive 0.092 0.011 12 1 0 1
AC 0.006 0.01 12.08 1 0.92 0.94
P=3
N=500
Oracle 0.036 0.053 0 0 0.89 0
Naive 0.679 0.104 12 1 0 1
AC 0.112 0.208 10.95 0.21 0.19 0.19
N=5000
Oracle 0.01 0.015 0 0 0.88 0
Naive 0.592 0.029 12 1 0 1
AC 0.012 0.053 12.02 0.86 0.77 0.83
N=10000
Oracle 0.007 0.011 0 0 0.91 0
Naive 0.592 0.022 12 1 0 1
AC 0.007 0.025 12.05 0.97 0.86 0.95
This table reports median absolute error, standard deviation, number of IVs selected as invalid, frequency with which all invalid IVs have been selected as invalid, coverage rate of the 95 % confidence interval and frequency with which oracle model has been selected. For the first two, means over the statistic for each regressor are taken. The true coefficient is . Settings are described in the text. 100 repetitions per setting.
Table 4: Multiple endogenous regressors

Following the discussion of weak instruments in 4.2, we expect that in settings where the invalid IVs are weaker than the valid ones the performance of the CIM will deteriorate, while the HT should detect the invalid instruments. To achieve such a setting, we divide the of the invalid IVs by 10. The results can be found in table 2. The HT method attains oracle performance but it deteriorates when the sample size is at 10,000. As expected, the CIM has a low coverage and probability of selecting the true model at almost zero for all sample sizes. The MAE can be as high as 20, when that of the naive estimator is at just 0.35. Our AC method is close to oracle performance for all sample sizes. Hence, in this setting HT and AC ourperform CIM.

What happens when the valid IVs are weak? Following our aforementioned discussion we would expect the HT method to use only valid and relevant IVs and hence use too few IVs as valid. The CIM should work well, while our AC method could deteriorate because of bad finite sample behavior of the point estimates which use valid IVs. Indeed, the HT ends up selecting 16 instead of 12 as invalid (and not relevant), leading to an oracle rate close to zero. For the largest sample size the MAE is at a four-fold of the oracle MAE. The CIM performs well when the sample size is at 10,000, but much worse for smaller sizes. Our method weakly outperforms the other two in terms of MAE and the reported statistics approach oracle performance.

Now, we want to inspect the performance of our method when there are multiple endogenous regressors. The existing selection methods do not allow for such an extension. We draw 21 IVs with , , and , when there is a third endogenous regressor. The rest of the parameters are the same as before.

With this setting we estimate for replications. The results can be found in table 4. Again, it appears that the performance of our estimator approaches that of the oracle estimator as the sample size grows large.

These simulations illustrate the two key advantages of our method. First, in settings with weak instruments our method potentially outperforms existing methods. If it is not known whether weak IVs are valid or invalid, it would be preferable to use AC instead of the existing methods. In the setting from WHLB, the performance of the methods is comparable. Second, our method is applicable to the multiple endogenous variable case.

6 Application: Effect of Immigration on Wages

In this section we apply our method to the estimation of the effects of immigration on wages in the US. We first describe the setting and then discuss the results.

Many recent studies have tried to estimate the causal effects of immigration on labor market outcomes.1 Most papers in the literature only estimate the contemporaneous effects of immigration on labor market outcomes. Jaeger2020Shift point out that there might be general equilibrium adjustments that affect wages in the long run, for example through the attraction of capital or the responses of native labor. This calls for including lagged immigration into the regression equation. However, the instrumental variables typically used in estimation might be invalid, because of correlation with unobservable shocks or direct effects.

We estimate the following linear model:

(9)

as in Basso2015Association.

Here, there are three years and 722 commuting zones . The dependent variable is the change in log weekly wages of high-skilled workers. The independent variables are and , which is the change of the share of immigrants in employment. The coefficients of interest are and . Decade fixed-effects are described by and is the error term. Commuting-zone fixed effects are eliminated through first-differencing. This regression is canonical in migration economics. The authors use data from the Census Integrated Public Use Micro Samples and the American Community Survey.

The key econometric challenge is that migrants select where to live endogenously. For example, migrants might choose where to live based on economic conditions in a region. This creates a bias in the estimates. A much-used estimation strategy to address this issue is to use a shift-share instrumental variable, also known as Bartik-instrument due to Bartik1991Who.

The key idea is to interact shares of previous migrants in a base-period, with current, aggregate-level shifts, or inflows of migrants. This identification strategy dates back to Altonji1991effects in migration economics. Goldsmith-Pinkham2020Bartik show that the validity of the shift-share instrument depends on the validity of all shares and that an over-identified model with all shares as instruments can be used equivalently to the just-identified model. Therefore, we use all shares of migrants from a certain origin country , at a base period in region . We use origin-specific shares from 19 origin country groups and base years 1970 and 1980 as separate IVs and obtain IVs.

The main drawback of Bartik-type designs is that all instruments need to be valid instruments. Why should these instruments be invalid? Jaeger2020Shift show that the shift-share IV estimator might be inconsistent, first, because of correlation of the IVs with unobserved demand shocks and, second, because of dynamic adjustment processes. Hence, none of these two should play a role. However, it is well plausible that some origin country groups did not locate randomly in the past or have had direct effects on the wages. The second challenge can be somewhat tackled by including lagged immigration as an additional regressor. Of course, this will also be subject to the same endogeneity problem as before and hence should also be instrumented.2 To circumvent these problems, we apply the new estimator, which allows for direct effects of many shares on wages by selecting the invalid shares. This alternative estimation of shift-share designs has also been proposed in Apfel2019Relaxing.

OLS 2SLS 2SLS AHC

0.586 0.877 1.522
(0.0935) (0.460) (0.292)

-0.197 -0.249 -0.771
(0.0814) (0.321) (0.246)

Nr inv
0 2
P-value .0126 .0447
N = 2166 (722 CZ 3), L = 38. Standard errors in parentheses. Observations weighted by beginning-of-period population. Significance level in testing procedure: 0.013.
Table 5: Impact of Immigration

Results

The results can be found in Table 5. The first column shows results for ordinary least squares: the contemporaneous effect is 0.586, while the lagged effect is lower and negative. When using all shares as valid IVs, both effects are higher in absolute terms but only the contemporaneous effect is marginally statistically significant. The Hansen-Sargan test for this model gives a p-value of 0.0126, which is lower than the proposed significance level of .

When using AHC with this significance level in the downward testing procedure, two origin country shares are selected as invalid: the share of Mexicans in the US in 1970 and 1980. This means that two IVs which are similar a priori in that they are shares from the same origin country. These shares are likely to be invalid, because Mexican migrants were attracted to border regions as Texas and California by the good economic conditions in those states, both in the base year and in later periods. California’s economy has a large agricultural sector, and both states are among the wealthiest in the US. It is therefore likely that wages or unobserved productivity shocks that have driven the initial settlement are correlated over time, invalidating the initial shares. Moreover, Goldsmith-Pinkham2020Bartik find that Mexico has the highest sensitivity-to-misspecification weight, that is the overall bias will be sensible to any invalidity stemming from the Mexican share. Indeed, after controlling for Mexican shares, the contemporaneous effect almost doubles, while the lagged effect triples.

7 Conclusion

We have proposed a novel method to select valid instruments. This method is applied to the estimation of the effect of immigration on wages in the US. The method can also be easily applied to any other setting in which there are many candidate instruments. Another application that we will include is that of Mendelian Randomization, the use of instrumental variables in epidemiology.

The advantages of our method are that it extends straightforwardly to the setting with multiple endogenous regressors and to the setting with heterogenous treatment effects. It also performs well in the presence of weak instruments.

Ways to further improve the method would be to account for the variance of each just-identified estimator in the selection algorithm, and to apply it in nonlinear models. Also, we plan to further explore applications of our method to models with richer forms of heterogeneity.

Appendix A Illustration of the IV Selection Procedure for

In figure 2, the procedure is illustrated. Here, we have a situation with four IVs and two endogenous regressors. Instrument NO. 1 is invalid, because it is directly correlated with the outcome, while the remaining three IVs (2, 3, 4) are related with the outcome only through the endogenous regressors and are hence valid.

In the first graph on the top left, we have plotted each just-identified estimate. The horizontal and vertical axes represent coefficient estimates of the effects of the first () and second regressor (), respectively. Each point has been estimated with two IVs, in this case with IV pairs 1-2, 1-3, 1-4, 2-3, 2-4 and 3-4, because there are four candidate IVs.

In the initial Step (0), each just-identified estimate has its own cluster. In step 1, we join the estimates which are closest in terms of their Euclidian distance, e.g. those estimated with pairs 2-3 and 2-4. These two estimates now form one cluster and we only have five clusters. We re-estimate the distance to this new point and continue with this procedure, until there is only one cluster left in the bottom right graph. We evaluate the Sargan test at each step, using the IVs which are involved in the estimation of the largest group. When the p-value is larger than a certain threshold, say 0.05, we stop the procedure. Ideally this will be the case at step 2 or 3 of the algorithm, because here the largest group (in orange) is formed only by valid IVs (2,3 and 4). If this is the case, only the valid IVs are selected as valid.

Figure 2: Illustration of the algorithm

Appendix B consists of valid IVs only

Next, we show that the family with is composed of valid IVs with , only.

Remark 1.

is necessary and sufficient for .

Proof:

First prove sufficiency: Direct proof: Assume holds. follows directly.
Second, prove necessity: Proof by contraposition: Assume , then . The latter inequality holds, because otherwise the columns of are linearly dependent, and is not invertible and hence does not exist, which it clearly does, by Assumption 1.a.

This also implies that consists of valid IVs only and all sets of cardinality are elements of . Hence, the following remark directly follows:

Remark 2.

.

Appendix C One family can consist of different vectors

We have shown that the number of valid IVs defines the size of the family . However, this relation between and is available only when .

Remark 3.

The function is non-injective.

Proof:

Proof by counter-example: Show that there is more than one element in the domain which leads to the same image, i.e.

Define . Then, and . Assume .

(10)
(11)

Therefore

(12)
(13)

This means that all s.t. lead to the same solution .

Hence, even though the number of IVs with the same value is smaller than , the largest family might still consist of combinations of invalid IVs, because the first-stage coefficient matrix also determines .

Appendix D Oracle Properties

This section gives proofs for Lemma 1 and Theorem 1. All proofs apply for the general case that .

d.1 Proof of Lemma 1

Proof.

Consider

Under Assumptions 1 - 5:

(15)

and hence

Show that the probability that is assigned to a cluster with elements of its own family goes to 1. is assigned to a cluster with elements of its own family iff . The following two are hence equivalent

Under 15