Agglomerative Hierarchical Clustering for Selecting Valid Instrumental Variables
Abstract
We propose an instrumental variable (IV) selection procedure which combines the agglomerative hierarchical clustering method and the HansenSargan overidentification test for selecting valid instruments for IV estimation from a large set of candidate instruments. Some of the instruments may be invalid in the sense that they may fail the exclusion restriction. We show that under the plurality rule, our method can achieve oracle selection and estimation results. Compared to the previous IV selection methods, our method has the advantages that it can deal with the weak instruments problem effectively, and can be easily extended to settings where there are multiple endogenous regressors and heterogenous treatment effects. We conduct Monte Carlo simulations to examine the performance of our method, and compare it with two existing methods, the Hard Thresholding method (HT) and the Confidence Interval method (CIM). The simulation results show that our method achieves oracle selection and estimation results in both single and multiple endogenous regressors settings in large samples when all the instruments are strong. Also, our method works well when some of the candidate instruments are weak, outperforming HT and CIM. We apply our method to the estimation of the effect of immigration on wages in the US.
IVSelection.bib
1 Introduction
Instrumental variables estimation is a widely used statistical method for analysing the causal effects of treatment variables on an outcome factor when the causal pathway between them is confounded. Consistent IV estimation requires that all instruments are valid. This requires that (a) instruments must be associated with the endogenous variables (relevance condition) and (b) instruments should not affect the outcome directly or through unobserved factors (exclusion restrictions). In practice, a main challenge in IV estimation is that when there are a large number of candidate instruments, some of them may be invalid in the sense that they may fail the exclusion restrictions. Many IV applications select valid instruments from the set of potential instruments merely based on economic intuition, or even simply include all the candidate instruments in IV estimation. This kind of practice is problematic because including invalid instruments may lead to severely biased results. Therefore, it is important to develop IV selection methods in presence of possibly invalid instruments, when complete knowledge about the candidate instruments’ validity is absent.
The importance of developing IV selection techniques can be illustrated by a class of IV application: the shiftshare IV estimation in international economics, where the instruments are constructed by many classspecific share variables. For example, Apfel (2019) estimates the effect of immigration on wages in the US labor market. The instruments for contemporaneous immigration pattern is the lagged immigration pattern, which is constructed by 19 origincountryspecific share variables. Researches in this area have documented that to make a valid IV, each of the 19 share variables must satisfy the exclusion restriction. However, some of the shares may violate the exclusion restrictions, as they may affect the wage variable directly through longterm dynamic adjustment process, or be correlated with unobserved demand shocks.
In this paper, we propose an IV selection and estimation method, which combines the agglomerative hierarchical clustering algorithm, a machine learning algorithm typically employed in cluster analysis, with the Sargan test for overidentifying restrictions. The estimator that we develop relies on the plurality rule (Guo2018Confidence) which states that the largest group of IVs consists of valid instruments, where instruments form a group if their instrumentspecific justidentified estimators converge to the same value. Under the plurality rule, our method can achieve oracle selection, which means that the valid instruments can be selected consistently, and the IV estimator using the selected valid instruments has the same limiting distribution as the estimator if the true set of valid instruments were known.
Previous work has tackled the IV selection problem in the single endogenous variable case. Kang2016Instrumental propose a selection method based on least absolute shrinkage and selection operator (LASSO). Windmeijer2019Use make improvements by proposing an adaptive Lasso based method under the assumption that more than half of the candidate instruments are valid (the majority rule), which theoretically guarantees consistent selection. Guo2018Confidence propose the Hard Thresholding method under the suffcient and necessary identification condition (the plurality rule), which is a relaxation to the majority rule. Under the same identification condition, Windmeijer2020Confidence propose the Confidence Interval method, which has better finite sample performance. Our research adds to the literature in five ways:

We combine agglomerative hierarchical clustering with a traditional statistical test, the Sargan overidentification test, to yield a novel downward testing algorithm for IV selection. This new method provides the theoretical guarantee that it can select the true set of valid instruments consistently, and is computationally feasible.

We extend the method to settings with multiple endogenous regressors. Such an extension is not available for the aforementioned methods, but it is straightforward in our setting.

Our method performs well in the presence of weak valid or invalid instruments, which is an advantage over existing methods.

We also discuss the application of our method to a setting with heterogeneous treatment effects.

Our algorithm is computationally less complex than the CIM method. Also, compared with the commonly used Kmeans clustering, our method does not need to prespecify the number of clusters or any starting points – the only prespecified parameter for our algorithm is the critical value for the Sargan test, which has been well established in the existing theory to guarantee consistent selection, making our method easy to implement in practice.
We conduct Monte Carlo simulations to examine the performance of our method, and compare with two existing methods, the Hard Thresholding method (HT) and the Confidence Interval method (CIM). The simulation results show that our method achieves oracle performance in both single and multiple endogenous regressors settings in large samples when all the instruments are strong. Also, our method works well when some of the candidate instruments are weak, outperforming HT and CIM. We apply our method to shiftshare IV estimation to estimate the dynamic effects of immigration on wages in the US.
The remainder of this paper is structured as follows. In Section 2, we state the model and assumptions and illustrate some of the wellestablished properties of the 2SLS justidentified estimator. In Section 3, we describe the basic method and the algorithm, and investigate its asymptotic properties. In Section 4, we present extensions to settings with multiple endogenous regressors and heterogenous treatment effects, and discuss our method’s performance in presence of weak instruments. In Section 5, we provide Monte Carlo simulation results. In Section 6, we apply our method to estimate the effects of immigration on wages. Section 7 concludes.
2 Model and Assumptions
2.1 Setup
In the following we introduce notational conventions used throughout this paper. Matrices are in bold. Vectors are in bold and italic. Let an vector of the observed outcome, , …, be endogenous regressor vectors (each ), which can be subsumed in an  matrix , , …, be instrument vectors, which can be subsumed in an  matrix . Let error terms be and for , which are all errorvectors and are correlated with cov. The latter covariances measure the endogeneity of regressors in . The coefficient vector of interest is (). The matrix contains firststage coefficients. Let be the number of invalid instruments in set , be the number of valid instruments in set and be the total number of instruments in set . The mean of a variable is , denotes the L2norm and denotes cardinality. The projection matrix is , the annihilator matrix is and are the fitted values.
2.2 Model
We start from the model setup with a single endogenous regressor, i.e. throughout Section 2 and Section 3, . The extension of our method to the cases with multiple endogenous regressors can be found in Section 4.1. Following previous research, we adopt the following observed data model which takes the potentially invalid instruments into account:
(1) 
and the first stage reduced form is
(2) 
is a vector with entries , each of which is associated with an instrument. Each entry indicates which of the instruments has a direct effect on the outcome variable and hence is invalid. An IV which is associated with a zeroentry in is valid( (Guo2018Confidence) . The number of valid IVs is .
2.3 Assumptions
The first assumption makes sure that the justidentified estimators all exist
Assumption 1.
Existence of justidentified estimators.
For all , we assume
Assumption 2.
Rank assumption
Assumption 3.
Error structure
Assumption 4.
The next assumption is key. It states that the largest group is composed by valid IVs. A group of IVs is defined as a set which converges to the same value , i.e. (Guo2018Confidence).
Assumption 5.
Plurality Rule
The assumptions above will be modified when there are more than one endogenous regressors.
2.4 Properties of Justidentified Estimators
From and , we have the outcomeinstrument reduced form
where . There are justidentified 2SLS estimators. We write these estimators as in Windmeijer2020Confidence.
where and are the OLS estimator for and respectively. Then we have
Hence, the inconsistency of is . We define a group following the definition in Guo2018Confidence as:
Then the group consisting of all valid instruments is
Let there be groups.
3 IV Selection and Estimation Method
We explore clustering methods for IV selection and estimation. First we fit the general clustering framework to the IV selection problem, which is summarized in the minimisation problem in 3. This general method needs a prespecified parameter , which is the number of clusters. We show that when equals the number of groups, it can achieve consistent selection. However, the fact that consistent selection depends on makes it difficult to implement in practice, as we do not have prior knowledge about the number of groups. If K is too large (larger than the number of groups), then the largest group will be split. If K is too small, then the largest group might be in a cluster with some other group. To tackle this problem, we propose a downward testing procedure which combines the agglomerative hierarchical clustering method (Ward’s method) with the Sargan test for overidentifying restrictions to select the valid instruments, which allows us to systematically select .
Note that in this section, we develop our methods and properties with . In Section 4.1 we extend the method to cases where . All proofs in the Appendix are for a general .
3.1 Clustering Method for IV Selection
Let be a partition of justidentified estimators into cluster cells. Let be the set of identities of the justidentified estimators which are in cluster . The clustering result is the solution to the following minimization problem:
(3) 
is the mean of all justidentified estimators in cluster .
Based on Assumption 5, the group that consists of valid IVs is selected as the cluster that contains the largest number of justidentified estimators:
Then the set of invalid instruments is
Now we show that when the number of clusters is equal to the number of groups , , then the partition minimizing the sum in 3 is such that , i.e. each cluster is formed by a group. Define this partition as the true partition .
To see that, first note that if the partition is such that ,
, we have , and . This is the case for all , hence . Second, if the partition is such that some , i.e. , then for some and . This means that when there is a unique solution for 3, which is such that . Based on Assumption 5, the valid instruments are those contained in the largest cluster. This of course relies on the correct choice of which satisfies .
3.2 Ward’s Algorithm for IV Selection
To tackle the difficulty of choosing the correct value of without prior knowledge of the number of groups, we propose a selection method which combines the Ward’s algorithm, a general agglomerative hierarchical clustering procedure proposed by Ward1963Hierarchical, with the Sargan test for overidentification. Our selection algorithm has two parts. The first part is Ward’s algorithm, as is listed in Algorithm 1, which generates clusters of the justidentified estimators for each . After obtaining the clusters for each , we use a downward testing procedure based on the Sargan test (Algorithm 2) to select the set of valid instruments.
Algorithm 1.
Ward’s algorithm

Input: Each justidentified pointestimate is calculated.

Initialization: Each justidentified estimate has its own cluster. The total number of clusters in the beginning hence is .

Joining: The two clusters which are closest as measured by the Euclidean distance of their means are joined to a new cluster.

Iteration: The joining step is repeated until all justidentified pointestimates are in one cluster.
This yields a path of steps, on which there are clusters of size . After generating the whole clustering path by Algorithm 1, we select the set of valid instruments following Algorithm 2:
Algorithm 2.
Selection

Starting from , find the cluster that contains the largest number of justidentified estimators.

Do Sargan test on the instruments contained in the largest cluster using the rest as controls.

Repeat for each .

Select the largest cluster (in terms of number of justidentified estimators) that does not get rejected by the Sargan test. If there are multiple such clusters, select the one with the smallest Sargan statistic.

Select the instruments contained in the cluster from Step 4 as valid instruments.
The Sargan statistic in Step 4 is given by
where is the 2SLS estimator using the instruments contained in the largest cluster for each as valid instruments and controlling for the rest of the instruments, and is the 2SLS residual. We show later that to guarantee consistent selection, the critical value for the Sargan
test, denoted by should satisfy and . In practice, we choose the quantity following Windmeijer2020Confidence.
The procedure is illustrated in figure 2. Here, we have a situation with six instruments. Three of them are valid as they affect the outcome variable only through the endogenous regressor, while it is not the case for the other three invalid instruments. In the graph the circles above the real line denote the justidentified estimators for the coefficient estimated by each of the six instruments. From left to right, we number these estimators and their corresponding instruments as No.1 to No.6.
In the initial Step (0) of the clustering process, each justidentified estimator has its own cluster. In step 1, we join the two estimators which are closest in terms of their Euclidian distance, i.e. those estimated with instrument No.3 and No.4 (the two orange circles). These two estimators now form one cluster and we only have five clusters. We recheck the distance between each two of the five clusters and merge the closest two into a new cluster. Continue with this procedure, until there is only one cluster left in the bottom right graph. We evaluate the Sargan test at each step, using the instruments contained in the largest cluster. When the pvalue is larger than a certain threshold, say , we stop the procedure. Ideally this will be the case at step 2 or 3 of the algorithm, because here the largest group (in orange) is formed only by valid IVs (2,3 and 4). If this is the case, only the valid IVs are selected as valid.
3.3 Oracle Selection and Estimation Property
In this section, we state the theoretical properties of the IV selection results obtained by Algorithm 1 and Algorithm 2 and the postselection estimators. See Section 4.1 for detailed theoretical results developed for the general case . We establish that our method can achieve oracle properties in the sense that it can select the valid instruments consistently, and that the postselection IV estimator has the same limiting distribution as if we knew the true set of valid instruments.
Theorem 1.
The postselection 2SLS estimator using the selected valid instruments and controlling for the selected invalid instruments has the same asymptotic distribution as the oracle estimator:
Theorem 2.
Let with , being the selected invalid and valid instruments respectively. Let be the 2SLS estimator given by
Under Assumption 15, the limiting distribution of is
where is the asymptotic variance for the oracle 2SLS estimator given by
with being the true set of invalid instruments.
The proof of Theorem 2 follows from the proof of Guo2018Confidence
3.4 Computational Complexity
Recent implementations of the hierarchical agglomerative clustering algorithm have a computational cost of (Amorim2016Ward). In the downward testing procedure, a maximum of different models needs to be tested. Therefore, the computational cost of the downward testing algorithm is . This is an improvement on the CIM which has a time complexity of and where the maximal number of tests is .
4 Extensions
In this section, we propose extensions of the method to a setting with multiple endogenous regressors and to a setting with heterogeneous effects. We also discuss the performance of our method in presence of weak instruments as compared with the HT and CI method in this situation.
4.1 Multiple Endogenous Regressors
One shortcoming of previous methods that try to select invalid instruments is that they only allow for one endogeneous regressor. Therefore, in this section we show how our method can be naturally extended to select invalid instruments when . First of all, the input of our method, all the justidentified estimators, are estimated by all the combinations from . Hence we now have instead of justidentified estimators. Let be a set of identities of any instruments such that the model is exactly identified with these instruments. Let denote the corresponding instrument matrix. To guarantee that all the justidentified estimators exist, we modify Assumption 1 as follows:
Assumption 1.a.
Existence of justidentified estimators
For all possible values of , let be the combinations of the row of for all . Then we assume
The plurality assumption also needs modification for . For , Assumption 5 states that the valid instruments form the largest group, where instruments form a group if their justidentified estimators converge to the same value. If we find the largest set of justidentified estimators that converge to the same value, then this set is automatically the largest group of instruments as each justidentified estimator is estimated by a single instrument. However, when , each justidentified estimator is estimated by multiple instruments, hence the equivalence between the largest set of justidentified estimators and the largest group of instruments may not hold. In this case, we modify the plurality rule so it is based on the combinations of instruments instead of individual instruments. The modification starts with revisiting the asymptotics of the justidentified estimators for .
There are justidentified models. We write the corresponding justidentified estimators for and analogously to the proof of Proposition A1 in Windmeijer2020Confidence for the case . First, for an arbitrary , partition the matrix , where is a matrix containing the th columns of , and is a matrix containing the rest columns of . is the equivalent partition of the matrix of firststage coefficients. , then , with
The justidentified 2SLS estimators using as instruments and controlling for the rest instruments can be written as
Note that is equal to
By Assumption 4, we have the following asymptotics
Hence, the inconsistency of is and there are inconsistency terms . Let be the justidentified 2SLS estimator estimated with . As when , not each IV is associated with a single , we introduce the concept of a family: A family is a set of IV combinations that is associated with justidentified estimators which converge to the same vector, .
Then the family that consists of IV combinations which generate consistent estimators is
Let there be families. Note that when a group of IVs automatically is a family.
Analogously to Assumption 5, we assume that is the largest family:
We show in Appendix B, that a combination of IVs is an element of if and only if all of the IVs used for its estimation are in fact valid. This means that the family of valid IVs consists of all combinations that use IVs out of and hence . Therefore, the plurality assumption can be modified to
Assumption 5.a.
New plurality
The inconsistency term of the other families depends on the firststage coefficient vectors and hence there is no direct relation from to . This means that one family can be estimated with IV combinations which have different vectors . We show this in Appendix C.
One way in which the new plurality could be fulfilled is
Assumption 5.b.
and
where and are two different sets of IVs.
The second part of the assumption makes sure that the family of valid IVs consists of more than one element, without which Assumption 5.a. can not be fulfilled. The third part of this assumption makes sure that sets containing at least one invalid IV do not converge to the same value and the valid group of IVs also translates into the largest family. This can be seen as a technical assumption and is stronger than needed.
The procedure to estimate is analogous to the one in the preceding section (see Appendix A for illustration) only that now we need to account for the presence of families.
The valid IVs are then selected as those that are involved in estimating the largest cluster.
The cluster containing the valid instruments is chosen as the one where the number of estimates in the cluster is maximal and the Sargan test is not rejected (Sargan statistic smaller than the threshold ). In cases where there are multiple such clusters, we select the cluster in which more IVs are involved.
One ambiguity arises in finite samples: asymptotically, following Assumption 5.b the largest group (in terms of direct effects) being valid also implies the largest family being valid. However in finite samples, the number of IVs involved in the estimation of the largest cluster might be smaller than the number involved in the estimation of another cluster. Therefore, we could also select the valid IVs as
directly selecting the family associated with the maximal number of IVs instead of the largest cluster. It is an empirical question which of the two methods should be used.
The method has oracle properties as stated in Theorem 1 and Theorem 2. Here we formally establish the theoretical results for the general case . See Appendix D for proofs of all theorems. The following Lemma establishes that when assigning a justidentified estimator to either (1) clusters that are formed by justidentified estimators from the same family as , or (2) clusters that contain at least one element that is from a different family from , asymptotically will be assigned to the first type of clusters.
Lemma 1.
Let be a justidentified estimator such that , be the cluster such that all the elements in are from , . For , let be a cluster such that at least one element in is from a different family from , and . Under assumptions 1 to 5 and Algorithm 1, if assigning to either or , is assigned to with probability converging to 1.
In Algorithm 1, we start from the number of clusters . For each step onward, according to Step 3 in Algorithm 1, there would be two clusters joining with each other and forming a new cluster. For a cluster that contains more than one justidentified estimators, the cluster mean can be viewed as a single estimator averaging over all the estimators in it (e.g. and in Step 3 Algorithm 1). Therefore, the joining process for each step in Algorithm 1 can be viewed as assigning an estimator to a cluster. Note that a cluster that consists of justidentified estimators from the same family can be viewed as an estimator from this family. Based on Lemma 1, along the path of Algorithm 1, estimators from different families will not be joined with each other until all the estimators from the same family have merged into one cluster. If for each family, the justidentified estimators contained in it have merged into the same cluster, then we know that the total number of clusters would be . This implies that when the number of clusters is smaller than , then the current clusters must be subsets of families.
Corollary 1.
Under assumptions 1 to 4, in steps 3 and 4 of Algorithm 1:
To better understand why this is the case, consider the following analogy. There are guests ( justidentified estimates) which belong to families. These people live in a hotel, which has rooms (clusters). Each day, one room disappears, and one of the people needs to move into the room of some other guest. The people in a family have closer ties, so the person whose room disappears will move into the room of somebody from their own family. This goes on until each family is living respectively in one crowded room. The hotel now continues to shrink. Only now are people from different families merged together into the same rooms. The largest family can be detected, when all people from the same family have been merged into one room, but people from other families have not been merged into one room completely (or have just been all merged into one room respectively).
In Algorithm 1, the number of clusters starts with and ends with . For each step inbetween, the number of clusters decreases by 1, hence there must be a step where . Based on Lemma 1 and Corollary 1, estimators from different families are joined together only when all elements of their own family have been completely joined to their clusters. This implies that in particular when , there is a cluster . Therefore, the path generated by Algorithm 1 contains the true partition as there must be one step such that .
Corollary 2.
When , .
The theoretical results above establish that the selection path generated by Algorithm 1 covers the true set of valid instruments . Next we show that by Algorithm 2, we can locate this and select the valid instruments consistently. This consistent selection property is summarized in Theorem 1 which holds for under Assumption 1(1.a.) to Assumption 5 (5.a., 5.b.). This is also the case for Theorem 2.
4.2 Weak Instruments
A major advantage of our method is that it can deal with the presence of weak instruments (valid or invalid) efficiently. The intuition is that for weak instruments, their instrumentspecific just identified estimators tend to have much larger magnitude than those of the strong instruments. Hence, the Euclidian distance between these two types of estimators tends to be large, causing them to be less likely to be joined with each other. The existing method, the HT and CI method, can face problems in the presence of weak instruments. CI would always selects the weak instruments as valid, while the first stage hard thresholding of the HT method might keep invalid instruments and rule out valid ones under certain correlation structures among the instruments.
4.3 Heterogeneous Treatment Effects
The instrumental variable estimator also has a local average treatment effect (LATE) interpretation, as estimating the average treatment effect of a subpopulation, whose treatment can be changed by the instrument (Imbens1994Estimation). Hence, LATEs will naturally vary with the instruments. For example, an increase in minimum schoolleaving age and proximity to school will see different populations increase their schooling. In this section we show such a setting and argue that our method can retrieve the largest group associated with a given LATE or the whole set of different LATEs.
For simplicity, we look at a setting with a binary treatment , a binary instrument and potential outcomes and . The outcome and the treatments can be written as
Assumption 6.
Independence
Assumption 7.
First Stage
Assumption 8.
Monotonicity
If the last three assumptions are fulfilled, Imbens1994Estimation show that the IV estimand is the average treatment effect of compliers:
(4) 
In the following, we show a setting in which the LATEs are dependent on one potentially unobserved variable . For this, we make use of the setting in Angrist2010Extrapolate.
The treatment is determined by the following latentindex assignment mechanism
(5) 
where and the potential outcomes depend on the variable :
where the errors are .
Angrist2010Extrapolate then assume
Assumption 9.
Conditional Effect Ignorability:
The authors then show that under this assumption LATE can be written as function of :
(6) 
Next, we are interested in a setting where the byIV treatment effects form groups:
(7) 
This might be the case, when different compliant populations have the same or different lead to the same . Keep in mind that the number of groups is .
This follows by the same assumptions as above. In the same way:
Theorem 3.
Consistent selection of LATE groups
Let be the critical value for the Sargan test in Algorithm 2. Under Assumptions 1  5 and Lemma 2, for and , there is at least one step s.t. and
where is the largest group.
The difference to the setting with invalid IVs is that in the LATE setting not only the largest cluster contains valuable information, but also the smaller clusters contain coefficient estimates obtained with valid instruments.
5 Monte Carlo Simulations
First, we apply our method to simulated data. In the single regressor setting, we want to compare the performance of the new clustering method with that of the existing Confidence Interval Method and the TwoStage Hard Thresholding Method. Therefore, we run simulations in which we follow closely the setting in Windmeijer2020Confidence: There are 21 IVs, twelve of which are invalid, while nine are valid with , , where is an vector of ones and is an vector of zeros. We set and . The true is and with . Errors are
The results are in table 1. The oracle 2SLS estimator has the lowest bias and the coverage rate of the 95 % confidence interval is at 0.948. The naive 2SLS estimator has a median absolute error of about 1.034 and never covers the true value. This does not change when increasing the sample size to 10,000 and was expected.
When using twostage hard thresholding (HT) with 500 observations, the MAE is larger than for naive 2SLS and the method never chooses the oracle model, leading none of the confidence intervals to cover the true value. When using CIM, the MAE is already low when , the number of IVs chosen as invalid is close to twelve, the frequency with which the oracle model is selected is at 0.96, and the coverage rate is at about 0.94. Results are very similar for our agglomerative clustering method. When increasing the sample size, the performance improves for all three selection methods and the MAE is equal to that of the oracle estimator in all cases. The coverage rate is now very close to the correct 95% for all estimators and the probability to select the oracle model is close to one, except for HT where it is at 0.83, which however does not deteriorate the performance in terms of MAE.
MAE  SD  # invalid  p allinv  Coverage  p oracle  
N=500  
oracle  0.017  0.024  12  1  0.948  1 
naive  1.034  0.026  0  0  0  0 
HT  1.174  0.124  12.932  0  0  0 
CIM  0.018  0.215  12.043  0.992  0.936  0.963 
HC  0.018  0.183  12.073  0.987  0.928  0.974 
N=5000  
oracle  0.005  0.007  12  1  0.952  1 
naive  1.053  0.007  0  0  0  0 
HT  0.005  0.008  12.321  1  0.952  0.73 
CIM  0.005  0.008  12.012  1  0.952  0.988 
HC  0.005  0.125  12.05  0.993  0.937  0.981 
N=10,000  
oracle  0.004  0.005  12  1  0.941  1 
naive  1.05  0.005  0  0  0  0 
HT  0.004  0.006  12.206  1  0.929  0.83 
CIM  0.004  0.005  12.012  1  0.939  0.988 
HC  0.004  0.104  12.047  0.996  0.926  0.979 
This table reports median absolute error standard deviation, number of IVs selected as invalid, frequency with which all invalid IVs have been selected as invalid, coverage rate of the 95 % confidence interval and frequency with which oracle model has been selected. The true coefficient . WLHB setting and invalid weaker setting are described in the text. 1000 repetitions per setting. 
MAE  SD  # invalid  p allinv  Coverage  p oracle  
N=500  
oracle  0.017  0.024  12  1  0.947  1 
naive  0.204  0.054  0  0  0.463  0 
HT  0.017  0.024  12  1  0.947  1 
CIM  13.522  8.767  11.509  0.023  0.079  0 
HC  0.017  1.146  12.012  0.991  0.934  0.983 
N=5000  
oracle  0.005  0.007  12  1  0.959  1 
naive  0.357  0.018  0  0  0  0 
HT  0.005  0.007  12  1  0.959  1 
CIM  20.401  4.613  10.329  0.014  0.016  0.002 
HC  0.006  4.296  11.967  0.923  0.876  0.913 
N=10,000  
oracle  0.003  0.005  12  1  0.947  1 
naive  0.356  0.013  0  0  0  0 
HT  0.005  0.056  10.84  0.798  0.756  0.798 
CIM  21.895  5.451  10.952  0.054  0.053  0.044 
HC  0.004  4.46  11.978  0.928  0.862  0.908 
This table reports median absolute error standard deviation, number of IVs selected as invalid, frequency with which all invalid IVs have been selected as invalid, coverage rate of the 95 % confidence interval and frequency with which oracle model has been selected. The true coefficient . WLHB setting and invalid weaker setting are described in the text. 1000 repetitions per setting. 
MAE  SD  # invalid  p allinv  Coverage  p oracle  
N=500  
oracle  0.138  0.192  12  1  0.939  1 
naive  1.84  0.04  0  0  0  0 
HT  1.714  0.19  9.953  0  0  0 
CIM  1.264  0.571  6.121  0  0  0 
HC  1.224  0.577  11.43  0.077  0.075  0.006 
N=5000  
oracle  0.05  0.071  12  1  0.954  1 
naive  1.841  0.012  0  0  0  0 
HT  0.268  0.577  16.642  0.596  0.557  0 
CIM  1.241  0.477  10.501  0.147  0.139  0.147 
HC  0.201  0.766  13.047  0.541  0.49  0.16 
N=10,000  
oracle  0.036  0.052  12  1  0.959  1 
naive  1.845  0.008  0  0  0  0 
HT  0.122  0.256  16.256  0.951  0.783  0.004 
CIM  0.055  0.567  11.73  0.715  0.686  0.71 
HC  0.055  0.589  12.655  0.828  0.754  0.536 
This table reports median absolute error standard deviation, number of IVs selected as invalid, frequency with which all invalid IVs have been selected as invalid, coverage rate of the 95 % confidence interval and frequency with which oracle model has been selected. The true coefficient . WLHB setting and invalid weaker setting are described in the text. 1000 repetitions per setting. 
MAE  SD  # invalid  p allinv  Coverage  p oracle  
P=2  
N=500  
Oracle  0.025  0.041  0  0  0.92  0 
Naive  0.104  0.056  12  1  0.01  1 
AC  0.073  0.137  12.28  0.53  0.47  0.3 
N=5000  
Oracle  0.008  0.012  0  0  0.95  0 
Naive  0.092  0.014  12  1  0  1 
AC  0.008  0.011  12.04  1  0.96  0.96 
N=10000  
Oracle  0.006  0.009  0  0  0.95  0 
Naive  0.092  0.011  12  1  0  1 
AC  0.006  0.01  12.08  1  0.92  0.94 
P=3  
N=500  
Oracle  0.036  0.053  0  0  0.89  0 
Naive  0.679  0.104  12  1  0  1 
AC  0.112  0.208  10.95  0.21  0.19  0.19 
N=5000  
Oracle  0.01  0.015  0  0  0.88  0 
Naive  0.592  0.029  12  1  0  1 
AC  0.012  0.053  12.02  0.86  0.77  0.83 
N=10000  
Oracle  0.007  0.011  0  0  0.91  0 
Naive  0.592  0.022  12  1  0  1 
AC  0.007  0.025  12.05  0.97  0.86  0.95 
This table reports median absolute error, standard deviation, number of IVs selected as invalid, frequency with which all invalid IVs have been selected as invalid, coverage rate of the 95 % confidence interval and frequency with which oracle model has been selected. For the first two, means over the statistic for each regressor are taken. The true coefficient is . Settings are described in the text. 100 repetitions per setting. 
Following the discussion of weak instruments in 4.2, we expect that in settings where the invalid IVs are weaker than the valid ones the performance of the CIM will deteriorate, while the HT should detect the invalid instruments. To achieve such a setting, we divide the of the invalid IVs by 10. The results can be found in table 2. The HT method attains oracle performance but it deteriorates when the sample size is at 10,000. As expected, the CIM has a low coverage and probability of selecting the true model at almost zero for all sample sizes. The MAE can be as high as 20, when that of the naive estimator is at just 0.35. Our AC method is close to oracle performance for all sample sizes. Hence, in this setting HT and AC ourperform CIM.
What happens when the valid IVs are weak? Following our aforementioned discussion we would expect the HT method to use only valid and relevant IVs and hence use too few IVs as valid. The CIM should work well, while our AC method could deteriorate because of bad finite sample behavior of the point estimates which use valid IVs. Indeed, the HT ends up selecting 16 instead of 12 as invalid (and not relevant), leading to an oracle rate close to zero. For the largest sample size the MAE is at a fourfold of the oracle MAE. The CIM performs well when the sample size is at 10,000, but much worse for smaller sizes. Our method weakly outperforms the other two in terms of MAE and the reported statistics approach oracle performance.
Now, we want to inspect the performance of our method when there are multiple endogenous regressors. The existing selection methods do not allow for such an extension. We draw 21 IVs with , , and , when there is a third endogenous regressor. The rest of the parameters are the same as before.
With this setting we estimate for replications. The results can be found in table 4. Again, it appears that the performance of our estimator approaches that of the oracle estimator as the sample size grows large.
These simulations illustrate the two key advantages of our method. First, in settings with weak instruments our method potentially outperforms existing methods. If it is not known whether weak IVs are valid or invalid, it would be preferable to use AC instead of the existing methods. In the setting from WHLB, the performance of the methods is comparable. Second, our method is applicable to the multiple endogenous variable case.
6 Application: Effect of Immigration on Wages
In this section we apply our method to the estimation of the effects of immigration on wages in the US. We first describe the setting and then discuss the results.
Many recent studies have tried to estimate the causal effects of immigration on labor market outcomes.
We estimate the following linear model:
(9) 
as in Basso2015Association.
Here, there are three years and 722 commuting zones . The dependent variable is the change in log weekly wages of highskilled workers. The independent variables are and , which is the change of the share of immigrants in employment. The coefficients of interest are and . Decade fixedeffects are described by and is the error term. Commutingzone fixed effects are eliminated through firstdifferencing. This regression is canonical in migration economics. The authors use data from the Census Integrated Public Use Micro Samples and the American Community Survey.
The key econometric challenge is that migrants select where to live endogenously. For example, migrants might choose where to live based on economic conditions in a region. This creates a bias in the estimates. A muchused estimation strategy to address this issue is to use a shiftshare instrumental variable, also known as Bartikinstrument due to Bartik1991Who.
The key idea is to interact shares of previous migrants in a baseperiod, with current, aggregatelevel shifts, or inflows of migrants. This identification strategy dates back to Altonji1991effects in migration economics. GoldsmithPinkham2020Bartik show that the validity of the shiftshare instrument depends on the validity of all shares and that an overidentified model with all shares as instruments can be used equivalently to the justidentified model. Therefore, we use all shares of migrants from a certain origin country , at a base period in region . We use originspecific shares from 19 origin country groups and base years 1970 and 1980 as separate IVs and obtain IVs.
The main drawback of Bartiktype designs is that all instruments need to be valid instruments. Why should these instruments be invalid? Jaeger2020Shift show that the shiftshare IV estimator might be inconsistent, first, because of correlation of the IVs with unobserved demand shocks and, second, because of dynamic adjustment processes. Hence, none of these two should play a role. However, it is well plausible that some origin country groups did not locate randomly in the past or have had direct effects on the wages. The second challenge can be somewhat tackled by including lagged immigration as an additional regressor. Of course, this will also be subject to the same endogeneity problem as before and hence should also be instrumented.
OLS  2SLS  2SLS AHC  

0.586  0.877  1.522 
(0.0935)  (0.460)  (0.292)  

0.197  0.249  0.771 
(0.0814)  (0.321)  (0.246)  
Nr inv 
0  2  
Pvalue  .0126  .0447  
N = 2166 (722 CZ 3), L = 38. Standard errors in parentheses. Observations weighted by beginningofperiod population. Significance level in testing procedure: 0.013. 
Results
The results can be found in Table 5. The first column shows results for ordinary least squares: the contemporaneous effect is 0.586, while the lagged effect is lower and negative. When using all shares as valid IVs, both effects are higher in absolute terms but only the contemporaneous effect is marginally statistically significant. The HansenSargan test for this model gives a pvalue of 0.0126, which is lower than the proposed significance level of .
When using AHC with this significance level in the downward testing procedure, two origin country shares are selected as invalid: the share of Mexicans in the US in 1970 and 1980. This means that two IVs which are similar a priori in that they are shares from the same origin country. These shares are likely to be invalid, because Mexican migrants were attracted to border regions as Texas and California by the good economic conditions in those states, both in the base year and in later periods. California’s economy has a large agricultural sector, and both states are among the wealthiest in the US. It is therefore likely that wages or unobserved productivity shocks that have driven the initial settlement are correlated over time, invalidating the initial shares. Moreover, GoldsmithPinkham2020Bartik find that Mexico has the highest sensitivitytomisspecification weight, that is the overall bias will be sensible to any invalidity stemming from the Mexican share. Indeed, after controlling for Mexican shares, the contemporaneous effect almost doubles, while the lagged effect triples.
7 Conclusion
We have proposed a novel method to select valid instruments. This method is applied to the estimation of the effect of immigration on wages in the US. The method can also be easily applied to any other setting in which there are many candidate instruments. Another application that we will include is that of Mendelian Randomization, the use of instrumental variables in epidemiology.
The advantages of our method are that it extends straightforwardly to the setting with multiple endogenous regressors and to the setting with heterogenous treatment effects. It also performs well in the presence of weak instruments.
Ways to further improve the method would be to account for the variance of each justidentified estimator in the selection algorithm, and to apply it in nonlinear models. Also, we plan to further explore applications of our method to models with richer forms of heterogeneity.
Appendix A Illustration of the IV Selection Procedure for
In figure 2, the procedure is illustrated. Here, we have a situation with four IVs and two endogenous regressors. Instrument NO. 1 is invalid, because it is directly correlated with the outcome, while the remaining three IVs (2, 3, 4) are related with the outcome only through the endogenous regressors and are hence valid.
In the first graph on the top left, we have plotted each justidentified estimate. The horizontal and vertical axes represent coefficient estimates of the effects of the first () and second regressor (), respectively. Each point has been estimated with two IVs, in this case with IV pairs 12, 13, 14, 23, 24 and 34, because there are four candidate IVs.
In the initial Step (0), each justidentified estimate has its own cluster. In step 1, we join the estimates which are closest in terms of their Euclidian distance, e.g. those estimated with pairs 23 and 24. These two estimates now form one cluster and we only have five clusters. We reestimate the distance to this new point and continue with this procedure, until there is only one cluster left in the bottom right graph. We evaluate the Sargan test at each step, using the IVs which are involved in the estimation of the largest group. When the pvalue is larger than a certain threshold, say 0.05, we stop the procedure. Ideally this will be the case at step 2 or 3 of the algorithm, because here the largest group (in orange) is formed only by valid IVs (2,3 and 4). If this is the case, only the valid IVs are selected as valid.
Appendix B consists of valid IVs only
Next, we show that the family with is composed of valid IVs with , only.
Remark 1.
is necessary and sufficient for .
Proof:
First prove sufficiency:
Direct proof: Assume holds. follows directly.
Second, prove necessity: Proof by contraposition: Assume , then . The latter inequality holds, because otherwise the columns of are linearly dependent, and is not invertible and hence does not exist, which it clearly does, by Assumption 1.a.
This also implies that consists of valid IVs only and all sets of cardinality are elements of . Hence, the following remark directly follows:
Remark 2.
.
Appendix C One family can consist of different vectors
We have shown that the number of valid IVs defines the size of the family . However, this relation between and is available only when .
Remark 3.
The function is noninjective.
Proof:
Proof by counterexample: Show that there is more than one element in the domain which leads to the same image, i.e.
Define . Then, and . Assume .
(10)  
(11) 
Therefore
(12)  
(13) 
This means that all s.t. lead to the same solution .
Hence, even though the number of IVs with the same value is smaller than , the largest family might still consist of combinations of invalid IVs, because the firststage coefficient matrix also determines .
Appendix D Oracle Properties
This section gives proofs for Lemma 1 and Theorem 1. All proofs apply for the general case that .
d.1 Proof of Lemma 1
Proof.
Consider
Under Assumptions 1  5:
(15) 
and hence
Show that the probability that is assigned to a cluster with elements of its own family goes to 1. is assigned to a cluster with elements of its own family iff . The following two are hence equivalent
Under 15