Estimating Heterogeneous Causal Effects in the Presence of Irregular Assignment Mechanisms
Abstract
This paper provides a link between causal inference and machine learning techniques  specifically, Classification and Regression Trees (CART)  in observational studies where the receipt of the treatment is not randomized, but the assignment to the treatment can be assumed to be randomized (irregular assignment mechanism). The paper contributes to the growing applied machine learning literature on causal inference, by proposing a modified version of the Causal Tree (CT) algorithm to draw causal inference from an irregular assignment mechanism. The proposed method is developed by merging the CT approach with the instrumental variable framework to causal inference, hence the name Causal Tree with Instrumental Variable (CTIV). As compared to CT, the main strength of CTIV is that it can deal more efficiently with the heterogeneity of causal effects, as demonstrated by a series of numerical results obtained on synthetic data. Then, the proposed algorithm is used to evaluate a public policy implemented by the Tuscan Regional Administration (Italy), which aimed at easing the access to credit for small firms. In this context, CTIV breaks fresh ground for targetbased policies, identifying interesting heterogeneous causal effects.
I Introduction
Modern statistics is experiencing the growth of machine learning techniques, such as Classification and Regression Trees (CART) [References], and Random Forests [References], which can be applied to a wide range of statistical problems. In order to use these techniques to answer relevant statistical questions, it is appropriate to highlight some important features of many machine learning methods. These methods are largely about making good predictions and finding the model that fits the data best. Furthermore, their importance lies in the ability to deal with complex datasets, where the number of units is large, as well as the number of features connected with a single unit. In this framework, causality is often deemphasized. However, in the last decades, the availability of increasingly larger datasets has brought to the attention a new important problem for causal inference, which machine learning techniques can “easily” solve. As a matter of fact, the necessity to deal with problems connected with the heterogeneity of the treatment effects is stronger than in the past: the availability of large datasets makes it possible to customize causal effect estimates for population’s subsets and even for individuals. In the past, the analysis subsets for causal inference problems were specified in advance by trials’ protocols, while with the new machine learning technique presented in this paper, the subsets are selected by the algorithm itself in a datadriven way. Classical approaches to the analysis of heterogeneous effects are nonparametric methods, such as nearest neighbour matching method, kernel method, and series estimation [References]. These techniques usually offer good results in terms of estimation abilities. The drawback is that they perform well as far as the number of covariates is low. Machine learning techniques outperform other nonparametric methods when the number of covariates is relatively high. This can be seen as the reason that led recently to the application of machine learning techniques to causal discovery and inference. A good example of the use of machine learning techniques in these fields, and a very important inspiration for the present work, is the recently published paper [References], where an adaptation to causal inference of CART in its regression version, named Causal Tree (henceforth, CT), was developed to estimate causal effects with instrumental variables. While the goal of the method proposed in [References] is very similar to the one of the algorithm we develop in this paper (namely, to draw proper causal inference in the presence of irregular assignment mechanisms), the CT algorithm can identify the heterogeneity of causal effects with respect to a particular subset of selected covariates, where the selection needs to be done by the researcher herself. Conversely, our algorithm, named Causal Tree with Instrumental Variable (henceforth, CTIV), provides a datadriven way to shed light on the heterogeneity of the treatment effects. The paper is structured as follows. Section II provides a background on the causal inference framework, its link with machine learning as it is modeled via the CT algorithm, and basic concepts about instrumental variables. In Section III, we describe our proposed CTIV algorithm. In Section IV, we provide a comparison on synthetic data between the CT and CTIV algorithms in the presence of an irregular assignment mechanism, showing numerically advantages of the latter. Section V concludes the paper with a case study on firm level data where the proposed algorithm is used to assess the heterogeneity of the effects of an employment policy implemented by the Tuscan Regional Administration (Italy).
Ii Background
A. Rubin’s Causal Model. In order to set up the method presented in this paper, it is important to remind some notions and notations of Rubin’s potential outcome framework [References, References]. Rubin’s framework is the milestone of causal inference. Together with the Pearl’s causality approach [References], it is the most widely used model in the scientific literature about causal inference.
Given a set of units, indexed by , let be the binary indicator of the receipt of the treatment:
(1) 
In order to develop a proper causal inference framework, one needs to assume that the potential outcomes for any unit do not vary with the treatments assigned to other units, and that, for each unit, there are no different forms or versions of each treatment level, which may lead to different potential outcomes [References, References]. This assumption is referred in the literature as the Stable Unit Treatment Value Assumption (henceforth, SUTVA). Given SUTVA, one can postulate the existence of a pair of potential outcomes for each unit:
(2) 
Starting from the notion of potential outcomes, one can define a unitlevel causal effect as the difference between the potential outcome under treatment and the one under control:
(3) 
The problem of this approach to causal inference is that one can observe just one potential outcome for every unit. It is impossible to observe both potential outcomes for the same unit at the same time. Therefore, from this perspective, causal inference is a missing data problem [References].
Is it then impossible to estimate any causal effect? No, it is not but, in orded to draw proper causal inference, one needs to introduce the central concepts of the Rubin’s Causal Model [References]. Let be the vector of features (usually called also covariates or pretreatment variables) associated with the th unit, and known not to be affected by the treatment. Let be the matrix of covariates values (where is the number of units, and the number of covariates per unit), the dimensional vector of binary assignments to the treatment, and and the dimensional vectors of potential outcomes. Imbens and Rubin [References] define the assignment mechanism , the unit level assignment probability and the propensity score , which is the probability for a unit to be treated, conditional on its covariates [References].
Following [References], one defines a classical randomized experiment as an assignment mechanism that has the following 4 properties:

it is individualistic, meaning that the treatment assignment for any unit is a function only of its own covariates and potential outcomes;

it is probabilistic, meaning that the unit level assignment probability belongs to the open interval ;

it is unconfounded, meaning that it does not depend on the potential outcomes;

it has a functional form that is known (and, to some extent, controlled) by the researcher.
Suppose that one is interested in the population average treatment effect:
(4) 
where is the expected value of , and is the expected value of . In the case of a classical randomized experiment, an unbiased estimator of is:
(5) 
In the equation above, , where is the number of units assigned to the treated group, and , where is the number of units assigned to the control group. Finally, .
By relaxing the fourth property of a known assignment mechanism, one ends up in a scenario that [References] defines as a Regular Assignment Mechanism. Is it possible in such a scenario to still draw causal inference? The central property that needs to be invoked in order to do so is the unconfoundedness property 3) defined above. Unconfoundedness can be formalized as the conditional independence of the assignment variable to the potential outcomes given (conditioning on) the covariates vector:
(6) 
The importance of this assumption is that, conditional on covariates, one can treat observations as they were coming from a randomized experiment. Let the Conditional Average Treatment Effect (CATE) be defined as:
(7) 
where is the expected value of given . Then it can be proven, by the law of iterated expectations, that:
(8) 
It follows that is identified if and are identified over the support of . Under unconfoundedness, it can be proven that and are identified [References]. This gives the possibility to the researcher, if all the important confounding covariates are present in the data, to draw causal inference even when the assignment mechanism is not randomized but is regular. This is the typical case of observational studies, where the researcher does not know beforehand the assignment mechanism (i.e., property 4) above does not hold). Moreover, in observational studies, the assignment to the treatment may be different from the receipt of the treatment. In this scenario, where one allows for noncompliance between the treatment assigned and the treatment received, one can assume that the assignment is itself unconfounded, while the receipt is confounded. Following [References], this assignment mechanism is named Irregular Assignment Mechanism. How to draw inference in the presence of an irregular assignment mechanism will be the focus of Subsection II.C, and also the focus of our applied machine learning algorithm in Section III.
Going back to the CATE, there is a variety of reasons for researchers to conduct estimation of (see formula (7)). One is strictly related to the magnitude of the benefits of the treatment which can vary with the features of the individuals. For instance, one can imagine the extreme case where the average treatment effect of a drug is positive on the overall population (in terms of curing a specific disease), but for a subpopulation of patients, with certain characteristics, the average treatment effect is ineffective, or even negative. For these reasons, it is important to find a proper way to estimate causal effects not only on the entire population, but also on specific subsets of the population.
B. Regression Trees for Causal Inference. Machine learning offers new ways to investigate heterogeneous effects (i.e., ones that depend on the covariates vector , see (7)), as suggested in [References, References]. Machine learning techniques developed so far in the literature can provide a useful tool to achieve this goal, in scenarios where the assignment mechanism is randomized or is regular.
A machine learning technique that was applied to this task is the CART method [References]. CART is suitable for this goal because, on one side, it is a fully supervised machine learning technique but, on the other side, it is a pretty flexible method that can be adapted to various learning tasks. Here, due to page constraints, we limit to provide an overview of the basic ideas behind such method, referring the reader to [References] for other details about it. The primary goal of CART is to estimate the conditional expectation of an observed outcome
on the basis of the information on features and outcomes for units in the training sample, and to compare the resulting estimates on a test sample. Practically, one can estimate these values by building a suitable tree (a classification or a regression tree, depending on the specific problem). The different admissible tree models one can construct entail alternative splits of the tree, based on the values of the features in the data. A possible way to choose the best among various admissible trees is provided by the following procedure, whose initial step consists in dividing the dataset into two different samples:
a) a first sample, called training sample (or training set), which is used to construct a maximal depth tree, performing the splits using an insample goodnessoffit measure . The size of this training sample is indicated by . Then, the maximal depth tree is pruned, with the aim of maximizing another criterion function , for various choices of a suitable penalty parameter on which depends;
b) a second sample, called validation sample (or validation set), which is used, for each choice of , to validate the associated pruned tree, through the use of an outofsample goodness of fit . This second sample size is indicated by .
Here, we consider the case in which a single training set and a single validation set are used. In the machine learning literature, this procedure is called the holdout method, and is particular form of crossvalidation. In this case, is chosen by maximizing with respect to it, and the tree itself is retrained using the full dataset, for the resulting value of . Finally, a different sample, called test sample (or test set), with cardinality , is used to assess the performance of the resulting model.
In the following, we describe the Causal Tree (CT) method [References], which is a modification of the original CART method in its regression version, tailored to causal inference. The CT method differs from CART from the following features:

the CATE transformation of the outcome;

a rework of the insample goodness of fit;

a rework of the outofsample goodness of fit.
ii.1 The CATE Transformation
First of all, one needs to address the big issue of constructing an algorithm that leads to an accurate estimate of the conditional average treatment effect. In an ideal world, one would measure the quality of the estimator by looking at the value of the following goodness of fit measure, defined in terms of the mean squared error:
(9) 
However, it is infeasible to estimate the value of , because one does not know the values of both potential outcomes for each unit, as is unobservable. To address this issue, one can transform the observed outcome using the treatment indicator variable and the propensity score , as proposed by Athey and Imbens [References]:
(10) 
Since is equivalent to then, using (1), one can express (10) as:
(11) 
What is the strength of this transformation? Athey and Imbens prove that, if the unconfoundedness assumption holds, then:
(12) 
where in (11) is computed replacing the propensity score with its suitable estimate (obtained, e.g., via logistic regression). However, there are some issues in building a tree using a straightforward transformation of the outcome like . In fact, Athey and Imbens argue that the within a leaf sample average of the transformed outcome is not the most efficient estimator of the treatment effect and, moreover, that the proportion of treated and control units within a leaf can be quite different from the overall sample proportion. An easy way to solve this issue, proposed in [References], is to weight the CATE transformation in a matter similar to the one developed in [References]. Every partition of the covariates space is identified by a set of leaves, and the treatment effect for the covariates vector belonging to a generic leaf is estimated as^{1}^{1}1Likewise next formulas (30) and (39), (13) can be applied also to the validation sample and to the entire (training and validation) sample , replacing the superscript “tr”, respectively, with “va” and “”.:
(13)  
ii.2 InSample Goodness of Fit
The second component of the algorithm, which also differs from the corresponding component in the original CART algorithm, is the insample goodness of fit. The big issue for defining a proper criterion function for the insample goodness of fit is that, in the causal inference framework, the criterion (9), and even its sample approximation , which is what would be implemented by a direct application of the original CART algorithm, are infeasible. Hence, [References] proposes to approximate (9) by:
(14) 
and to use the corresponding criterion function
(15) 
where is a penalty parameter, and is the number of leaves in the tree.
ii.3 OutofSample Goodness of Fit
For crossvalidation, there is no big need for any significant additional computational effort, given the fact that one has already obtained an estimate of the causal effect defined in terms of the training sample (see (13)), and one just needs to compare it with the causal effect drawn from the validation sample used for the crossvalidation. One could easily rework the mean squared error with the transformed outcome to get the TransformTheOutcome (TOT) loss function:
(16) 
The insample goodness of fit can be reworked in different ways, following [References]. It looks to us that the TOTbased outofsample goodnessoffit in (16) fits in a better way in those frameworks in which the number of covariates would lead to very computationally demanding alternative estimators.
ii.4 Causal Inference with Causal Tree
Due to the specific construction of the Causal Tree, the learning problem reduces to that of estimating the treatment effect in each member of a partition of the covariate space. For the problem of estimating the treatment effect in each leaf of the partition, standard methods are valid. Once one has constructed the tree , one can consider the leaf that corresponds to the subset (henceforth, identified with itself). The tree is defined as a partition of the feature space , and one can write:
(17) 
where indicates the number of leaves in the tree. Within the leaf , the average treatment effect is:
(18) 
which can be estimated as follows, by subtracting the average outcome on the control units from the average outcome on the treated units, both evaluated over the test sample, which is different from the training and validation sample used for the crossvalidation:
(19) 
One can also estimate, for each leaf , the variance of this estimator using the following Neyman estimator [References]:
(20) 
where represents the sample variance of the treated group in the test set, its size, the sample variance of the control group in the test set, and its size. This estimator of the variance is unbiased, with respect to the finitesample distribution of the test sample, if the treatment effect can be assumed to be additive and constant within a leaf [References]. However, it can be used to construct confidence intervals only under the normal approximation, which is typically reliable when the number of units inside a leaf is large enough.
C. General Instrumental Variable Framework. In observational studies, the assignment mechanism may be irregular. For example, dependence on the assignment of the potential outcomes may be not ruled out even after conditioning on a rich set of covariates. These are the cases where the unconfoundedness assumption is violated. In these settings, instrumental variable methods [References] can still help to estimate causal effects. To briefly make the context clear, one can consider the following example of an irregular assignment mechanism, for which, in a study population of units, a certain number of individuals are randomized to receive a treatment (read a drug), but not all the units that are assigned to receive it are actually treated.
Let us denote by the receipt of the treatment, and by the assignment to the treatment (instrumental variable). Throughout this paper, we will assume both the and to be binary, even if one could get similar results by relaxing this assumption. In the following, represents the treatment received as a function of the treatment assigned. This leads one to distinguish four different subpopulations of units: those that always comply with the assignment (compliers), those who never comply with the assignment (defiers), those that even if not assigned to the treatment take it (alwaystakers), and those who do not take the treatment even if assigned to it (nevertakers). Formally, one can highlight these subpopulations as follows:

Compliers (): and ;

Defiers (): and ;

Alwaystakers (): , ;

Nevertakers (): , .
How can one conduct causal inference in such a setting, if one decides to use the CART method? The first thing to do is to assume the classical Instrumental Variable (IV) assumptions to hold [References]: monotonicity, existence of compliers, unconfoundedness of the instrument, and exclusion restriction. These four assumptions can be written in detail as follows:

monotonicity: ;

existence of compliers: ;

unconfoundedness of the instrument (expressed in terms of conditional independence notation): ;

exclusion restriction: where, for each subpopulation and , the shortened notation is used to denote .
The monotonicity assumption leads us to exclude the existence of units that do exactly the opposite of what they are assigned to (read defiers). In the case of onesided noncompliance, when units that are not assigned to take the drug cannot take it, this assumption is automatically satisfied as for each unit, excluding the presence of defiers and alwaystakers. In the case of twosided noncompliance, when treated and control units can access the opposite treatment status, the monotonicity assumption is very plausible but not directly verifiable. The second assumption is the socalled “existence of compliers” assumption. This assumption states that the subpopulation of compliers exists with positive probability. The third assumption states that the instrument is unconfounded. As we saw in Section II, the importance of unconfoundedness is that, conditional on covariates, the assignment to the treatment is as good as if the assignment mechanism was randomized. The last but not least assumption is the exclusion restriction, which rules out any direct effect of on . According to this assumpton, there is no effect of the assignment on the outcome, in the absence of an effect of the assignment of the treatment on the treatment received, being the treatment of primary interest.
1) Complier Average Causal Effect: In the setting above, what “one can get from the data” (without invoking any of the previous assumptions) is the Intention To Treat , which is defined as the effect of the intention to treat a unit on the outcome of the same unit (effect of the assignment):
(21) 
If one does not assume any of the classical IV assumptions above to hold, then the global may be written as the weighted average of the effects across the four subpopulations of compliers, defiers, alwaystakers and nevertakers:
(22) 
where ( is the effect of the treatment assignment on units of type and is the proportion of units of type .
We can then proceed by adding step by step the four assumptions. The first assumption that we impose is the exclusion restriction. If it holds, then we get
(23) 
since, for both alwaystakers and nevertakers, one has
(24) 
If for an individual the assignment has no effect on the treatment received, then it has also no effect on the outcome. This is a substantial assumption, and is not implied by the design. It is generally stated as the assignment not affecting the outcome other than through the treatment received, as we saw above. Such an assumption can be used to attribute the effect of assignment to the treatment received as follows, taking into account only compliers and defiers:
(25) 
Under monotonicity, we rule out the existence of defiers: . If we add the unconfoundedness assumption, we can estimate the distribution of compliance types as follows:

, estimated as ;

, estimated as ;

, estimated as ,
where is the number of units assigned to the treatment and is the number of units assigned to the control. Once one has estimated the distribution of compliers, when one adds also the “existence of compliers” assumption, one finally gets:
(26) 
From this formula, as being , it comes out that , the socalled Complier Average Causal Effect (CACE), is [References]:
(27) 
In general, the global may be viewed as a lower bound on the treatment effect on the compliers: with the assumptions , , and that both and are strictly less than , one gets . The complier average treatment effect, , is a local effect, since it makes reference just to the population of compliers, hence it can also be referred as a Local Average Treatment Effect (LATE). It can be estimated as the coefficient associated with the instrumental variable regression [References] as we will see in detail in Subsection III.B. Invoking unconfoundedness, exclusion restriction and monotonicity, one can also infer the outcome distribution for compliers, , and . Under the same assumptions, one can estimate the entire marginal distribution of and for compliers.
Iii Causal Tree with Instrumental Variable
A. Causal Tree with Randomized Instrumental Variable. In the following, we extend the CT algorithm to the case of an irregular assignment mechanism where the assignmenttothetreatment variable is itself randomized, but its receipt is not. If we assume the instrumental variable to be randomized, we can draw causal inference from a Causal Tree by making some changes in the structure of the tree. The first difference is that we need to rework the outcome variable, substituting in (10) the indicator variable with the instrumental variable , as follows:
(28) 
where the propensity score is now reworked as . In this case, when the assignment mechanism corresponds to a classical randomized experiment, the propensity score is a constant (i.e., for all ), and the transformation above simplifies to:
(29) 
Likewise in (13), one can also use a weighted version of the transformation of the outcome to provide an estimate of the intention to treat , for belonging to a generic leaf , as follows:
(30) 
where is the estimated value of . Again, following [References], (30) is an unbiased and efficient estimator of (21) within every leaf.
We also need to rework the insample and outofsample goodnessoffit measures:

Insample goodness of fit:
(31) 
Outofsample goodness of fit:
(32)
For the sake of clarity, here (and in the following subsection), to fit the instrumental variable framework, we have reworked the outofsample goodnessoffit based on TOT (see (16)). This rework could easily be adapted to other outofsample goodnessoffit measures.
The last part of our algorithm based on the instrumental variable focuses on the estimation of the complier average treatment effects. As we highlighted before, by using the instrumental variable , we are substantially assuming four different types in our population: compliers, alwaystakers, nevertakers, and defiers. As before, our interest lies on the effect on the compliers. Within every leaf, the complier average causal effect is:
(33) 
This formula is analogous to (27), and can be estimated in every leaf assuming the existence of compliers. Then, can be estimated as:
(34) 
where is estimated following (30), and can be estimated as:
(35) 
where and are the numbers of units assigned respectively to the treated and control group within a certain leaf , and is the number of units within the leaf.
B. Causal Tree with Unconfounded Instrumental Variable. Now, we extend the analysis above to the case of an irregular assignment mechanism, where both the assignment and receipt of the treatment are not randomized, but the assignment can be assumed to be unconfounded when conditioning on important covariates. When the instrumental variable is not randomized a priori, the property of unconfoundedness of the instrument does not necessarily hold. If we think of as our assignment mechanism, then the unconfoundedness of the instrument holds when:
(36) 
Due to the propensity score properties, this assumption holds even conditioning on the propensity score:
(37) 
When the assumption (37) holds, one can rework the transformed outcome variable in a similar way as in the previous subsection, obtaining
(38) 
Assuming that the exclusion restriction and the monotonicity assumptions hold, it is possible to provide an estimate of the intention to treat for belonging to a generic leaf , as follows:
(39)  
where is the estimated value of . The difference between (30) and (39) is that, given the complete randomization of the instrument, in (30) the probability was fixed to for any given unit, while in (39) the assignmenttothetreatment probability is modelled by the estimated propensity score . Finally, the complier average treatment effect in each leaf is still estimated using (34), replacing (30) with (39) to determine the estimate .
iii.1 Overall CACE
Starting from all the leaves, one can reconstruct the overall effect over all of them as a weighted average of the estimates over every leaf . One can represent this weighted average as
(40) 
where represents the number of leaves, the number of compliers for every leaf , and the overall number of compliers in all the leaves. One can also compute the proportion of compliers in every leaf simply as:
(41) 
iii.2 Estimating CACE in Every Leaf with Two Stage Least Squares Regressions
A suitable possibility to estimate the treatment effect in every leaf is to use, within every leaf of the tree , the Two Stage Least Squares (henceforth, TSLS) method for the estimation of the effect on the complier population, as it is presented in [References]. If one assumes the receipt of the treatment variable and the instrumental variable to be binary variables, our problem can be expressed in terms of 2 simultaneous regressions:
(42)  
(43) 
In the econometric terminology, the explanatory variable is , while the IV variable is .
The logic of IV regression is that one can estimate the above two reduced form regressions in the case of a single instrument by least squares. In particular, one can estimate through TSLS, as the following ratio [References, References]:
(44) 
where is an unbiased estimator of the average causal effect on the population of compliers. If one runs a TSLS regression within every leaf of the tree , then one is able to obtain an estimate for every such leaf.
A possible extension of (43) would be to include in the first stage regression all the possible confounding variables available in the dataset (in this case, denotes a scalar product):
(45) 
The idea is that, if the instrument is unconfounded only conditional on confounding variables, then one could include these covariates in the estimation of the treatment effect on the complier population within each leaf.
In every leaf, using the TSLS method, we can also obtain an estimate of the variance of our estimator, which corresponds to the Neyman estimated variance for the leaf of the tree (see (20)).
C. The CTIV Algorithm. Our proposed CTIV algorithm is summarized as follows.
Causal Tree with Instrumental Variable (CTIV)
Inputs: units , where is the feature vector, is treatment assignment (instrumental variable), is the treatment receipt, and is the observed response.
Outputs: 1) a Causal Tree (determined by the use of the instrumental variable), and 2) estimates of the Complier Average Causal Effects on its leaves.

First Step of the Algorithm (Building the Tree)

Draw a random subsample from without replacement and divide it into two disjoint sets: a training set () and a validation set () of size with .

Grow a Causal Tree, following the next procedure to take into account the presence of the instrumental variable :

estimate the propensity of getting assigned to the treatment;

drop units with an estimated propensity score below 0.1 or above 0.9 (in order not to weight too much units with extreme values of the estimated propensity score);

grow a tree by maximizing the following insample goodnessoffit criterion, for several values of :
where is estimated on the training sample as in (30) in the case of randomization of the instrument or as in (39) if the instrument is not randomized, is the penalty parameter, and is the number of leaves, which measures the complexity of the model;



Second Step of the Algorithm (Estimating the Complier Average Causal Effects)

The complier average causal effect within a leaf can be estimated on the entire sample in two alternative ways:

if is not randomized but can be assumed to be unconfounded (Subsection III.B) then run a TSLS conditioning on the confounding covariates in the first stage regression.

Iv Comparison of the CT and CTIV Algorithms on Synthetic Data
In this section, we conduct simulations on synthetic data, to compare the performance of the proposed CTIV algorithm with that of CT. As goodnessoffit measure, we use the opposite of the Mean Squared Error of prediction (MSE) on the test set, and to assess the relative performance of the two algorithms, we consider the following relative gap measure based on such MSE [References]:
(46) 
Moreover, we run some robustness checks. In this section, our focus will be also on what happens in presence of a weak instrument, namely when the instrument is weakly correlated with the treatment variable, and when the instrument directly affects the outcome. While the presence of weak instruments is directly testable (typically, with an Ftest on the first stage regression), what is not testable and could be potentially harmful is a violation of the exclusion restriction at the leaf level. Alternative algorithms, such as the one in [References], take into account the exclusion restriction just at a general level while, in this paper, we take into account that assumption at the leaf level. In a nonsynthetic scenario, this assumption is not directly testable, but our algorithm seems to be more “transparent” than other algorithms by taking into account possible violations of this assumption.
A. Synthetic Data Construction. To compare our CTIV algorithm with the CT one, we first consider some scenarios where the assignment mechanism is irregular. As we saw in Subsection II.C, this means that the assignment to the treatment is randomized, but the receipt of the treatment is not. The general model that we use for our data simulation is built by considering the following variation of the typical IV setting reported in (42) and (43). The major differences are that we introduce in the main equation (47) a nuisance term and an interaction term between regressors and the treatment indicator, in order to heterogenise the treatment effects. The nuisance term can be thought as a notobservable feature that affects both the treatment assignment and the outcome. The general setting looks as follows:
(47)  
(48) 
where is a dimensional vector of covariates, highlights those covariates that have an effect on the outcome, and (with ) those covariates that affect the treatment effect. We consider various functional forms for and and for the error distribution in the main equation (47), as well as for in the first stage equation (48). The designs investigated (with for design 1, and for the other cases) are reported in Table I.
Design  Form of the Model  Error 

1  
2  
3  
4  
5 
We train all the five models using incrementally bigger samples, with cardinality ranging from 500 to 50000 (i.e., 500, 1000, 5000, 50000). We implement a holdout crossvalidation, assigning half of the observations to the training set (and validation set) and the other half to the test set. We let (considering independent features), , , and the nuisance parameter be a white noise. Moreover, we set the correlations between and and and to be respectively and . To make the trees comparable, we set the maximal depth of the tree to be 2, and the minimal leaf size to be one tenth of the sample size.
B. Simulation Results. The results of the simulations are evaluated in Table II, in terms of the mean squared error of prediction on the test set. As one can see from the relative gaps reported in Table II, the IVCT algorithm outperforms the CT one in all the different designs. Comparing the various models by column, one observes that with respect to the baseline case (design 1), the relative gap between the IVCT and CT algorithms widens as we add covariates (design 2), change the errors distribution (designs 3 and 4), or change the functional form (design 5). Moreover, it is important to notice that, as the sample size increases, the relative gap widens as well. From the values of the MSE it seems that, while the CT performance is quite stable, CTIV performance increases as the sample size grows larger. This is especially true in designs 1 and 5.
Design  Approach  Sample Size  

500  1,000  5,000  50,000  
1  MSE (CTIV)  0.369  0.038  0.067  0.066 
MSE (Causal Tree)  0.857  0.727  1.073  0.973  
Relative Gap  57%  94%  94%  93%  
2  MSE (CTIV)  0.239  0.058  0.058  0.052 
MSE (Causal Tree)  0.778  0.787  1.028  0.994  
Relative Gap  69%  93%  94%  95%  
3  MSE (CTIV)  0.058  0.041  0.037  0.062 
MSE (Causal Tree)  0.872  1.186  1.044  1.004  
Relative Gap  93%  97%  96%  94%  
4  MSE (CTIV)  0.072  0.053  0.051  0.050 
MSE (Causal Tree)  0.851  1.052  1.098  1.004  
Relative Gap  92%  95%  95%  95%  
5  MSE (CTIV)  0.122  0.030  0.070  0.058 
MSE (Causal Tree)  0.893  0.866  0.720  1.014  
Relative Gap  86%  96%  90%  94% 
C. Robustness Checks. Once we have checked that the CTIV algorithm outperforms the CT algorithm on synthetic data, it is worth asking what happens when some of the assumptions on which the consistency of the CTIV is built are partially violated. The main problem that can arise when applying the CTIV method is a wellknown issue in the econometric literature, known as the weak instrument problem [References]. This problem, in our framework, deals with the fact that the number of compliers within every leaf can be particularly small. In an econometric framework, the goal that one would like to achieve is to ensure that is bounded away from zero. In the following, we test what happens when the instrument is weak on the overall population, and what happens when the exclusion restriction is violated in a specific subpopulation. We test these violations on the second model design in Table I. In particular, we assume 2 different scenarios. In the first scenario, we let the instrument be weak on the overall population, by setting . In the second scenario, we impose a partial violation of the exclusion restriction, by letting the instrumental variable directly affect the outcome when the feature satisfies the condition .
Scenario  Approach  Sample Size  

500  1,000  5,000  50,000  
1  MSE (CTIV)  0.439  0.157  0.120  0.194 
MSE (Causal Tree)  0.881  0.898  1.270  1.252  
Relative Gap  50%  82%  90%  85%  
2  MSE (CTIV)  0.198  0.040  0.118  0.143 
MSE (Causal Tree)  0.244  0.329  0.452  0.311  
Relative Gap  19%  87%  74%  54% 
In this case, the results from the simulations, reported in Table III, show that the IVCT algorithm outperforms the CT even in the presence of weak instruments. It is important to notice that, within every leaf, the weakinstrument test leads to the rejection of the null hypothesis of weak instrument: our algorithm is able to identify those leaves where there is no weakinstrument problem. Moreover, our algorithm is robust even when the exclusion restriction is partially violated (second scenario). In this case, while the CT algorithm shows a better performance compared with the other scenario, by partially reducing the relative gap, the CTIV still performs better in terms of the mean squared error of prediction. Since the estimation of the causal effects is performed in a second stage with respect to the building of the tree, our algorithm seems to handle in a good way possible problems due to the violation of exclusion restriction within every leaf. This could not hold true if the exclusion restriction is taken into account just at a general level, as in [References].
V Case Study
A. Programs for the Development of Crafts in Tuscany (Italy). During the years 20032005, the Tuscan Regional Administration (Italy) introduced the “Programs for the Development of Crafts” (henceforth, PDC). These programs were aimed at Tuscan smallsized handicraft firms, with the goal of promoting innovation and regional development [References, References]. The firms could access PDC by a voluntary application and eligibility criteria. The objective of PDC was to ease access to credit for smallsized firms to boost investments, sales and employment levels. The PCD call guaranteed softloans to the firms that were considered eligible for the grant. The eligibility was evaluated on the basis of an investment project. The minimal admissible investment cost was 25 000 Euros, and the grant covered 60% of the financed investment [References]. Among firms participating in the PDC, the large majority of the projects were funded, and the percentage of insolvencies was lower than 3%. Data are available for firms that received the PDC, firms that applied for the founding but were not eligible, and firms that did not apply for the PDC. For our analysis, we use an integrated dataset including information collected by the “Artigian Credito Toscano” and information coming from the archives of the Chamber of Commerce (). The data are available for assisted firms (participating in 2003/05 PDC) and