Model Selection for Treatment Choice:
Penalized Welfare Maximization
This paper studies a new statistical decision rule for the treatment assignment problem. Consider a utilitarian policy maker who must use sample data to allocate one of two treatments to members of a population, based on their observable characteristics. In practice, it is often the case that policy makers do not have full discretion on how these covariates can be used, for legal, ethical or political reasons. We treat this constrained problem as a statistical decision problem, where we evaluate the performance of decision rules by their maximum regret. We focus on settings in which the policy maker may want to select amongst a collection of such constrained classes: examples we consider include choosing the number of covariates over which to perform best-subset selection, and model selection when approximating a complicated class via a sieve. We adapt and extend results from statistical learning to develop a decision rule which we call the Penalized Welfare Maximization (PWM) rule. We establish an oracle inequality for the regret of the PWM rule which shows that it is able to perform model selection over the collection of available classes. We then use this oracle inequality to derive relevant bounds on maximum regret for PWM. We illustrate the model-selection capabilities of our method with a small simulation exercise, and conclude by applying our rule to data from the Job Training Partnership Act (JTPA) study.
KEYWORDS: Treatment Choice, Minimax-Regret, Statistical Learning
JEL classification codes: C01, C14, C44, C52
1 Introduction00footnotetext: (continued from previous page), NASMES 2017, and the Bristol Econometrics Study Group for helpful comments, as well as Nitish Keskar for help in implementing EWM. This research was supported in part through the computational resources and staff contributions provided for the Social Sciences Computing Cluster (SSCC) at Northwestern University. All mistakes are our own.
This paper develops a new statistical decision rule for the treatment assignment problem. A major goal of treatment evaluation is to provide policy makers with guidance on how to assign individuals to treatment, given experimental or quasi-experimental data. Following the literature inspired by Manski (2004) (a partial list in econometrics includes Dehejia, 2005; Schlag, 2007; Hirano and Porter, 2009; Stoye, 2009; Chamberlain, 2011; Bhattacharya and Dupas, 2012; Tetenov, 2012; Stoye, 2012; Kasy, 2014; Athey and Wager, 2017; Kitagawa and Tetenov, 2017; Armstrong and Shen, 2015), we treat the treatment assignment problem as a statistical decision problem of maximizing population welfare. Like many of the above papers, we evaluate our decision rule by its maximum regret.
Often, policy makers have observable characteristics at their disposal on which to base treatment, however, they may not always have full discretion on how these covariates can be used. For example, policy makers may face exogenous constraints on how they can use covariates for legal, ethical, or political reasons. Even in cases where policy makers have leeway in how they assign treatment, plausible modelling assumptions may imply certain restrictions on assignment. Kitagawa and Tetenov (2017) develop what they call the Empirical Welfare Maximization (or EWM) rule, whose primary feature is its ability to solve the treatment choice problem when exogenous constraints are placed on assignment. EWM will play an important role in the development of our rule, which we call the Penalized Welfare Maximization rule (PWM).
The PWM rule is designed to address situations in which the policy maker can choose amongst a collection of constrained allocations. To be concrete, suppose we have two treatments, and we denote assignment into these treatments by partitioning the covariate space into two pieces. We can then think of constraints on assignment as constraints on the allowable subsets we can consider for the partition. Kitagawa and Tetenov (2017) focus on deriving bounds on maximum regret of the EWM rule for a fixed class of subsets of finite VC dimension (see Györfi et al. (1996) for a definition). In this paper, we consider settings where it may be beneficial to allow the planner to choose amongst a sequence of such classes. We establish an oracle inequality for the regret of the PWM rule which shows that it behaves as if we knew the “correct” class to use in the sequence. We then use this result to derive bounds on the maximum regret of the PWM rule in two important empirical settings.
The first setting we consider is one where the constraints imposed on the planner may not have finite VC dimension, in particular, we argue that the constraints imposed by some reasonable assumptions may generate classes of infinite VC dimension. To solve this problem, we approximate the class by a sieve of classes of finite VC dimension. The strength of the PWM rule in this application will then be to provide a data-driven method by which to select an appropriate approximating class in applications. In doing so we will derive bounds on the maximum regret of the PWM rule for a large set of classes of infinite VC dimension.
The second setting we consider is when the class of allocations may have large VC dimension relative to the size of the sample. This could arise, for example, if the planner has many covariates on which to base assignment. As is shown in Kitagawa and Tetenov (2017), when the constraints placed on assignment are too flexible relative to the sample size available, the EWM rule may suffer from overfitting, which can result in inflated values of regret. By the same mechanism that allows PWM to select an appropriate approximating class in our first application, we can use PWM in order to select amongst simpler subclasses in this setting as well. We illustrate PWM’s ability to reduce regret in a simulation study where the policy maker has many covariates on which to base treatment assignment, but does not know how many to use when performing best-subset selection.
The PWM rule is heavily inspired by the literature on model selection in classification: see for example the seminal work of Vapnik and Chervonenkis (1974), as well as Györfi et al. (1996), Koltchinskii (2001), Bartlett et al. (2002), Scott and Nowak (2002), Boucheron et al. (2005), Bartlett (2008), Koltchinskii (2008) among many others. The theoretical contribution of our paper is to modify and extend some of these tools to the setting of treatment choice. As pointed out in Kitagawa and Tetenov (2017), there are substantive differences between classification and treatment choice: observed outcomes are real-valued in the setting of treatment choice, and only one of the potential outcomes is observed for any given individual. When we say that we extend these tools, we mean that we prove results for settings where the data available to the policy maker is observational or quasi-experimental. As we will see, in such a setting the policy maker’s objective function contains an estimated quantity, which is not an issue that arises in the classification problem. In deciding which tools to extend, we have attempted to strike a balance between ease of use for practitioners, theoretical appeal, and performance in simulations. The connection between classification and treatment choice has been explored in various fields, including machine learning, under the label of policy learning (see Zadrozny, 2003; Beygelzimer and Langford, 2009; Swaminathan and Joachims, 2015; Kallus, 2016, among others), and in epidemiology under the label of individualized treatment rules (examples include Qian and Murphy, 2011; Zhao et al., 2012). Kitagawa and Tetenov (2017) and Athey and Wager (2017) provide a discussion on the link between these various literatures.
The remainder of the paper is organized as follows. In Section 2, we setup the notation and formally define the problem that the policy maker (i.e. social planner) is attempting to solve. In Section 3, we introduce the PWM rule and present general results about its maximum regret. In Section 4, we perform a small simulation study to highlight PWM’s ability to reduce regret when performing best-subset selection. In Section 5 we derive bounds on maximum regret of the PWM rule when the planner is constrained to what we call monotone allocations, and then illustrate these in an application to the JTPA study. Section 6 concludes.
2 Setup and Notation
Let denote the observed outcome of a unit , and let be a binary variable which denotes the treatment received by unit . Let denote the potential outcome of unit under treatment (which we will sometimes refer to as “the treatment”), and let denote the potential outcome of unit under treatment (which we will sometimes refer to as “the control”). The observed outcome for each unit is related to their potential outcomes through the expression:
Let denote a vector of observed covariates for unit . Let denote the distribution of , then we assume that the planner observes a size random sample
where is jointly determined by , and the expression in (1). Throughout the paper we will assume unconfoundedness, i.e.
(Unconfoundedness) The distribution satisfies:
This assumption asserts that, once we condition on the observable covariates, the treatment is exogenous. This assumption will hold in a randomized controlled trial (RCT), which is our primary application of interest, since the treatment is exogenous by construction. This assumption is sometimes also made (possibly tenuously) in observational studies; it is a key identifying assumption when using matching or regression estimators in policy evaluation settings with observational data (Imbens, 2004, provides a review of these techniques, and discusses the validity of Assumption 2.1 in economic applications).
The planner’s goal is to optimally assign the treatment to the population. The objective function we consider is utilitarian welfare, which is defined by the average of the individual outcomes in the population:
where represents the covariate values for those individuals to whom treatment 1 is assigned. The planner is tasked with choosing a treatment allocation using the empirical data. Using Assumption 2.1, we can rewrite the welfare criterion as:
where is the propensity score. Since the first term of this expression does not depend on , we define the planner’s objective function given a choice of treatment allocation as:
Let be the class of feasible treatment allocations . Here when we say feasible, we mean that it may be the case that the planner is restricted in what kinds of allocations they can (or want to) consider. For instance, it could be the case that the planner is not able to select certain treatment allocations for legal, ethical, or political reasons, or it could be that a specific application justifies certain types of allocations. Consider the following three examples of :
could be the set of all measurable subsets of . This is the largest possible class of admissible allocations. It is straightforward to show that the optimal allocation in this case is as follows: define
then the optimal allocation is given by
Suppose , and consider the class of threshold allocations:
Such a class could be reasonable, for example, when assigning scholarships to students: suppose the only covariate available to the planner is a student’s GPA, then it may be school policy that only threshold-type rules are to be considered.
Suppose , and consider the class of monotone allocations:
As an example, consider again the setting of assigning scholarships to students, but now suppose that the covariates available to the planner are a student’s GPA and parental income. It could then be school policy that the treatment allocation be such that the GPA requirement for a scholarship increases the higher the student’s parental income. In fact, even if the planner were not exogenously constrained to such an allocation, it may be the case that reasonable assumptions justify the use of such a restriction. For example, suppose that the outcome of interest depends only on a student’s innate “ability”, which is unobservable, and whether or not they receive a scholarship. Further suppose that the planner must use information on GPA and parental income to assign scholarships, which have a per-unit cost. Under some assumptions about the outcome equation, and the relationship between the distributions of ability, GPA, and parental income, it can be shown that the optimal allocation is in fact monotone. In Appendix B we work through this example in detail.
Given a feasible class , we denote the highest attainable welfare by:
A decision rule is a function from the observed data into the set of admissible allocations . We call the rule that we develop and study in this paper the Penalized Welfare Maximization (or PWM) rule. As in much of the literature that follows the work of Manski (2004), we assume that the planner is interested in rules that, on average, are close to the highest attainable welfare. To that end, the criterion by which we evaluate a decision rule is given by what we call maximum -regret:
We note that, in contrast to many papers on statistical treatment rules which employ maximum-regret criteria, this notion of regret is defined relative to the optimum attained in , which is not necessarily the first-best unrestricted optimum. Kitagawa and Tetenov (2017) and Athey and Wager (2017) are recent papers which also focus on the -regret criterion.
3 Penalized Welfare Maximization
In this section, we present the main results of our paper. In Section 3.1, we review the properties of the empirical welfare maximization (EWM) rule of Kitagawa and Tetenov (2017), which will motivate the PWM rule and serve as an important building block in its construction. In Section 3.2, we define the penalized welfare maximization rule and present bounds on its maximum -regret for general penalties. In Section 3.3 we illustrate these results by applying them to some specific penalties. In Section 3.4 we present results for a modification of the PWM rule for applications where the propensity score is not known and must be estimated.
3.1 Empirical Welfare Maximization: a Review and Some Motivation
The idea behind the EWM rule is to solve a sample analog of the population welfare maximization problem:
In general this problem could be computationally challenging. However, Kitagawa and Tetenov (2017) show that solving such a problem is practically feasible for many applications by formulating it as a Mixed Integer Linear Program (MILP): see Appendix C for details. Note that to solve this optimization problem, the planner must know the propensity score . This assumption is reasonable if the data comes from a randomized experiment, but clearly could not be made in a setting where the planner is using observational data. Kitagawa and Tetenov (2017) derive results for a modified version of the EWM rule where the propensity score is estimated, which we will review in Section 3.4.
To derive their non-asymptotic bounds on the maximum -regret of the EWM rule, Kitagawa and Tetenov (2017) make the following additional assumptions, which we will also require for our results:
(Bounded Outcomes and Strict Overlap) The set of distributions has the following properties:
There exists some such that the support of the outcome variable is contained in .
There exists some such that for all .
The first assumption asserts that the outcome is bounded. Since the implementation of both the EWM rule and PWM rule do not require that the planner knows , and the existence of some bound on outcomes of interest to economics seems plausible, we feel that this assumption is tenable. The second assumption is standard when imposing unconfoundedness. In a RCT, this assumption will hold by design, but may be violated in settings with observational data.
In order to derive their results, Kitagawa and Tetenov (2017) also make the following assumption, which we will not require:
(Finite VC Dimension)111It should be possible to derive analogous results by assuming that the class of treatment allocations has sufficiently small bracketing entropy (as in Tsybakov, 2004), or Hamming entropy (as in Athey and Wager, 2017). We will also not require these types of assumptions. : has finite VC dimension .
Such an assumption may or may not be restrictive depending on the application in question. Consider Example 2.2, the class of threshold allocations on . This class has VC dimension 2, and so Assumption 1 holds. On the other hand, it can be shown that the class of monotone allocations on that was introduced in Example 2.3 has infinite VC dimension (see Györfi et al. (1996)).
for some universal constant . Moreover, when has sufficiently large support, they derive the following lower bound for any decision rule :
for a universal constant and sufficiently large. This shows that the rate of convergence of maximum -regret implied by (3) is the best possible, i.e. that no other decision rule could achieve a faster rate without imposing additional assumptions.
In fact, Theorem 2.2 in Kitagawa and Tetenov (2017), which establishes (4), implies another interesting result: if is continuously distributed and we do not impose additional restrictions on the distribution , then it is impossible to derive a uniform rate on maximum -regret for any rule, for classes of infinite VC dimension. This is in line with the results derived in Stoye (2009), where he shows that, for any sample size, flipping a coin to assign individuals is minimax-regret optimal despite this rule not even being pointwise consistent. Since we will be interested in classes of infinite VC dimension, we will revisit this problem later in Section 3.
As pointed out in Kitagawa and Tetenov (2017), the EWM rule is not invariant to positive affine transformations of the outcomes, and thus the researcher could manipulate the treatment rule in settings where they have leeway in how to code the outcome variable. To deal with this issue they suggest solving a demeaned version of the welfare maximization problem. In Appendix B we discuss the demeaned version of EWM and repeat the exercises of Sections 4 and 5 using a demeaned version of EWM and PWM.
As pointed out in Kitagawa and Tetenov (2017), because the bound given in (3) is valid for every , it is immediate that we could use it to derive a rate of convergence for EWM over a sequence of classes , for which the VC-dimension grows with sample size at rate for some . We return to this observation in Remark 3.5 below.
3.2 Penalized Welfare Maximization: General Results
For the PWM rule, we consider situations in which the planner may want to choose amongst a sequence of subclasses of :
Let be the EWM rule in the class . Then we can decompose the -regret of the rule as follows:
Given this decomposition, we call
the estimation error of the rule in the class , and we call
the approximation error of the class . In typical applications, the higher is the larger the estimation error, and the smaller the approximation error. Suppose it was the case that we could derive sharp uniform bounds on these errors, then an appropriate choice of would be one that balances the tradeoff according these bounds. In Theorem 3.1, we derive an oracle inequality which shows that PWM selects such a , in a data-driven fashion. We use this feature of PWM to derive bounds on maximum regret in two settings of empirical interest.
The first setting we address is when has infinite VC dimension (consider Examples 2.1 and 2.3). We will study situations where it is possible to “approximate” with a sequence of classes of finite VC dimension in which EWM can be applied. We present examples of relevant approximating sequences in Examples 3.2 and 3.3 below. In Corollary 3.1 we establish a bound on maximum regret in this setting.
The second setting we address is when the sample size is small, but the class is relatively complex. This situation could arise when the planner has many covariates on which to base treatment. The bound on regret given by (3) is worse the larger is and the smaller is : this is because of the ability for complex classes to “overfit” the data in small samples. In a situation where is large relative to , it may be beneficial to perform EWM in a class of smaller VC dimension, so that the bound on
is small. On the other hand, this will only be useful if it is also the case that
is small as well: here we face the same tradeoff between estimation and approximation error as we did above. We consider the example of selecting the number of covariates over which to perform best-subset selection with threshold allocations in Example 3.1 below. In Corollary 3.2 we establish a bound on maximum regret in this setting.
We consider the following assumption on our sequence of classes, which we call a sieve sequence of :
The sequence of classes
is such that each class has VC dimension , which is finite.222Kitagawa and Tetenov (2017) additionally assume that their class is countable so as to avoid potential measurability concerns. We instead choose not to address these concerns explicitly, as is done in most of the literature on classification. See Van Der Vaart and Wellner (1996) for a discussion of possible resolutions to this issue.
Interesting sequences may be finite or countable. We illustrate this with some examples:
Recall the class of threshold allocations introduced in Example 2.2. Let , and define to be the threshold allocations on and to be the threshold allocations on . We can now define the set of two-dimensional threshold allocations on :
To make this concrete, suppose is an age covariate and is an income covariate, then this class contains allocations of the form, for example, “receive treatment if age is above and income is below ” for some and .
With available covariates, it is straightforward to extend this definition to the class of -dimensional threshold allocations. For large , could become complex relative to our sample size, and so we may want to base treatment only on a subset of the covariates: this is a variant of the best-subset selection problem, which has been recently studied in the classification context by Chen and Lee (2016). However, the question still remains as to how many covariates should be considered (that is, the size of the subset). An interesting sieve sequence for is given by the following: let be defined as
The sequence corresponds to the sequence of threshold allocations that use zero, one and two covariates respectively (that each class has finite VC dimension follows from the fact a class of threshold allocations in one dimension has finite VC dimension, and that unions of classes of finite VC dimension have finite VC dimension, see Dudley (1999)).333Note that in this example, it is actually the case that and have the same VC dimension. This will not be the case when we move to settings in higher dimensions. We thus see that PWM could be used to select the number of covariates over which to perform best-subset selection. We will revisit this example in the simulation study of Section 4, in a setting where the planner must select from potentially many covariates.
Recall the class of monotone allocations introduced in Example 2.3. Suppose that , so that has infinite VC dimension (see Györfi et al. (1996) for a proof of this fact). We will construct a useful sieve sequence for , where we approximate sets in with sets that feature monotone, piecewise-linear boundaries. We proceed in three steps.
First define, for an integer and , the following function :
The function is simply a triangular kernel whose base shifts with and is scaled by . For example, is a triangular kernel with base , and is a triangular kernel with base . Next, using these functions, we define the following classes :
where . These are a special case of what Kitagawa and Tetenov (2017) call generalized eligibility scores, which, as shown in Dudley (1999), have VC dimension . The intuition behind the class is that it divides the covariate space into treatment and control such that the boundary is a piecewise linear curve. Note that by construction it is the case that for every . Finally, to construct our approximating class , we will modify the class such that we ensure that the resulting treatment allocations are monotone.
For an integer, let be the following differentiation matrix:
Then is defined as follows:
for . Note that the purpose of the constraint is to ensure that for all , which is what imposes monotonicity on the allocations. This construction, which we borrow from Beresteanu (2004), is useful as it imposes monotonicity through a linear constraint, which is ideal for our implementation of this sequence in Section 5. Proposition 5.1 provides a uniform rate at which under some additional regularity conditions, and Corollary 5.1 derives the corresponding bound on maximum -regret of the PWM rule. It is important to mention that, under the regularity conditions we will impose, the class of monotone allocations is an example of a class for which bounds on maximum -regret exist for EWM, despite this class having infinite VC dimension (see Proposition 5.2). We will compare the bounds we derive for PWM to these bounds in the discussion following Corollary 5.1. In Section 5, we study the use of this sequence of approximating classes in an application to the JTPA study.
Suppose the planner faces no restrictions on treatment assignment, so that is the class of all measurable subsets of . Recall from Example 2.1 that that the optimal allocation in this case is given by . In this setting it may seem natural to employ the plug-in decision rule:
where is a non-parametric estimate of . Under Assumption 2.1 many non-parametric estimates of are well understood. The Penalized Welfare Maximization Rule could provide an interesting alternative to plug-in rules in this setting by considering a sequence of sieve classes that form decision trees. Decision trees are popular rules in classification because of their natural interpretability. Intuitively, a decision tree recursively partitions the covariate space in such a way that the resulting decision rule can be understood as a series of “yes-or-no” questions involving the covariates. Using decision trees for the estimation of causal effects has recently become a popular idea (see for example Athey and Imbens, 2016; Wager and Athey, 2015). Although we do not explore decision trees extensively in this paper, in Appendix B we explain how we could accommodate them in our framework and relate them to recent work on the use of decision trees for treatment assignment, as presented in Kallus (2016) and Athey and Wager (2017). We also provide a preliminary comparison to plug-in decision rules.
Given a sieve sequence , let
be the EWM rule in the class . Our goal is to select the appropriate class in which to perform EWM. We do this by selecting the class in the following way: for each class , suppose we had some measure of the amount of “overfitting” that results from using the rule (we will be more precise about the nature of in a moment). Given such a measure , let be any increasing sequence of real numbers, and define the following penalized objective function:
Then the penalized welfare maximization rule is defined as follows:
In words, the PWM rule considers the estimated welfare obtained by the EWM rule in each class , and penalizes this by a term that captures how much may “overfit” the data, it then selects the class which best balances this tradeoff.
Note that the PWM objective function includes a term. This component of the objective is required for technical reasons when the approximating sequence is infinite, as it ensures that the classes get penalized at a sufficiently fast rate as increases. Ideally, we would like the penalty to be completely intrinsic to each class, but this technical device seems to be unavoidable and similar devices are pervasive throughout the literature on model selection in classification: see Koltchinskii (2001), Bartlett et al. (2002), Boucheron et al. (2005), Koltchinskii (2008). We will make three comments about this term. First, our results hold for any increasing sequence , and the choice is reflected explicitly in the bounds that we derive. Second, if one is only interested in using PWM in settings where the sequence of classes is finite, then we will show that the term is not required. Third, if we set , then in our experience this additional term is so small relative to the penalty that its presence is irrelevant when performing PWM in applications. For simplicity, unless otherwise specified, we will present all of our results with this term included, and set .
It is worth commenting on why such a penalty is necessary in the first place. It seems reasonable that, given a sieve sequence , one could construct a sequence such that the decision rule achieves an optimal balance between the estimation and approximation error (this is in fact just a small extension of the observation made in Remark 3.3). However, it is impossible to construct such a sequence that would apply to every set of allocations , approximating sequence , and class of data generating processes; the appropriate would depend on the VC dimensions of the sieve sequence, which may be hard to bound, as well as knowledge of the uniform rate of convergence of to , which will depend on , , and the regularity conditions we are willing to impose on . Our method provides a way to find the appropriate class in which to maximize in general, and in a data-driven fashion.
Before stating our main results about the -regret of the PWM rule, we must formalize how should behave. In this section we will present high level assumptions that the penalty must possess, and in Section 3.3 we will provide specific examples. We make the following assumption on the penalty :
There exist positive constants and such that satisfies the following tail inequality for every , , and for every :
We will provide some intuition for this assumption. Given an EWM rule , the value of the empirical welfare is given by . From the perspective of -regret, what we would really like to know is the value of population welfare . Although this is not knowable, suppose we could define as , then the penalized objective would be exactly . Since implementing such a is impossible, we require our penalty to be a good upper bound on this quantity to obtain our results. We are now ready to state our main workhorse result: an oracle inequality that chracterizes the -regret of the PWM rule:
This result forms the basis of all the results we present in Sections 3.2 and 3.3. It says that, at least from the perspective of pointwise (as opposed to maximum) -regret, the PWM rule is able to balance the tradeoff between and the approximation error, at the cost of adding two additional terms that are . We comment on the nature of these terms in Remark 3.6 below. Note that this result as stated does not quite accomplish our initial goal of balancing the estimation and approximation error along our sieve sequence: it is possible to choose a that satisfies Assumption 3.4 for which is too large. For this reason, we also impose the requirement that any penalty we consider should have the following additional property:
There exists a positive constant such that, for every , satisfies
where is the dimension of .
This assumption ensures that is comparable to the estimation error for EWM derived in (3).
Theorem 3.1 shows that the PWM rule is able to balance the tradeoff between the estimation and approximation errors, but the bound we derive introduces two additional terms. The second of these terms, at this level of generality, is hard to quantify. We will attempt to shed light on this term, for specific penalties, in Section 3.3.
The next result we present addresses our first setting of interest: of choosing the appropriate approximating class when has infinite VC dimension. It shows that, if there exists a uniform bound on the approximation error, then the maximum regret of the PWM rule is such that we select the class appropriately. First we make the assumption that we restrict ourselves to a set of distributions for which there exists a uniform bound on the approximation error:
Let be a set of distributions such that
for a sequence , and non-decreasing as , as .
The first assumption asserts that we have a uniform bound on the approximation error. As we pointed out in Remark 3.1, an assumption of this type is necessary to derive a bound on maximum regret when the class has infinite VC dimension. The second assumption is made to highlight the following possibility: although Assumption 3.5 guarantees that we can satisfy this restriction with , it is possible that, once we have imposed that must lie in , an even tighter bound may exist on . We make this point to emphasize that PWM will balance the tradeoff between the estimation and approximation error according to the tighest possible bounds on and , regardless of whether or not we know these bounds for a given application.
As mentioned in Remark 3.5, if and were known, then we could achieve such a result with a deterministic sequence . The strength of the PWM rule then is that we achieve the same behavior for any class and approximating sequence without having to know these quantities in practice. We will illustrate this result in our application Section 5, in the setting of Example 3.2.
Our final result of Section 3.2 addresses our second setting of interest: of the appropriate selection of a subclass when is relatively complex relative to sample size (for example, when selecting amongst many covariates when performing best-subset selection):
This result shows that, if the distribution is such that the optimum is achieved in , then the upper bound on regret for PWM is as if we had performed EWM in even though we cannot know this class in practice.
3.3 Penalized Welfare Maximization: Some Examples of Penalties
This section serves two purposes. First, it illustrates the results of Section 3.2 with two concrete choices for the penalty . Second, the results help quantify the size of the extraneous term in the bound of Theorem 3.1 for these penalties, so as to address the concerns presented in Remark 3.6. The first penalty we present, the Rademacher penalty, is theoretically elegant but computationally burdensome. The second penalty we present, the holdout penalty, is very intuitive and much more tractable in applications. However, the holdout penalty involves a sample-splitting procedure that some may find unappealing. Both of the penalties share the property that they do not require the practitioner to know the VC dimensions of the approximating classes, which we feel is important to make the method broadly applicable.
3.3.1 The Rademacher Penalty
The first penalty we present is very attractive from a theoretical perspective, but is computationally burdensome. Let be the observed data. Then the Rademacher penalty is given by
where is defined as in equation (2), and are a sequence of i.i.d Rademacher variables, i.e. they take on the values , each with probability half.
To clarify the origin of this penalty, recall that must be a good upper bound on , which is the requirement of Assumption 3.4. Bounding such quantities is common in the study of empirical processes, and the usual first step is to use what is known as symmetrization, which gives the following bound:
It is thus this inequality that inspires the definition of . The concept of Rademacher complexity555Note that the definition of Rademacher complexity is slightly different than the definition of our penalty. Here we follow Bartlett et al. (2002) and do not include the absolute value in our definition of the penalty. is pervasive throughout the statistical learning literature (see for example Koltchinskii (2001), Bartlett and Mendelson (2002), and Bartlett et al. (2002)). Intuitively, it measures a notion of complexity that is finer than that of VC dimension, and is at the same time computable from the data at hand. Furthermore, unlike the holdout penalty introduced in the next subsection, it allows both the objective function and the penalty to be estimated with all of the data.
We are thus able to refine Theorem 3.1 to the case of the Rademacher penalty.
We can now revisit the comment we made in Remark 3.6, about quantifying the size of the constants in the extraneous term of the bound. In Appendix B we perform a back-of-the-envelope calculation that provides insight into the size of , and compares it to the size of the universal constant derived in Kitagawa and Tetenov (2017).
Despite this penalty being theoretically appealing, implementing it in practical applications is problematic. The standard approach suggested in the statistical learning literature is to compute by simulation: first, we repeatedly draw samples of , then we solve the problem
for each draw, and then average the result. Unfortunately, the optimization problem to be solved in the second step is computationally demanding for most classes of interest, so that repeatedly solving it for multiple draws of is impractical. Moreover, this procedure must be repeated for each class , which makes it even more prohibitive.
In the next section, we present a penalty that is not only conceptually very simple, but easy to implement as well.
3.3.2 The Holdout Penalty
The second penalty we introduce is motivated by the following idea: First fix some number such that (for expositional clarity suppose this is an integer)666The results would continue to hold if one were to instead define ., and let . Given our original sample , let denote what we call the estimating sample, and let denote the testing sample. Now, using , compute for each . It seems intuitive that we could get a sense of the efficacy of by applying this rule to the subsample and computing the empirical welfare . We could then select the class that results in the highest empirical welfare .
It turns out this idea can be formalized in our framework by treating it as a PWM-rule on the estimating sample, with the following penalty: for each EWM rule estimated on , let
be the empirical welfare of the rule on and let
be the empirical welfare of the rule on . We define the holdout penalty to be
Now, recall that the PWM rule is given by
which, given the definition of , simplifies to
Hence we see that the PWM rule with the holdout penalty reproduces the intuition presented above (with the usual addition of the term; see Remark 3.4).
We can perform the same analysis as we did in Remark 3.7. In doing so we see that the difference between this result and the result in Proposition 3.1 is that sample-splitting introduces distortions into the constant terms through . Indeed, the tradeoff between splitting the sample into the estimating sample and testing sample is reflected in these constants.
As noted in Remark 3.8, the bound we derive for the holdout penalty is similar to what we derive for the Rademacher penalty, but with inflated constants. However, the benefit of the holdout penalty lies in the fact that it is much more practical to implement. The only remaining issue with the holdout penalty is how to split the data. Deriving some sort of data-driven procedure to choose the proportion is beyond the scope of our paper, but as a rule of thumb, we have found that it is much more important to focus on accurate estimation of the rule than on the computation of . In other words, we recommend that the estimating sample be a large proportion of the original sample . Throughout Sections 4 and 5, we designate three quarters of the sample as the estimating sample.
3.4 Penalized Welfare Maximization: Estimated Propensity Score
In this section we present a modification of the PWM rule where the propensity score is not known and must be estimated from the data. This situation would arise if the planner had access to observational data instead of data from a randomized experiment. Before describing our modification of the PWM rule, we must review results about the corresponding modification of the EWM rule in Kitagawa and Tetenov (2017). The modification we consider here is what they call the e-hybrid EWM rule. Recall the EWM objective function as defined in equation (2). To define the e-hybrid EWM rule we modify this objective function by replacing with
where is an estimator of the propensity score, and is a trimming parameter such that for some . The e-hybrid EWM objective function is defined as follows:
In a recent paper, Athey and Wager (2017) argue that more sophisticated estimators of the welfare objective can improve performance relative to the e-hybrid rule, and derive corresponding bounds on the maximum regret of their procedure that feature smaller constants. Modifying our method using their techniques would be an interesting direction for future work.
Since we are now estimating the propensity score, we must impose additional regularity conditions on to guarantee a uniform rate of convergence. We make a high level assumption:
Given an estimator , let be a class of data generating processes such that
Although we do not explore low-level conditions that satisfy this assumption here, Kitagawa and Tetenov (2017) do so in their paper. To summarize their results, they show that if is a local polynomial estimator, and that and the marginal distribution of satisfy some smoothness conditions, then Assumption 3.7 is satisfied with , where is a constant that determines the smoothness of .777To be more precise, is the degree of the Holder class to which must belong.
Let be the solution to the e-hybrid problem in a class of finite VC dimension, then Kitagawa and Tetenov (2017) derive the following bound on maximum -regret:
With a non-parametric estimator of , will generally be slower than and hence determine the rate of convergence.
We are now ready to present the construction of the corresponding e-hybrid PWM estimator. Let be an arbitrary class of allocations, and let be some approximating sequence for . Let be the hybrid EWM rule in the class . Let be our penalty for the hybrid PWM rule. We now require that the penalty satisfies the following properties:
(Assumptions on )
In addition to making assumptions about , we assume there exists an “infeasible penalty” with the following properties:
There exist positive constants and such that satisfies the following tail inequality for every , and for every :