Uniform Inference in High-Dimensional Dynamic Panel Data Models

Uniform Inference in High-Dimensional Dynamic Panel Data Models

Abstract

We establish oracle inequalities for a version of the Lasso in high-dimensional fixed effects dynamic panel data models. The inequalities are valid for the coefficients of the dynamic and exogenous regressors. Separate oracle inequalities are derived for the fixed effects. Next, we show how one can conduct uniformly valid simultaneous inference on the parameters of the model and construct a uniformly valid estimator of the asymptotic covariance matrix which is robust to conditional heteroskedasticity in the error terms. Allowing for conditional heteroskedasticity is important in dynamic models as the conditional error variance may be non-constant over time and depend on the covariates. Furthermore, our procedure allows for inference on high-dimensional subsets of the parameter vector of an increasing cardinality. We show that the confidence bands resulting from our procedure are asymptotically honest and contract at the optimal rate. This rate is different for the fixed effects than for the remaining parts of the parameter vector.

1 Introduction

Dynamic panel data models are widely used in economics and social sciences. They are extremely popular as workers, firms, and countries often differ due to unobserved factors. Furthermore, these units are often sampled repeatedly over time in many modern applications thus allowing to model the dynamic development of these. However, so far no work has been done on how to conduct inference in the high-dimensional dynamic fixed effects model

(1.1)

where the presence of lags of allows for autoregressive dependence of on its own past. is a vector of exogenous variables and are the individual specific fixed effects while are idiosyncratic error terms. Applications of panel data are widespread: ranging from wage regressions where one seeks to explain worker’s salary, to models of economic growth determining the factors that impact growth over time of a panel of countries as in Islam (1995).

Recent years have witnessed a surge in availability of big data sets including many explanatory variables. For example, De Neve et al. (2012) have considered the effect of genes on happiness/life satisfaction. Controlling for many genes simultaneously clearly results in a vast set of explanatory variables, hence calling for techniques which can handle such a setting. High-dimensionality may also arise out of a desire to control for flexible functional forms by including various transformations, such as cross products, of the available explanatory variables. In the specific context of panel data models Andersen et al. (2012) investigated the causal effect of lightning density on economic growth using a US panel data set. These authors had access to a big set of control variables compared to the sample size. For this reason, they decided to investigate the effect of lightning using several subsets of control variables instead of including all control variables simultaneously as one would ideally do. In this paper we show how one can achieve this ideal by proposing an inferential procedure for high-dimensional dynamic panel data models.

Much progress has also been made on the methodological side in the last decade. Among the most popular procedures is the Lasso of Tibshirani (1996) which sparked a lot of research on its properties. However, until recently, not much work had been done on inference in high-dimensional models for Lasso-type estimators as these possess a rather complicated distribution even in the low dimensional case, see Knight and Fu (2000). This problem has been cleverly approached by unpenalized estimation after double selection by Belloni et al. (2012); Belloni et al. (2014) or by desparsification in Zhang and Zhang (2014); van de Geer et al. (2014); Javanmard and Montanari (2013); Caner and Kock (2014).

The focus in the above mentioned work has been almost exclusively on independent data and often on the plain linear regression model while high-dimensional panel data has not been treated. Exceptions are Kock (2013) and Belloni et al. (2014) who have established oracle inequalities and asymptotically valid inference for a low-dimensional parameter in static panel data models, respectively. Caner and Zhang (2014) have studied the properties of penalized GMM, which can be used to estimate dynamic panel data models, in the case of fewer parameters than observations. To the best of our knowledge, no research has been conducted on inference in high-dimensional dynamic panel data models. Note that high-dimensionality may arise from three sources in the dynamic panel data model (1.1). These sources are the coefficients pertaining to the lagged left hand side variables (), the exogenous variables (), as well as the fixed effects (). In particular, we shall see that (joint) inference involving an behaves markedly different from inference only involving ’s and . Furthermore, panel data differ from the classic linear regression model in that one does not have independence across for any as consecutive observations in time can be highly correlated for any given individual. Ignoring this dependence may lead to gravely misleading inference even in low-dimensional panel data models. For that reason we shall make no assumptions on this dependence structure across for the . Static panel data models are a special case of (1.1) corresponding to .

Traditional approaches to inference in low-dimensional static panel data models have considered the fixed effects as nuisance parameters which have been removed by taking either first differences or demeaning the data over time for each individual , see e.g., Wooldridge (2010); Arellano (2003); Baltagi (2008). In this paper we take the stand that the fixed effects may be of intrinsic interest. Thus we do not remove them by first differencing or demeaning. This allows us to test hypothesis simultaneously involving and .

The two most common assumptions on the unobserved heterogeneities, , are the random and fixed effects frameworks. In the former, the are required to be uncorrelated with the remaining explanatory variables while the latter does not impose any restrictions. Ruling out any correlation between the and the other covariates often unreasonable. In this paper we strike a middle ground between the random and fixed effects setting: we do not require zero correlation between the unobserved heterogeneities and the other covariates, however we shall impose that is weakly sparse in a sense to be made precise in Section 2.2. We still refer to the as fixed effects as we treat them as parameters to be estimated as is common in fixed effects settings. However, the reader should keep in mind that our setting is actually intermediate between the random and fixed effects setting.

In an interesting recent paper dealing with the the low-dimensional case, Bonhomme and Manresa (2014) have assumed a different type of structure, namely grouping, on the fixed effects. However, in the high-dimensional setting we are considering, weak sparsity works well as just explained.

Our inferential procedure is closest in spirit to the one in van de Geer et al. (2014), which in turn builds on Zhang and Zhang (2014), who cleverly used nodewise regressions to desparsify the Lasso and to construct an approximate inverse of the non-invertible sample Gram matrix in the context of the linear regression model. In particular, we show how nodewise regressions can be used to construct one of the blocks of the approximate inverse of the empirical Gram matrix in dynamic panel data models. As opposed to van de Geer et al. (2014), we do not require the inverse covariance matrix of the covariates to be exactly sparse. It suffices that the rows of the inverse covariance matrix are weakly sparse. Thus, none of its entries needs to be zero.

We contribute by first establishing an oracle inequality for a version of the Lasso in dynamic panel data models for all groups of parameters. As can be expected, the fixed effects turn out to behave differently than the remaining parameters. Next, we show how joint asymptotically gaussian inference may be conducted on the three types of parameters in (1.1). In particular, we show that hypotheses involving an increasing number of parameters can be tested and provide a uniformly consistent estimator of the asymptotic covariance matrix which is robust to conditional heteroskedasticity. Thus, we introduce a feasible procedure for inference in high-dimensional heteroskedastic dynamic panel data models. Allowing for conditional heteroskedasticity is important in dynamic models like the one considered here as the conditional variance is known to often depend on the current state of the process, see e.g. Engle (1982). Thus, assuming the error terms to be independent of the covariates with a constant variance is not reasonable. Next, we show that confidence bands constructed by our procedure are asymptotically honest (uniform) in the sense of Li (1989) over a certain subset of the parameter space. Finally, we show that the confidence bands have uniformly the optimal rate of contraction for all types of parameters. Thus, the honesty is not bought at the price of wide confidence bands as is the case for sparse estimators, c.f. Pötscher (2009). Simulations reveal that our procedure performs well in terms of size, power, and coverage rate of the constructed intervals.

The rest of the paper is organized as follows. Section 2 introduces the estimator and provides an oracle inequality for all types of parameters. Next, Section 3 shows how limiting gaussian inference may be be conducted and provides a feasible estimator of the covariance matrix which is robust to heteroskedasticity even in the case where the number of parameter estimates we seek the limiting distribution for diverges with the sample size. Section 4 shows that confidence intervals constructed by our procedure are honest and contract at the optimal rate for all types of parameters. Section 5 studies our estimator in Monte Carlo experiments while Section 6 concludes. All the proofs of our results are deferred to Appendix A; Appendix B contains further auxiliary lemmas needed in Appendix A.

2 The Model

2.1 Notation

For , let , , and denote the , , and norms, respectively. Let denote the unit column vector with th entry being 1 in some Euclidean space whose dimension depends on the context. If the argument of is a matrix, then denotes the maximal absolute element of the matrix. For some generic set , let denote the vector obtained by extracting the elements of whose indices are in , where denotes the cardinality of ; . For an matrix , denotes the submatrix consisting of the rows and columns indexed by . is the Kronecker product. Let and denote and , respectively. For two real sequences and , means that for some fixed, finite and positive constant for all . For two deterministic sequences and we write if there exist constants such that for all . is the sign function. and are the maximal and minimal eigenvalues of the argument, respectively. For some vector , gives a diagonal matrix with supplying the diagonal entries.

The model in (1.1) can be rewritten as

(2.1)

where and are vectors (). Note that the dimensions of , and can vary with dimensions and but in general we suppress this dependence where no confusion arises. We assume that initial observations are available for .

The three sources of high-dimensionality in (2.1) are , and as all of these can be increasing sequences. Sometimes one thinks of the number of lags, , as being fixed and in that case only two sources remain. Next, (2.1) may be written more compactly as

where is a matrix, , , and is a vector of ones. Then, one can write

where , and . contains the fixed effects, , and . Finally, contains all parameters of the model. Thus the dynamic panel model (1.1) can be written more compactly as something resembling a linear regression model. There are several differences, however. First, blocks of rows in the data matrix may be heavily dependent. Second, we shall see that and have markedly different properties as a result of the fact that the probabilistic properties of the blocks of a properly scaled version of the Gram matrix pertaining to are very different. Third, imposing weak sparsity only on implies that the oracle inequalities which we use as a stepping stone towards inference do not follow directly from the technique in, e.g., Bickel et al. (2009). In fact, we do not get explicit expressions for the upper bounds but instead characterize them as solutions to certain quadratic equations in two variables.

2.2 Weak Sparsity and the Panel Lasso

Let denote the active set of lagged left hand side variables and with . is said to be (exactly or ) sparse when is small compared to . Exact sparsity is by now a standard assumption in high-dimensional econometrics and statistics. The unobserved heterogeneity, , is usually modeled as either random or fixed effects. The former rules out correlation between and the remaining covariates. This is often too restrictive. In the fixed effects approach no restrictions are imposed on correlation between and the covariates. As explained in the introduction, our fixed effects approach is in fact a middle ground between pure random and fixed effects approaches. We choose to call it a fixed effects approach as is treated as a parameter to be estimated. However, is not entirely unrestricted and assumed to be weakly sparse1 in the sense

for some and . Weak sparsity does not require any of the fixed effects to be zero but instead restricts the ”sum”, , of all the fixed effects. can be large in the sense that it tends to infinity but the smaller it is, the sharper will our results be. It is appropriate to stress that the fixed effects can not be entirely unrestricted – that is why our setting is middle ground between random and fixed effects. Thus, our framework also excludes many models of interest. We believe, however, that our results provide a useful first step towards uniform inference in high-dimensional dynamic panel data models and we certainly allow for more correlation between and the covariates than the random effects assumption does.

Note that he presence of many control variables in a high-dimensional model leaves less variation to be explained by the unobserved heterogeneities and these are therefore likely to be small in magnitude making the weak sparsity assumption reasonable. Thus, weak sparsity actually becomes more reasonable the larger the number of control variables is.

Weak sparsity is a strict generalization of exact sparsity in the sense that if only elements of are non-zero and none of these exceeds a constant , then . Thus, works. Alternatively, exact sparsity of can be handled as the boundary case upon defining such that will equal the number of non-zero entries of .

2.3 The Objective Function and Assumptions

Our starting point for inference is the minimiser of the following panel Lasso objective function

(2.2)

As usual is a positive regularization sequence. Note that we penalize and differently to reflect the fact that we have observations to estimate for while only observations are available to estimate each . Penalizing the fixed effects is not new and was already done in Koenker (2004) and Galvao and Montes-Rojas (2010) in a low dimensional panel-quantile model. Furthermore, the penalization fits well with the weak sparsity assumption on the fixed effects and may increase efficiency of as found in Galvao and Montes-Rojas (2010).

For practical implementation it is very convenient that we only have one penalty parameter instead of having separate penalty parameters for and . The minimization problem can be solved easily as it simply corresponds to a weighted Lasso with known weights. However, the probabilistic analysis of the properly scaled Gram matrix is different from the one for the standard Lasso as it must be broken into several steps. We now turn to the assumptions needed for our inferential procedure.

Assumption 1.
  • is an independent sequence and

  • Assumption 1 imposes independence across which is standard in the panel data literature, see e.g. Wooldridge (2010) or Arellano (2003). Note however, that we do not assume the data to be identically distributed across . Assumption 1 also implies, by iterated expectations, that the error terms form a martingale difference sequence with respect to the filtration generated by the variables in the above conditioning set and thus restricts the degree of dependence in the error terms across (in particular, they are uncorrelated).2 However, it still allows for considerable dependence over time, as higher moments than the first are not restricted. Furthermore, the error terms need not be identically distributed over time for any individual. Note that the increasing number of lags of also whiten the error terms. We also note that Assumption 1 does not rule out that the error terms are conditionally heteroskedastic. In particular, they may be autoregressively conditionally heteroskedastic (ARCH). In panel data terminology, both lags of and are called predetermined or weakly exogenous. Finally, one can of course also include lags of the as these are also weakly exogenous.

    In order to introduce the next assumption define the scaled empirical Gram matrix

    When , is singular. However, to conduct inference it suffices that a compatibility type condition tailored to the panel data structure is satisfied. To be precise, define for integers and

    3

    which is reminiscent of the restricted eigenvalue condition in Bickel et al. (2009). We will need to be bounded away from zero for and being a sequence made precise in the Appendix A depending on the degree of weak sparsity of the fixed effects. To bound away from zero consider where4

    We will see that in order for to be bounded away from zero it suffices that is bounded away from zero and being close to in an appropriate sense. Writing , where and , note that by the block diagonal structure of

    The above estimates are useful as they show that we really only have to consider minimization over the upper left submatrix in the definition of . To be precise,

    (2.3)

    Thus, is a uniform lower bound for and in order for to be bounded away from zero it suffices to assume that

    Assumption 2.

    is uniformly bounded away from zero.

    Assumption 2 is rather innocent as it is trivially satisfied when the is positive definite. Since is the population second moment matrix of this is a rather innocent assumption which is typically imposed. Compatibility type conditions are standard in the literature and various versions and their interrelationship have been investigated in van de Geer et al. (2009).

    Assumption 3.

    There exist positive constants and such that

    1. are uniformly subgaussian; that is, for every , and .

    2. are uniformly subgaussian; that is, for every , , and .

    In the context of the plain static regression model it is common practice to assume the error terms as well as the covariates to be subgaussian. However, this assumption is not as innocent in the context of the dynamic panel data model (1.1) as is generated by the model and its properties are thus completely determined by those of as well as the parameters of the model. Lemma 2 in Appendix A shows that is subgaussian if and satisfy this property and the parameters are well-behaved. In particular, a wide class of (causal) stationary processes are included. Note also, that Assumption 3 imposes subgaussianity of the initial values for all . Caner and Kock (2014) have derived results similar to ours in a cross sectional setting without the sub-gaussianity assumption. However, the dimension of their model can not increase as fast as here.

    2.4 The Oracle Inequalities

    With the above assumptions in place we are ready to state our first result. Defining , one has

    Theorem 1 (Oracle inequalities).

    Let Assumptions 1 - 3 hold. Then, choosing for some , the following inequalities are valid with probability at least

    for positive constants and and ,

    Moreover, the above bounds are valid uniformly over .

    Theorem 1 provides oracle inequalities for the prediction error as well as the estimation error of the parameter vectors. While these bounds are of independent interest, we primarily use them as means towards our ultimate end of conducting (joint) inference on and . We stress that the bounds in Theorem 1 are finite sample bounds; they hold for any fixed values of and . The novel feature of our oracle inequalities is that , the ”size” of , is allowed to grow even when we want the upper bound of go to zero. The special case of exact sparsity of corresponds to and being the sparsity index, say , of .

    We also note that the oracle inequalities are not obtained in an entirely standard manner as the mixture of exact and weak sparsity in dynamic panel data models calls for a different proof technique which yields the upper bounds as solutions to certain quadratic equations. Furthermore, we remark that in analogy to oracle inequalities in the plain linear regression model the number of covariates in () may increase at an exponential rate in without hindering the right hand sides of the oracle inequalities in being small. Finally, we do not assume independence across for any individual thus altering the standard probabilistic analysis as well. Instead we use concentration inequalities for martingales to obtain bounds almost as sharp as in the completely independent case. If one restricts the dependence structure of for every to be, e.g., strongly mixing then one can use concentration inequalities for mixing processes such as in Merlevède et al. (2011). Restricting the dependence structure this way will allow and to increase faster. The focus on the -norm in the oracle inequalities for and is due to the fact that an upper bound in this norm will be particularly useful when developing our uniformly valid inference procedure in the following sections.

    3 Inference

    In this section we show how to conduct inference on and first discuss how desparsification as proposed in van de Geer et al. (2014) works in our context.

    3.1 The Desparsified Lasso Estimator

    First, observe that in (2.2) is convex in and in order for to be a minimiser of , 0 must belong to the subdifferential of at , i.e.

    where and are and vectors, respectively, such that with if for . Similarly, with if for . Hence,

    (3.1)

    Using that and multiplying by from the left yields

    In order to derive the limiting distribution of one would usually proceed by isolating which implies inverting . However, when , is not invertible. The idea of van de Geer et al. (2014) and Javanmard and Montanari (2013) is to circumvent this problem by using an approximate inverse of and controlling the asymptotic approximation error. Suppose that a matrix is a reasonable approximation to the inverse of . We shall explicitly construct in the next section. Then we may write

    where is the error resulting from using an approximate inverse of as opposed to an exact inverse. The term in the above display is the bias incurred by due to shrinkage of the parameters in (2.2). As this bias term is known one may add it back to in order to define the debiased estimator

    The new estimator is no longer sparse as it has added a bias correction terms to the sparse Lasso estimator . Therefore, we will also refer to it as the desparsified Lasso estimator in the dynamic panel context.

    (3.2)

    For any vector with we shall study the asymptotic behaviour of

    (3.3)

    A central limit theorem for as well as asymptotic negligibility of will yield asymptotically gaussian inference. Furthermore, we shall provide a uniformly consistent estimator of the asymptotic variance of even in the presence of conditional heteroskedasicity. A leading special case of (3.3) is when one is only interested in the asymptotic distribution of corresponding to being the th basis vector of . In general, we will be interested in the asymptotic distribution of a subset of the indices of with cardinality and shall show that asymptotically honest (uniformly valid) gaussian inference is possible in the presence of heteroskedasticity even for and simultaneously involving elements of and .

    3.2 Construction of

    As is clear from the discussion above we need a good choice for . In particular we shall show that

    works well. Here will be constructed using nodewise regressions as in van de Geer et al. (2014) and we show that this is possible even when the rows of are not independent and identically distributed. The construction of parallels the one in van de Geer et al. (2014) to a high extent but importantly for our context we do not need the rows of to be sparse for the nodewise regressions to work well. We will discuss the importance of this, once we have properly constructed . First, define

    (3.4)

    where is the th column of , is the submatrix of with ’s th column removed, and the vector . Thus, is the Lasso estimator resulting from regressing on . Next, define

    and as well as . Finally, we set . Let denote the th row of and let denote the th row of but both written as a vectors. Then, . For any , the KKT condition for a minimum in (3.4) are

    (3.5)

    where is the subdifferential of evaluated at . Using this, the definition of , and yields

    (3.6)

    Thus, by the definition of , and as is bounded away from zero (we shall later argue rigorously for this)

    (3.7)

    Furthermore, the KKT conditions (3.5) can also be written as

    (3.8)

    which implies . Combining with (3.7) yields

    (3.9)

    which together with an oracle inequality for provides an upper bound on the th entry of in (3.3). In other words, (3.9) will be used to show the required asymptotic negligibility of in (3.3) by arguments made rigorous in the appendix.

    3.3 Asymptotic Properties of the Approximate Inverse

    In order to show that is asymptotically gaussian one needs to understand the limiting behaviour of constructed above. We show that is close to

    in an appropriate sense. To this end, note that by Yuan (2010)

    (3.10)

    where is the th diagonal entry of , is the vector obtained by removing the th entry of the th row of , is the submatrix of with the th row and column removed, is the th row of with its th entry removed, is the th column of with its th entry removed. Next, let be the th element of and be all elements except the th. Define the vector

    such that

    (3.11)

    Therefore, showing that and only differ by a multiplicative constant. In particular, th row of is exactly sparse if and only if is exactly sparse. More generally, we shall exploit below that weak sparsity of one implies weak sparsity of the other. Furthermore, defining we may write

    where by the definition of

    (3.12)

    Thus, in light of Theorem 1, it is sensible that the Lasso estimator defined in (3.4) is close to the population regression coefficients (we shall make this more formal in Appendix A). Next, defining

    observe . Thus, we can write where and is defined similarly to but with replacing for . Finally, let denote the th row of written as a column vector. In Lemma 1 below we will see that and are close to and , respectively such that is close to which is the desired control of . Write with , where and . Hence define

    with , and . In dynamic panel data models it may not be reasonable to assume that the rows of the inverse second moment matrix , i.e. are sparse. Paralleling Section 2.2 we shall instead assume that the are weakly sparse and assume that

    (3.13)

    for some and . Define