# Theory and methods of panel data models with interactive effects

## Abstract

This paper considers the maximum likelihood estimation of panel data models with interactive effects. Motivated by applications in economics and other social sciences, a notable feature of the model is that the explanatory variables are correlated with the unobserved effects. The usual within-group estimator is inconsistent. Existing methods for consistent estimation are either designed for panel data with short time periods or are less efficient. The maximum likelihood estimator has desirable properties and is easy to implement, as illustrated by the Monte Carlo simulations. This paper develops the inferential theory for the maximum likelihood estimator, including consistency, rate of convergence and the limiting distributions. We further extend the model to include time-invariant regressors and common regressors (cross-section invariant). The regression coefficients for the time-invariant regressors are time-varying, and the coefficients for the common regressors are cross-sectionally varying.

10.1214/13-AOS1183 \volume42 \issue1 2014 \firstpage142 \lastpage170 \newproclaimremarkRemark[section] \newproclaimassAssumption

Panel data models with interactive effects

A]\fnmsJushan \snmBai\correflabel=e1]jushan.bai@columbia.edu\thanksreft1 and B]\fnmsKunpeng \snmLilabel=e3]likp.07@sem.tsinghua.edu.cnlabel=u1,url]http://www.foo.com\thanksreft2 \thankstextt1Supported by the NSF (SES-0962410). \thankstextt2Supported by NSFC (71201031) and Humanities and Social Sciences of Chinese Ministry of Education (12YJCZH109).

class=AMS] \kwd[Primary ]60F12 \kwd60F30 \kwd[; secondary ]60H12 Factor error structure \kwdfactors \kwdfactor loadings \kwdmaximum likelihood \kwdprincipal components \kwdwithin-group estimator \kwdsimultaneous equations

## 1 Introduction

This paper studies the following panel data models with unobservable interactive effects:

where is the dependent variable; is a row vector of explanatory variables; is an intercept; the term is unobservable and has a factor structure, is an vector of factor loadings, is a vector of factors and is the idiosyncratic error. The interactive effects () generalize the usual additive individual and time effects; for example, if , then .

A key feature of the model is that the regressors are allowed to be correlated with . This situation is commonly encountered in economics and other social sciences, in which some of the regressors are decision variables that are influenced by the unobserved individual heterogeneities. The practical relevance of the model will be further discussed below. The objective of this paper is to obtain consistent and efficient estimation of in the presence of correlations between the regressors and the factor loadings and factors.

The usual pooled least squares estimator or even the within-group estimator is inconsistent for . One method to obtain a consistent estimator is to treat as parameters and estimate them jointly with . The idea is “controlling through estimating” (controlling the effects by estimating them). This is the approach used in [8, 23] and [30]. While there are some advantages, an undesirable consequence of this approach is the incidental parameters problem. There are too many parameters being estimated, and the incidental parameters bias arises; see [26]. In [1, 2] and [17] the authors consider the generalized method of moments (GMM) method. The GMM method is based on a nonlinear transformation known as quasi-differencing that eliminates the factor errors. Quasi-differencing increases the nonlinearity of the model especially with more than one factor. The GMM method works well with a small . When is large, the number of moment equations will be large, and the so called many-moment bias arises. In [27], the author considers an alternative method by augmenting the model with additional regressors and , which are the cross-sectional averages of and . These averages provide an estimate for . The estimator of [27] becomes inconsistent when the factor loadings in the equation are correlated with those in the equation, as shown in [32]. A further approach to controlling the correlation between the regressors and factor errors is to use the Mundlak–Chamberlain projection ([24] and [15]). The latter method projects and onto the regressors such that , where ( are parameters to be estimated, and is the projection residual (a similar projection is done for ). The projection residuals are uncorrelated with the regressors so that a variety of approaches can be used to estimate the model. This framework is designed for small and is studied by [9].

In this paper we consider the pseudo-Gaussian maximum likelihood method under large and large . The theory does not depend on normality. In view of the importance of the MLE in the statistical literature, it is of both practical and theoretical interest to examine the MLE in this context. We develop a rigorous theory for the MLE. We show that there is no incidental parameters bias for .

We allow time-invariant regressors such as education, race and gender in the model. The corresponding regression coefficients are time-dependent. Similarly, we allow common regressors, which do not vary across individuals, such as prices and policy variables. The corresponding regression coefficients are individual-dependent so that individuals respond differently to policy or price changes. In our view, this is a sensible way to incorporate time-invariant and common regressors. For example, wages associated with education and with gender are more likely to change over time rather than remain constant. In our analysis, time invariant regressors are treated as the components of that are observable, and common regressors as the components of that are observable. This view fits naturally into the factor framework in which part of the factor loadings and factors are observable, and the maximum likelihood method imposes the corresponding loadings and factors at their observed values.

While the theoretical analysis of MLE is demanding, the limiting distributions of the MLE are simple and have intuitive interpretations. The computation is also easy and can be implemented by adapting the ECM (expectation and constrained maximization) of [22]. In addition, the maximum likelihood method allows restrictions to be imposed on or on to achieve more efficient estimation. These restrictions can take the form of known values, being either zeros, or other fixed values. Part of the rigorous analysis includes setting up the constrained maximization as a Lagrange multiplier problem. This approach provides insight into which kinds of restrictions provide efficiency gain and which kinds do not.

Panel data models with interactive effects have wide applicability in economics. In macroeconomics, for example, can be the output growth rate for country in year ; represents production inputs, and is a vector of common shocks (technological progress, financial crises); the common shocks have heterogenous impacts across countries through the different factor loadings ; represents the country-specific unmeasured growth rates. In microeconomics, and especially in earnings studies, is the wage rate for individual for period (or for cohort ), is a vector of observable characteristics such as marital status and experience; is a vector of unobservable individual traits such as ability, perseverance, motivation and dedication; the payoff to these individual traits is not constant over time, but time varying through ; and is idiosyncratic variations in the wage rates. In finance, is stock ’s return in period , is a vector of observable factors, is a vector of unobservable common factors (systematic risks) and is the exposure to the risks; is the idiosyncratic returns. Factor error structures are also used as a flexible trend modeling as in [20]. Most of panel data analysis assumes cross-sectional independence; see, for example, [6, 13] and [18]. The factor structure is also capable of capturing the cross-sectional dependence arising from the common shocks . Further motivation can be found in [7, 28, 29].

Throughout the paper, the norm of a vector or matrix is that of Frobenius, that is, for matrix ; is a column vector consisting of the diagonal elements of when is matrix, but represents a diagonal matrix when is a vector. In addition, we use to denote for any column vector and to denote for any vectors and .

The rest of the paper is organized as follows. Section 2 introduces a common shock model and the maximum likelihood estimation. Consistency, rate of convergence and the limiting distributions of the MLE are established. Section 3 shows that if some factors do not affect the equation but only the equation, more efficient estimation can be obtained. Section 4 extends the analysis to time-invariant regressors and common regressors; the corresponding coefficients are time varying and cross-section varying, respectively. Computing algorithm is discussed in Section 5, and simulations results are reported in Section 6. The last section concludes. The theoretical proofs are provided in the supplementary document [10].

## 2 A common shock model

In the common-shock model, we assume that both and are impacted by the common shocks so the model takes the form

(1) | |||||

for . In across-country output studies, for example, output and inputs (labor and capital) are both affected by the common shocks.

The parameter of interest is . We also estimate and . By treating the latter as parameters, we also allow arbitrary correlations between and . Although we also treat as fixed parameters, there is no need to estimate the individual , but only the sample covariance of . This is an advantage of the maximum likelihood method, which eliminates the incidental parameters problem in the time dimension. This kind of the maximum likelihood method was used for pure factor models in [3, 4] and [11]. By symmetry, we could also estimate individuals , but then we only estimate the sample covariance of the factor loadings. The idea is that we do not simultaneously estimate the factor loadings and the factors (which would be the case for the principal components method). This reduces the number of parameters considerably. If is much smaller than , treating factor loadings as parameters is preferable since there are fewer parameters.

Because of the correlation between the regressors and regression errors in the equation, the and equations form a simultaneous equation system; the MLE jointly estimates the parameters in both equations. The joint estimation avoids the Mundlak–Chamberlain projection and thus is applicable for large and large .

We assume the number of factors is fixed and known. Determining the number of factors is discussed in Section 6, where a modified information criterion proposed by [12] is used. Let , , and . The second equation of (1) can be written in matrix form as

Further let , , , . Then model (1) can be written as

Let denote the coefficient matrix of in the preceding equation. Let , , and . Stacking the equations over , we have

(2) |

To analyze this model, we make the following assumptions.

### 2.1 Assumptions

{ass}The factor process is a sequence of constants. Let , where . We assume that is a strictly positive definite matrix.

The nonrandomness assumption for is not crucial. In fact, can be a sequence of random variables such that uniformly in , and is independent of for all . The fixed assumption conforms with the usual fixed effects assumption in panel data literature and, in certain sense, is more general than random .

The idiosyncratic errors are such that: {longlist}[(B.1)] (B.1) The is independent and identically distributed over and uncorrelated over with and for all and . Let denote the variance of . (B.2) is also independent and identically distributed over and uncorrelated over with and for all and . We use to denote the variance matrix of . (B.3) is independent of for all . Let denote the variance matrix . So we have , a block-diagonal matrix. {remark} Let denote the variance of . Due to the uncorrelatedness of over , we have , a block-diagonal matrix. Assumption 2.1 is more general than the usual assumption in the factor analysis. In a traditional factor model, the variances of the idiosyncratic error terms are assumed to be a diagonal matrix. In the present setting, the variance of is a block-diagonal matrix. Even without explanatory variables, this generalization is of interest. The factor analysis literature has a long history to explore the block-diagonal idiosyncratic variance, known as multiple battery factor analysis; see [31]. The maximum likelihood estimation theory for high-dimensional factor models with block diagonal covariance matrix has not been previously studied. The asymptotic theory developed in this paper not only provides a way of analyzing the coefficient , but also a way of analyzing the factors and loadings in the multiple battery factor models. This framework is of independent interest. {ass} There exists a sufficiently large such that: {longlist}[(C.3)] (C.1) for all ; (C.2) for all , where and denote the smallest and largest eigenvalues of the matrix , respectively; (C.3) there exists an positive matrix such that where is defined earlier. {ass} The variances for all and are estimated in a compact set, that is, all the eigenvalues of and are in an interval for a sufficiently large constant .

### 2.2 Identification restrictions

It is a well-known result in factor analysis that the factors and loadings can only be identified up to a rotation; see, for example, [5, 21]. The models considered in this paper can be viewed as extensions of the factor models. As such they inherit the same identification problem. We show that identification conditions can be imposed on the factors and loadings without loss of generality. To see this, model (2) can be rewritten as

(3) |

where is an orthogonal matrix, which we choose to be the matrix consisting of the eigenvectors of associated with the eigenvalues arranged in descending order. Treating as the new , as the new and as the new , we have

with and being a diagonal matrix. Thus we impose the following restrictions for model (2), which we refer to as IB (identification restrictions for Basic models). {longlist}[(IB1)] (IB1) ;

, where is a diagonal matrix with its diagonal elements distinct and arranged in descending order;

.

### 2.3 Estimation

The objective function considered in this section is

(4) |

where and . The latter is the data matrix. The parameters are . The MLE is defined as

where the parameter space is defined to be a closed and bounded subset containing the true parameter as an interior point; and are positive definite matrices, as in Assumption 2.1. The boundedness of implies that the elements of and are bounded. This is for theoretical purpose and is usually assumed for nonconvex optimizations, as in [19] and [25]. In actual computation with the EM algorithm, we do not find the need to impose an upper or lower bound for the parameter values. The likelihood function involves simple functions and are continuous on (in fact differentiable), so the MLE exists because a continuous function achieves its extreme value on a closed and bounded subset.

Note that the determinant of is 1, so the Jacobian term does not depend on . If and are independent and normally distributed, the likelihood function for the observed data has the form of (4). Here recall that are fixed constants, and are not necessarily normal; (4) is a pseudo-likelihood function.

For further analysis, we partition the matrix and as

where for any , and are both matrices.

Let and denote the MLE. The first order condition for satisfies

(5) |

where . The first order condition for satisfies

(6) |

Post-multiplying on both sides of (6) and then taking summation over , we have

(7) |

The first order condition for satisfies

(8) |

where is a matrix such that its upper-left and lower-right submatrices are both zero, but the remaining elements are undetermined. The undetermined elements correspond to the zero elements of . These first order conditions are needed for the asymptotic representation of the MLE.

### 2.4 Asymptotic properties of the MLE

Theorem 2.1 states the convergence rates of the MLE. The consistency is implied by the theorem.

###### Theorem 2.1 ((Convergence rate))

Bai [8] considers an iterated principal components estimator for model (1). His derivation shows that, in the presence of heteroscedasticities over the cross section, the PC estimator for has a bias of order . As a comparison, Theorem 2.1 shows that the MLE is robust to the heteroscedasticities over the cross section. So if is fixed, the estimator in [8] is inconsistent unless there is no heteroskedasticity, but the estimator here is still consistent.

Let denote the project matrix onto the space orthogonal to , that is, . We have

###### Theorem 2.2 ((Asymptotic representation))

Under the assumptions of Theorem 2.1, we have

where is a matrix whose element with being the element of matrix .

In Appendix A.3 of the supplement [10], we show that the asymptotic expression of can be alternatively expressed as

(9) | |||||

where is (the data matrix for the th regressor, ); is ; with and ; ; where is a vector with all 1’s.

Theorem 2.2 shows that the asymptotic expression of only involves variations in and . Intuitively, this is due to the fact that the error terms of the equation share the same factors with the explanatory variables. The variations from the common factor part of (i.e., ) do not provide information for since this part of information is offset by the common factor part of the error terms (i.e., ) in the equation.

###### Corollary 2.1 ((Limiting distribution))

Matrix can be consistently estimated by

where is the data matrix for the th regressor,

(10) |

with and

(11) |

Here and are the maximum likelihood estimators.

## 3 Common shock models with zero restrictions

The basic model in Section 2 assumes that the explanatory variables share the same factors with . This section relaxes this assumption. We assume that the regressors are impacted by additional factors that do not affect the equation. An alternative view is that some factor loadings in the equation are restricted to be zero. Consider the following model:

(12) | |||||

for , where is an vector representing the shocks affecting both and , and is an vector representing the shocks affecting only. Let , and , the above model can be written as

which is the same as model (1) except that elements of are restricted to be zeros. For further analysis, we introduce some notation. We define

We also define and similarly as , that is, , . This implies that . The presence of zero restrictions in (12) requires different identification conditions.

### 3.1 Identification conditions

Zero loading restrictions alleviate rotational indeterminacy. Instead of restrictions, we only need to impose restrictions. These restrictions are referred to as IZ restrictions (Identification conditions with Zero restrictions). They are: {longlist}[(IZ2)] (IZ1) ; (IZ2) and , where and are both diagonal matrices with distinct diagonal elements in descending order; (IZ3) and . In addition, we need an additional assumption for our analysis. {ass} is of full column rank. Identification conditions IZ are less stringent than IB of the previous section. Assumption 3.1 says that the factors are pervasive for the equation. In Appendix B of the supplement [10], we explain why restrictions are sufficient.

### 3.2 Estimation

The likelihood function is now maximized under three sets of restrictions, that is, , and where denotes the zero factor loading matrix in the equation. The likelihood function with the Lagrange multipliers is

where ; is and is , both are symmetric Lagrange multipliers matrices with zero diagonal elements; is a Lagrange multiplier matrix of dimension .

Let . Notice is a symmetric matrix. The first order condition on gives

Post-multiplying yields

Since is a symmetric matrix, the above equation implies that is also symmetric. But is a diagonal matrix. So the th element of is , where is the th element of and is the th diagonal element of . Given is symmetric, we have for all . However, is also symmetric, so . This gives . Since by IZ2, we have for all . This implies since the diagonal elements of are all zeros.

Let with , and , a block diagonal matrix of dimension. We partition the matrix and define the matrix as

where is a matrix, and