EM estimation of a Structural Equation Model
In this work, we propose a new estimation method of a Structural Equation Model. Our method is based on the EM likelihood-maximization algorithm. We show that this method provides estimators, not only of the coefficients of the model, but also of its latent factors. Through a simulation study, we investigate how fast and accurate the method is, and then apply it to real environmental data.
keywords:EM algorithm, Factor model, Latent Variable, Structural Equation Model.
Structural Equation Models (SEM) are widely used is as various research domains as psychology, social and behavioral sciences, ecology, chemometrics, etc. A SEM formalizes the interdependence of many Observed numeric Variables (OV) through fewer unobserved ones, referred to as Latent Variables (LV). Every LV is assumed to be underlying a specific set of OVs, which depend on it as well as on extra observed covariates. A SEM is structured through two types of equations, termed measurement equations and structural equations. A measurement equation relates a LV to the corresponding OV’s. A structural equation states a hypothesized relationship between LV’s. 1 graphs an example of SEM.
Literature widely presents two competing families of methods that deal with SEM’s: factor-methods, and component-methods. Among the former family are the classical Factor Analysis, and Jöreskog’s SEM estimation technique Jöreskog (1970) implemented in the LISREL software. These methods use factors as LV’s, i.e. variables of which we merely assume to know the distribution (typically standardized normal). They base their estimation on the structure of the covariance matrix of the data according to the model, within a likelihood maximization approach. They estimate all coefficients in the model (linear combination coefficients and variances), but not the values of the factors, which therefore remain unknown. The component-model family of methods assumes that every LV is a component, i.e. a linear combination, of its OV’s. Note that such a constraint is stronger than the distribution-assumption made on factors. This family includes the classical Principal Component Analysis, Canonical Correlation Analysis and Partial Least Squares (PLS) methods Jöreskog and Sörbom (1982), Wangen and Kowalski (1989), Lohmöller (2013), W. W. Chin (1999), Vinzi et al. (2010), but also more recent techniques as Generalized Structured Component Analysis Hwang and Takane (2004), Generalized Regularized Canonical Correlation Analysis Tenenhaus and Tenenhaus (2011) and THEME Bry et al. (2012), Bry and Verron (2015).
Factor-methods and PLS-type ones have been compared in several works Jöreskog and Sörbom (1982). The gist is that the latter encounter less convergence problems than the former with small samples. A second advantage is that, since they express every LV as a linear combination of its OV’s, and yield the estimated coefficients of that combination, the values of the LV’s are estimated, and can also be forecast on new samples, opening the way to easy cross-validation. Such is not the case of factor-methods, which do have yet the theoretical advantage to be based on a proper statistical distribution-based model of data, contrary to PLS-type methods, thus allowing standard statistical tests, which are not possible with the latter.
In many areas, it is of essence to be able to estimate the values of LV’s on statistical units, since these values allow to efficiently analyze the disparities of units on a reduced number of dimensions.
Therefore, we are interested in estimating these values even in the factor-model context. In this work, we adapt the EM algorithm to the SEM estimation problem, in order to get estimates of the factor values. The paper is organized as follows. Section 2 formally introduces the equations of the SEM we deal with. Section 3 applies the EM algorithm to the SEM and derives the estimation formulas. Section 4 first presents a simulation-based study of the performance of the method, with comparison to more classical methods, and then an application to environmental data.
2 The model
The data consists in blocks of OV’s describing the same units.
We consider the following data-matrices and notations:
; , is the matrix coding the dependent block of OV’s , identified with its column-vectors.
; , , is the matrix coding the -explanatory block of OV’s . Value of variable for unit is denoted . Variable-blocks will be referred to through the corresponding matrix.
(resp. ) refers to a (resp. , …, ) matrices of covariates.
We assume that:
The units, hence the rows of matrices are independent multivariate normal vectors.
The SEM we handle here is a restricted one, in that it contains only one structural equation, relating a dependent factor , underlying a block of OV’s, to explanatory factors respectively underlying blocks of OV’s (cf. fig.1). The main assumptions of this model are the following:
Factors are standard normal, i.e. and .
In each block (e.g. ), the OV’s (e.g. ) depend linearly on the block’s factor (e.g. ) and a block of extra-covariates (e.g. ), conditional on which they are independent.
Factor is normal with zero-mean, and its expectation conditional on is a linear combination of them.
The SEM consists of measurement equations and one structural equation. It is graphed on (cf. fig.1).
2.2 Measurement equations
As formerly mentioned, each measurement equations relates the variables in a block (respectively ) to the block’s factor (resp. ). This link may also involve covariates (resp. ): each OV is expressed as a linear combination of the factor, the covariates and some noise. Hence the measurement equations:
where (resp. ) is a (resp. ) parameter matrix, (resp. ) a (resp. ) parameter matrix, and (resp. ) an (resp. ) measurement-error matrix.
We impose that the first column of as well as of each matrix is equal to constant vector having all elements equal to 1. Thus, the first row of and of each contains mean-parameters.
As far as distributions are concerned, we assume that
(0, ) , where .
: (0, ), where and that
and , are independent.
As to the factors, we assume that:
: with independent.
2.3 Structural equations
The structural equation we consider relates dependent factor to explanatory factors (cf. fig.1) through a linear model:
where , is a scalar parameter, and is a disturbance vector.
We assume that
is independent of and , .
N.B. The unit-variance of disturbance serves an identification purpose.
Hence we have the overall model:
where the set of parameters is such as is -dimensional.
Thus, when all matrices are diagonal, we have:
2.4 A simplified model
But in order to avoid heavy formulas in the development of the algorithm, we shall use in the sequel, with no loss of generality, a simplified model involving explanatory blocks and . The corresponding equation set, for a given unit , reads:
Such as, . Thus, in this case (cf. (2)), the dimension of is:
3 Estimation using the EM algorithm
In this work, likelihood maximization is carried out via an iterative EM algorithm (Dempster et al. (1977), section 4.7). Each iteration of the algorithm involves an Expectation (E)-step followed by a Maximization (M)-step. Dempster et al. (1977) prove that the EM algorithm yields maximum likelihood estimates. Moreover, they proved that even if the starting point is one where the likelihood is not convex, if an instance of the algorithm converges, it will converge to a (local) maximum of the likelihood. Another major advantage of the EM algorithm is that it can be used to “estimate” missing data. Thus, if we consider LV’s as missing data, the EM algorithm will prove a general technique to maximize the likelihood of statistical models with LV’s, but also to estimate these LV’s. In our SEM framework, LV’s correspond to factors. Thus, we will be able to estimate the factors at unit-level. We shall present the algorithm on the simplified model (cf. section 2.4) with no loss of generality.
3.1 The EM algorithm
Let be the OV’s and the LV’s. The EM algorithm is based on the log-likelihood associated with the complete data .
The complete log-likelihood function
Let denote the probability density of the complete data. The corresponding log-likelihood function is:
Where is the -dimensional set of model parameters and a constant. However, because of the simplification made in the section 2.4, in our case . Indeed, , and .
Estimation in the SEM
To maximize this function, in the framework of EM algorithm, we have to solve:
This demands that we know the derivatives of the log-likelihood function and the distribution of conditional on for each observation . Let us introduce the following notation:
For all , we denote .
The parameters of the gaussian distribution are explicit and have the following form:
These results are demonstrated in AppendixB. Expressions of the first-order derivatives of with respect to are also established in AppendixB and written in the following forms with :
So, here formula (4) develops into:
System of equations (6) is easy to solve and the obtained solutions will be given in the next section.
To estimate parameters in , we propose the following EM-algorithm. We denote the -iteration of the algorithm.
2= choice of the initial parameter values .
Current iteration , until stopping condition is met:
E-step: with ,
Calculate explicitly distribution for each .
Estimate the factor-values , , .
Calculate and , .
Update to by injecting , and , , into the formulas in (7).
We used the following stopping condition with the smallest possible:
where is the -dimensional vector containing the scalar values in all parameters in .
4 Numerical results on simulated data
4.1 Data generation
We consider units and . Therefore, the OV’s are simulated so as to be structured respectively around three factors . Factors and are explanatory of . Besides, we consider i.e covariates are simulated for each covariate matrix , and . The data is simulated as follows.
Choice of :
a) matrices filled in row-wise with the ordered integer sequence ranging from to (indeed: ).
ordered integer sequence ranging from to .
Simulation of factors
Simulate vectors and of normally distributed random numbers with mean 0 and variance 1 (abbreviated ).
We simulate according to distribution .
We then calculate as
Simulation of noises , , Each element of matrix , (respectively , ) is simulated independently from distribution (respectively, , ).
Simulation of covariate matrices , ,
Each element of matrices , , is simulated according to the standard normal distribution.
Construction of , , , , are eventually calculated through formulas in the model (1).
This simulation scheme was performed 100 times, each time yielding a set of simulated data matrices . Then for each simulated data, we ran an estimation routine with a threshold value , yielding the average results presented in section 4.2. Thus from scalar elements of data, we will estimate scalar elements of factors plus scalar parameters, i.e: scalars.
Convergence was observed in almost all cases in less than five iterations. We assess the quality of the estimations as follows.
On the one hand, we calculate the absolute relative deviation between each simulated scalar parameter in and its estimation, and then average these deviations over the simulations. We then produce a box-plot of the average absolute relative deviations (cf. fig. 2). This makes the interpretation easier, since we only need to look at the box-plot’s values and check that they are positive (because of the absolute value) and close to .
On the other hand, to assess the quality of the factor estimations, we compute the values of square correlations between the simulated concatenated factors (respectively) and the corresponding estimations (). Once again, we produce a box-plot of these correlations (cf. fig. 3) and check that it indicates values close to .
Figures 2 and 3 show clearly that the estimations are very close to the actual quantities. Indeed, on figure 2, the median of average absolute relative deviations is , the first quartile is and the third quartile is . On figure 3, the median of square correlations is , the first quartile is and the third quartile is . So, factor (respectively and ) turn out to be drawn towards the principal direction underlying the bundles made up by observed variables (respectively and ). Now, we may legitimately wonder how the quality of estimations could be affected by the number of observations and the number of OV’s in each block. In the following section we give a sensitivity analysis performed to investigate this.
4.3 Sensitivity analysis of estimations
We performed a sensibility analysis on the simulated data presented in section 4.1. The purpose was to study the influence of the block-sizes (, , , ) on the quality of estimation, both of the parameters and the factors. To simplify the analysis, we imposed and varied and separately, i.e. studied the cases with and with . Each case was simulated times. Therefore, we simulated data-sets.
Sensitivity with respect to the number of observations
In this section, we study the evolution with of the average estimation of structural coefficients and and parameter with respect to their actual values, all equal to , and that of the correlations between factors and their estimates. The number of OV’s is fixed to in each block. Figures 4, 5 and 6 graph these evolutions (average value of estimate in plain line), including a confidence-interval about each average estimate (dotted line). These figures show that the biases and the standard deviations are, as expected, more important for little values of , but also that the quality of estimation is already quite good for .
As for the factors, figure 7 shows that their correlations increase and get close to one as increases, with a dispersion decreasing to 0. However, even for , the correlations are mostly above , indicating that the factors are correctly reconstructed.
Sensitivity with respect to the number of OV’s in each block
Likewise, we study the evolution of the average estimates of , , and the correlation between factors and their estimates for different values of , with fixed to . We observe that, unsurprisingly, the biases and the standard deviations decrease as increases (cf. figures 8, 9 and 10). We observe that they stabilize even faster with than with , particularly . Indeed, from on, the confidence interval is narrow enough.
As for the factors, figure 11 shows that their correlations are already very close to for , with a very small variance, and keep increasing with .
To sum things up, the sample size proved to have more impact on the quality of parameter estimation and factor reconstruction than the number of OV’s. Now, the quality of factor reconstruction remains high for rather small values of or . We advise to use a minimal sample size of to obtain really stable structural coefficients. Above this threshold, has but little impact on the biases and standard deviations of estimations.
5 An application to environmental data
5.1 Data presentation
We apply our model to the data-set genus provided in the R-package SCGLR by Mortier et al. (2014). Data-set genus was built from the CoForChange database. It gives the abundances of 27 common tree genera present in the tropical moist forest of the Congo-Basin, and the measurements of 40 geo-referenced environmental variables, for inventory plots (observations). Some of the geo-referenced environmental variables describe 16 physical factors pertaining to topography, geology and rainfall description. The remaining variables characterize vegetation through the enhanced vegetation index (EVI) measured on 16 dates.
In this section, we aim at modeling the tree abundances from the other variables, while reducing the dimension of data. The dependent block of variables therefore consists of the tree species counts divided by the plot-surface. A PCA of the geo-referenced environmental variables and the photosynthetic activity variables confirms that EVI measures are clearly separated from the other variables (cf. Fig. 12). Indeed, Fig. 12 shows two variable-bundles with almost orthogonal central directions. This justifies using our model (cf. section 5.2) with explanatory groups, one of them () gathering rainfall measures and location variables (longitude, latitude and altitude), and the second one (), the EVI measures. Besides, in view of the importance of the geological substrate on the spatial distribution of tree species in the Congo Basin, showed by Fayolle et al. (2012), we chose to put nominal variable geology in a block . This block therefore contains constant 1 plus all the indicator variables of geology but one, which will be the reference value. Geology having 5 levels, T has thus 5 columns.
5.2 Model with geologic covariates
Here is the model used with the variable-blocks designed in section 5.1.: