Interpretation of linear regression coefficients

# Interpretation of linear regression coefficients under mean model miss-specifications

## Abstract.

Linear regression is a frequently used tool in statistics, however, its validity and interpretability relies on strong model assumptions. While robust estimates of the coefficients’ covariance extend the validity of hypothesis tests and confidence intervals, a clear interpretation of the coefficients is lacking if the mean structure of the model is miss-specified. We therefore suggest a new intuitive and mathematical rigorous interpretation of the coefficients that is independent from specific model assumptions. It relies on a new population based measure of association. The idea is to quantify how much the population mean of the dependent variable Y can be changed by changing the distribution of the independent variable X. Restriction to linear functions for the distributional changes in X provides the link to linear regression. It leads to a conservative approximation of the newly defined and generally non-linear measure of association. The conservative linear approximation can then be estimated by linear regression. We show how this interpretation can be extended to multiple regression and how far and in which sense it leads to an adjustment for confounding. We point to perspectives for new analysis strategies and illustrate the utility and limitations of the new interpretation and strategies by examples and simulations.

Keywords. association, confounding, quasi likelihood, robust regression, sandwich estimate

## 1. Introduction

Linear regression is one of the oldest and still widely used statistical methods to investigate the association between a metric response and a number of independent variables (also called covariates later on). Linear regression is very easy to apply, and it provides a simple and straightforward understanding of the covariate’s effects on the response in terms of regression slopes. However, the application and interpretation of classical linear regression presumes strong modeling assumptions that are rarely known to be satisfied in practice. Statisticians have therefore made several attempts to extend the validity of linear regression and have suggested a number of generalizations. For stochastically independent observations, the probably most far reaching relaxation of classical modeling assumptions were provided by White (1980) and earlier, in the more general framework of maximum likelihood estimation, by Huber (1967); see also White (1982a, 1982b). Roughly speaking, Huber and White showed that, under weak regularity assumptions, the least square regression coefficients (and more general, maximum likelihood estimates) are consistent and approximately normally distributed estimates of specific population parameters that are mathematically well defined even if the model has been miss-specified.

In the case of linear regression, the limiting population parameters are the coefficients from the linear least square loss approximation of the response in the population. To see this, assume that the response and covariates are multivariate i.i.d. observations , , with finite variances. Here is the response and is the covariate vector of individual . Note that the assumption of finite variances implies that and the components of belong to the space of square integrable random variables. It follows from geometric arguments in the Hilbert space that the population square loss is minimized by a unique regression coefficient . White (1980) showed that the least square estimate is a consistent estimate of with the property that is approximately multivariate normally distributed with mean vector and a covariance matrix that can be consistently estimated by the nowadays called “Huber-White sandwich estimate”. This permits, for instance, asymptotic hypothesis tests and confidence intervals for each under model miss-specifications.

Since is the orthogonal projection of onto the linear subspace spanned by , the error term and are orthogonal in , i.e.,

 E(~UiXi)=(E(~Ui),E(~UiXi1),…,E(~UiXim))=0.

Therefore, whenever the dependent and independent variables have finite variances, then

 (1) Yi=Xiθ+~Ui

where the error term has mean zero and is uncorrelated to each , . White (1980) defined directly by identity (1) with uncorrelated and , and he considered the more general situation of independent but not necessarily identically distributed observation . For simplicity, we will stick to the assumption of i.i.d. observations.

Identity (1) seems to imply that we can always claim a linear relationship between and , at least under mild regularity assumptions, like square integrability. However, identity (1) can be miss-leading because the assumption that and are uncorrelated is much weaker than the classical assumption of stochastic independence. To see this, assume a non-linear regression relationship with the non-linear function and error term that is stochastically independent from . In this case, identity (1) holds with . Due to the non-linearity of , the error term is functionally dependent on , even though it is uncorrelated to . Hence, the interpretation of the linear regression vector in (1) is rather unclear.

A similar concern has been formulated by Friedman (2006) in the more general context of maximum likelihood estimation. He states already in his abstract that ”…if the model is seriously in error, the sandwich [estimate of the covariance matrix] may help on the variance side, but the parameters being estimated by the MLE are likely to be meaningless …”. He acknowledged that Huber and White made important contributions to mathematical statistics, however, he criticized the practical application of miss-specified models in connection with robust covariance estimates. Without a general and convincing interpretation of under mean model miss-specifications, this skepticism is well justified. It is the goal of this paper to provide such an interpretation for linear regression models. Of course, a convincing interpretation would strengthen the application of linear regression in general.

Our interpretation is based on a new perspective of statistical association. We take a population based point of view and ask how much the marginal population mean of can be changed by changing the marginal distribution of in the population. If and are stochastically independent, then the conditional mean equals the constant and therefore the marginal mean of (which is the expectation of with regard to the distribution of ) is not affected by any distributional changes in . Otherwise, if depends on , then it appears likely that we find a distributional change of that will lead to a change in the marginal mean of . Therefore, it is natural to consider as a measure for the statistical association between and , the maximum possible change in the marginal mean of that is achievable by (suitably standardized) changes in the distribution of . We will see in the next section that this is indeed a sensible association parameter. Furthermore, we believe that this parameter is intuitive and understandable also for non-statisticians. We will then show that linear regression (with robust covariance estimates) provides a method to estimate the new association in a conservative fashion, and we will provide a clear interpretation of the regression slopes in terms of this parameter.

The paper is organized as follows. In the next section we formally introduce the mentioned new population based association measures for the bivariate case with a single independent variable, discuss their properties and provide the interpretation of linear regression slopes in terms of these association measures. In Section 3 we consider the multiple independent variables case and extend our population based interpretation to multiple linear regression coefficients. In Section 4 we discuss how far and in which sense the new population based association parameters introduced in Section 3 are robust against confounding. In Section 5 we illustrate the new association parameters and our interpretation of linear regression slopes for typical examples. We also provide an alternative, more explicit interpretation if the independent variables are related by linear regression models themselves, as it is the case, for instance, for a multivariate normal vector of independent variables. In Section 6 we point to new perspectives for strategies of analyzing the association of an independent variable with a dependent variable while accounting for potential confounding. In particular, we suggest a new procedure that aims to account for as much confounding variables as possible by a specific, data dependent sequence of nested linear models. We argue that this procedure controls the multiple type I error rate asymptotically and illustrate its finite sample size properties with the results of a simulation study in Section 7. We close with a discussion and a number of future perspectives in Section 8.

## 2. Mean impact, linear mean impact and regression analysis

We start with the mathematical definition and major properties of the new association parameter in the bivariate case with a single, real valued independent variable . We will also show, how this parameter can be estimated in a conservative way by bivariate linear regression. This will provide the new interpretation of the linear least square regression slope in terms of an association parameter.

### 2.1. Mean impact

As before, let be i.i.d. with finite variances and the pair of random variables be distributed as . We denote by the density of with regard to the Lebesgue measure, the counting or any other sigma-finite dominating measure. Assume that the density is changed to some density (with the same or smaller support than ) and let . Then and we call a “distributional disturbance” of . We will assume and . The first identity follows from the fact that is a density, the second will be justified immediately. The distributional disturbance of leads to a change in the expectation which is equal to . Therefore, we can quantify the maximum effect of a change in the distribution of by

 (2) ιX(Y)=supδ(X)∈L2(R), E[δ(X)]=0, E[δ2(X)]=1E[Yδ(X)] .

We call the “mean impact” of on . The condition is required to obtain a finite measure of association with (2).

At this point, one may argue that we have overlooked an important constraint for , namely for all , such that the density is non-negative. We show in the appendix that there is no need to introduce this constraint because, when accounting for it, we end up with essentially the same supremum, and the mathematical arguments are much easier without it.

The mean impact has the following appealing properties.

###### Theorem 1.

Let and be square integrable. Then

1. ,

2. if and only if is independent from ,

3. where ,

4. if and only if depends on deterministically, i.e., for a measurable function ,

5. if where is measurable and and are stochastically independent, then .

###### Proof.

(a) follows from Cauchy-Schwarz’s inequality in , which implies that for all with and

 E[Yδ(X)]= ≤ SD[E(Y|X)] .

For we obtain , , and . Therefore . Properties (b) to (e) follow from (a) and .

Note that the proof of (a) also shows that the supremum in (2) is actually a maximum.

### 2.2. Extension to multivariate associations

We sometimes aim to quantify the overall dependence of on a whole set of independent variables . We consider here the vector without the constant , because it is not required in this section. A natural extension of definition (2) that we call “mean impact” of on , is given by

 ιX(Y)=supδ(X)∈L2(R), E[δ(X)]=0, E[δ2(X)]=1E[Yδ(X)] .

This parameter quantifies the effect of changes in the common distribution of on . More generally, we can define for a sub sigma-algebra of the sample probability space the parameter by consideration of all that are measurable with respect to . This quantifies the overall dependence of on the set of random variables generating . This points to perspectives for the extensions of the concept to stochastic processes (like point processes) with time varying covariates. We have not yet followed up this idea.

The properties of in Theorem 1 apply also to and , whereby in (e) of Theorem 1, we replace by or, more general, by a real valued function of the underlying probability space that is measurable with respect to . The proof of Theorem 1 remains essentially the same.

### 2.3. A non-linear measure of determination

Property (e) of Theorem 1 implies if follows a regression model with independent and . Hence,

 (3) MoDX(Y)=ι2X(Y)/Var(Y)={ιX(Y)/ιY(Y)}2

provides a natural definition for a (generally non-linear) measure of determination. Definition (3) is also useful without the regression assumption in (e), because (b) to (d) imply , with iff is independent from , and iff depends on deterministically. Hence, has the basic properties of a measure of determination. Moreover, by (a) of Theorem 1 and its extension to multivariate associations mentioned in Section 2.2. Therefore, is the maximum change in that is reachable by changing the distribution of the data , and is the fraction of the maximum mean change that is attributable to changes in the marginal distribution of only.

A similar (non-linear) measure of association can be defined for the covariate vector or a sub sigma-algebra by and , respectively.

### 2.4. Linear mean impact and bivariate linear regression

We discuss now the estimation of and from i.i.d. observations , . Replacing the population distribution of by the empirical distribution of the data gives the naive estimate

 ˆι(0)X(Y)=supδ(X)∈L2(R), ∑ni=1δ(Xi)=0, (1/n)∑ni=1δ2(Xi)=1 (1/n)n∑i=1Yiδ(Xi) .

Unfortunately, this is not a sensible estimate, because it always equals its maximum where . This can be seen by the Cauchy-Schwarz inequality in and similar arguments as those in the proof of Theorem 1. The failure of the naive estimate is closely related to the problem of over-fitting in statistical modeling.

For a sensible estimate, we need to restrict the set of standardized distributional disturbances , for instance, to linear functions or polynomials of a specific degree. Any restriction of leads to a potential underestimation of , as the supremum in (2) becomes smaller with additional constraints. Therefore, additional restrictions on will, in general, provide conservative estimates of .

In the rest of this paper, we will focus on linear , because this provides the link to linear regression. Since the constraints and permit only the two linear functions , we obtain from the linear disturbances the (smaller) association parameter

 ιlinX(Y) = supδ(x)=a+bx, E[δ(X)]=0, E[δ2(X)]=1E[Yδ(X)] (4) = |E[Y{X−E(X)}]|SD(X) = |Cov(Y,X)|/SD(X).

We will call the “linear mean impact” of on . We know that . Moreover, if is a linear function itself, then one can see from (a) of Theorem 1 that equals . The linear mean impact can be consistently estimated by

 (5) ˆιlinX(Y)=|ˆCov(Y,X)|/ˆSD(X)

where and are consistent estimates of and .

Recall that the slope of the least square regression line can also be written in terms of and , namely as

 ^θ1=ˆCov(Y,X)/ˆVar(X).

Therefore . Because and
, we obtain that

 |^θ1|=ˆιlinX(Y)/ˆιlinX(X) .

This is a consistent estimate of the parameter , which is the maximum possible change in divided by the maximum possible change in , when changing the marginal distribution of by standardized linear disturbances. The signs of and are those of the population and empirical covariances between and .

Because and , the absolute coefficient is also a conservative estimate of

 τX(Y)=ιX(Y)/ιX(X),

which is the maximum possible change in divided by the maximum possible change in , when changing the marginal distribution of by arbitrary standardized disturbances. We call the “mean (impact) slope” of for . Because we can consider as conservative (i.e. smaller), linear version of , we call the “linear mean (impact) slope”.

To summarize, we have suggested a new, generally non-linear measure of association defined as the maximum possible change in achievable by standardized changes in the marginal distribution of . We have then shown that, if the true mean structure is non-linear, and have an interpretation as conservative estimates of , i.e., the mean impact of on in units of the maximum possible change in . If the mean structure is linear, then and is consistent for . In general, can be considered as consistent estimate of the smaller version of , in which the distributional disturbances of are restricted to linear functions.

### 2.5. Conservative estimation of the non-linear measure of determination

We can also use linear regression to conservatively estimate the generally non-linear measure of determination . Because and , any consistent estimate of will provide a conservative estimate of . One can easily verify from the formulas in the previous paragraph that is equal to the absolute correlation between and . Hence, the classical linear measure of determination is a conservative estimate of the non-linear measure of determination . Moreover, if is linear in , then , and is a consistent estimate of .

### 2.6. Examples

We determine the mean impact and mean slope for when is quadratic and , are stochastically independent. By (e) of Theorem 1, we obtain . The linear mean impact can be calculated by (2.4) as . We can also express the linear impact in terms of central moments of

 ιlinX(Y)=∣∣ϑ1+ϑ2{2E(X)+E([X−E(X)]3)/Var(X)}∣∣ιlinX(X)

which shows that

 |θ1|=ιlinX(Y)/ιlinX(X)=|ϑ1+2ϑ2E(X)| if E({X−E(X)}3)=0,

like for a normally distributed .

Figure 1 shows , the linear least square loss approximation of , for three different populations with different distributions of .

## 3. Partial mean impact and multiple regression

We turn now to the interpretation of the regression coefficients , , from a least square multiple regression analysis with independent variables if the model, including the mean structure, has been miss-specified.

### 3.1. Partial mean impact

The usual interpretation of the coefficient is that it describes the linear influence of on when all other () are fixed. To translate this interpretation to our population based point of view, we consider changes in the distribution of that leave the mean of all for unchanged. More precisely, we define the set of distributional disturbances

 Hk = {δ(X)∈L2(R):E[δ(X)]=0, E[δ2(X)]=1, E[Xjδ(X)]=0 for all% j≠k}

and the maximum mean change

 (6) ιXk|Xj,j≠k(Y)=supδ(X)∈HkE[Yδ(X)] .

We call the “partial mean impact” of on . The partial mean impact has the following major property. The proof can be found in the appendix.

###### Theorem 2.

Let and all , , be square integrable. Then if and only if ;

### 3.2. Linear partial mean impact and multiple regression

Again we have to think of ways to estimate . Like in the bivariate case, this requires further restrictions of the set for . To link the approach to multiple linear regression, we consider the set of linear disturbances

 (7) Hlink = {δ(X)=η0+m∑j=1ηjXj:E[δ(X)]=0, E[δ2(X)]=1, E[Xjδ(X)]=0%forallj≠k}

and the linear version of the partial mean impact

which we call the “partial linear mean impact” of on . The following theorem summarizes the most important properties of this association parameter. Its proof can be found in the appendix.

###### Theorem 3.

Let and all , , be square integrable. Then the following statements are true:

1. If and , then

 ιlinXk|Xj,j≠k(Y)=ιlin~Xk(Y) .
2. If , then

 |θk|=ιlinXk|Xj,j≠k(Y)/ιlinXk|Xj,j≠k(Xk) .
3. We have , and
.

4. If and are independent, then .

5. If then .

Note that in (a) of Theorem 3 is the error term of White’s linear model (1) with as dependent and , as independent variables. Mathematically speaking, it is the orthogonal complement of the projection of onto the space spanned by , and the constant . The theorem says that equals the (non-partial) linear mean impact of on . A similar result is known for linear regression, see e.g. Hastie et al. (2009; Section 3.2.3).

Statements (b) and (c) of the theorem show that the linear population coefficient is a conservative version of the generally non-linear measure of association

 τXk|Xj,j≠k(Y)=ιXk|Xj,j≠k(Y)/ιXk|Xj,j≠k(Xk) ,

which is the maximum change in divided by the maximum change in , both achievable by all standardized distributional changes in that leave the expectations for unchanged. The parameter

 τlinXk|Xj,j≠k(Y)=ιlinXk|Xj,j≠k(Y)/ιlinXk|Xj,j≠k(Xk)

has the same interpretation but with linear (standardized) distributional disturbances. By (b) and (c) of the above theorem and the results in White (1980), the absolute least square regression coefficient is a consistent estimate of and a conservative estimate of .

According to (d) of Theorem 3, the partial and non-partial linear impact coincide for stochastically independent covariates. By (e) the partial (non-linear) and partial linear mean impacts coincide when the conditional expectation of is linear in . In this case , and is a consistent estimate of .

## 4. Partial mean impact and confounding

One common and important goal of fitting a multiple linear regression model is to adjust for potential confounding. Roughly speaking, confounding means that we find an association between and an independent variable, say that is solely driven by the influence of other independent variables (, ) on and . An example for confounding is given, for instance, if the true mean structure is linear and does not include as independent variable. However, if depends on for at least one () with , then depends (in general) on as well, and the slope of the bivariate regression line would erroneously indicate an association between and . Estimation of instead of will uncover the spurious association.

A more formal and more general way of defining confounding is by cases where the conditional mean of given is independent of , i.e., where we can write

 (9) E(Y|X)=g(X2,…,Xm)

for some measurable function . The mathematically rigorous meaning of (9) is that is measurable with respect to the -algebra generated by . If the population association measure under question (e.g. the population regression coefficient or the mean impact) indicates an association between and even though (9) is true, then one would speak of confounding. By this definition, confounding is a property (or weakness) of the population association measure. Note that confounding is defined relative to a set of covariates . It may appear or disappear when adding or removing covariates, respectively.

The set of covariates , relative to which confounding is considered, is not primarily a statistical question. It depends on the scientific context, the interpretation of association in this context, a priori scientific knowledge and practical constraints. Note that confounding relative to implies confounding relative to any larger set of covariates .

Given a set of covariates , a parameter for the association between and is free of confounding, if it does not indicate an association whenever (9) is true for a measurable (and square integrable) .

By (a) of Theorem 2, the partial mean impact (6) is zero (indicating no association) when in (9) is a linear function. Of course, the same is true for the partial linear mean impact.

Unfortunately, the (non-linear) partial mean impact is not completely free of confounding, because it can be positive for non-linear functions . Assume, for instance that and where is exponentially distributed with mean 1 and for some and a random variable which is distributed as and stochastically independent from . Assume also that and let . Then , and .

Furthermore, for all . Hence, even though can be written as function of only .

Note that also in the above example, because can be rewritten as linear function in , . However, we can see that for a multivariate normal identity (9) always implies , and thereby . This follows from the fact that for multivariate normal and linear , the identities and for all , imply that and are stochastically independent. Consequently, every is uncorrelated to every square integrable . We show in the appendix that is free of confounding if and only if is linear in .

When is non-linear, we can define association measures that are more robust against confounding by adding functions of to the set of covariates in the definitions of and in (6) and (8). For instance, adding all squares and two-fold products for as additional covariates, the partial mean impact is zero under (9) for multivariate polynomials of degree 2, and the linear partial mean impact is completely free of confounding if is quadratic in . The corresponding associations measures can be estimated by the -slope of the regression model that is linear in and multivariate quadratic in .

## 5. Examples and interpretation under regression dependent covariates

We can provide an even more intuitive and complete interpretation of the partial linear mean impact under the assumption that and are related by a linear regression relationship

 (10) X1=β0+m∑j=2βjXj+~X1,

whereby and are stochastically independent. Because the conditional expectation of is linear in , the partial linear mean impact is completely free of confounding under this assumption. Note that in (10) and in (a) of Theorem 3 are identical.

A multivariate normal is a typical example for (10). However, we will not assume that or , , are normally distributed, because there is only little gain in clarity from such additional assumptions. At a single (and well indicated point) we will additionally assume that , which follows when is normal, or more generally, symmetrically distributed.

This condition on the third moment of indicates that we will sometimes need to assume integrability or square integrability for specific functions of . We will make these assumptions whenever required without notifying them explicitly.

We will now present some examples and afterwards the more complete interpretation of the partial linear mean impact and linear regression slope.

### 5.1. Semi-linear additive mean structure

We start with the case where for some possibly non-linear (measurable) function . By assumption (10), for stochastically independent and . Recall from (a) of Theorem 3 that and . Note that by the stochastic independence between and , we get with intercept

 ϑ∗0=ϑ0+ϑ1β0+ϑ1m∑j=2