The conditionality principle in high-dimensional regression

# The conditionality principle in high-dimensional regression

David Azriel
July 11, 2019
###### Abstract

Consider a high-dimensional linear regression problem, where the number of covariates is larger than the number of observations. The interest is in estimating , the conditional variance of the response variable given the covariates. Two frameworks are being considered, conditional and unconditional, where conditioning is with respect to the covariates, which are ancillary to . In recent papers, a consistent estimator was developed in the unconditional framework when the marginal distribution of the covariates is normal with known mean and variance. In this work a certain Bayesian hypothesis testing is formulated under the conditional framework, and it is shown that the Bayes risk is a constant. However, when the marginal distribution of the covariates is normal, a rule based on the above consistent estimators can be constructed such that its Bayes risk converges to zero, with probability converging to one. This means that with high probability, its Bayes risk would be smaller than the Bayes rule. It follows that even in the conditional setting, information about the marginal distribution of an ancillary statistic may have a significant impact on statistical inference. The practical implication in the context of high-dimensional regression models is that additional observations where only the covariates are given, are potentially very useful and should not be ignored.

## 1 Introduction

An ancillary statistic is one whose distribution does not depend on the parameters of the model. In Cox and Hinkley (1974) notation and words, if C is an ancillary statistic then “the conditionality principle is that the conclusion about the parameter of interest is to be drawn as if C were fixed at its observed value c” (p. 38). This principle has two implications:

1. Conditional inference: Statistical inference should be conditioned on an ancillary statistic.

2. Ignorability of the marginal distribution: The true marginal distribution of an ancillary statistic should be ignored in any estimation procedure.

Focusing on the conditional inference implication (Point 1), Brown (1990) has presented an “ancillarity paradox” where in a certain regression problem, the standard estimator is admissible in the conditional setting for every value of the ancillary statistic but it is inadmissible in the unconditional setting. Brown argues that the common practice to consider only conditional inference is sometimes misleading and the unconditional risk function should be also accounted for. This work goes one step further with respect to the ignorability implication (Point 2). It is shown under the framework of the conditional setting, that an estimator that uses the true marginal distribution of an ancillary statistic has a smaller Bayes risk with large probability than the Bayes estimator. That is, even if one carries out conditional inference, one can still benefit from the marginal distribution of an ancillary statistic, rejecting the ignorability implication. Moreover, while in Brown’s setting the advantage of the new estimator becomes negligible as the sample size grows, in the asymptotic regime considered here, the gap between the Bayes estimator and the improved one becomes larger with sample size. This is because here a high-dimensional regression is being considered where the sampling distribution itself changes with sample size.

The regression linear model is , where is a (column) vector of the response variables, is a matrix of covariates, is a vector of unknown parameters and is a vector of the residuals. It is assumed that are independently and identically distributed, where is the -th row of the matrix ; denotes a “generic” observation that should be read as for some . Here the focus is on estimating . When the linear model is true, i.e., , then is the coefficient of the conditional expectation; when the conditional expectation is not linear in , then is still meaningful in the sense it can be defined as the coefficient of the best linear predictor; see Buja et al. (2014) for exact definitions.

Consider an additional sample of size where only observations from the marginal distribution of are given. In the machine learning literature this situation is called “semi-supervised” learning; see Zhou and Belkin (2014) and references therein. Such a situation arises when the observations are costly and may require a skilled human agent and the ’s are easy to obtain. A typical example is web document classification, where the classification is done by a human agent while there are many more unlabeled on-line documents. In such situations, is much larger than , and hence the marginal distribution of can be assumed known.

While there is a large body of literature on this problem in the machine learning world, there are very few statistical papers concerning this topic. The standard approach in the statistical literature may be best summarized by the following quote from Little (1992):

The related problem of missing values in the outcome was prominent in the early history of missing-data methods, but is less interesting in the following sense: If the ’s are complete and the missing values of are missing at random, then the incomplete cases contribute no information to the regression of on .

This approach is justified by the conditionality principle since is ancillary to the parameters of interest. However, Buja et al. (2014) show that is ancillary with respect to if and only if the linear model is actually true. Indeed, for low-dimensional regression ( being fixed) Chakrabortty and Cai (2018) and Azriel et al. (2016) construct a semi-supervised estimator that asymptotically dominates the least squares estimator as the latter is based only on the labeled data set. The asymptotic variance of the new estimators are smaller than that of the least squares when a certain non-linearity condition holds; otherwise, the new estimates are equivalent to the least squares. In other words, improvement can be made only in the non-linear case where is no longer ancillary. The case of high-dimensional regression is different as discussed below.

In high-dimensional regression is larger than , i.e., there are more parameters than observations. In this setting, the Lasso estimator suggested by Tibshirani (1996) has gained much popularity and many extensions were suggested. That line of research is related to sparsity assumptions where most of the parameters are assumed to be zero or close to zero. When those assumptions do not hold then estimation of the entire vector of is not feasible. However, it is possible to estimate the signal-to-noise-ratio and . Dicker (2014) and Lucas et al. (2017) suggested estimators when assuming that the marginal distribution of is i.i.d standard normal. In the context of semi-supervised learning and when the marginal distribution of is say, then by a linear transformation, which does not change or the signal-to-noise ratio, could become i.i.d standard normal, justifying their assumptions.

The interesting point here is that the marginal distribution of plays an important role in the estimation procedure even though it is ancillary to the parameters of interest. This point was the motivation for the present paper. The focus here is on the estimator of Dicker (2014) since it is easier to analyze (although it might be inferior to the estimator of Lucas et al. (2017)). It is shown that under an asymptotic regime where and under a certain Bayesian formulation in the conditional setting, the Bayes risk of the optimal Bayes rule equals 0.5, while the Bayes risk of a rule based on Dicker’s estimate is arbitrarily close to zero with probability converging to one. Under the Bayesian framework an optimal rule is well defined and it is somewhat paradoxical that one can find a rule that has a smaller Bayes risk than the Bayes rule. This paradox implies that information on the marginal distribution of an ancillary statistic can be helpful even in the conditional setting contradicting the ignorability implication of the conditionality principle that was mentioned above.

## 2 Preliminaries

Assume that are i.i.d from the distribution

 y|x=x0∼N(βTx0,σ2) and x∼N(0,I), (1)

for some unknown vector and is the identity matrix. The purpose is to estimate . This work studies Model (1) under an asymptotic regime in which both the dimension and the number of observations converge to infinity with as , where . It is also assumed that and are of order of a constant as .

Estimation of is being considered in two settings:

• Conditional setting: the distribution is conditioned upon the ’s. That is, , with . Since , there are no constraints on the (the parameter space for the ’s is ) as is of full rank with probability 1. To clarify the notation we use , and to denote conditional probability, expectation and variance.

• Unconditional setting as above. Here one can use the known marginal distribution of the ’s.

In the conditional setting, since the ’s are unrestricted and plays no role. This is different from low-dimensional regression () where conditioning on restrict the ’s to lie in a -dimensional sub-space of . This is the reason why there is such a difference in conditioning in low- and high-dimensional regression. Unlike the unconditional setting, in the conditional one, it is shown below that there exists no consistent estimator for . This implies that in a high-dimensional regression model, when estimation of is of interest, the unconditional setting should be preferred.

Dicker (2014) considered the unconditional setting and noticed that under Model (1),

 E(1n||Y||2)=||β||2+σ2 and E(1n2||XTY||2)=p+n+1n||β||2+pnσ2,

where these expectations are unconditional. Solving these equations yields the estimators

 ^σ2Dicker=p+n+1n(n+1)||Y||2−1n(n+1)||XTY||2 and, ˆ||β||2=||XTY||2−p||Y||2n(n+1). (2)

Here the focus is on . It follows that ; the variance is

 Var(^σ2Dicker)=2n{pn(σ2+||β||2)2+σ4+||β||4}{1+O(1/n)}. (3)

Conditional analysis of Dicker’s estimate is carried out below.

## 3 The Bayes risk

Consider now the conditional setting. Suppose we want to determine whether or . Let be a rule and denote by the true value. The risk is and it is a function of the unknown parameters . We show below that for any

 ∫Rδdπ(μ1,…,μn,σ2)≥1/2, (4)

where is a certain probability measure on the parameter space to be defined below. The probability measure in (4), , can be thought as a prior density for the parameters. For any rule , the integral in (4) is bounded below by the Bayes risk. However, in the next section we show that under a rule based on Dicker’s estimate, the integral in (4) can be arbitrarily small. This rule uses the knowledge of the (known) marginal distribution of . This demonstrates the usefulnesses of the information on the distribution of an ancillary statistic in this context.

In order to show (4), it is enough to compute the Bayes risk. The idea is to define as mixture of two priors

 μ1,…,μn∼iidN(0,η20),σ2=σ20 and μ1,…,μn∼iidN(0,η21),σ2=σ21.

When , the ’s have the same marginal distributions under the two priors and therefore no Bayesian procedure can distinguish between them, leading to (4).

Specifically, consider a Bayesian formulation where . Given ,

 μ1,…,μn|J=0∼iidN(0,η20) and σ2|J=0 equals σ20;

and similarly for with replacing . Let be the induced probability measure on . Lemma 3.1 describes the Bayes rule under . Specifically, it is shown that if the posterior probability of is the same as the prior, implying (4).

###### Lemma 3.1.

If then

 Px,π(J=0|y1,…,yn)=Px,π(J=1|y1,…,yn)=1/2,

where denotes the conditional probability under the prior . Otherwise, is higher if

 1nn∑i=1y2i>log(η20+σ20η21+σ21)1η21+σ21−1η20+σ20.

Since (4) holds for any rule , it follows that there exists no consistent estimator for in the conditional setting. This result is summarized in the next corollary.

###### Corollary 3.1.

Under Model (1) in the conditional setting there exists no consistent estimator for .

## 4 Dicker’s estimate in the conditional setting

Recall Dicker’s estimate, which is defined in (2) and consider the problem of the previous section to determine whether or . Assume without loss of generality that . Define a rule based on Dicker’s estimate

 δDicker=⎧⎨⎩1^σ2Dicker>σ20+σ2120otherwise.

Define the conditional risk . We show below that with high probability (over ), is small. This implies that the Bayes risk of is smaller than the Bayes rule with high probability.

###### Theorem 4.1.

Consider model (1), and assume that as for , and that are bounded as . Then, there exist a sequence of sets and a constant , such that as and for and any ,

 Px(|^σ2Dicker−σ2|≥ξ)≤Cg(c,n,μ,σ2)ξ2√n, (5)

where .

Consider the conditional probability of the event that but is actually true, i,e., ; then,

 Px{^σ2Dicker≥(σ21+σ2)/2}≤Px(|^σ2Dicker−σ2|≥ξ),

where . Theorem 4.1 implies that for ,

 ∫Px{^σ2Dicker≥(σ21+σ2)/2}dπ(μ1,…,μn,σ2)≤C∫g(c,n,μ,σ2)dπ(μ1,…,μn,σ2)/(ξ2√n).

Since has finite expectation with respect to , then the conditional probability converges to zero. It follows that for , converges to zero at rate .

## 5 Simulations

The discussion in the previous section was based on asymptotic arguments. Here the purpose is to evaluate for certain values of and using simulations. Table 1 reports simulation estimates of under various scenarios. For each scenario, was sampled once from a standard normal distribution with independent entries. In all the scenarios considered here, so that the Bayes risk is 1/2; also . For each scenario, 10,000 simulated data sets were sampled where in 5,000 data sets the ’s are i.i.d and , while in the other 5,000 data sets, the distribution is and . Each was standardized so that and . Hence, if are i.i.d then each is distributed . In each data set it was recorded whether correctly identifies . The mean of those indicators over the simulated data sets is a simulation estimate of .

In all scenarios but varies from to . Similarly, is fixed to be 1 and goes from to . As becomes farther from and as grows, gets smaller. When it is the closest to , but even then and even when , is smaller than the Bayes risk, which is 1/2. This demonstrates that even for relatively small values of , the asymptotic findings of the previous section holds in the sense that is smaller than the Bayes risk with large probability.

The results of Table 1 are random since they depend on the matrix ; different ’s yield different results for . In order to assess this randomness, the scenario with was repeated 1000 times, where in each repetition 10,000 data sets were simulated as above. The resulting histogram is given in Figure 1. It is demonstrated that in this setting, where and are not large and also and are relatively close, the probability that is smaller than the Bayes risk, which is 0.5, is practically 1.

## 6 Conclusions

To sum up, a Bayesian hypothesis testing is considered in the conditional setting, where it is shown that the corresponding Bayes risk is constant, while a rule based on Dicker’s estimate attains a smaller Bayes risk with high probability. More generally, by Corollary 3.1, there exists no consistent estimator for in the conditional setting, whereas Dicker’s estimate is consistent in the unconditional setting. Since is ancillary in the above model, this finding demonstrates that the ignorability implication of the conditionality principle that was mentioned in the Introduction is inadequate under model (1), at least in the context of the estimation of .

Unlike in Brown (1990), here it is shown that an estimator based on information about the marginal distribution of an ancillary statistic is helpful also in the conditional setting. On the other hand, Brown considers the conditional setting for every possible value of , while here the claims are given only with high probability with respect to the distribution of .

The results of this paper are close in spirit to the work of Robins and Ritov (1997). They consider a certain semi-parametric problem and show that when the marginal distribution of an ancillary statistic is known, a consistent estimator exists, but otherwise consistent estimation is impossible. Robins and Wasserman (2000) argue that this latter result calls for rethinking of fundamental principles in the presence of modern data sets where high-dimensional or infinite-dimensional models are natural. This work is another example where low-dimensional models are different than high-dimensional models with respect to the conditionality principle.

The purpose of this work is not only to make general comments on the conditionality principle but also to emphasize the importance of the semi-supervised framework in the context of high-dimensional regression. As mentioned in the Introduction, this problem received little attention in the statistical literature. The current work aims at changing this situation.

## 7 Proofs

### Proof of Lemma 3.1

All the probabilities and expectations below are with respect to and the subscript is suppressed in the notation. The difference of the log posterior probabilities is,

 log(Px(J=0|y1,…,yn))−log(Px(J=1|y1,…,yn))=log(f(y1,…,yn|J=0))−log(f(y1,…,yn|J=1)),

where is the conditional density; notice that the prior does not play a role since . Furthermore,

 f(y1,…,yn|J=0)=E{f(y1,…,yn|μ,σ2)|J=0},

where the expectation is over (the ’s are constants and is also constant given ). Now,

 Ex{f(y1,…,yn|μ,σ2)|J=0}=E⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩1(√2πσ20)nexp[−n∑i=1(yi−μi)2/2σ20]⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭,

where the latter expectation is over (the ’s are constants). To compute the latter expectation consider the integral

 ∫∞−∞1√2πσ20exp[−(y−μ)22σ20]1√2πη20exp[−μ22η20]dμ.

By trivial algebra we have

 (y−μ)2σ20+μ2η20=(1σ20+1η20)(μ−y1+σ20/η20)2+y2η20+σ20.

Therefore,

 ∫∞−∞1√2πσ20exp[−(y−μ)22σ20]1√2πη20exp[−μ22η20]dμ=1√2π(σ20+η20)exp(−y2η20+σ20).

Hence,

 log(f(y1,…,yn|J=0))=−n2log(2π)−n2log(η20+σ20)−∑ni=1y2i2(η20+σ20)

and

 log(f(y1,…,yn|J=0))−log(f(y1,…,yn|J=1))=n2log(η21+σ21η20+σ20)+12n∑i=1y2i(1η21+σ21−1η20+σ20)

Therefore, when , the posterior probability that and are equal (no matter what the values of the ’s are). Otherwise, the posterior probability that is higher when

 1nn∑i=1y2i≥log(η20+σ20η21+σ21)1η21+σ21−1η20+σ20.

### Proof of Corollary 3.1

Assume that a consistent estimate exists and denote it by . Without loss of generality, assume that ; now, define the corresponding rule

 δ∗=⎧⎨⎩1^σ2∗>σ20+σ2120otherwise.

By consistency, converges to zero for any sequence and for . However, since is bounded, by Lebesgue’s dominated convergence theorem,

 limn→∞∫Rδ∗dπ(μ1,…,μn,σ2)=0,

### Proof of Theorem 4.1

The proof is based on the decomposition

 Var(^σ2Dicker)=E{Varx(^σ2Dicker)}+Var{Ex(^σ2Dicker)}, (6)

which connects the unconditional variance of and the conditional expectation and variance. Also, by (3), there exists a constant such that for every

 Var(^σ2Dicker)≤C1V/n. (7)

where ; notice that under our asymptotic regime, is a constant (does not depend on ).

Now, define and . We have that

 ε(1)n=√nE{Varx(^σ2Dicker)}≤√nVar(^σ2Dicker)≤C1V/√n,

where the first inequality follows from (6) and the second follows from (7). Similarly, .

By Markov’s inequality,

and also . Therefore, for

 ~An={X:Varx(^σ2Dicker)≤ε(1)n and {Ex(^σ2Dicker)−σ2}2≤ε(2)n};

we have that . Now, for , the conditional Markov’s inequality implies that

 Px(|^σ2Dicker−σ2|≥ξ)≤Ex{(^σ2Dicker−σ2)2}ξ2=Varx(^σ2Dicker)+{Ex(^σ2Dicker)−σ2}2ξ2≤εn(1)+εn(2)ξ2=2C1Vξ2√n. (8)

I now bound . Recall the notation . I first show that . Indeed, and

 Var(1n||μ||2)=E([xTβ]4)−||β||4n=2||β||4n,

since

 E([xTβ]4)=∑jE(x4j)β2j+3∑j≠j′β2jβ2j′=∑j{E(x4j)−3}β4j+3||β||4=2||β||4,

where in the second equality I used the identity and the last equality follows from normality. Hence, as . Therefore, also as .

We have that

 V=2{c(σ4+||β||4+2σ2||β||2)+σ4+||β||4}=2[c{||β||4−(1n||μ||2)2+2σ2(||β||2−1n||μ||2)}+||β||4−(1n||μ||2)2]+2(c+1){(1n||μ||2)2+σ4}+4σ2n||μ||2+2σ4.

Since and as , then

 ¯An={X|V≤1+2(c+1){(1n||μ||2)2+σ4}+4σ2n||μ||2+2σ4}, (9)

satisfies as .

Let , then as . It follows from (8) and (9), that for

 Px(|^σ2Dicker−σ2|≥ξ)≤2C1Vξ2√n≤2C11+2(c+1){(1n||μ||2)2+σ4}+4σ2n||μ||2+2σ4ξ2√n.\qed

## References

• Azriel et al. (2016) Azriel, D., Brown, L. D., Sklar, M., Berk, R., Buja, A., and Zhao, L. (2016). Semi-supervised linear regression. ArXiv e-prints.
• Brown (1990) Brown, L. D. (1990). An ancillarity paradox which appears in multiple linear regression. Ann. Statist., 18(2):471–493.
• Buja et al. (2014) Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhan, K., and Zhao, L. (2014). Models as approximations, part I: a conspiracy of nonlinearity and random regressors in linear regression. ArXiv e-prints.
• Chakrabortty and Cai (2018) Chakrabortty, A. and Cai, T. (2018). Efficient and adaptive linear regression in semi-supervised settings. The Annals of Statistics.
• Cox and Hinkley (1974) Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. London : Chapman & Hall, c1974.
• Dicker (2014) Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284.
• Little (1992) Little, R. J. (1992). Regression with missing x’s: a review. Journal of the American Statistical Association, 87(420):1227–1237.
• Lucas et al. (2017) Lucas, J., Foygel, B. R., and Emmanuel, C. (2017). Eigenprism: inference for high dimensional signalâtoânoise ratios. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1037–1065.
• Robins and Wasserman (2000) Robins, J. and Wasserman, L. (2000). Conditioning, likelihood, and coherence: A review of some foundational concepts. Journal of the American Statistical Association, 95(452):1340–1346.
• Robins and Ritov (1997) Robins, J. M. and Ritov, Y. (1997). Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in Medicine, 16(3):285–319.
• Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
• Zhou and Belkin (2014) Zhou, X. and Belkin, M. (2014). Semi-supervised learning. In Academic Press Library in Signal Processing, volume 1, pages 1239–1269. Elsevier.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters