Scalable Bayesian regression in high dimensions
with multiple data sources
Abstract
Current applications of highdimensional regression in biomedicine often involve multiple sources or types of covariates. We propose methodology for this setting, focusing on the “wide data” regime with large total dimensionality and sample size . As a starting point, we formulate a flexible ridgetype prior with shrinkage levels that are specific to each data type or source. These multiple shrinkage levels are set automatically using empirical Bayes. Importantly, all estimation, including setting of shrinkage levels, can be formulated mainly in terms of inner product matrices of size , rendering computation fast and scalable in the wide data setting, even for millions of predictors, and the resulting procedures are free of userset tuning parameters. We consider sparse extensions via constrained minimization of a certain KullbackLeibler divergence. This includes a relaxed variant that scales to large , allows adaptive and sourcespecific shrinkage and has a closedform solution. We compare these approaches to standard highdimensional methods in a simulation study based on biological data.
Keywords: highdimensional regression, multiple data types, prediction, shrinkage priors, ridge regularization
1 Introduction
Advances in data acquisition and storage have meant that current and emerging studies are routinely including multiple sources of covariates, such as different data types, with one or more of the sources being highdimensional. To fix ideas, consider a biomedical setting in which samples indexed by each have response and covariates of several types (representing say genetic data, imaging, clinical covariates and so on) with respective dimensionalities . We refer to the different types of covariate as sources. The ’s are the sourcespecific dimensionalities and is the total dimensionality. We consider a specific example of this kind below, in the context of Alzheimer’s disease.
Constructing regression models using such data is challenging, because the relevance of the sources may be quite unequal (and unknown in advance) and the total dimension may be large. This motivates a need for methodology that can cope with multiple sources and that scales to high dimensions.
Methods for highdimensional regression are now wellestablished and include penalized leastsquares approaches such as the lasso and extensions (tibshirani; tibshirani2005; Yuan_Lin; zou_hastie_2005), SCAD (fan_li_2001) and Bayesian analogues (see e.g. Kyung, for a review). A range of Bayesian approaches have been proposed, notably those based on shrinkage priors, often coupled with variable selection, see for instance Yuan_Lin_2005; park_casella; hans_2010; griffin_brown_2010; carvalho_etal_2010 and armagan_etal_2012, among others. However, in the very large case many available methods become computationally cumbersome or intractable and effective sourcespecific shrinkage remains hard to achieve.
In this paper we put forward an approach to regression in the multiplesource, highdimensional setting. Specifically:

We consider a generalized ridgetype prior with shrinkage that adjusts to individual data sources, with the corresponding shrinkage levels estimated from the data.

We show that estimation can be formulated in a way that renders computation efficient for “wide” data, even for large and over sources, and including automatic setting of all tuning parameters.

We introduce sparsifications that achieve competitive prediction performance and that provide a fast yet multivariate technique for discarding noninfluential predictors.
Thus, we consider the case of data from multiple sources with sourcespecific dimensionalities that could differ by many orders of magnitude (e.g. between clinical and genetic data), with total large and a priori unknown sourcespecific importance.
There has been much interesting work on group selection approaches in regression (reviewed in huang2012). The group lasso (Yuan_Lin) allows specification of covariate subsets that can then be selected as groups; however, applying the group lasso in the current setting (by identifying groups with sources) would not be useful because sources would then simply be either included or excluded (without withinsource regularization). The sparse group lasso (simon2013) permits additional regularization via withingroup sparsity but its use here would require a nontrivial extension to sourcespecific penalties whose tuning would be difficult if not intractable in the very highdimensional, multiple source setting. Dondelinger_Mukherjee consider the case of penalized regression over multiple subgroups of samples; this is quite different from the present setting of sources of covariates (i.e. we focus on the columns, not the rows), also the authors do not tackle the very highdimensional case.
Ridgetype estimators are among the oldest and best studied regularized regression tools, whether from a penalized likelihood or Bayesian viewpoint. Our results build on these classical tools to deal with multiple source, highdimensional problems. Our main focus is on prediction and not variable selection per se. The sparse solutions proposed should be regarded mainly as fast and scalable multivariate filtering procedures which can potentially reduce dimensionality enough to open the way for established variable selection methods.
The remainder of this paper is organized as follows. In Section 2 we introduce the scalable Bayesian regression (SBR) approach, describing model formulation, prior specification and tuning of shrinkage levels. Section 3 deals with the sparse extension of the methodology, sparse SBR (SSBR), including a general solution and a relaxed variant. The latter is tractable for large problems and permits an adaptive approach to regulate induced sparsity. Results and comparisons with standard penalized likelihood approaches from a simulation study are presented in Section 4. The paper concludes with a discussion in Section 5.
2 Scalable Bayesian regression
2.1 Model
Let be a vector of responses and a set of covariate or feature matrices from datasources. Each is of size so that the total number of potential predictors is . We consider the normal linear model
where each is a vector of regression coefficients, denotes a dimensional multivariate normal density with mean and covariance and is the identity matrix. Without loss of generality we assume throughout that all data are standardized. Let and denote the respective global predictor matrix and vector of regression coefficients (here, “global” means with all sources taken together). Then, with prior the full model under consideration is
The improper prior for (Jeffreys’ prior) is a common option for linear regression models. The crucial aspect of prior formulation for the multiplesource, highdimensional setting under consideration is the construction of . This we discuss in detail next.
2.2 The prior on
The SBR approach is based on a natural generalization of the standard ridge prior. Specifically, the prior on is
(1) 
where , and with , for . Here each is a sourcespecific shrinkage level on the corresponding . The special case recovers the standard ridge prior with just one shrinkage level (and indeed the solutions presented below could be used to give a scalable implementation of classical ridge with a single ). However, when dealing with multiple data sources one would expect different contributions from the various datasources. This motivates a need for sourcespecific penalties that can adjust to account for these differences and additionally provide potentially useful information about the usefulness of specific data sources.
At this point it is useful to define the quantity
(2) 
All formulas presented in the remainder of this Section are cast in terms of . Importantly, this means that the key computations under SBR can be formulated so as to require only a oneoff computation of these individual inner product (Gram) matrices of size (these calculations can be easily implemented in parallel) followed mainly by operations on those matrices. As we show below, for wide data with large , this gives a practical way to implement SBR.
2.3 Inference
Under the conjugate prior in (1) the posterior distribution of is given by
(3) 
where and . Calculating the posterior mode directly involves a matrix inversion. For we instead use
(4) 
where is an vector whose calculation involves only an matrix inversion. The derivation of (4) is provided in Appendix A. For very large problems the computation of the posterior mode can be done in parallel; additionally, we draw attention to the useful expression
(5) 
for . Having obtained the posterior mode, prediction from an available of dimensionality is straightforward via . In addition, when interest lies solely in prediction the corresponding calculation can be simplified to
(6) 
Calculating the posterior covariance matrix can also be simplified through the formula
(7) 
For details see Appendix A. In practice we are not interested in evaluating the entire covariance matrix (for very large this might in fact be difficult due to memory limitations). However, the methodology considered in Section 3.2 requires the diagonal elements of and in this case the formula in (7) facilitates computation as it allows for fast and parallel block matrix computations. Again, note that , so here the magnitude of each , for , can guide us in determining whether to use block computations or not. Moreover, the posterior distribution of is given by
(8) 
i.e. an inversegamma distribution with shape and scale , where
(9) 
is the residual sum of squares.
Below we will make use of the marginal likelihood . This is the likelihood obtained by integrating over the parameter space with respect to the joint prior distribution (here, over and ). Using the prior specification and results above we have
(10) 
2.4 Automatic setting of shrinkage levels
Specification of penalty parameters is often handled through crossvalidation (CV) or generalized CV in a frequentist framework (tibshirani), while Bayesian methods typically rely on empirical Bayes (EB) point estimates or datadependent hyperpriors; see e.g. Yuan_Lin_2005; park_casella; bala_madigan_2010; hans_2010 and griffin_brown_2010. An alternative approach is considered by lykou_ntzoufras_2013 who tune the Bayesian lasso penalty based on Pearson correlations at the limit of significance determined by Bayes factors. Furthermore, fully Bayesian shrinkage methods include the horseshoe prior (carvalho_etal_2010) and the double generalized Pareto (armagan_etal_2012).
Here, the tuning parameter is vector valued and for the applications we consider we would like fast and efficient approaches by which to set it. We propose three EB point estimators obtained by (i) minimizing the leaveoneout CV error, (ii) maximizing the marginal likelihood and (iii) locating the posterior mode under a datadependent prior. All three require no user input and are computationally fast. We discuss each in turn.
Leaveoneout crossvalidation (CV) estimator: Similar to the case of ordinary least squares under ridge regression the leaveoneout CV error in our case can be computed as
(11) 
A proof is provided in Appendix B.
Marginal likelihood (ML) estimator: From Eq. (10) the quantity that maximizes the ML is given by
(12) 
Posterior mode (PM) estimator: We consider a productexponential datadependent prior for with prior mode at zero, prior mean equal to as given in (11) and prior variance , i.e. . The rationale is that a smaller individual estimated penalty corresponds to a stronger belief that the corresponding matrix contains useful signal and therefore to a smaller prior variance (especially when ). On the other hand as increases we let the quadratic prior variance account for the chance that there is actually some useful signal in which passes undetected by the leaveoneout CV approach. The resulting posterior mode estimate is
(13) 
Note that the number of available data sources will typically not be large. Therefore, the aforementioned solutions are easy to find through standard optimization routines.
3 Sparse SBR
The SBR posterior mode in (4) is nonsparse (“dense”) in the sense that the regression coefficients will not be set to exactly zero. In this Section we propose a methodology for “sparsifying” SBR. The idea is to find a sparse approximation to the full (dense) Bayesian solution that is closest to it in a KullbackLeibler (KL) sense. To do so, we minimize the KL divergence with respect to the posterior distribution of the regression vector, but subject to a lassotype constraint to ensure sparsity. We show first a general solution that is suitable for small to moderate and then go on to consider a relaxed solution that is applicable to the large case. The solutions presented below bear a resemblance to methods proposed by Antoniadis_Fan2001, in the context of wavelet regression, and by Aseervatham2011 for the special case of ridge logistic regression for text categorization. However, these are rooted in different arguments and not equivalent to the KLbased approach below.
3.1 Sparsification using the KL divergence
Let denote the true posterior over , conditional on , with mode and covariance as in Eqs. (4) and (7), respectively, and let denote an approximate conditional posterior where is the approximate mode (this will provide a sparsification of ). The idea is to minimize the KL divergence from to under an penalty on vector to induce sparsity. It is easy to show that the KL divergence from to is
(14) 
Note that in (14) is a true distance metric (satisfying nonnegativity, symmetry and the triangle inequality). Note also that the presence of the nuisance parameter cannot be ignored when the minimization also involves a penalty on . In principle, one could work with the marginal posterior distribution of (a multivariate distribution) in order to avoid consideration of . However, in this case working with the KL divergence is not straightforward. Another option would be to use a plugin posterior point estimate in (14) such as the mode or mean of . Instead, here we pursue a tuningfree approach in which is integrated out; specifically, we work with
(15) 
where the posterior of is given in (8). The constant depends on sample size and the residual sum of squares, i.e. , with RSS as given in Eq. (9). Note that using the posterior mean or mode of results in and , respectively. Using (15), the general solution is
(16) 
where controls the sparsity of . Clearly, the SSBR solution implies a lassotype model with the particularity of a saturated design where the analogue to sample size equals . This means that SSBR can include at most predictors (unlike classical lasso which for can include at most predictors).
The solution in (16) involves computations with the inverse covariance matrix and is, therefore, best suited to settings where is nonlarge. Large cases can be treated through the approach presented next which leads to a closedform expression for .
3.2 A relaxed solution applicable to the large case
Instead of the KL divergence to the posterior used above, consider the KL divergence between the quantities and , with . This amounts to setting as target distribution the product of the marginal posterior densities. The use of independent posterior factorizations is common in various settings; for instance, in marginal likelihood estimation (e.g. Botev_2013; perrakis_marglik), in expectationpropagation algorithms (e.g. Minka_2001) and in variational Bayes (Bishop). Working with the diagonal matrix leads to the following minimization
(17) 
where is the th element, for , of the main diagonal of . Note that the main diagonal elements are feasible to calculate even for very large ; this can be achieved using Eq. (7) with parallel block computations. Moreover, the minimization in (17) has a closedform solution which is as follows
(18) 
Note that for fixed and we obtain which makes sense from an asymptotic perspective. However, when is a constant not depending on either directly or indirectly (e.g. through ), a “nonsparsifying” effect may be triggered even for moderate sample size, which is in contrast to our initial intent. Setting of is discussed next.
3.3 Tuning of
We will mainly focus on the specification of in the relaxed approach leading to the solution in (18), which is of main interest as it is directly applicable to the large case^{1}^{1}1For the general case we note that one could potentially borrow ideas from Antoniadis_Fan2001 using a universaltype threshold (this would equal in our setting). Additionally, one could incorporate information from the SBR part using sourcespecific penalties of the form for , with the corresponding estimate used in the initial SBR phase.. For the relaxed solution we propose parameterspecific and sourcespecific adaptive penalties for each , where and . Specifically, we consider penalties of the form
(19) 
The rationale in (19) is that the larger the magnitude of , the smaller the corresponding penalty. In addition, we restrict to which guarantees reasonable shrinkage when and avoids extreme shrinkage when (which is the common case). Here the ’s act as further tuning parameters with values close to one or zero encouraging sparse or dense solutions, respectively. At this point we borrow information from the available sourcespecific ’s and treat the ’s as powerweights, namely setting them equal to
(20) 
With this approach we take advantage of all available information from the previous SBR step, i.e. parameterspecific shrinkage through and soursespecific shrinkage through . Arguably, this strategy may result in undesirable nonsparse solutions, but that will be in the rare, and rather unrealistic, case where is large and all sources are equally important in the sense that the ’s will be more or less the same; a setting where in fact a single SBR approach is more suitable.
As a final comment, we remark that despite the fact that this penalization approach depends indirectly on sample size through the regression coefficients in (19) and the shrinkage parameter in (20), it may still be sensitive to the “nonsparsifying” effect discussed at the end of Section 3.2. Controlling this effect requires scaling the penalty in (19) by a factor ; however, automatic tuning of is not straightforward. Empirical results (see Section 4) suggest that can lead to a reasonable balance between sparsity and predictive performance. We will call this extension “controlled” SSBR or cSSBR (as it “controls” for sample size). Table 1 summarizes the different variants of SBR that we consider.
Solution  Sparsity control  Shrinkage estimator  

Cross Validation  Marginal Likelihood  Posterior Mode  
Dense  –  SBR  SBR  SBR 
Sparse  No  SSBR  SSBR  SSBR 
Sparse  Yes  cSSBR  cSSBR  cSSBR 
4 Simulation study
In this Section we present a simulation study aimed at mimicking a typical modern biomedical application involving multiple data types. Reflecting the relative ease with which multiple data modalities can now be acquired such designs are becoming common, with examples including the Cancer Genome Atlas^{2}^{2}2https://cancergenome.nih.gov, the Alzheimer’s Disease Neuroimaging Initiative^{3}^{3}3http://adni.loni.usc.edu and the Rhineland Study^{4}^{4}4http://www.rheinlandstudie.de among many others.
4.1 Setup
The problem. We consider a regression problem with covariates from three sources, namely clinical (CL), geneexpression (RNA) and genetic (single nucleotide polymorphism or SNP) data with respective (simulated) feature matrices , and . The number of covariates in each datasource is set equal to , and . Although the methods we propose can cope with larger , we restrict total in this Section to facilitate empirical comparison with standard methods.
Covariates. The covariate matrices for the clinical and geneexpression variables are generated as and , respectively. Here and are covariance matrices estimated from (real) phenotype and geneexpression data from the Drosophila Genetic Reference Panel (DGRP) (drosophila) (data available online at http://dgrp2.gnets.ncsu.edu/data.html). To simulate the genetic data we use a blockdiagonal covariance structure (HapMap). We specify , where each is of size (with /B) and generated from a inverseWishart with degrees of freedom and identity scale matrix, i.e. for . As dominates in terms of dimensionality the specification of essentially controls the overall correlation level. We consider two simulation scenarios: (i) corresponding to 1000 blocks of size 100 (“lowcorrelation scenario”) and (ii) corresponding to 100 blocks of size 1000 (“highcorrelation scenario”). We first generate and then discretize in correspondence to the common SNP encoding 0/1/2 (homozygous major allele/heterozygous/homozygous minor allele). The discretization is tuned in order to give a reasonable empirical distribution of SNPs; specifically, for we discretize as
Regression coefficients and sparsity. For the (true) regression vectors , and we consider the following levels of sparsity (fraction of nonzero ’s); , and . Varying sparsity of gives three scenarios for overall sparsity : (i) (sparse scenario), (ii) (medium scenario) and (iii) (dense scenario). Let , and denote the respective number of elements in the subvectors , and containing the nonzero beta coefficients. The nonzero betas are generated from the generalized normal distribution (GND). Following the parameterization in Mineo_2003 the probability distribution function of a GND() with location , scale and shape is given by
The GND includes as special cases the normal () and the double exponential () distributions. To avoid these particular cases (which could potentially bias the simulation towards ridge or lasso respectively) we set and generate the nonzero effects as , for , , for , and , for . The signal strength is controlled via the scale parameter (this is downscaled by a factor of 1.5 for the SNP coefficients to control the total amount of signal in the SNPs). To complete the specification of the simulation we set this scale parameter by considering the finitesample risk in a simplified CLonly oraclelike setup. Specifically we consider the correlation induced between predictions (under the OLS estimate using the lowdimensional CL data only) and outofsample test data (with the datagenerating mechanism being a linear model with conditional mean and error variance equal to unity). Specifically, we set which results in an average outofsample correlation of 0.6 when =100 and =5000.
Given the above configurations (low/high correlation and sparse/medium/dense scenarios) we generate data from the model
where and with . The test sample size always equals 5000. Each simulation scenario is repeated 50 times.
Methods under comparison.
We consider the SBR method under the three EB estimates proposed in Section 2.4 and the corresponding SSBR solutions with the penalty terms in (19) the powerweights in (20) and no control for the effect of sample size. Furthermore, we consider cSSBR approaches with , and . We present results obtained from as this option led to a good balance between sparsity and predictive performance.
We compare to standard ridge and lasso (with set to minimize the mean squared error from 5fold CV using package glmnet
in R).
4.2 Results
Boxplots of outofsample correlations between predictions and test data under the lowcorrelation and highcorrelation simulations are presented in Figures 1 and 2, respectively. In general, all proposed SBR methods (CV/ML/PM) demonstrate good and stable predictive performance, without any apparent differences among them. Specifically, we see:

In the sparse scenario SBR methods match or outperform lasso.

In the medium scenario SBR methods generally outperform ridge and lasso.

In the dense scenario SBR methods match or outperform ridge.
Concerning the sparse extensions of the SBR approaches we see the following:

In the low correlation simulations SSBR/cSSBR match more or less the performance of nonsparse methods.

In the high correlation simulations SSBR performs well overall in the medium and dense scenarios for larger sample size. The cSSBR approach leads to more stable predictions in comparison to cSSBR.
Figure 3 shows the resulting values of for (higher values correspond to lower penalty, i.e. higher estimated importance) in the lowcorrelation simulations, and provides useful insights concerning the behaviour of SBR methods. Evidently, all estimates (CV/ML/PM) adjust well by appropriate sourcespecific penalization. This key attribute allows SBR to generally outperform ridge and lasso in prediction when dealing with multiple data sources. Note also that the PM estimator tends to penalize less. The corresponding plots from the highcorrelation scenarios (not shown) are very similar.
As noted above, the sparse SSBR (without or with control) solutions seem to allow equally good predictive performance, in certain cases, as the dense SBR solutions. In addition, they employ fewer parameters; Table 2 shows the average sparsity (over the 50 repetitions) induced by SSBR methods under the various simulations. As seen, the solutions seem to display an adjustment to the true underlying sparsity. In addition controlling for the effect of sample size yields much sparser models.
Method  Sample size  Simulation scenario  

Lowcorrelation  Highcorrelation  
SSBR  17%  48%  56%  15%  45%  51%  
9%  40%  70%  9%  47%  66%  
9%  53%  69%  4%  56%  77%  
cSSBR  6%  20%  24%  4%  19%  23%  
2%  17%  35%  2%  19%  33%  
1%  21%  35%  2%  20%  35%  
SSBR  19%  50%  56%  9%  47%  53%  
10%  41%  69%  5%  46%  67%  
10%  56%  73%  4%  56%  77%  
cSSBR  7%  22%  24%  3%  19%  23%  
3%  17%  35%  1%  17%  34%  
2%  22%  38%  0%  18%  38%  
SSBR  14%  47%  57%  8%  42%  53%  
10%  50%  71%  4%  57%  73%  
14%  69%  77%  13%  70%  79%  
cSSBR  4%  18%  25%  2%  16%  22%  
2%  24%  35%  1%  25%  38%  
4%  35%  43%  3%  33%  42% 
4.3 Computational performance
We conclude by examining computational burden as a function of total dimension and in comparison with the lasso. To do so we add a fourth “data source” which is simply Gaussian noise. The number of these additional covariates is set so that the total number of predictors is . Sample size is set equal to 100.
Computations were carried out on a compute server with 128 cores (2.28GHz) and 1TB of RAM.
For lasso we treat the binding of individual matrices into one data matrix (an operation not needed for SBR) as a preprocessing step and do not include this in the reported runtimes for lasso. We use the parallel option in glmnet
for estimation of the penalty parameter via 10fold CV (the default option).
For our methods we consider the SBR approach, which requires evaluation of both in (11) and in (13) (and is thus in principle the slowest), and also its corresponding sparse extension. We include the formation of transpose matrices and calculation of gram matrices in reported runtime (although these could be regarded as preprocessing steps). We do not consider ridge because it can be seen as a special case of SBR and hence equally fast when implemented as described here.
Figure 4 shows the average runtimes (in minutes) over five fits on a single simulated dataset. SBR and sparse SBR are considerably faster than lasso with the gap increasing with .
We note that by adding random noise variables as described above we in a way “favour” the lasso implementation in glmnet
, as the screening rules that are used by default can relatively easily exclude these covariates.
SBR is the fastest method, with the average runtime for being approximately 5 minutes, net of all steps. Also, running sparse SBR with suitable parallel blockmatrix calculations makes this method also very fast; average runtime was approximately 10 minutes for . We note that our current implementations of the methods are certainly not optimal in terms of computational efficiency.
We note also that our methods can fully utilize available cores for parallel computation, hence they should continue to gain in runtime if more cores are available even with no increase in clock speed.
5 Discussion
The aim of this paper was to introduce a framework for highdimensional regression using multiple datasources that allows efficient and fast computations in the wide data, very large setting. We showed that the SBR method is effective in this setting, while the sparse extension SSBR gives parsimonious solutions that can be attractive in many cases.
Concerning the predictive performance of the various SBR variants, our empirical results suggest that the choice of shrinkage estimator (CV/ML/PM) is not very crucial; all three estimators seem able to adjust to individual data sources. As a consequence, all SBR approaches showed effective and stable prediction performance in both sparse and dense conditions. The corresponding SSBR solutions have the potential to achieve similar predictive performance but with explicit sparsity. In addition, using some control for the effect of sample size on sparsity is desirable as it can lead to enhanced sparsity with no loss in terms of prediction. In this case, cSSBR using the PM shrinkage estimator and as sample size control factor seems to be a promising option.
As shown the proposed methodology is very fast even when the number of predictors is in the millions. We note that SBR/SSBR can take advantage of large numbers of cores (regardless of clock speed) because the parallelization is used for matrix multiplication (this is not the case if using parallelization e.g. for crossvalidation as in standard implementations of lasso among others).
Software
R code for SBR/SSBR implementation, together with an illustrative example, and complete documentation are available at https://github.com/mukherjeegroup/SBR.
References
Appendices
A The posterior mode
B The leaveoneout CV estimator
The leaveoneout CV estimates are obtained via
(B.1) 
Here is the posterior mode from the regression of ( without the th element) on ( without the th row). For simplicity henceforth.
First, set and observe that
(B.2) 
For the transition from the second to the first line we used the ShermanMorrison formula (harville_97, p.424). Note that , i.e. the th element of the main diagonal of the hat matrix . Since from the result in (B.2) we have
(B.3) 
Pluging in (B.3) in the quantity we wish to minimize in (B.1) we get
Switching to matrix notation we have
(B.4) 
The derivation up to (B.4) is as in meijer who considers the OLS case. Now let us further examine the quantity . From (A.1) we have
As a result
(B.5) 
C The relaxed SSBR solution
Let , with , and let for . It is straightforward to see that
For  (C.1)  
For  (C.2) 
From (C.1) and (C.2) we have that , . In addition for , therefore
(C.3) 
Thus, when from (C.1) and (C.3) we have that
and when from (C.2) and (C.3) we have that
Which concludes the proof. For we obtain the solution in (18).