Multiple Output Regression with Latent Noise
Abstract
In highdimensional data, structured noise caused by observed and unobserved factors affecting multiple target variables simultaneously, imposes a serious challenge for modeling, by masking the often weak signal. Therefore, (1) explaining away the structured noise in multipleoutput regression is of paramount importance. Additionally, (2) assumptions about the correlation structure of the regression weights are needed. We note that both can be formulated in a natural way in a latent variable model, in which both the interesting signal and the noise are mediated through the same latent factors. Under this assumption, the signal model then borrows strength from the noise model by encouraging similar effects on correlated targets. We introduce a hyperparameter for the latent signaltonoise ratio which turns out to be important for modelling weak signals, and an ordered infinitedimensional shrinkage prior that resolves the rotational unidentifiability in reducedrank regression models. Simulations and prediction experiments with metabolite, gene expression, FMRI measurement, and macroeconomic time series data show that our model equals or exceeds the stateoftheart performance and, in particular, outperforms the standard approach of assuming independent noise and signal models.
120001484/0010/00Jussi Gillberg, Pekka Marttinen and Samuel Kaski \ShortHeadingsLatent noiseGillberg et al. \firstpageno1
Bayesian reducedrank regression, latent variable models, latent signaltonoise ratio, multipleoutput regression, nonparametric Bayes, shrinkage priors, structured noise, weak effects
1 Introduction
Explaining away structured noise is one of the cornerstones for successful modeling of highdimensional output data in the regression framework (Fusi et al., 2012; Klami et al., 2013; Rai et al., 2012; Rakitsch et al., 2013; Stegle et al., 2012; Virtanen et al., 2011). The structured noise refers to dependencies between response variables, which are unrelated to the dependencies of interest between the response variables and the covariates. It is noise caused by observed and unobserved confounders that affect multiple variables simultaneously. Common observed confounders in medical and biological data include age and sex of an individual, whereas unobserved confounders include, for example, the state of the cell being measured, measurement artefacts influencing multiple probes, or other unrecorded experimental conditions. When not accounted for, structured noise may both hide interesting relationships and result in spurious findings (Leek and Storey, 2007; Kang et al., 2008).
The effects of known confounders can be removed straightforwardly by using supervised methods. For the unobserved confounders, a routinely used approach for explaining away structured noise has been to assume a priori independent effects for the interesting and uninteresting factors. For example, in the factor regression setup (Bernardo et al., 2003; Stegle et al., 2010; Fusi et al., 2012), the target variables are assumed to have been generated as
(1) 
where is the matrix of target variables (or dependent variables) and contains the covariates (or independent variables), for the observations. The model parameter matrix comprises the unknown latent factors and the factor loadings, which are used to model away the structured noise. The term represents independent unstructured noise and the elements of are independently distributed, . In this paper we call this model independentnoise BRRR. To reduce the effective number of parameters in the regression coefficient matrix , a lowrank structure may be assumed:
(2) 
where the rank of parameters and is substantially lower than the number of target variables and covariates . The lowrank decomposition of the regression coefficient matrix (2) may be given an interpretation whereby the covariates affect latent components with coefficients specified in , and the components, in turn, affect the target with coefficients .
(a)  (b) 
Another line of work in multiple output prediction has focused on borrowing information from the correlation structure of the target variables when learning the regression model. The intuition stems from the observation that correlated targets are often seen to be affected similarly by the covariates, for example in genetic applications (see, e.g., Davis et al., 2014; Inouye et al., 2012). One popular method, GFlasso (Kim et al., 2009), learns the regression coefficients using
(3) 
where the are the columns of . Two regularization parameters are introduced: represents the standard Lasso penalty, and encourages the effects and of the th covariate on correlated outputs and to be similar. Here represents the correlation between the th and th phenotypes. The is an a priori specified correlation graph for the output variables, with edges representing correlations to be accounted for in the model.
In this paper we propose a model that simultaneously learns the structured noise and encourages the sharing of information between the noise and the regression models. To motivate the new model, we note that by assuming independent prior distributions on and in model (1), one implicitly assumes independence of the interesting and uninteresting effects, caused by covariates and unknown factors , respectively (Fig. 1a). The assumption is appealing for example when explaining away batch effects (Fusi et al., 2012) in highdimensional data, but may be inadequate in the presence of other types of noise in molecular biology, where gene expression and metabolomics measurements record concentrations of compounds generated by ongoing latent biological processes. In this kind of situations, a limited set of covariates, such as single nucleotide polymorphisms (SNPs), determines the activity of the latent process only partially and all other activity of the process is due to unrecorded factors. In such cases, the noise affects the measurement levels through the very same process as the interesting signal (Fig. 1b), and rather than assuming independence of the effects, an assumption about parallel effects would be more appropriate. We refer to this type of noise as latent noise as it can be considered to affect the same latent subspace as the interesting effects. We note that in practice both types of structured noise are likely to be present. In this work, our main focus is on the latent noise, but we also present a comparison with a model that includes both types of structured noise simultaneously.
A natural way to encode the assumption of latent noise is to use the following model structure:
(4) 
where the is a matrix consisting of unknown latent factors. In (4), mediates the effects of both the interesting and uninteresting signals on the target variables. We note that the change required in the model structure is small, and has in fact been presented earlier (Bo and Sminchisescu 2009; recently extended with an Indian Buffet Process prior on the latent space by Bargi et al. 2014). We now proceed to using the structure (4) for GFlassotype sharing of information (3) between the regression and noise models while simultaneously explaining away structured noise. To see that the information sharing between noise and regression models follows immediately from model (4), one can consider simulations generated from the model. The a priori independence assumption of model (1) results in uncorrelated regression weights regardless of the correlations between target variables (Figure 2a). The assumption of latent noise (4), however, encourages the regression weights to be correlated in a similar way as the target variables are (Figure 2c).
.  

(a)  (b)  (c) 
In this work, we focus on modelling weak signals in highdimensional data with structured noise, where we consider effects that explain a tiny portion, say , of the variance of the target variables as weak. We have hypothesized above that a model with the structure (4) might be particularly wellsuited for this purpose. Additionally, (i) particular emphasis must be put on defining adequate prior distributions to distinguish the weak effects from noise as effectively as possible, and (ii) scalability to large sample size is needed in order to have any chance of learning the weak effects. For (i), we define latent signaltonoise ratio as a generalization of the standard signaltonoise ratio in the latent space:
(5) 
We use the latent signaltonoise ratio as a hyperparameter in our model, and show that it is a key parameter affecting model performance. It can be either learned or set using prior knowledge. In addition, we introduce an ordered infinitedimensional shrinkage prior that resolves the inherent rotational ambiguity in the model (4), by sorting both signal and noise components by their importance. Finally, we present efficient inference methods for the model.
2 Related work
Simultaneously solving multiple realvalued prediction tasks with the same set of covariates is called multipleoutput regression (Breiman and Friedman, 1997); and more generally sharing of statistical strength between related tasks is called multitask learning (Baxter, 1996; Caruana, 1997). The data consist of inputoutput pairs ; the dimensional input vectors (covariates) are used for predicting dimensional vectors of target variables. The common approach to dealing with structured noise due to unobserved confounders is to apply factor regression modeling (1) (Bernardo et al., 2003) and to explain away the structured noise using a noise model that is assumed to be a priori independent of the regression model (Stegle et al., 2010; Fusi et al., 2012; Rai et al., 2012; Virtanen et al., 2011; Klami et al., 2013; Rakitsch et al., 2013). A recent Bayesian reducedrank regression (BRRR) model (Marttinen et al., 2014) implements the routine assumption of the independence of the regression and noise models; we will include it in the comparison studies of this paper.
Methods for multipleoutput regression without the structured noise model have been proposed in other fields. In the application fields of genomic selection and multitrait quantitative trait loci mapping, solutions (Yi and Banerjee, 2009; Xu et al., 2009; Calus and Veerkamp, 2011; Stephens, 2013) for lowdimensional target variable vectors () have been proposed, but these methods do not scale up to the currently emerging needs of analyzing higherdimensional target variable data. Additionally, sparse multipleoutput regression models have been proposed for prediction of phenotypes from genomic data (Kim et al., 2009; Sohn and Kim, 2012).
Many methods for multitask learning have been proposed in the field of kernel methods (Evgeniou and Pontil, 2007). These methods do not, however, scale up to data sets with several thousands of samples, required for predicting the weak effects. Other relevant work include a recent method based on the BRRR presented by Foygel et al. (2012), but it does not scale to the dimensionalities of our experiments either. Methods for highdimensional phenotypes have been proposed in the field of expression quantitative trait loci mapping (Bottolo et al., 2011) for the related task of finding associations (and avoiding false positives) rather than prediction, which is our main focus. Also functional assumptions (Wang et al., 2012) have been used to constrain related learning problems.
3 Model
In this Section, we present the details of our new model, Bayesian reduced rank regression with latent noise (âlatentnoise BRRRâ), show how the hyperparameters can be set using the latent signaltonoise ratio, and analyze theoretically some properties of the infinitedimensional shrinkage prior.
3.1 Model details: latentnoise BRRR
Our model is given by
(6) 
where contains the dimensional response variables for observations, and contains the predictor variables. The product , of and , results in a regression coefficient matrix with rank . The contains unknown latent factors representing the latent noise. Finally, with where is a matrix of uncorrelated target variablespecific noise vectors. Figure 3 displays graphically the structure of the model. In the figure, the node corresponding to the parameter that is shared by the regression and noise models is highlighted with green.
Similarly to a recent BRRR model (Marttinen et al., 2014) and the Bayesian infinite sparse factor analysis model (Bhattacharya and Dunson, 2011), we assume the number of components connecting the covariates to the targets to be infinite. Accordingly, the number of rows in the weight matrix , and the numbers of columns in and , are infinite. The lowrank nature of the model is enforced by shrinking the columns of and rows of and increasingly with the growing column/row index, such that only a small number of columns/rows are influential in practice. The increasing shrinkage also solves any rotational unidentifiability issues by enforcing the model to mediate the strongest effects through the first columns/rows. In Section 3.4 we explore the basic properties of the infinitedimensional prior, to ensure its soundness.
The hierarchical priors for the projection weight matrix , where , are set as follows:
(7) 
Here is a global shrinkage parameter for the th row of and the s are local shrinkage parameters for the individual elements of , to provide additional flexibility over the global shrinkage priors. The same parameters are used to shrink the columns of the matrices and , because the scales of and (or ) are not identifiable separately:
where is a parameter that specifies the amount of latent noise, which is used to regularize the model (see the next Section). With the priors specified, the hidden factors can be integrated out analytically, yielding
(8) 
where is obtained from by multiplying the rows of with the shrinkages of the columns of .
Finally, conjugate prior distributions
(9) 
are placed on the noise parameters of the target variables.
3.2 Regularization of latentnoise BRRR through the variance of
The latent signaltonoise ratio in Equation (5) has an intuitive interpretation: given our prior distributions for and , the prior latent SNR indicates the extent to which we believe the noise to explain variation in , as compared to the variance explained by the covariates . Thus, the latent SNR acts as a regularization parameter: when the latent variables are allowed to have a large variance, the data will be explained by the noise model rather than the covariates. We note that this approach to regularization is nonstandard and it may have favourable characteristics compared to the commonly used L1/L2 regularization of regression weights. First of all, the regression weights remain relatively unbiased as they need not be enforced to zero to control for overfitting. This is important when the effects are weak: if the effects were shrunk towards zero, they might be lost completely.
Secondly, while regularizing with the a priori selected latent SNR, the regularization parameter itself remains interpretable: every value of the variance parameter of can be immediately interpreted as the percentage of variance explained by the noise model as compared to the covariates. In our experiments, we use crossvalidation to select the variance of and the interpretability of the parameter makes it easy to express beliefs of the plausible values based on prior knowledge. Making similar educated guesses for L1/L2 regularization parameters is not straightforward.
3.3 Difference between latentnoise BRRR and independentnoise BRRR
We call the standard Bayesian reduced rank regression (Equation 1), which assumes independent noise and signal models, the independentnoise BRRR. The new latentnoise BRRR differs from it in two ways: in the latentnoise BRRR

the structure of the model is different in that the noise model uses the same projection parameters as the regression model, and

the model is regularized by modifying the variance of the noise model. This is achieved by learning the latent signaltonoise ratio parameter .
In Section 5.5 we show that both of these improvements are needed to reach the performance differences observed.
We emphasize that although the technical difference between the two models is minor, the models are very different from the conceptual point of view, as discussed in the Introduction, as well as from the practical point of view. In particular, it has been reported before that with weak effects the independentnoise BRRR may suffer from severe instability, resulting from a highly multimodal posterior distribution and, consequently, poor convergence and mixing properties of the learning algorithms (Koop et al., 2006; Marttinen et al., 2014). In Section 5.11, we demonstrate how the latent noise assumption provides just the required additional regularization to make the formal Bayesian inference tractable even with weak effects.
As both independent structured noise and latent noise could be present, a logical extension to the models presented so far is to consider both noise types simultaneously,
(10) 
where the distributional assumptions for and are the same as in latentnoise BRRR, and for and they follow independentnoise BRRR. The Gibbs updates for this model are straightforward modifications of those for the latentnoise BRRR and independentnoise BRRR. We have implemented also this model and study its performance in Section 5.9.
We note that both the latentnoise model is, in principle, able to express data generated by the independentnoise BRRR model, and vice versa. The latentnoise BRRR model may learn noise components that are independent from the signal in practice, having negligible contribution from the regression part . On the other hand, nothing prevents the independent noise model to learn some correlated regression and noise components. Therefore, the family of models defined by Equation (10) that simultaneously includes both kinds of structured noise may have redundancy in its parameters. Indeed, the experiments in Section 5.9 demonstrate only minor improvements from this model.
3.4 Proofs of the soundness of the infinite prior
In this Section we verify the sensibility of the infinite nonparametric prior, which we introduce for ordering the components according to decreasing importance, and of a computational approximation resulting from truncation of the infinite model.
It has been proven that in Bayesian factor models and (in our case defined in eqn 7) is sufficient for the elements of to have finite variance in a Bayesian factor model (1), even if an infinite number of columns with a prior similar to our model is assumed for (Bhattacharya and Dunson, 2011). In this Section we present similar characteristics for the infinite reducedrank regression model. The detailed proofs can be found in the Supplementary material. First, in analogy to the infinite Bayesian factor analysis model, we show that
(11) 
is sufficient for the prediction of any of the response variables to have finite variance under the prior distribution (Proposition 1). Second, we show that the underestimation of uncertainty (variance) resulting from using a finite rank approximation to the infinite reducedrank regression model decays exponentially with the rank of the approximation (Proposition 2). For notational clarity, let denote the column of the matrix in the following. With this notation, the prediction for the th response variable can be written as
Furthermore, let denote below the gamma function (not to be confused with the matrix used in all other Sections of this paper).
Proposition 1: Finite variance of predictions Suppose that and . Then
(12) 
A detailed proof is provided in the Supplementary material.
Proposition 2: Truncation error of the finite rank approximation Let denote the prediction for the th target variable when using an approximation for and consisting of the first columns or rows only, respectively. Then,
that is, the reduction in the variance of the prediction resulting from using the approximation, relative to the infinite model, decays exponentially with the rank of the approximation. A detailed proof is provided in the Supplementary material.
4 Efficient computation by reparameterization
For estimating the parameters of the latentnoise BRRR, we use Gibbs sampling, updating the parameters one by one by sampling them from their conditional posterior probability distributions, given the current values of all other parameters. The bottleneck of the computation is in updating the matrix , and below we present a novel efficient update for this parameter.
Update of
The conditional distribution of the parameter matrix of latentnoise BRRR can be updated using a standard result for Bayesian linear models (Bishop et al., 2006) which states that if
(13) 
then
(14) 
where
(15) 
Because in our model (6) the columns of the noise matrix are assumed independent with variances , we get
(16) 
Thus, by substituting
into (13), together with prior covariance derived from (7), we immediately obtain the posterior of from (14) and (15).
Updates of and
The updates of the hyperparameters are the same as in Bayesian Reduced Rank Regression, and the conditional posterior distributions of the hyperparameters can be found in the Supplementary material of Marttinen et al. (2014). The has the same conditional posterior distribution as the model parameter of Marttinen et al. (2014).
Improved update of
The computational bottleneck of the naïve Gibbs sampler is the update of parameter , which has elements with a joint multivariate Gaussian distribution, conditionally on the other parameters (Geweke, 1996; Marttinen et al., 2014). Thus, the inversion of the precision matrix of the joint distribution has a computational cost of . To remove the bottleneck, we reparameterize our model, after which a linear algebra trick by Stegle et al. (2011) can be used to reduce the computational cost of the bottleneck to . When sampling we also integrate over the distribution of following the standard result from Equation (8). The reparameterization and the new posteriors are presented in the Supplementary material.
In brief, the trick is that the eigenvalue decomposition of a matrix of the form
(17) 
can be evaluated inexpensively. After reparameterizing the model in the proposed way the posterior covariance matrix of becomes of the form (17) and the eigenvalue decomposition can then be used to efficiently generate samples from the posterior distribution of . We note that the trick can also be applied to the original formulation of the Bayesian reducedrank regression model by Geweke (1996) and the Rcode published with this article allows generating samples from the original model as well. In the next Section, we compare the computational cost of the algorithm using the naïve Gibbs sampler and the improved version that uses the new parameterization.
Sampling the maximum rank of the model
The sparse infinite factor analysis model presented by Bhattacharya and Dunson (2011) uses a certain adaption procedure to update the maximum rank, i.e., the truncation point of their infiniterank factor model. The idea is to update the maximum rank occasionally during the algorithm such that ranks having all elements of the corresponding projection vectors within some prespecified distance from zero are removed from the model and, if none of the ranks has all elements within the threshold, another rank is added into the model. We have implemented a modification of this approach where we adapt the maximum rank of our infinite reduced rank regression model using a prespecified cutoff for the amount of variance explained by the corresponding rank. With a slight abuse of terminology, we shall call this updating of the rank as âsamplingâ in the sequel.
5 Experiments
We start with a basic validation of the latentnoise BRRR model, and its relative merits over alternatives in a prediction task, using simulations with the ground truth available (Section 5.3), and a realworld omics dataset (Section 5.4). Section 5.5 analyses these results in more detail and identifies the characteristics of the proposed latentnoise BRRR model that are responsible for the performance differences observed, by considering the impact of each novel model aspect in isolation. Section (5.7) investigates another application domain, the detection of multivariate associations. In order to assess the prediction performance in more general, we analyse several additional realworld data sets from different domains in Section 5.8.
Different aspects of the inference algorithm are considered in three subsections: sampling vs. crossvalidation of the rank and the latent signaltonoise ratio (Section 5.6), speedup resulting from the proposed reparameterization of the algorithm (Section 5.10), and convergence diagnostics (Section 5.11). To assess the value of further extensions, Section 5.9 considers a model that includes both latent and independent structured noise simultaneously. Finally, Section 5.12 summarizes the findings on all real data sets.
5.1 Data sets
Experiments were performed on the following data sets:

NFBC1966 The NFBC1966 data set comprises genomewide SNP data along with metabolomics measurements for a cohort of 4,702 individuals (Rantakallio, 1969; Soininen et al., 2009). With these data, 96 metabolites belonging to the subclasses VLDL, IDL, LDL and HDL (Inouye et al., 2012) were used as the target variables and SNPs known to be associated with lipid metabolism (Teslovich et al., 2010; Kettunen et al., 2012; Global Lipids Genetics Consortium, 2013) were used as the covariates. Effects of age, sex, and lipid lowering medication were regressed out from the metabolomics data as a preprocessing step. For the genotype data, SNPs with low minor allele frequency (0.01) were removed as a preprocessing step. For this data set, the comparison method GFlasso required excessive training time and we used 5fold crossvalidation to evaluate test set performances. Where crossvalidation was needed for selecting model parameter values, the validation data performance was measured as an average over 3 validation sets, each comprising of the training samples.

DILGOM metabolomics and gene expression prediction from SNPs The DILGOM data set (Inouye et al., 2010) consists of genomewide SNP data along with metabolomics and gene expression measurements. For details concerning metabolomics and gene expression data collection, see Soininen et al. (2009) and Kettunen et al. (2012). In total 509 individuals had all three measurement types. The DILGOM metabolomics data comprises 137 metabolites, most of which represent NMRquantified levels of lipoproteins classified into 4 subclasses (VLDL, IDL, LDL, HDL), together with quantified levels of amino acids, some serum extracts, and a set of quantities derived as ratios of the aformentioned metabolites. All 137 metabolites were used simultaneously as prediction targets. In gene expression prediction, in total 387 probes corresponding to curated gene sets of 8 KEGG lipid metabolism pathways were used as the prediction targets. A separate model was learnt for each pathway. The average number of probes in a pathway was 48. For details about the pathways, see the Supplementary material. On these data sets, 10fold crossvalidation was used to evaluate test set performances. To select values of the parameters that required evaluation on validation data, the training data was then further divided into 9 folds, on which crossvalidation was performed to select parameters according to averaged validation set performance.

fMRI The cognitive neuroscience data set (Wehbe et al., 2014) consists of a time series of fMRI measurements from 8 subjects reading a chapter from “Harry Potter and the Sorcerers Stone“ using Rapid Serial Visual Presentation: words of the text are presented one by one in the center of a screen. Brain voxel activations were measured every 2 seconds. The 250 most accurately predictable voxels (see Supplementary material of Wehbe et al., 2014) of the fMRI measurements were used as prediction targets. The fMRI measurements from all patients were predicted simultaneously from features of the words being shown, such as semantic and syntactic properties, visual properties and discourse level features. The data were divided into 10 folds, only two of which were used to measure test data performance. This computational compromise was needed as the preprocessing (Wehbe et al., 2014) for each fold required about 10,000 hours of computation. To select the values of parameters that required evaluation on validation data, the training data were further divided into 10 folds, on which crossvalidation was performed to select parameters according to averaged validation set performance.

econ The macroeconomic time series data set (Stock and Watson, 2006) consists of monthly values of 52 macroeconomic indicators. Prediction performance of these values from their earlier values was measured with different lags (1 month, 2 months, etc.). The data were processed as described by Carriero et al. (2011). Data for each month were used as a test set (395 test sets) while using data from the previous 10 years for training. Where crossvalidation was needed for learning the values of model parameters, data from the last 2 years before the monthtobepredicted were used for validation and data from the previous 8 years for training.
5.2 Methods included in comparison
We compared the latentnoise BRRR with a stateoftheart sparse multipleoutput regression method Graphguided Fused Lasso (’GFlasso’) (Kim and Xing, 2009), BRRR/factor regression model (Marttinen et al., 2014) with and without the a priori independent noise model (’independentnoise BRRR’, ’BRRR without noise model’), standard Bayesian linear model (’blm’) (Gelman et al., 2004), elasticnetpenalized multitask learning (’L2/L1 MTL’), kernel regression with linear and Gaussian kernels combined with a process for removing confounding factors (Stegle et al., 2012) (’KRR with linear kernel + PEER’, ’KRR with Gaussian kernel + PEER’) and a baseline method of predicting with target data mean. GFlasso constitutes a suitable comparison as it encourages sharing of information between correlated responses, as our model, but does that within the Lassotype penalized regression framework without the use of a noise model to explain away the structured noise. L2/L1 MTL is a multitask regression method implemented in the glmnet package (Friedman et al., 2010) that allows elastic net regularization. It does not use a noise model to explain away confounders either. The blm was selected as a simple singletask baseline.
In one of the experiments, on an association study, latentnoise BRRR is compared with independentnoise BRRR and canonical correlation analysis (’cca’), considered the stateoftheart methods for the detection of multivariate associations (Marttinen et al., 2013, 2014). Additionally, the simple univariate linear model (’lm’) is included as it represents the common baseline in association analysis.
We compare latentnoise BRRR also with two other new models for structured noise modeling. In the simulations, we study the performance of correlated Bayesian reduced rank regression (’correlated BRRR’), which is presented in more detail in the Supplementary material. In brief, in the correlated BRRR, the correlation structure of the target variables learnt by an a priori independent noise model is used as a prior for the regression weight parameters. With the NFBC1966 data and the macroeconomic time series data sets, we also study the performance of the method presented in Equation (10) in Section 3.3 that explicitly models both latent and independent structured noise, abbreviated as ’latent+independentnoise BRRR’.
Parameters for the different methods were specified as follows:

GFlasso: The regularization parameters of the gw2 model were selected from the default grid using crossvalidation. The method has been developed for genomic data indicating the default values should be appropriate. However, for NFBC1966 data, we were unable to run the method with the smallest values of the regularization parameters due to lengthy runtime with these values. With this computational compromise of leaving out these three values, the average training time for the largest training data sets was 650 h. With NFBC1966 data, the prespecified correlation network required by the GFlasso was constructed to match the VLDL, IDL, LDL, and HDL metabolite clusters from Inouye et al. (2012). Within these clusters, the correlation network was fixed to the empirical correlations, and to 0 otherwise. With DILGOM data, we used the empirical correlation network, with correlations below 0.8 fixed to 0 to reduce the number of edges in the network for computational speedup.

independentnoise BRRR, BRRR without noise model: Hyperparameters and of all the BRRR models were fixed to 10 and 4, respectively. In total 1,000 MCMC samples were generated and 500 were discarded as burnin. In preliminary tests similiar results were obtained with 50,000 samples. The remaining samples, thinned by a factor of 10, were used for prediction. The maximum rank of the infiniterank BRRR model was learned using crossvalidation from the set of values for the NFBC1966 data set, for the metabolomics prediction task on the DILGOM data set and for the gene expression prediction task on the DILGOM data set. These grids were selected based on initial experiments. For the fMRI response prediction, the possible values for the maximal rank were limited to in order to save computational time. For the econometrics data set, maximum ranks of were used. In the associaton detection task, the rank of independent noise BRRR was fixed to 1 as this was already sufficient for the task.

latentnoise BRRR: With the NFBC1966 data, the latent signaltonoise ratio was selected using crossvalidation from a range of values from to , , in order to thoroughly evaluate the sensitivity of the model to this parameter. For the other data sets/tasks, the sets of values were as follows: DILGOM metabolomics prediction: , DILGOM gene expression prediction and for macroeconomic time series prediction . For fMRI response prediction, the set of values was limited to to save computation time. Other parameters, including the number of iterations, were set as for the independentnoise BRRR. The performance of the model was evaluated both by sampling the maximum rank and by learning it with crossvalidation from the same range of values as with independentnoise BRRR. Shrinkage hyperparameters were set to noninformative values, 10 and 4, similarly to the corresponding parameters and of independentnoise BRRR.

blm: The variance hyperparameter of BLM was integrated over using MCMC. The variance hyperparameter was assigned a Gamma prior with both shape and rate parameters set to 1. In total 1,000 posterior samples were generated and 500 were discarded as burnin.

L1/L2 MTL: The effects of different types of regularization penalties are an active research topic and we ran a continuum of mixtures of L1 and L2 penalties ranging from group lasso to ridge regression. The mixture parameter controlling the balance between L1 and L2 regularization was evaluated on the grid [0, 0.1, , 0.9, 1.0] and selected using a 10fold cross validation. The default convergence threshold parameters of glmnet were used and no warnings/numerical problems occured.

KRR with linear kernel + PEER: First, the PEER software (Stegle et al., 2012) was used to remove the effects of confounders using 15 components. Then kernel ridge regression with a normalized linear kernel (Bishop et al., 2006) was applied using the residuals from PEER as the target variables. Kernel ridge regression was regularized according to the standard approach of adding parameter to the diagonal elements of the kernel. The value of was selected using crossvalidation from a set of 10 values ranging from 0.1 to 100, . To share information between the different target variables, the approach of using the same kernel for all target variables was adopted.

KRR with Gaussian kernel + PEER: Kernel ridge regression using a Gaussian kernel was used. Regularization and the use of PEER were otherwise similar to KRR with linear kernel + PEER. The radius parameter of the Gaussian kernel was selected using crossvalidation from a set of 30 values ranging from 0.001 to 1000,

cca: This is the conventional classical canonical correlation analysis that attempts to identify linear combinations of the columns of the input and output matrices that are maximally correlated with each other.

correlated BRRR: Rank and hyperparameters , , and were set as with the independentnoise BRRR. This model is presented in detail in Supplementary Section 1.

latent+independentnoise BRRR: With the NFBC1966 data, the hyperparameters , , and were set as with the independentnoise BRRR. The latent signaltonoise ratio was selected using crossvalidation from a range of values from to , and the maximum rank was fixed to 10. For the econometrics data set, maximum ranks of were used and the signaltonoise ratio was selected using crossvalidation from the values . For both data sets, the variance parameter of the a priori independent noise was selected from values , value corresponding to the extreme case of latentnoise BRRR.
5.3 Simulation experiment: impact of the noise model assumptions
In this Section, we study the implications of different noise model assumptions. Performances of models with different noise model assumptions are measured on simulated data sets generated from a continuum of models between the two extremes of assuming either latent noise, or a priori independent regression and noise models. The synthetic data are generated according to
(18) 
where and the parameter defines the proportion of variance attributed to the latent noise versus independent noise. We study a continuum of problems with the values of parameter . The parameters and are orthogonalized using GrammSchmidt orthogonalization. The parameters are scaled so that covariates explain 3 % of the variance of through , the diagonal Gaussian noise explains 20 % of the total variance of and the structured noise explains the remaining 77 % of the total variance of . The simulation was repeated 100 times and training data sets of 500 and 2000 samples were generated for each replicate. To compare the methods, performance in mean squared error (MSE) of the models learned with each method was compared to that of the true model on a test set of 15 000 samples. The number of covariates was fixed to 30 and the number of dependent variables to 60. Rank of the regression coefficient matrix and structured noise was set to 3 when simulating the data sets.
For independentnoise BRRR, the rank of the regression coefficient parameters and was fixed to the true value while the rank of the noise model was learnt from the data. For latentnoise BRRR, the performance of the model was evaluated both by fixing the rank of the regression coefficient matrix to its true value and by learning it from the data. The variance of was selected using 10fold crossvalidation. The grid for latent signaltonoise ratios was to , . More specifically, where . The grid was chosen according to the interpretation given in Section 5.8; it corresponds to assuming that the latent noise explains 5 to 15 times the variance explained by the covariates.
Figures 4 (a) and (b) present the results of a simulation study with training sets of 500 and 2000 samples, respectively. When the structured noise is generated according to the conventional assumption of independent signal and noise, the model making the independece assumption (i.e., the independentnoise BRRR) performs equally well to the true model with both 500 and 2000 samples. However, when the assumption is violated and the proportion of latent noise increases, the performance of the independentnoise BRRR breaks down, whereas the latentnoise BRRR performs consistently well. The method that does not explain away the structured noise at all (BRRR without noise model) is always inferior to the null model with the training set of 500 samples. When the number of training samples is increased to 2000 and the noise is generated according to the latentnoise assumption, the model, however, outperforms even the independentnoise BRRR.Thus, having no noise model is in this case better than having the noise model based on the incorrect independence assumption, which emphasizes the importance of the assumptions on which the noise model is based. Interestingly, with n=2000 the BRRR without noise model is among the best performing methods whereas with n=500 it is clearly the worst, highlighting the fact that the smaller n gets, the more important the right assumptions become.
The latentnoise end of the continuum appears to be more difficult for the methods that do not account for the structured noise (blm, BRRR without noise model). This weak but consistent trend can be seen in Figure 4(b) where the difference between the oracle and these methods increases with the percentage of latent noise. This behaviour is, however, rather intuitive in terms of Equation (18); by rewriting
it is obvious that as , the structured noise (coming from and ) will with certainty be projected on the particular target variables that are affected by the covariates . In other words, latent noise blurs exactly the relationships of interest, being very disruptive.
Figure 4 shows results also for an alternative novel model that shares information between the noise and regression models (correlated BRRR, see Supplementary material for a detailed description). The model includes a separate noise model for the structured noise, as in (1), but achieves the information sharing by assuming a joint prior for the noise and regression models. In detail, conditional on the noise model, the current residual correlation matrix between the response variables is used as a prior for the rows of . This way the correlations between target variables are propagated into the corresponding regression weights; however, the strongest noise components are not automatically coupled with the strongest signal components. Notably, the performance of the correlated BRRR model is very similar to the regular BRRR model that does not have any dependence between the noise and signal components.
.  

(a)  (b) 
5.4 NFBC1966: metabolomics prediction
In this Section, the models accounting for latent noise are evaluated in terms of predictive performance on the NFBC1966 data with different training set sizes. Figure 5 presents the test data MSE for the different methods. With all training set sizes, latentnoise BRRR outperforms the other methods. Method blm performs worse than the baseline (null model), even with the largest training data set containing 3761 individuals, and BRRR without noise model requires the largest training set size in order to outperform the baseline. A paired ttest for the performance difference between latentnoise BRRR and independentnoise BRRR yields a pvalue of 0.03 suggesting a statistically significant difference.
5.5 Differences between latentnoise BRRR and independentnoise BRRR on NFBC1966 metabolomics prediction
The two differences between our new approach, latentnoise BRRR, and independentnoise BRRR are (1) model structure (latentnoise BRRR shares parameters between the regression and noise models) and (2) using the latent signaltonoise ratio parameter to regularize the model. In order to identify how these developments lead to the observed performance differences on the NFBC1966 data, we performed a sensitivity analysis for the two methods with respect to the assumed amount of variance attributed to the noise model.
Figure 6 presents the results of this sensitivity analysis. For latentnoise BRRR, the assumed variance of the noise model controlled by the a priori signaltonoise ratio affects performance in a consistent way, whereas for independentnoise BRRR the impact appears random. If the performance difference stemmed mainly from controlling the variance of the noise model, controlling that parameter for both models should lead to similar results. On the other hand, if the difference in the model structure alone sufficed to explain the performance difference, the difference should not be sensitive to the variance of the noise model. Hence, we conclude that, on this data set, both the new model structure and regularization by using the latent signaltonoise ratio are required for improved performance.
5.6 Evaluation of the chosen inference procedures for rank and noise parameters
Inference for the proposed model could naturally be done in several alternative ways. In this Section we justify the proposed inference procedure.
In the simulations (Section 5.3), sampling the maximum rank of the infinite prior worked well, measured in terms of predictive performance. Figure 4 shows that sampling the maximum rank actually improves performance, as compared to fixing it to the value used in the generative process, when the latent noise assumption is wrong (left end), both when 500 and when 2000. When the latent noise assumption holds (right end), the two inference procedures perform equally well. With the NFBC1966 data set (Figure 5), learning the maximum rank of the infinite prior by sampling (latentnoise BRRR, sample rank) or by crossvalidation (latentnoise BRRR) results in very similar test set performances, similarly to in the simulation experiment in Section 5.3. Hence, we conclude that for learning the maximum rank, both sampling and crossvalidation are appropriate techniques. We also ran the independentnoise BRRR so that the rank was sampled instead of selecting it using crossvalidation on this data. However, the results were poor, with the test data MSE equal to 1.019 with 1880 and 1.005 with 3761. The lines were omitted for clarity. We hypothesize that the problems with the instability of the model (see Section 5.11) were accentuated when the rank was sampled.
The key parameter of our model, the latent signaltonoise ratio, was estimated using crossvalidation. In the simulations, the crossvalidation based scheme allowed estimation of the latent signaltonoise ratio to a reasonable accuracy. The estimated values are included in Figure 2 in the Supplementary material. While the latent signaltonoise ratio of the generative process was 1/25, the estimated posterior latent signaltonoise ratios ranged from to in the parts of the domain where the percentage of correlated structured noise was 10080%. When the percentage of correlated structured noise was 010 %, the model correctly learnt lower variance for the latent noise and a corresponding stronger latent signaltonoise ratio .
We also studied the performance of latentnoise BRRR while sampling the variance of the noise model. A noninformative prior was assigned for the variance of , and . The performance of this model is presented in Figure 6. The performance of latentnoise BRRR when samping the variance of is consistently worse than when using crossvalidation to select the value of the latent signaltonoise ratio. Hence, we conclude that, as opposed to other parameters, crossvalidation is needed to learn the latentsignal to ratio to reach the improved performance.
5.7 NFBC1966: multivariate association detection
Detection of associations between multiple SNPs and metabolites is a topic that has received attention recently (see, e.g., Kim et al., 2009; Inouye et al., 2012; Marttinen et al., 2014). Here we demonstrate the potential of the new method in this task using two illustrative example genes for which ground truth is available. Associations between SNPs within two genes, LIPC and XRCC4, and the metabolites in the NFBC1966 data are investigated in the experiment. Note that the covariates (SNPs) used in this experiment are different from the ones used in the prediction experiment: here SNPs in individual genes are used, whereas in the prediction experiment all known lipidassociated SNPs were used. LIPC was selected as a reference, because it is one of the most strongly lipidassociated genes. On the contrary, XRCC4 was discovered only recently using three cohorts of individuals (Marttinen et al., 2014), and it was selected to serve as an example of a complex association detectable only by associating multiple SNPs with multiple metabolites, and not visible using simpler methods.
We use the proportion of total variance explained (PTVE) as the test score (Marttinen et al., 2014), and sample 100 permutations to measure the power to detect the associations. Furthermore, we use downsampling to evaluate the impact of the amount of training data. For comparison, we select the BRRR, the exhaustive pairwise (univariate) linear regression (’lm’), and canonical correlation analysis (CCA) (Ferreira and Purcell, 2009), these being the methods that have been proposed for the task and having a sensible runtime in putative genomewide applications. For lm, the minimum pvalue of the regression coefficient over all SNPmetabolite pairs, and for the CCA, the minimum pvalue over all SNPs (each SNP associated with all metabolites jointly) are used as the test scores. The association involving the XRCC4 gene was originally detected using the BRRR model; however, unlike here, informative priors were used for the regression coefficients.
Table 1 presents the ranking of the original data among the permuted data with different sample sizes and methods. Ten MCMC chains were computed for both models to account for sampling variability on this difficult and relatively strongly collinear data. The association score was obtained by averaging over the scores for different chains. As expected, all methods were able to detect the association involving LIPC with both training set sizes. However, latentnoise BRRR had the highest power to detect the XRCC4 gene.
XRCC4  LIPC  

4702  2351  4702  2351  
latentnoise BRRR  0.98  0.94  1  1.00 
independentnoise BRRR  0.41  0.32  1  0.99 
lm  0.62  0.74  1  1.00 
cca  0.20  0.24  1  1.00 
5.8 Results: other realworld data sets
To thoroughly study the empirical value of the new method, we compared it to alternative methods on macroeconomic time series prediction, metabolomics and gene expression prediction experiments on the DILGOM data set and the fMRI response prediction. In these domains, explaining away structured noise is of crucial importance.
With the DILGOM data, the prediction of the weak effects was challenging for all methods. Indeed, we noticed that the null model using the average training data value for prediction was better than any other method in terms of MSE over all target variables with the single exception of L1/L2 MTL, which set all regression coefficients to zero thus reducing to the null model. However, a detailed investigation of the results revealed that while many of the target variables could not be predicted at all (as indicated by the worse than null model MSE) some of the target variables could still be predicted better than the null model, and by focusing the analysis on the MSE computed over the predictable target variables (i.e., those that could be predicted better than the null by at least one method), comparisons regarding the model performances could still be made. For consistency, both metrics were computed also with the fMRI and econometrics data sets. To save computation time, we chose to evaluate only the crossvalidation based variant of our model for the fMRI data, as this approach had already been identified as the most promising implementation of our method.
Table 2 and supplementary Table 1 present the results of the macroeconomic time series prediction experiment, metabolomics and gene expression prediction experiments on the DILGOM data set and the fMRI response prediction experiments. The results have been normalized so that the score for the null model (prediction using the mean) is 1. Table 2 presents the results for the predictable target variables and supplementary Table 1 presents the results obtained by averaging test data MSE over all target variables.
Latentnoise BRRR outperforms independentnoise BRRR consistently on the gene expression (on 8/10 folds), metabolomics (10/10 folds) and fMRI response prediction (2/2 folds) tasks on both scores. In the fMRI response prediction, the latentnoise BRRR and L1/L2 MTL are the only methods that outperform the null model. With the DILGOM data none of the methods outperformed the null model when averaged over all target variables and when concentrating on the predictable target variables, only the latentnoise BRRR and KRR with Gaussian kernel were able to outperform the null model. With gene expression prediction, latent noise BRRR (sample rank), GFlasso and the kernel methods outperform the null model, latentnoise BRRR being the best. The econometrics data is the only case in which the independentnoise BRRR is more accurate than the latentnoise BRRR on both metrics, and the latentnoise BRRR is the second best method. In this data set, however, the effects appear rather strong as different methods explain up to 1029 of the variance of the target variables.
On the small DILGOM data sets, L1/L2 MTL sets all regression weights to zero as hypothesized in Section . This demonstrates the need to develop new alternatives to L1/L2 regularization: when modeling weak effects on small data sets, using L1/L2 penalties can prevent analysis altogether. Regularization by making the noise model stronger as in latentnoise BRRR avoids this problem.
econometrics 


fMRI  


0.733200.22564  0.999900.00057  1.000460.00130  0.997980.00282  

0.734530.21219  1.000390.00107  0.999950.00100  

0.710720.20549  1.000510.00038  1.041630.03781  1.002150.00183  
L1/L2 MTL  0.750350.15651  1.000000.00000  1.000000.00000  0.997860.00090  
GFlasso  1.000100.00106  0.999960.00221  

0.881380.11021  1.000930.00057  0.999950.00006  1.002360.00112  

0.904970.09707  0.999850.00016  0.999980.00004  1.006490.00179  

0.818180.34747  1.005680.00274  1.307950.08802  1.067220.05586  
blm  1.590401.23041  1.042450.00914  1.525730.08859  2.086500.25396  
null model  1.000000.00000  1.000000.00000  1.000000.00000  1.000000.00000 
5.9 Results: simultaneous modeling of both latent and independent structured noise
As both latent and independent structured noise can be present simultaneously, we evaluated the possible gains from taking both noise types simultaneously into account. A model that incorporates both latent and independent structured noise, here called latent+independentnoise BRRR, was evaluated for the metabolomics prediction task on the NFBC1966 data and on the macroeconomic time series prediction task, the strong domains of the methods of interest.
Results of this experiment are presented in Table 3. In metabolomics prediction, accounting for both noise types improved results slightly on the smallest training data size as compared to the best performing method latentnoise BRRR. On the larger training data sets, the more flexible latent+independentnoise BRRR model performed worse than the latentnoise BRRR that only accounts for latent noise. On the macroeconomic time series prediction task, accounting for both noise types improved performance as compared to only accounting for the dominant noise type (independent structured noise) on the smaller training data set. For summary,even though slight performance improvements were seen with the smallest training set sizes, the results indicate that as the size of the training data set increases, the advantages disappear. We hypothesize that the potential underidentifiability issues discussed in Section 3.3 hinder model performance more than the increased flexibility improves it.







0.98330.0077  0.99490.0037  0.75360.2143  0.83740.1816  

0.98400.0072  0.99470.0019  0.74450.1918  0.78890.1561  

0.98490.0078  0.99800.0059  0.73390.1977  0.80970.2085 
5.10 Improvement in computational efficiency resulting from the reparameterization of model
To confirm the computational speedup resulting from the reparameterization presented in Section 4, we performed an experiment where the algorithm implementing the naïve Gibbs sampling updates for the Bayesian reducedrank regression (Geweke, 1996; Karlsson, 2012) was compared with the new algorithm that uses the reparameterization. Similar improvements were achieved with all other BRRR models as well.
Ten simulated data replicates were generated from the prior. The number of samples in the training set was fixed to 5000 and the number of target variables was set to 12. Rank of the regression coefficient matrix was 2. Runtime was measured as a function of the number of covariates, which was varied from 100 to 300; 1000 posterior samples were generated. The new algorithm that reparameterizes the model clearly outperformed the naïve Gibbs sampler (Figure 7). As a sanity check, the regression coefficient matrices estimated by the algorithms were compared , and found to be similar.
5.11 Efficiency of the algorithm
To investigate the efficiency of the proposed algorithm and to compare it with the alternative methods, we recorded the wallclock run times with the NFBC1966 data set, shown in Figure 8. In addition, we studied the conventional convergence diagnostics. To assess convergence and mixing, we recomputed four MCMC chains of 2000 posterior samples each, for each of the BRRR methods. Averaged effective sample sizes (ESS) and potential scale reduction factors (PSRF) were computed for 200 randomly selected parameters of the regression coefficient matrix (Gelman et al., 2004). These results are presented in Table 4.
All BRRR methods, except for independentnoise BRRR, converge (PSRF 1.1) and mix acceptably efficiently ( ). Independent noise BRRR, however, showed poor mixing and convergence. In initial experiments we observed that the PSRF for the independentnoise BRRR did not necessarily ever reach values indicating convergence even when sampled for 15,000 iterations. Thus, we decided to simply use the same number of MCMC iterations for each method in our experiments. The reason for the bad behaviour was the multimodality of the posterior distribution, caused by the too flexible model structure of the independent noise model, and the resulting convergence of the different chains into different modes.


latentnoise BRRR 



1000 samples  4.46 0.32  1.03 0.03  1.06 0.05  1.01 0.004  
2000 samples  3.42 0.18  1.02 0.02  1.05 0.05  1.01 0.003 






1000 samples  4.32 0.32  43.88 0.93  44.83 0.25  40.74 1.18  
2000 samples  5.15 0.66  84.39 1.61  86.40 0.43  77.77 1.25 
To further demonstrate the difference between the latentnoise and independentnoise BRRR methods, we visualized the MCMC trace of the association metric used in Section 5.7. The instability of independentnoise BRRR is strikingly visible in Figure 9. The chains converge to different modes and mix very slowly. On the other hand, the latentnoise BRRR appears to mix adequately and always converges to the same mode, except for one of the ten chains with the XRCC4 gene, which converges to a mode with a lower value of the explained variance.
.  

(a)  (b) 
.  
(c)  (d) 
5.12 Results: summary of the results with the real data sets
To provide an overview of the performances of the different methods on the various data sets and tasks, the methods’ performances were ranked for each task/data set. For the prediction tasks, methods were ranked according to the MSE on the test set. When none of the methods outperformed the null model, the scores on the predictable target variables were compared instead. In the association detection task, estimated statistical power was used as the ranking criterion. Table 6 presents the overview results.
Averaged over all data sets and tasks, latentnoise BRRR outperforms the comparison methods. In particular, the latentnoise BRRR outperforms the independentnoise BRRR on all setups except for the macroeconomic time series prediction task, where independentnoise BRRR is the best method and the two variants of latentnoise BRRR follow. The difference between the latentnoise BRRR and the independentnoise BRRR is consistent, present on 4/5 test folds on the NFBC1966 metabolite prediction, 8/10 test folds on the DILGOM gene expression prediction, 10/10 test folds on DILGOM metabolite prediction and on 2/2 folds on the fMRI response prediction. On macroeconomic time series prediction, independentnoise BRRR is better on 218/395 test folds. In the association detection task the latentnoise BRRR has higher power with both training set sizes on the challenging XRCC4 gene (0.94 vs. 0.32 with n=2,351; 0.98 vs. 0.41 with n=4,702).
Simultaneously accounting for both latent and independent structured noise improves performance on the smallest training data sets considered in the macroeconomic time series prediction and metabolomics prediction (NFBC1966) as compared to accounting for only one type of noise. On the other hand, with the larger training set sizes, the models with just the dominant noise type present perform better than the model including both noise types simultaneously.
Selecting the rank for latentnoise BRRR by sampling or by crossvalidation results in comparable performance. Average performance ranks for crossvalidation based and samplingbased inferences are 2.5 and 3.5, respectively. For the NFBC1966 data set and gene expression prediction task on the DILGOM data set, crossvalidation yields better performance. On metabolomics prediction on DILGOM and the macroeconomic time series prediction, on the other hand, the samplingbased approach works better. It is also intriguing that similarly to the simulations, the sampling based variant of the model works better with independent structured noise (macroeconomic time series prediction) than the crossvalidation based approach.
Latentnoise BRRR outperforms the null model on all test cases except for the metabolomics prediction on the DILGOM data. Even on that data set, however, the variant of the model that samples the maximum rank of the infinite prior outperforms the null model. We hypothesize that the poor performance may have resulted from convergence to some inferior mode of the posterior distribution; this can happen to latentnoise BRRR (as demonstrated in Figure 9) although the sharing of information between the signal and noise models makes it substantially more stable than the independentnoise BRRR.










1  2  2  7  2  1  2.5  

2  3  6  1  3.0  

3  1  7  8  4  3  4.3  
L1/L2 MTL  7  4  4  5  1  4.2  
GFlasso  4  5  3  4.0  

6  6  8  2  5  5.4  

5  7  1  4  6  4.6  

8  5  9  9  7  7.6  
blm  10  9  10  10  8  9.4  
null model  9  8  3  6  3  5.8  
cca  4  
lm  2 
6 Discussion
In this work, we evaluated the performance of multipleoutput regression with different assumptions for the structured noise. While most existing methods assume a priori independence of the interesting effects and the uninteresting structured noise, we started from the opposite assumption of strong dependence between the components of the model. This assumption may be deemed appropriate for instance with the molecular biological data sets often analyzed with such methods. Using simulations we demonstrated the harmfulness of the independence assumption when latent noise was present. In real data experiments the model assuming latent noise outperformed stateoftheart methods in prediction of metabolite measurements from genotype (SNP) data and fMRI response prediction, and showed consistently good performance in the different domains. In an illustrative multivariate association detection task, the latent noise model had increased power to detect associations invisible to other methods. To better address the computational needs, we presented a new algorithm reducing the runtime considerably, and improving the scalability of the BRRR models as the number of variables increases. The prior distributions were parameterized in terms of the new concept of latent signaltonoise ratio, which was a key ingredient for optimal model performance. In addition, the rotational unidentifiability of the model was solved using ordered infinitedimensional shrinkage priors. We also demonstrated that the two modifications (model structure, regularization through the latent signaltonoise ratio) made to the existing stateoftheart noise modeling approach were both needed in order to reach the optimal performance.
In real data both latent and independent structured noise can be present. We studied a model incorporating both types simultaneously, and, based on these results, we concluded that the possible gains in predictive power as compared to modeling only the dominant type of noise were not worthwile. In fact, results were also found to degrade when both noise types were included, which we hypothesize to be the result of poor identifiability of the corresponding model
The new model implementing the concept of latent noise was studied using highdimensional data containing weak signal (weak effects). The new model exploits a ubiquitous characteristic of such data: while the interesting effects are weak, the noise is strong. Latentnoise BRRR borrows statistical strength from the noise model so as to alleviate learning of the weak effects, by automatically enforcing the regression coefficients on correlated target variables to be correlated. This intuitive characteristic can be seen as a counterpart of the powered correlation priors (Krishna et al., 2009) in the target variable space: Krishna et al. used the correlation structure of the covariates as a prior for the regression weights to enforce correlated covariates to have correlated weights.
The latentnoise BRRR is an extension of several common model families. By removing the covariates, the model reduces to a standard factor analysis model, which explains the output data with underlying factors. Thus, the latentnoise BRRR can be seen as a reversed analogy of PCA regression (Bernardo et al., 2003), in which components of the input space are used as covariates in prediction; in latentnoise BRRR components derived from the output space are predicted using the covariates (see Bo and Sminchisescu, 2009). Allowing the noise term to affect the latent space directly results in interesting connections to linear mixed models (LMMs) and best linear unbiased prediction (BLUP) (Robinson, 1991); using the latent noise formulation, the model can explain away bias in the residuals as in BLUP. On the other hand, LMMs have a random term for each sample and target variable. While LMMs are not computationally feasible to generalize for highdimensional targets due to the random effect parameters and the associated inversion of an covariance matrix, the latentnoise BRRR can be seen as a lowrank generalization of LMMs for highdimensional target variables: the covariates are used for prediction in the latent space and in this space there is a noise term for each sample and dimension. Therefore, the number of random effect parameters stays at and inference remains tractable.
In summary, our findings extend the existing literature on modeling structured noise in an important way by
showing that structured noise can, and should, be taken advantage of when learning the interesting effects
between the covariates and the target variables, and how this can be done. Code in R for the new method is
available for download at http://users.ics.aalto.fi/lgillber/latentnoise/
This work was financially supported by the Academy of Finland (the Finnish Centre of Excellence in Computational Inference Research COIN; grant numbers 259272 and 286607 to PM; grant number 257654 to MP; grant number 140057 to SK).
NFBC1966 received financial support from the Academy of Finland (project grants 104781, 120315, 129269, 1114194, 24300796, Center of Excellence in Complex Disease Genetics and SALVE), University Hospital Oulu, Biocenter, University of Oulu, Finland (75617), NHLBI grant 5R01HL08767902 through the STAMPEED program (1RL1MH08326801), NIH/NIMH (5R01MH63706:02), ENGAGE project and grant agreement HEALTHF42007201413, EU FP7 EurHEALTHAgeing 277849 and the Medical Research Council, UK (G0500539, G0600705, G1002319, PrevMetSyn/SALVE).
The development and applications of the quantitative serum NMR metabolomics platform are supported by the Academy of Finland, the Sigrid Juselius Foundation, Strategic Research Funding from the University of Oulu, the British Heart Foundation, the Wellcome Trust and the Medical Research Council, UK.
Disclosure: AJK, PS and MAK are shareholders of Brainshake Ltd., a company offering NMRbased metabolite profiling.
Footnotes
 Code will be made available when the paper is published.
References
 A. Bargi, R. Y. Xu, Z. Ghahramani, and M. Piccardi. A nonparametric conditional factor regression model for multidimensional input and response. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33, pages 77–85. JMLR W&CP, 2014.
 J. Baxter. A Bayesian/information theoretic model of bias learning. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, COLT ’96, pages 77–88, New York, NY, USA, 1996. ACM.
 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and M. West. Bayesian factor regression models in the âlarge p, small nâ paradigm. Bayesian Statistics, 7:733–742, 2003.
 A. Bhattacharya and D. Dunson. Sparse Bayesian infinite factor models. Biometrika, 98(2):291–306, 2011.
 C. M. Bishop et al. Pattern recognition and machine learning, volume 1. Springer, 2006.
 L. Bo and C. Sminchisescu. Supervised spectral latent variable models. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5, pages 33–40. JMLR W&CP, 2009.
 L. Bottolo, E. Petretto, S. Blankenberg, F. Cambien, S. Cook, L. Tiret, and S. Richardson. Bayesian detection of expression quantitative trait loci hot spots. Genetics, 189(4):1449–1459, 2011.
 L. Breiman and J. Friedman. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3–54, 1997.
 M. Calus and R. Veerkamp. Accuracy of multitrait genomic selection using different methods. Genetics Selection Evolution, 43(1):26, 2011.
 A. Carriero, G. Kapetanios, and M. Marcellino. Forecasting large datasets with Bayesian reduced rank multivariate models. Journal of Applied Econometrics, 26(5):735–761, 2011.
 R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
 O. Davis, G. Band, M. Pirinen, C. Haworth, E. Meaburn, Y. Kovas, N. Harlaar, S. Docherty, K. Hanscombe, M. Trzaskowski, et al. The correlation between reading and mathematics ability at age twelve has a substantial genetic component. Nature Communications, 5, 2014.
 A. Evgeniou and M. Pontil. Multitask feature learning. In Advances in Neural Information Processing Systems 19, volume 19, pages 41–48, Cambridge, MA, 2007. The MIT Press.
 M. Ferreira and S. Purcell. A multivariate test of association. Bioinformatics, 25(1):132–133, 2009.
 R. Foygel, M. Horrell, M. Drton, and J. Lafferty. Nonparametric reduced rank regression. In Advances in Neural Information Processing Systems 25, pages 1637–1645, 2012.
 J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
 N. Fusi, O. Stegle, and N. Lawrence. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Computational Biology, 8(1):e1002330, 2012.
 A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman & Hall/CRC, 2004.
 J. Geweke. Bayesian reduced rank regression in econometrics. Journal of Econometrics, 75(1):121–146, 1996.
 Global Lipids Genetics Consortium. Discovery and refinement of loci associated with lipid levels. Nature Genetics, 45(11):1274–1283, 2013.
 M. Inouye, J. Kettunen, P. Soininen, K. Silander, S. Ripatti, L. S Kumpula, E. Hämäläinen, P. Jousilahti, A. J. Kangas, S. Männistö, et al. Metabonomic, transcriptomic, and genomic variation of a population cohort. Molecular Systems Biology, 6(1), 2010.
 M. Inouye, S. Ripatti, J. Kettunen, L. Lyytikäinen, N. Oksala, P. Laurila, A. Kangas, P. Soininen, M. Savolainen, J. Viikari, et al. Novel loci for metabolic networks and multitissue expression studies reveal genes for atherosclerosis. PLoS Genetics, 8(8):e1002907, 2012.
 H. M. Kang, C. Ye, and E. Eskin. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics, 180(4):1909–1925, 2008.
 S. Karlsson. Conditional posteriors for the reduced rank regression model. Technical Report Working Papers 2012:11, Orebro University Business School, 2012.
 J. Kettunen, T. Tukiainen, AP. Sarin, A. OrtegaAlonso, E. Tikkanen, LP. Lyytikäinen, A. J. Kangas, P. Soininen, P. Würtz, K. Silander, et al. Genomewide association study identifies multiple loci influencing human serum metabolite levels. Nature genetics, 44(3):269–276, 2012.
 S. Kim and E. Xing. Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics, 5(8):e1000587, 2009.
 S. Kim, K. Sohn, and E. Xing. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(12):i204–i212, 2009.
 A. Klami, S. Virtanen, and S. Kaski. Bayesian canonical correlation analysis. The Journal of Machine Learning Research, 14(1):965–1003, 2013.
 G.M. Koop, Rodney W. Strachan, Herman Van Dijk, Mattias Villani, K. Patterson, and T. Mills. Bayesian approaches to cointegration, pages 871–898. 2006. ISBN 1403941556. Working paper version  Department of Economics, University of Leicester: Discussion Papers in Economics number 04/27.
 A. Krishna, H. D. Bondell, and S. K. Ghosh. Bayesian variable selection using an adaptive powered correlation prior. Journal of Statistical Planning and Inference, 139(8):2665–2674, 2009.
 J. T. Leek and J. D. Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9):e161, 2007.
 P. Marttinen, J. Gillberg, A. Havulinna, J. Corander, and S. Kaski. Genomewide association studies with highdimensional phenotypes. Statistical Applications in Genetics and Molecular Biology, 12(4):413–431, 2013.
 P. Marttinen, M. Pirinen, AP. Sarin, J. Gillberg, J. Kettunen, I. Surakka, A. J. Kangas, P. Soininen, P. OâReilly, M. Kaakinen, M. Kähönen, T. Lehtimäki, M. AlaKorpela, O. T. Raitakari, V. Salomaa, M. R. Järvelin, S. Ripatti, and S. Kaski. Assessing multivariate genemetabolome associations with rare variants using Bayesian reduced rank regression. Bioinformatics, 2014.
 P. Rai, A. Kumar, and H. Daume III. Simultaneously leveraging output and task structures for multipleoutput regression. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 3194–3202. Curran Associates, Inc., 2012.
 B. Rakitsch, C. Lippert, K. Borgwardt, and O. Stegle. It is all in the noise: Efficient multitask gaussian process inference with structured residuals. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1466–1474. Curran Associates, Inc., 2013.
 P. Rantakallio. Groups at risk in low birth weight infants and perinatal mortality. Acta Paediatrica Scandinavica, 193:Suppl–193, 1969.
 G. K. Robinson. That blup is a good thing: The estimation of random effects. Statistical science, pages 15–32, 1991.
 KA. Sohn and S. Kim. Joint estimation of structured sparsity and output structure in multipleoutput regression via inversecovariance regularization. In N. Lawrence and M. Girolami, editors, International Conference on Artificial Intelligence and Statistics, volume 22, pages 1081–1089. JMLR W&CP, 2012.
 P. Soininen, A. Kangas, P. Würtz, T. Tukiainen, T. Tynkkynen, R. Laatikainen, M. Järvelin, M. Kähönen, T. Lehtimäki, J. Viikari, et al. Highthroughput serum NMR metabonomics for costeffective holistic studies on systemic metabolism. Analyst, 134(9):1781–1785, 2009.
 O. Stegle, L. Parts, R. Durbin, and J. Winn. A Bayesian framework to account for complex nongenetic factors in gene expression levels greatly increases power in eqtl studies. PLoS Computational Biology, 6(5):e1000770, 2010.
 O. Stegle, C. Lippert, J. M Mooij, N. D. Lawrence, and K. M. Borgwardt. Efficient inference in matrixvariate gaussian models with iid observation noise. In J. ShaweTaylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 630–638. Curran Associates, Inc., 2011.
 Oliver Stegle, Leopold Parts, Matias Piipari, John Winn, and Richard Durbin. Using probabilistic estimation of expression residuals (peer) to obtain increased power and interpretability of gene expression analyses. Nature Protocols, 7(3):500–507, 2012.
 M. Stephens. A unified framework for association analysis with multiple related phenotypes. PLoS ONE 8(7): e65245, 2013.
 James H Stock and Mark W Watson. Forecasting with many predictors. Handbook of economic forecasting, 1:515–554, 2006.
 T. Teslovich, K. Musunuru, A. Smith, A. Edmondson, I. Stylianou, M. Koseki, J. Pirruccello, S. Ripatti, D. Chasman, C. Willer, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307):707–713, 2010.
 S. Virtanen, A. Klami, and S. Kaski. Bayesian CCA via group sparsity. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML11), ICML ’11, pages 457–464, New York, NY, 2011. ACM.
 W. Wang, V. Baladandayuthapani, J. Morris, B. Broom, G. Manyam, and K. Do. Integrative Bayesian analysis of highdimensional multiplatform genomics data. Bioinformatics, 29(2):149–159, 2012.
 Leila Wehbe, Brian Murphy, Partha Talukdar, Alona Fyshe, Aaditya Ramdas, and Tom Mitchell. Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PLoS ONE, 9(11):e112575, 11 2014. doi: 10.1371/journal.pone.0112575. URL http://dx.doi.org/10.1371%2Fjournal.pone.0112575.
 C. Xu, X. Wang, Z. Li, and S. Xu. Mapping QTL for multiple traits using Bayesian statistics. Genetical Research, 91(1):23–37, 2009.
 N. Yi and S. Banerjee. Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics, 181(3):1101–1113, 2009.