# Estimating Bayesian networks for high-dimensional data with complex mean structure and random effects

###### Abstract

The estimation of Bayesian networks given high-dimensional data, in particular gene expression data, has been the focus of much recent research. Whilst there are several methods available for the estimation of such networks, these typically assume that the data consist of independent and identically distributed samples. However, it is often the case that the available data have a more complex mean structure plus additional components of variance, which must then be accounted for in the estimation of a Bayesian network. In this paper, score metrics that take account of such complexities are proposed for use in conjunction with score-based methods for the estimation of Bayesian networks. We propose firstly, a fully Bayesian score metric, and secondly, a metric inspired by the notion of restricted maximum likelihood. We demonstrate the performance of these new metrics for the estimation of Bayesian networks using simulated data with known complex mean structures. We then present the analysis of expression levels of grape berry genes adjusting for exogenous variables believed to affect the expression levels of the genes. Demonstrable biological effects can be inferred from the estimated conditional independence relationships and correlations amongst the grape-berry genes.

Bayesian network; complex mean structure; exogenous variable; grape-berry gene expression; regulatory network; score-based metric; variance components.

## 1 Introduction

The inner workings of a cell are very complex, with many interacting components. Determining how the genes within a cell interact with each other is an important, but difficult, field of research, often requiring the application of advanced statistical methods. Systems of these gene interactions are known as genetic regulatory networks, and the extent to which such networks may be inferred from observational gene expression data remains largely undetermined. To explore this question carefully and quantitatively, high-dimensional multivariate models, including Bayesian networks, need to be considered. The use of Bayesian networks for the modelling of genetic regulatory networks has been discussed by several authors: see for example [3, 7, 8, 18]. Their popularity lies in the provision of a flexible framework for the estimation of conditional dependence relationships, thereby providing a means to estimate a covariance matrix given a high-dimensional sample when maximum likelihood methods are unavailable, [5]. Estimation of such structures allows insight into how the expression levels of large groups of genes are related to one another, which, in turn, should help shed light on genetic regulatory networks involving the genes.

For the most part, it has been assumed that the data used to estimate the networks are independent and identically distributed. In the present paper, we consider the important case where the assumption of independent and identically distributed samples is not satisfied, and propose new methods to allow for the estimation of effects of interest given such complexity. Our theoretical development has been motivated by an observational time course microarray study, involving expression levels of grape-berry genes observed over time and known to be associated with changes in temperature. The grapes were sampled from three vineyards in different regions of southern Australia and data on the ambient temperatures during the times leading up to the picking of each sample of grapes was also measured. We want to investigate the conditional dependence structure of the genes, adjusting for the exogenous effects of temperature and vineyards, and we aim to do this through the estimation of a Bayesian network. If the effect of temperature is unaccounted for in the estimation of a Bayesian network for these genes, because of their common relationship with temperature, many pairs of genes will exhibit strong correlations. Unless the gross effects of vineyard and temperature are removed, one cannot hope to detect more subtle associations between genes.

There are many methods available for the estimation of Bayesian networks given microarray and other high-dimensional datasets, and these may be divided into two broad categories, namely, score-based and constraint-based methods, [22]. Score-based methods attempt to maximize some score metric associated with the estimated Bayesian network, whilst constraint-based methods estimate conditional independence relationships directly from the data, and combine these to form a Bayesian network. Constraint-based methods test for conditional independence relationships, so the networks obtained through their application can be quite sensitive to Type I and Type II errors, particularly when the sample sizes are small. Score-based methods on the other hand are not as sensitive to small sample sizes, and instead of finding the best local structure for each node, find the best global structure given the data, often resulting in more parsimonious models. Given that gene expression data sets tend to be high-dimensional with the attendant ‘small , large ’ problem, we approach the problem of Bayesian network estimation from a score-based perspective, and extend these to include exogenous variables and dependent data.

The outline of the paper is as follows. In Section 2, Bayesian networks and score metrics are briefly reviewed, and our two new score metrics for datasets with complex mean structure and random effects are presented. The new score metrics are used to estimate Bayesian networks for simulated datasets with a known complex mean structure in Section 3.2, and then applied to the analysis of the grape-berry gene expression data in Section 4. In Section 5, we present a brief summary of our overall findings.

## 2 Bayesian networks and Score Metrics

### 2.1 BGe, the basic Bayesian score metric

Consider a random vector . A Bayesian network for consists of two components: a directed acyclic graph with , often written as , and assumed conditional distributions . The set is the set of parents of in and is the set of parameters associated with the conditional distributions. The graph and conditional distributions then specify a joint distribution for :

(1) |

Bayesian networks encode information about the conditional independence relationships between the variables in . The directed Markov properties, as described in [17], for example, allow conditional independence statements about to be read from the graph . Additionally, when the available data set is high-dimensional, and maximum likelihood estimation of the covariance matrix of is unavailable, estimation of a Bayesian network allows the estimation of the covariance matrix.

Here we consider a vector of pre-processed, normalized expression levels for genes, and suppose that , where is unknown. We consider a data set , where is the vector containing the samples of the expression levels of gene . The estimation of a Bayesian network for given this data set consists of learning the structure of a directed acyclic graph encoding the conditional independence relationships between the variables, and the estimation of the parameters .

As described in the Introduction, a score-based approach to learning the structure of the graph encoding the conditional dependence relationships of is taken. After deciding on a score metric, we want to find a graph that maximises that score metric, and an obvious choice is the likelihood function of directed acyclic graphs given the data. The maximum likelihood score turns out not to be a good idea, however, as it inevitably assigns the complete graph, encoding no conditional independence relationships, the maximum score. For more detail, readers are directed to Section 18.3 of [15].

To avoid the problems with overfitting associated with the maximum likelihood score, the Bayesian score was developed. Following [9] among others, the Bayesian score metric for the estimation of a directed acyclic graph given some data set is proportional to the posterior probability of that graph:

(2) |

Here the focus is on the second component of this score, the marginal model likelihood of the data given the graph , where the density of given and is assumed to be an -dimensional normal density, with mean vector and covariance matrix . As per Equation (1), this joint density may be decomposed into a product of conditional densities. When the data set consists of independent and identically normally distributed samples, , and

Given normal-inverse gamma priors for each , or an equivalent inverse Wishart prior on , the score metric of Equation (2) can be written as the product of the prior density on the space of directed acyclic graphs, , and multivariate densities. This score metric, only appropriate in the case of independent and identically distributed samples, is known as the BGe metric: “Bayesian metric for Gaussian networks with score equivalence”.

### 2.2 BGeCM, the score metric for data sets with complex mean structure

We now consider the case of a more complex data set , that does not consist of independent and identically distributed samples, such as the grape-berry gene data set described in Section 1. As explained there, contained within that data set is information about exogenous variables thought to affect the expression levels of the genes under study. Given such a data set, we now express the model for the vector of expression levels of gene as

(3) |

where is the -vector of the effects of the exogenous variables on gene , are the parameters associated with the (as yet to be selected) prior distribution for , and is the matrix containing the data associated with the exogenous variables. It can be seen that in this specification, we retain linear dependence upon expression levels of parent genes, but now, in addition to that dependence, more complex sampling schemes and the influence of exogenous variables are accounted for through the linear dependence of expression levels upon .

Including exogenous variables in the estimation of a Bayesian network is important in order to obtain an unbiased estimate of the conditional dependence relationships between the genes of interest. For example, the expression levels of two genes may both be dependent upon changes in an exogenous variable, but conditionally independent of each other. If dependence upon exogenous variables is not accounted for, an edge between these two genes is likely to be present in an estimated graph. By accounting for the effects of exogenous variables, we can have more confidence that the conditional dependence relationships obtained represent actual dependence relationships, and are not due to common relationships with exogenous variables.

As can be seen by the definition of the Bayesian score metric given by Equation (2), a joint prior distribution for and is required for the calculation of a score metric. Care is required in the specification of this prior distribution, since if priors are not properly selected, a score metric that gives different scores to directed acyclic graphs encoding equivalent conditional independence restrictions will be induced. A score metric that does not discriminate between equivalent directed acyclic graphs is called an equivalent score metric. Discrimination between equivalent graphs is tantamount to assigning causal meaning to the directed edges of , and since the emphasis here is on the estimation of graphs given observational data, the assignation of causal meaning to the estimated relationships is not appropriate.

Extension of the results in [10] to our model indicates that the joint prior distribution for given must have a normal-inverse gamma form, and that the effects of the exogenous variables on one gene must be a priori independent of the effects upon another gene, for the induced score metric to satisfy equivalence. The following system of priors are used:

There are several possible assumptions about the form of the prior distribution of the variance of the random effects for gene , . Among other choices, the variance of the random effects could be assumed known, a uniform prior could be placed on , or an inverse gamma prior could be placed on . However, by an extension of the results in [10], when , any choice of prior distribution on will result in a marginal model likelihood without a closed form, requiring numerical integration to compute and slowing down computations.

A simulation study in [14] showed that the learnt network structure is quite robust to the misspecification of the prior density for , provided the magnitude of is correctly specified. Hence, for computational simplicity, the following prior for the variance of the effects of exogenous variables is used:

where is some positive parameter that is constant from gene to gene. If , then and are independent and identically distributed. Taking implies that the are less variable than the , while implies that the are more variable than the . If is taken to be very large, this is equivalent to assuming that the effects of exogenous variables do not contribute much to the overall variability of the expression levels of genes.

Although it will be application dependent, it may be that the assumption the variance of the exogenous variables is related to the variance of the regression parameters in the same way for each gene is not be valid. In this situation, a separate could be specified for each gene, but such specification would require information that is most probably unavailable. Alternatively, a hyperprior distribution could be placed upon the . However, any choice of such a distribution would lead to a score metric without an exact form, again requiring numerical integration to compute.

When , the marginal model likelihood for a particular random variable given its parents in the graph , can be shown to be

with

(4) |

We call the resultant score metric the BGeCM metric: “Bayesian metric for Gaussian networks having score equivalence for data sets with a complex mean structure”.

Posterior distributions of the parameters , and allow a detailed analysis of the relationships between random effects and the expression levels of the genes of interest. The posterior distributions of given , given and are given by

where is as given in Equation (4). Further,

and

Note that instead of using a score metric as developed above to allow for the inclusion of exogenous variables in the model, an extended directed acyclic graph could be learnt, where exogenous variables are included as vertices in the graph. There are however, a couple of difficulties presented by such an approach. The first is that if the exogenous variables are discrete, methods for Bayesian networks on both continuous and discrete variables are required. Additionally, many algorithms for learning directed acyclic graphs incorporate sparsity constraints, and if it is believed that many of the genes are affected by these exogenous variables, these sparsity constraints will require modification.

### 2.3 Removal of random effects through analysis of residuals

In the derivation of the BGeCM score metric, it was assumed that the effects of exogenous variables on gene expression were of intrinsic interest. However, in many situations, the effects of exogenous variables can be thought of as nuisance variables, complicating the estimation of Bayesian networks for the given gene expression levels. It may be desirable to ignore the possible influences of such effects upon gene expression levels, and on the relationships between genes. Of course, simply ignoring such effects is not recommended. Instead, we develop a non-parametric approach that adjusts for the effects of exogenous variables, without making assumptions about the form of their distributions. This approach, instead of directly using the gene expression data, is based upon the use of linear combinations of residuals left over after the data is regressed upon the effects of the exogenous variables. We call this the “residual approach”, and it is inspired by the restricted maximum likelihood procedure used in inference for mixed linear models; see for example Section 12.2 of [2], or Speed [21], which provides a good overview of REML.

The utility of the residual approach is that it makes no assumptions about the distributional form of the random effects of interest. Since no such assumptions are made, the approach is correct no matter what the true distribution of the random effects may be. Hence, in situations when the assumption that is not satisfied, the residual approach provides a useful alternative to the BGeCM score metric, and, as we demonstrate below, is considerably easier to implement.

We consider an random variable , where is an matrix such that

Hence,

and the score metric associated with this set of marginal model likelihoods is invariant to the choice of . Implementation of the residual approach to the estimation of Bayesian networks is therefore simple: after selection of an appropriate matrix and computation of for , the BGe score metric may be applied to this reduced data set in conjunction with the score-based method of choice.

A drawback of the residual approach is that posterior estimates of the random effects are not admitted. However, any potential loss of information about the underlying covariance matrix when the residual approach is used, compared to the ‘full’ BGeCM score metric, has been investigated in [13], and found to be typically small.

## 3 Numerical study of BGeCM and the residual score metrics

### 3.1 Implementation of BGeCM and the residual score metrics

In this section, the necessity of score metrics that take account of complex mean structure are demonstrated through the application of the residual approach and the BGeCM score metric to simulated and real data sets.

First, a note on implementation. The BGeCM score metric and the residual approach may be incorporated into any score-based algorithm for the estimation of Bayesian networks, without the need for any additional programming. In the case of the residual approach, all that is required is the calculation of the matrix , satisfying the conditions in Equation (2.3). Then, instead of inputting into the algorithm of choice, the augmented data set is input. Similarly, when the BGeCM score metric is used, an augmented data set will be the input into the algorithm, where is a matrix such that , where is as given in Equation (4).

Here we apply the residual approach and BGeCM score metrics in conjunction with the high-dimensional Bayesian covariance selection algorithm, [4], a score-based method for the estimation of Bayesian networks. This algorithm works by constructing and combining regression models for each .

### 3.2 Simulated data sets

Example 1: In this first example, 10 data sets were generated according to the following system of linear recursive equations:

The values of were obtained by sampling from an Inverse Gamma distribution, and are constant for each of the samples generated. Similarly, , , are fixed across data sets, obtained by sampling from

corresponding to . The non-zero mean structure of this example corresponds to two groups, and the true underlying directed acyclic graph is the empty graph.

Example 2: The system of linear recursive equations governing this example is

Ten data sets were generated according to this system of equations, and the parameters , and were assumed constant across these data sets. The values of these parameters were obtained by sampling from the following distributions:

Similarly, the random effects , were constant across the 10 data sets generated, obtained by sampling from

again corresponding to .

The true model for each variable may be written as

where

and

In this case, elements of the matrix consist of random samples from the standard normal distribution, treated as known constants in the analysis, and the true underlying graph has three edges.

Bayesian networks were estimated for each of the data sets generated according to Examples 1 and 2, using the BGe and BGeCM score metrics and the residual approach in conjunction with the High-dimensional Bayesian Covariance Selection algorithm. After assessing the performance of the BGeCM score metric under ideal conditions, we assess the sensitivity of this metric to the misspecification of .

For each of these analyses, the number of spurious and correct edges in the highest-scoring network found by the algorithm was recorded. The results are summarized in Tables 1 and 2. Table 1 gives the mean and standard deviation of the number of spurious and correct edges in the highest-scoring Bayesian networks found when BGe, BGeCM and the residual approach, with set at the correct value, that is , are used to estimate the true graph encoding the conditional independence relationships. Table 2 gives the numbers of correct and spurious edges when BGeCM is used given a range of values of .

Example 1 | Example 2 | ||
---|---|---|---|

Score Metric | Spurious Edges | Correct Edges | Spurious Edges |

BGe | 117 2(5 25) | 1 8(0 63) | 2 4(1 17) |

BGeCM | 1 0(0 94) | 2 2(0 92) | 0 4(0 70) |

Residual Approach | 1 7(1 16) | 1 1(0 32) | 0 0(0 00) |

Example | 00001 | 0001 | 001 | 01 | ||
---|---|---|---|---|---|---|

1 - Spurious | 1 4(097) | 14(097) | 14(097) | 13(082) | 10(082) | 14(107) |

2 - Spurious | 0 9(110) | 08(092) | 10(082) | 09(099) | 04(070) | 06(070) |

2 - Correct | 1 9(057) | 22(063) | 22(063) | 19(074) | 22(092) | 21(074) |

Example | 1000 | |||||

1 - Spurious | 3 8(239) | 158(346) | 456(576) | 846(417) | 1017(442) | 1194(288) |

2 - Spurious | 0 4(052) | 11(088) | 13(116) | 21(088) | 23(125) | 22(148) |

2 - Correct | 2 5(053) | 21(074) | 22(079) | 20(082) | 18(079) | 20(082) |

Comparing the results obtained when the BGe score metric is used to analyse the simulated data sets demonstrates the utility of both the BGeCM score metric and the residual approach. The two new score metrics result in the estimation of structure which is much closer to the true structure. In addition, Table 2 shows that the results obtained from the BGeCM score metric are quite robust to the misspecification of , producing accurate results when the value of selected departs as much as one or two orders of magnitude from its true value. As gets larger, the highest scoring graphs obtained become more and more similar to those obtained when the BGe metric is used. This is a result of the fact that as approaches , the limit of the BGeCM metric is the BGe metric, [13].

## 4 Analysis of the grape-berry microarray data

The data analysed here consisted of samples of gene expression levels for grape genes measured over a four-week period. The gene expression levels were derived from grape-berry tissue samples grown in three different vineyards in three different wine-growing regions of southern Australia. Twenty samples were taken from a vineyard in Clare, from the Wingara Vineyard in Mildura and from a vineyard in Willunga. Table 3 provides the reference numbers for the grape genes, together with a brief summary of their functions. All the genes in Table 3 are known to code for heat shock proteins (HSPs), [24], which are responsible for protecting the grapes against heat-induced stress. In addition to data on the gene expression levels, temperature in degrees celcius was recorded during the time leading up to the picking of the grapes.

These data are part of a larger dataset on grape-berry tissue samples measured between 2003 and 2005. At each vineyard, grapes were sampled roughly weekly over the period of development of the berries, i.e., from the time buds formed on the vines, to the time when the grapes were ripe. In general, grape berries follow a double sigmoidal pattern of growth that consists of two distinct growth phases, with a lag period between these phases (see [1, 19]). The second stage of grape-berry growth commences upon the occurrence of veraison, when the grape berries start to change colour. Robinson and Davies, [19], suggest that at veraison and during ripening, there are many changes in the expression levels of many different genes in grape berries. The observed developmental time period from bud formation to grape ripeness differed between vineyards in the present study. The shorter four-week time period we analysed occurred after fruit set, but well before veraison for all three vineyards. We restricted attention to samples corresponding to the third to seventh sampling weeks at each vineyard because the relationships between genes are thought to be more stable during this period, and the modelling assumption of identically distributed samples is therefore more likely to be valid.

mRNA expression levels for each of the grape tissue samples was measured using Affymetrix Vitis vinifera oligonucleotide arrays. Background subtraction and normalisation was carried out using robust microarray analysis (RMA), as described in Irizarry et al, [11]. Note that all samples were processed at the same laboratory.

Understanding the stress tolerance mechanisms of plants is important, and the heat shock protein network, as discussed by [16] and [24], is very complex. The heat shock protein network of plants is believed to consist of interactions between small Hsps, Hsp60, Hsp70, Hsp90 and Hsp100, [24]. Precisely how Hsps interact with one another and how they protect against heat stress is not yet completely understood, and here we seek to gain some insight into the heat shock protein network by examining the conditional dependence structure of the genes given in Table 3.

Ref | Affymetrix | NCBI | Protein Identity |
---|---|---|---|

1 |
1616246at | Vvi.9142 | Heat shock protein 70, ATP binding |

2 | 1607002at | Vvi.4801 | Heat shock protein 70, ATP binding, |

luminal binding protein, glucose regulated | |||

3 |
1610684at | Vvi.2869 | chloroplast HSP 70-1, ATP binding |

4 |
1611740at | Vvi.295 | unknown |

5 |
1620985at | Vvi.4530 | HSP21 chloroplast |

6 |
1616995at | Vvi.23518 | Ubiquitin conjugating enzyme 4e |

7 | 1614132at | Vvi.863 | |

8 | 1618265at | Vvi.15427 | |

9 |
1608052sat | Vvi.9085 | HSP81(early response to dehydration) |

10 | 1618009at | Vvi.9085 | |

11 | 1619931sat | Vvi.7394 | |

12 |
1608701at | Vvi.2083 | 10 kDa chaperonin |

13 |
1608164at | Vvi.6787 | Cytosolic class II 17.6 HSP |

14 | 1611052at | Vvi.6787 | |

15 | 1611192at | Vvi.6787 | |

16 | 1610032at | Vvi.6787 | |

17 | 1614330at | Vvi.6787 | |

18 |
1620956at | Vvi.3921 | 17.6 kDa class I small HSP |

19 | 1616538at | Vvi.7869 | |

20 | 1609554at | Vvi.7044 | |

21 | 1620960aat | Vvi.7044 | |

22 | 1621652at | Vvi.4464 | |

23 |
1622165at | Vvi.6156 | 17.4kDa class I small HSP |

24 | 1612385at | Vvi.4422 | |

25 |
1622628at | Vvi.5040 | 17.4kDa class III small HSP |

26 |
1610700at | Vvi.2537 | 23.6K mitochondrial small HSP |

Given the known functions of the genes considered in this study and the climatic and geographic disparities between the regions where the grape berries were sampled, it would incorrect to ignore the effects of vineyard and temperature in the estimation of a Bayesian network for the grape genes. The essential point is that if the expression levels of these genes are strongly influenced by these exogenous variables, then accounting for variation due to such variables in the estimation of a Bayesian network should result in a network that more accurately encodes the dependence structure of the genes. A further important point is that given the grape gene expression levels analysed are observational data, causal interpretations should not be applied to the directed edges present in any network estimated given the data. Hence, moralized versions of directed acyclic graphs are used to summarize conditional independence relationships of the grape genes.

To begin, the initial (null) model omitted the effects of vineyard and temperature on the expression levels of the genes. That is, if is the 50-vector of the expression levels for grape gene , it is assumed that

where is a matrix. The columns of this matrix consist of the expression levels of the grape genes in the dataset that the expression level of gene is dependent upon, and is the effect of the expression level of gene on the expression level of gene . Following the analysis of Affymetrix gene expression data in [4], and .

The highest scoring Bayesian network found through the application of the high-dimensional Bayesian covariance selection algorithm to the full dataset (ignoring the exogenous variables of vineyard and temperature) has edges, and the moralized version has edges. The moralized version is shown in Figure 1(a).

Next, graphs were estimated separately for each of the three vineyards. The highest-scoring directed acyclic graphs obtained for the Clare, Wingara and Willunga vineyards had, respectively, and edges. These graphs were quite different from one another, with the three graphs having only two edges in common, the Wingara and Clare graphs sharing eight edges, and the Willunga graph having three edges in common with the Wingara and Clare graphs. Given the paucity of the data and the complexity of the models, this lack of concordance between the graphs obtained separately for each vineyard is not surprising.

In order to make more efficient use of the data, models incorporating data from all three vineyards simultaneously were then considered.

The question of how best to include temperature and vineyard effects in the model for gene expression was investigated using linear regression models with forward and backward selection. The largest model fitted for each gene contains separate intercepts for the data from each vineyard, and terms for each of the temperatures recorded and minutes before the grapes were picked. We also considered the model including vineyard and temperature main effects and two-way temperature interactions. For the full model with interactions, it was observed that the adjusted of many of the regressions was above , indicating that some over-fitting was taking place. We therefore exclude two-way temperature interaction effects in what follows.

Results of the stepAIC function in R, [23], indicate that there is not a single backwards elimination step that would apply to all genes. That is, each of the vineyard or temperature variables is significant in at least one of the regression models estimated. In any case, use of separate regression models for each gene is beyond the scope of the present score metrics. As such, the largest model considered is as follows:

where and is the effect of vineyard on the expression level of gene , are the temperature effects.

Histograms of the marginal standard deviations of the expression data for each gene, and the residual standard errors from the regressions containing vineyard, and vineyard and temperature as covariates, are shown in Figure 2. Note that there are three genes with very small standard deviations. These plots show that vineyard variables account for only some of the variation in the gene expression levels. Changes in temperature and vineyard account for much more of the observed variation, but there remains some residual variation to be explained. On the basis of these histograms, we expect that the graph obtained when only vineyard is included as an exogenous variable will be somewhat similar to that obtained when the exogenous variables are ignored, whilst we would expect to see a reasonably different structure when both vineyard and temperature are accounted for.

In accounting for the effects of vineyard and temperature, we find high-scoring Bayesian networks using the BGeCM score metric, first fitting the model with vineyard effects only, then the model with vineyard and main temperature effects. The highest-scoring Bayesian networks found for and were recorded and their moral graphs summarized in Table 4. It can be seen that as more covariates are included in the model, more of the variation in the expression levels of the grapes is explained, and the highest-scoring graphs obtained have fewer edges. Edges that are removed as more exogenous variables are included in the model can be interpreted as being explained by common relationships of genes with these additional covariates.

The BGeCM score metric assumes that the effects of the exogenous variables are independent and identically distributed, an assumption that must be questioned. The effects of temperature and vineyard are almost certainly not iid. However, there is little information available to provide a useful estimate of the covariance structure of the effects of these exogenous variables. Therefore, the residual approach, which makes no assumptions about the covariance structure of the effects included in the model, is preferred here.

BGeCM | ||||

Residual | ||||

Included Covariates | ||||

Vineyard | 66 | 76 | 89 | 68 |

Vineyard and temperature | 57 | 63 | 68 | 41 |

The number of edges in the moralized versions of the highest-scoring networks obtained using the residual method are summarized in Table 4. When only the effects of vineyards are included in the model, the results obtained using the residual method are similar to those obtained when the BGeCM score is used, as expected on the basis of the histograms in Figure 2. When the effects of temperature are included in the model, the residual method produces high-scoring graphs with fewer edges than the BGeCM score metric. This indicates that whilst the BGeCM score metric may account correctly for the covariance structure of the effects of vineyard, the effects of temperature may have a more complicated variance structure, that is not adequately modelled by the iid assumption.

The moralized graphs obtained from the residual method are displayed in Figure 1. These graphs, drawn using GraphViz, [6], show that as more of the variation in gene expression due to exogenous sources is accounted for in the model, the moral graphs of the highest-scoring networks obtained have fewer edges. The graph obtained by including both temperature and vineyard as exogenous variables, Figure 1(c), is preferable to that obtained when only vineyard is included, Figure 1(b). For most genes, very little variation in the gene expression values is accounted for by the relationship with vineyard alone.

There are a number of interesting features to be observed in graph Figure 1(c), which is the graph obtained when both vineyard and temperature effects are accounted for in the model. We observe that seven nodes in this graph are completely disconnected from all other nodes, which implies that once relationships with temperature and vineyard have been accounted for, the expression levels of each of these genes are independent of the expression levels of all other genes. (Recall that absence of an edge between two genes in Figure 1(c) indicates that the expression levels of these nodes are independent, a relationship which can be refined through application of Markov properties.)

It is apparent that three of these seven disconnected nodes, corresponding to genes , and , are already disconnected from the rest of the graph when only vineyard is included in the model; see Figure 1(b), where it is observed that these are the only three unconnected nodes. The expression levels of these three genes are observed to have the lowest standard deviations of all the genes, at and respectively, and are in fact the three genes at the left-most end of the rug in Figure 2(a). When these three genes are regressed on vineyard, the residual standard deviations are even smaller ( and ). In other words, there is no variation in the expression levels of these genes to be explained by relationships with other genes, so it is not surprising that they are unconnected in the graph. Very small gene standard deviations can be problematic in microarray data analysis, and methods have been proposed for adjusting the standard deviation estimates upwards by adding a constant term or by application of empirical Bayes methods when constructing -tests, for example, [20]. Such adjustment is beyond the scope of our present analysis however. Note that the fact the three genes are connected in Figure 1(a) is suggestive of overfitting when the BGe metric is used.

Genes , and are also disconnected in the final graph, Figure 1(c), and correspond to Hsp81, which is an early response to dehydration. According to the KEGG data base, [12], they are predicted to be similar to Hsp90. The role of these sets of genes in the heat shock network of grapes is not entirely understood. The role of Hsp81 proteins in Arabidopsis thaliana, more commonly known as thale cress, has been discussed in [25], who note that an increase in the expression level of Hsp81-1 is possibly caused by a regulatory pathway other than the heat shock pathway. Our analysis supports this finding for Vitis Vinifera, indicating that Hsp81 may not be implicated in the heat shock protein network of grapes, at least over the four-week time period studied. We have established that variation in the Hsp81 genes is accounted for directly by the effects of the exogenous variables, and that they are uncorrelated with the other HSPs in the final graph.

The seventh gene, number , which is a mitochondrial small Hsp, is not implicated in the final network either. This gene is the only mitochondrial gene considered. That it is unconnected from the rest of the network indicates that variation in this gene is explained purely by exogenous temperature and vineyard effects, and is not dependent upon any of the other genes in the dataset. This suggests that the mitochondrial HSP are not regulated in the same way as other cellular HSPs.

On the whole, relatively little is known about the heat shock regulatory network for grapes. Typically in the representation of the heat response network for plants, relationships between classes of genes, such as small Hsps or Hsp70s are discussed, [24]. The graph obtained here provides a good starting point for the development of a finer structure, which can then be further developed. The edges between the genes in Figure 1(c) can be interpreted as encoding conditional dependence relationships. This graph is the moralized version of the directed acyclic graph found, and more detail is available through consideration of the class partially directed acyclic graph, or class PDAG, of the underlying directed acyclic graph. However, since we are analysing observational data, we will consider here edges in the moralized version of the graph only. We observe that there are two pairs of genes connected by a single edge. The first pair of nodes is , representing a chaperone gene (gene ) which promotes the folding and unfolding of proteins and a class I small HSP (gene ). This is an undirected edge, but the two genes are correlated after adjusting for the exogenous variables, and the chaperone gene has no other connecting edges. The second pair of connected nodes is , representing a glucose regulated HSP (gene ) and the ubiquitin conjugating enzyme 4e (gene ); again these two genes are correlated after adjusting for the exogenous variables, and the enzyme gene has no other connecting edges.

Further investigation of the connected nodes and possible regulatory heat shock mechanisms is beyond the scope of the present paper, and would require further biological evidence and possible investigation. It is clear from the detection of the unconnected nodes together with the plausible relationships between nodes connected by a single edge, that analysis of the data using the new score metrics has demonstrable utility and has detected real biological effects.

## 5 Discussion

The BGeCM score metric and the residual approach presented in this paper enable Bayesian network structures to be learnt given datasets that do not consist of independent and identically distributed samples, and may be used in conjunction with any score-based method for the estimation of a Bayesian network. Furthermore, the residual approach allows the estimation of a Bayesian network for datasets with a complex mean structure without the need to specify the variance structure of the mean effects. This approach proved useful for the analysis of the grape-berry gene data, where it could not reasonably be supposed that the effects of the exogenous variables were independent and identically distributed. Our analysis of the grape-berry gene microarray data has resulted in biologically plausible conclusions on the heat shock regulatory network of grape genes. These inferences could not have been drawn without the availability of suitable score metrics to account for the effects of exogenous variables.

## References

- [1] B. G. Coombe. The regulation of set and development of the grape berry. Acta Horticulturae, 34:261–271, 1973.
- [2] A. C. Davison. Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2003.
- [3] H. de Jong. Modeling and simulation of genetic regulatory systems: A literature review. Journal of Computational Biology, 9(1):67–103, 2002.
- [4] A. Dobra, C. Hans, B. Jones, J.R. Nevins, and M. West. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 90(1):196–212, 2004.
- [5] R. L. Dykstra. Establishing the positive definiteness of the sample covariance matrix. Annals of Mathematical Statistics, 41(6):2153–2154, 1970.
- [6] J. Ellson, E. R. Gansner, E. Koutsofios, S. C. North, and G. Woodhull. Graph drawing software. chapter Graphviz and dynagraph - static and dynamic graph drawing tools, pages 127–148. Springer-Verlag, 2004.
- [7] N. Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303:799–805, February 2004.
- [8] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7:601–620, 2000.
- [9] D. Geiger and D. Heckerman. Learning Gaussian networks. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, 1994.
- [10] D. Geiger and D. Heckerman. Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. The Annals of Statistics, 30(5):1412–1440, October 2002.
- [11] R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, and T. P. Speed. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2):249–264, 2003.
- [12] M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe, and M. Hirakawa. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic acids research, 38:D355–D360, 2010.
- [13] J. Kasza and P. Solomon. Kullback Leibler divergence for Bayesian networks with complex mean structure. arXiv:1009.1463.
- [14] J. E. Kasza. Bayesian networks for high-dimensional data with complex mean structure. PhD thesis, The University of Adelaide, 2009.
- [15] D. Koller and N. Friedman. Probabilistic graphical models. MIT Press, 2009.
- [16] S. Kotak, J. Larkindale, U. Lee, P. von Koskull-Döring, E. Vierling, and K.-D. Scharf. Complexity of the heat stress response in plants. Current opinion in plant biology, 10:310–316, 2007.
- [17] S. L. Lauritzen. Graphical Models. Clarendon Press, Oxford, 2004.
- [18] F. Markowetz and R. Spang. Inferring cellular networks - a review. BMC Bioinformatics, 8(Suppl 6):S5, September 2007.
- [19] S. P. Robinson and C. Davies. Molecular biology of grape berry ripening. Australian Journal of Grape and Wine Research, 6:175–188, 2000.
- [20] G. K. Smyth. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1), 2004.
- [21] T. P. Speed. Encyclopedia of Statistical Sciences Update Volume 1, chapter Restricted maximum likelihood (REML). Wiley, 1997.
- [22] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer-Verlag, 1993.
- [23] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2007.
- [24] W. Wang, B. Vinocur, O. Shoseyov, and A. Altman. Role of plant heat-shock proteins and molecular chaperones in the abiotic stress response. Trends in Plant Science, 9(5):244–252, May 2004.
- [25] N. Yabe, T. Takahashi, and Y. Komeda. Analysis of tissue-specific expression of Arabidopsis thaliana HSP90-family gene HSP81. Plant cell physiology, 35(8):1207–1219, 1994.