Inferring genetic networks: an information theoretic approach
In the postgenome era many efforts have been dedicated to systematically elucidate the complex web of interacting genes and proteins. These efforts include experimental and computational methods. Microarray technology offers an opportunity for monitoring gene expression level at the genome scale. By recourse to information theory, this study proposes a mathematical approach to reconstruct gene regulatory networks at coarse-grain level from high throughput gene expression data. The method provides the a posteriori probability that a given gene regulates positively, negatively or does not regulate each one of the network genes. This approach also allows the introduction of prior knowledge and the quantification of the information gain from experimental data used in the inference procedure. This information gain can be used to chose genes to be perturbed in subsequent experiments in order to refine the knowledge about the architecture of an underlying gene regulatory network. The performance of the proposed approach has been studied by in numero experiments. Our results suggest that the approach is suitable for focusing on size-limited problems, such as, recovering a small subnetwork of interest by performing perturbation over selected genes.
Gene expression is regulated by proteins that enhance or block polymerase binding at the promoter region. These biochemical reactions constitute the edges of the gene regulatory networks. One of the key issues in modern biology is the elucidation of the structure and function of gene regulatory circuits at the system level hood (). To address this challenge many efforts have been devoted to the task of developing computational methods capable of inferring the interaction between genes from expression levels both on small pathways luna (); jjc () as on genome-wide scale (see rev1 () for a review). Several models for gene regulatory networks have been proposed in order to infer network interactions holter (); tegner (), such as Bayesian networks friedman (); husmeier (); beal (), Boolean networks akutsu (), linear model dhaeseleer1 (); dhaeseleer2 (); vansomeren (). Once a regulatory network model has been chose, it is possible, in principle, to recover its parameters with some accuracy. Of course, more detailed models will require more extensive experimental data. In general this data is not available for the genome-wide scale assuming complex model. However, we can concentrate on simpler task, such as: who is regulating whom? and, Is that an up-regulation or a down-regulation? The idea behind restricting our questions at this qualitative information level, is to reduce the amount of data needed to infer valuable and robust biological knowledge even when dealing with noise data. In any case, the detailed information offered by more detailed modeling is not useful without a careful significance analysis of these predictions. In this sense, this study proposes a mathematical approach to infer gene networks at the coarse grain level. The inference process is to be accomplished according to Ockham’s razor, i.e., with the minimum number of assumptions compatible with the available data. To do that, the information theory (IT) is used within the framework of the maximum entropy principle sh (); j (). IT has proved to be of utility in devising techniques for analyzing gene expression and network reconstruction lezon (); martinez (), where gene expression levels were regarded as random variables. Here, complementing these previous works, each putative interaction has been considered as a random variable. In numero experiments show that, in this case, the IT parlance also provides a powerful framework to discuss questions related to the modeling process such as: (i) how to incorporate a priori information about the gene interaction; (ii) how to asses the likelihood of the inferred paths; (iii) how to quantify the information provided by the experimental data; and (iv) how to design experiments in order to identify subnetworks.
The IT approach
In general a genetic network can be modeled by a set of non-linear differential equations , where is the expression level of gene at time and is the regulatory function governing the expression of gene yang (). Near a steady state the nonlinear system can be approximated by a set of linear differential equations, where , is a weighted connectivity matrix yeung (). In order to uncover the connectivity matrix, we can apply a stimulus , then measure simultaneously the mRNA levels relative to genes and estimate the derivative . Repeating the procedure times we get a measurement matrix where columns denote the experiments and where rows indicate individual genes. Thus, we can approximate the dynamics by
where and follow the same notation as .
Usually, inferring genetic network attempts to retrieve the weight matrix , where the elements describe the type and strength of influence of gene on gene ( indicates activation, indicates repression, and a zero indicates no influence). However, without a careful assessment of the significance of the weight-values, this could lead to the conclusion that the network is fully connected in contradiction to the well-known fact that gene regulatory networks are sparse networks.
In the present work, the maximum entropy principle is applied to obtain the probability distribution from the data , over the possible matrix . After that, using maximum likelihood criterion, the gene interaction matrix is selected. The elements can take only three values, depending on the type of influence of gene on gene , for activation (direct or indirect), for repression and when gene does not have influence on gene . In order to infer weights consistent with , it is assumed that each set of weights is realized with probability . In other words, a normalized probability distribution is introduced over the possible sets , which satisfy
The relative entropy related to an a priori probability distribution , is given by
where is an appropriate a priori distribution. The negative relative entropy , known as Kullback-Leibler distance lt (), defines the information gained after has been used in the inference procedure. Thus, in this framework, the inference process takes place through a modification of the probability distribution on weights space due to incoming data.
Among all possible distributions consistent with , has been selected which comprises no unjustified prejudice. Thus, following the central tenets of the maximum entropy principle, relative entropy is maximized subject to the constraints Eq. 2. Thus, the a posteriori probability distribution yields,
where is Lagrange multiplier associated to the normalization condition, and the Lagrange multipliers associated to the constraints Eq. 2, which are determined once is properly selected.
In order to select , it is assumed that the weights are restricted to the values of i.e. and then a three-peaked a priori distribution is used, which is described by
where is the a priori probability for gene to regulate positively (), negatively () or to not regulate () gene . Of course for each pair . The parameter can be regarded as a constraint smoothness parameter. Replacing this choice in Eq. 4 the a posteriori probability distribution is obtained as a sum of three Gaussians,
where is the a posteriori probability for gene regulate positively (), negatively () or to not regulate () gene . These probabilities are defined by , and , where guarantee normalization. Furthermore, the relative entropy of the a posteriori distribution Eq. 3 is given by
where is the information gain of gene with respect to obtained from using the data which is defined by
The multipliers are obtained after solving the equation
where are subject to the constraints imposed by . Our central idea is that of reinterpreting, following the information in in a particular fashion,
Thus, all of the possible networks that are consistent with Eq. 10, can be written as
is an matrix, where is zero if and is otherwise an arbitrary scalar coefficient. , and correspond to the singular value decomposition of matrix , i.e. where is a unitary matrix of left eigenvectors, is diagonal matrix containing the eigenvalues , and is a unitary matrix of right eigenvectors. Without loss of generality, let all non-zero elements of be listed at the end and in Eq. 11 are taken to be zero if . The general solution 11 can be written as
where is the particular solution with the smallest norm. If , many weights are compatible with the available information. The information contained in the data set can be used in different ways. Each of these leads to a different probability distribution which exhibits diverse properties. In this sense, following the prescription in Eq. 12, the knowledge that gene regulatory networks are sparse can be made use of. Thus, we have , which is an overdetermined problem yeung (). This particular solution will be denoted as . Of course the is obtained solving Eq. 9 using or . In the following sections these alternatives will be considered independently. Notice that for , .
After determining the a posteriori distribution, the gene interaction matrix must be selected. In order to do that, the maximum likelihood criterion is taken into account, i.e. the selection is accomplished choosing the highest a posteriori probability from for each pair . For example if is greater than and , then indicating that gene activates the gene .
In order to achieve the best model, the idea is to use the information contained in and the knowledge that gene regulatory networks are sparse. The formalism presented here offers an alternative to the prescription which selects from all possible solutions 11. This alternative consists in setting . In this way the knowledge that gene regulatory network is sparse can be introduced by assigning a much lower value to the a priori probabilities of interaction than the a priori probabilities of absence of interaction. Furthermore, as the inference processes occur row by row, any other relevant a priori information about the gene in consideration (such as known interactions, type of gene, etc.) could be included in these probabilities. For example, if gene encode a helix-turn-helix or a zinc finger protein, high probabilities can be assigned for column ( and ).
In order to systematically benchmark the inference performance of this method, a linear data-generating model was used. The random inputs (the columns of matrix ) were generated in the range [-1,1] and was computed as the system response, where is the matrix to be reconstructed. Thus pairs constitute the available information . In the simulation, it was observed that the mean performance depends on size and the degree of connectivity and not on the network type. For this reason, random sparse linear networks will be considered, where each gene has entries in average. To build the connectivity matrix , following procedure was used: for each matrix element a random number between (0,1) was sorted, if a negative random value chosen from a uniform distribution was assigned to the matrix element, if the matrix element was a positive random number, and otherwise the matrix element was zero. The condition ensures sparseness.
By using singular value decomposition and interior point method for regression was computed. Subsequently the set of uncoupled nonlinear equation 9 was solved and the a posteriori probability for each putative interaction was evaluated. After this procedure the most likelihood can be selected. The performance of the inference procedure was measured by the prediction error , where is defined by
Figure 1 depicts the prediction error as a function of defined as the ratio of number experiments and number of genes, i.e. . These have been tested in three different size networks with , in which all a priori probabilities are assumed to be equals (i.e. ) and . For small values of the method mistakenly infers a percentage of interaction which depends on the network size and . However, the prediction error decays rapidly as increases and the gene interaction matrix is completely recovered with a value that decreases with the network size. This performance was obtained using prescription. Similar simulations (data not shown) performed with the prescription, reveal that in these cases the prediction error remains close to unit until , where they decay abruptly. Dependence of performance on the network topology has not been detected and similar results were obtained for scale free networks which have more biological appeal than the random networks used here. However, these simulations present a greater error bar due to the fact that the network building algorithm used here does not make networks with a uniform node degree.
Many times, when dealing with an incomplete data set , only a percentage of the interactions is inferred correctly. If the likelihood of the inferred paths cannot be assessed, this partial reconstruction has small predictive value in real life. The methodology proposed here can assess the likelihood of the predicted interaction straightforwardly through the a posteriori probability. In this sense, only those predicted interactions with an a posteriori probability which is greater than some significance level can be selected. To illustrate this issue, a network with 60 genes with was simulated. The related connectivity matrix is represented in Fig. 2(left), row corresponds to the genes that regulate the activity of gene , while column corresponds to the genes regulated by gene . The weight values are depicted following a linear gray scale, where white(black) corresponds to the maximum(minimum) values of weights and the gray background represents the absence of interaction. This network is random perturbed in 24 different experiments (). With this amount of data usually about of the interactions are predicted correctly (see Fig. 1). Nevertheless, which interactions were inferred correctly and which were inferred wrongly is unknown. By mean of the information theory approach, the a posteriori probabilities were computed and the inferred interaction matrix and the associated likelihood were derived. Fig. 2(right) represents the inferred connectivity matrix , by assuming that all a priori probabilities are equal (i.e. ). Red circles indicate wrong predictions ( of the interactions), while green circles indicate the interactions with a a posteriori probability greater than 0.99. In this case there are 98(108) interactions where the maximum a posteriori probability is greater 0.99(0.95). A more detailed study (data not show) revealed that interactions related to higher weight values, are associated to high a posteriori probability values. These results suggest that gene networks can be partially recovered even with small amounts of data, mainly for those genes that interact strongly.
Unfortunately all measurements are subject to observational noise, consequently it is important to asses to what extent the performance of the inference procedure is affected by noise. To simulate this condition in the numerical experiment, the available information (both input and output) was corrupted by an additive Gaussian noise with mean zero and standard deviation . This inference procedure was performed for networks with , in the same condition as for the previous assessment ( and ). However, in this case the method based on the prescription of sparseness assumed in could not correctly recovery the gene interaction matrix when the noise level was (even for smaller ). Figure 3(top) indicates the prediction error by using both and assuming that the a priori probability for activation, repression or absence of interaction are equal. This clearly shows that the prediction power decreases as more data becomes available.
However, the network can be partially reconstructed by using an alternative constraint of sparsity. This alternative consists in introducing the knowledge of sparseness of the matrix through the a priori probabilities. That is achieved by setting in the inference procedure. Fig. 3(middle) depicts the prediction error as a function of when the a priori probabilities were set to . The sum of these probability values corresponds to the percentage of genes that are regulated by one gene. With such a priori information, it is possible to reconstruct almost the complete structure of the network (around 95 of edges) using more experiments than the number of genes, . The mean node degree of the network is generally not known in advance. Notwithstanding, the prediction ability is robust for underestimations of the a priori probabilities. Figure 3(bottom) depicts the prediction error as function of when the a priori probabilities were set to . The result is almost the same as the previous one. This implies that it is possible to partially recover the interaction matrix even with noise data, by setting low values for the a priori probabilities . In the last two cases, the prediction performance obtained by the prescription is comparable with that obtained by using prescription, in contrast to the case which deals with clean data. Furthermore, when data are corrupted by noise, it was observed that prediction error has a peak around , this peak arises because some singular values, associated to the SVD, take small values as consequence of noise.
The partial recovery referenced above does not pursue recover a closed subnetwork, which mainly infer strong interactions around the whole network. However, in many cases this is crucial to recover the complete subnetwork associated to a given gene or path of interest. The inference approach and information gain tool presented in this study, could be used to establish new relationships between genes and to propose new experiments. By means of cycles of experiments-datamining, the knowledge about the subnetwork can be refined until its complete recovery, even in presence of observational noise. For that purpose the following protocol could be used: i) perform an initial perturbation where the gene of interest is overexpressed, and obtain the genome expression profile; ii) compute the information gain for each gene with this experimental data; iii) select the genes for which the information gain is greater than a given threshold; iv) iterate first two steps perturbing each one of the genes which were selected in the third step and which have still no been perturbed, until no new gene has an information gain greater than the threshold. Figure 4A illustrates the result of three of the experiments-datamining cycles. Firstly, the gene which belongs to the subnetwork of interest, gene g1, is initially overexpressed (level of 10.0 while the other gene levels are random in the range [-0.5,0.5]), then the input-output network is measured, this measurement is subject to observational noise with . The information gain of this experiment is computed for each gene using as an a priori probability. Subsequently, those genes with greater than 1.0 are selected. suggests that gene g6 is regulated by g1. Repeating the above step with gene g6, the results indicate that genes g2, g3 are regulated by g6. The above step is repeated with gene g2 and subsequent genes with high information gain values in ensuing experiments, until no new gene with an information gain greater than threshold appears. Fig. 4B illustrates a list of experiments where the first column corresponds to the gene that was perturbed in the experiment, and the second column corresponds to the genes which appear to be regulated by the perturbed gene. In the last two experiments no new regulated genes appeared (which were not indicated in the first column list). The above analysis provides a causal link between two genes, but it does not indicate if the regulation is positive or negative. In order to extract this information, the inference analysis was performed using the ten ”overexpression experiments” pooled in (). When the inference procedure was applied with this data, 19 out of 24 interactions in the subnetwork were inferred correctly, 10 of them with an posteriori probability greater than 0.99. However, the a priori probabilities provided by the information contained in list of Fig. 4B are included, setting (or 1/3) for all the pairs indicated in the list, and otherwise, 23 out of 24 interactions in the subnetwork are inferred, 19 of them with a posteriori probability greater than 0.99, Fig. 4C. The performance above obtained does not differ if the inference procedure is implemented using or prescription, of course is computationally cheaper than which requires linear programming optimization.
The above example about subnetwork inference suggests that this novel scheme can be re-used regarding further subnetworks until the whole network is recovered with experiments.
Discussion and Conclusions
A novel approach for regulatory network inference is presented in this study. Differently to other methods, this approach pursues to infer the type of interaction rather than a weight which characterizes the interaction quantitatively. Three main features of the proposed method are pointed out. First, it allows to introduce global a priori information about the network, as sparseness, and other gene dependent available information, as illustrated in the last example Fig. 4C. Second, the information theory formalism provides a way to quantify the likelihood of the inferred paths, by using the a posteriori probabilities computed with the method. Last, but not least, information theory formalism also quantifies the information gained with the set of data to be used in the inference procedure.
Furthermore, the IT approach seems to offer promising perspective as a network inference protocol; the methodology presented here introduces an information gain measure as a bonus. The way in which this quantity could be a useful tool to identify the downstream regulated genes in overexpression experiments is illustrated in this study. This feature allows a datamining-assisted way of uncovering the whole network with a number of experiments equal to the number of genes, even when dealing with a high level of observational noise. This IT approach enables the effective use of all the available information, in which each experiment is used as an individual constraint. Thus, the ensuing observation level becomes much richer than the standard one, where all data define a fitness function to be optimized. Efficient management leads to more realistic results in inference.
The learning protocol presented here constitutes an additional inference technique, which should be of interest not only for basic research but also as an application to many interesting real world problems without paying an excessive computational cost.
The author thanks Christina McCarthy for critical review of the manuscript. L.D. is researcher of CONICET (Argentina).
- (1) Ideker T, Galitski T, Hood L (2001) A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet 2:343–372.
- (2) Tienda-Luna IM, Yin Y, Carrion MC, Huang Y, Cai H, Sanchez M, Wang Y (2008) Inferring the skeleton cell cycle regulatory network of malaria parasite using comparative genomic and variational Bayesian approaches. Genetica 132:131–142.
- (3) Gardner TS, Di Bernardo D, Lorenz D, Collins JJ (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301:102-105.
- (4) Tegnér J, and Björkegren J (2007) Perturbations to uncover gene networks. Trends in Genetics 23:34–41.
- (5) Holter NS, Maritan A, Cieplak M, Fedoroff NV, Banavar JR (2001) Dynamic modeling of gene expression data. Proc. Natl. Acad. Sci. USA 98:1693-1698.
- (6) Tegnér J, Yeung MK, Hasty J, Collins JJ (2003) Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. Proc. Natl. Acad. Sci. USA 100:5944-5949.
- (7) Pe´Er D, Regev A, Elidan G, Friedman N (2001). Inferring subnetworks from perturbed expression profiles. Bioinformatics 17:S215-S224.
- (8) Husmeier D (2003) Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 19:2270-2282.
- (9) Beal MJ, Falciani F, Ghahramani Z, Rangel C, Wild DL (2005) A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 21:349-356.
- (10) Akutsu T, Miyano S, Kuhara S (2000) Inferring qualitative relations in genetic networks an metabolic pathways. Bioinformatics 16:727-734.
- (11) D´Haeseleer P, Liang S, Somogyi R (2000) Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 16:707–726.
- (12) D´Haeseleer P, Wen X, Fuhrman S, Somogyi R (1999) Linear modeling of MMA expression levels during CNS development and injury. Pac. Symp. Biocomput 41-52.
- (13) van Someren EP, Wessels LFA, Backer E, Reinders MJT (2002) Genetic network modeling. Pharmacogenomics 3:507–525.
- (14) Shannon CE and Weaver W (1949) The Mathematical Theory of Communication (University of Illinois Press, Chicago).
- (15) Jaynes ET (1957) Phys Rev 108:171–190.
- (16) Lezon T, Banavar J, Cieplak M, Maritan A, and Fedoroff NV (2006) Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns. Proc. Natl. Acad. Sci. USA 103:19033-19038.
- (17) Martínez O, Reyes-Valdés MH (2008) Defining diversity, specialization, and gene specificity in transcriptomes through information theory Proc. Natl. Acad. Sci. USA 105:9709–9714.
- (18) Wang Y, Joshi T, Zhang X-S, Xu D, Chen L (2006) Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics 22:2413–2420.
- (19) Yeung MK, Tegner J, Collins JJ (2002) Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl. Acad. Sci. USA 99:6163-6168.
- (20) Levine RD, Tribus M (1978) The Maximum Entropy Principle (MIT Press, Boston MA).