A mRNA and protein networks

Abstract

Genes are connected in regulatory networks, often modelled by ordinary differential equations. Changes in expression of a gene propagate to other genes along paths in the network. At a stable state, the system’s Jacobian matrix confers information about network connectivity. To disclose the functional properties of genes, knowledge of network connections is essential. We present a new method to reconstruct the Jacobian matrix of models for gene regulatory systems from equilibrium protein concentrations. In a recent paper we defined propagation and feedback functions describing how genetic variation at one gene propagates to the other genes in the network and possibly also back to itself. Here we show how propagation and feedback functions provide relations between equilibrium protein levels which are in principle observable, and Jacobi elements which are not directly observable. We establish exact formulae expressing the Jacobian in terms of derivatives of propagation and feedback functions. Approximating these derivatives from perturbed and unperturbed protein levels, we derive formulae for estimating the Jacobian. We apply the method to models of the Drosophila segment polarity network and randomly generated gene networks. Genes could be perturbed in two ways: by modifying mRNA degradation rates, or by allele knockout in diploid models. Comparison with the true Jacobians shows that for noiseless data we obtain hit rates close to 100% in the former case and in the range 80-90% in the latter. Our method adds to the network interference toolbox and provides a sign estimate of the Jacobian from steady state data, and a value estimate of the Jacobian if protein degradation rates are known. Also the approach identifies some predicted connections as much more reliable than others, and could point to further experiments for resolving uncertainties in the less dependable Jacobian elements.

Keywords: Gene regulatory networks; Reverse engineering; Jacobian; Feedback

Background

Living organisms contain large numbers of complex networks in which genes, mRNA molecules, protein, metabolites etc. interact to maintain essential functions and to react to a wide range of external impacts and conditions. Genes interact when the protein output of a gene enters into the cell’s biochemical system, and through interactions with other chemical species, frequently by long and intricate pathways, influence the expression of other genes by enhancing or inhibiting transcription or modifying translation. Thus a gene is not an independent object whose functions are only dependent on its internal structure and properties and the external conditions, but is part of a network, reacts to input from other genes in the network and in turn affect other components of the network (Emmert-Streib and Dehmer, 2011). Feedback loops and feedforward motifs are important building blocks in gene networks (Alon, 2007).

Systems of ordinary differential equations are frequently used to model the dynamics of such networks. In a dynamic system with a stable point, all solutions within its basin of attraction usually decay exponentially towards the stable point, which is then called hyperbolic. Close to it, dynamic trajectories can be computed approximately once the Jacobian matrix is known. The Jacobian matrix also confers information about the network structure of the system: all the system’s actions and interactions that are operative in the neighbourhood of the stable point, can be read out from its elements.

For a system for which no validated model exists, a basic question is whether it is possible to obtain information on the network topology and connections, which are not directly observable, from the concentrations of mRNA and proteins, which are directly observable. In the literature one finds a large number of papers describing different reverse engineering methods, see e.g. reviews by Brazhnik (2005); Camacho et al. (2007); Cho et al. (2007); Emmert-Streib et al. (2012); Goutsias and Lee (2007); Ross (2008); Stark et al. (2003a, b); Tirosh and Barkai (2011); Yip et al. (2010); Chai et al. (2014). There are essentially two main classes of methods: using time series data, or equilibrium concentrations. The present paper belongs to the latter class.

By repeatedly perturbing the system to induce a shift of equilibrium values of the state variables, one may hopefully be able to infer the Jacobian matrix elements. The observation that expression of gene is affected by a perturbation of gene does not in itself say much about . From this fact alone one cannot tell whether the effect is direct or mediated by one or several other genes. This enigma might be solved by performing more perturbations, but by an ad hoc procedure one soon gets lost. A systematic approach is necessary.

The present paper is a continuation of our recent analysis of propagation of genetic variation in diploid networks (Plahte et al., 2013). There we developed a formalism for describing how a change of genotype of one gene in a gene regulatory network propagates to other, downstream genes and possibly also back to the gene itself, and how this propagation effect is related to the structure of the network.

In the present paper we consider the opposite situation in which so far no network model for some gene regulatory system exists. How, and to what extent, can allele knockouts or other perturbations yield information by which a model of the network can be constructed?

In the following the term “gene” should be considered as a functional module, “an entity of known/unknown genes, proteins or metabolites, grouped together and internally connected by complex physico-chemical interactions” (Yalamanchili et al., 2006). A module may take inputs from many other modules, including itself, but we assume each module is delimited such that it only produces a single output. The process of obtaining information about the local interactions, i.e. the direct effect of a perturbation of one module on another, from the global effects that result from the web of network connections between the modules, has been called Modular Response Analysis (MRA) (Sontag, 2008), and was developed by Kholodenko et al. (2002) (see also Andrec et al. (2005); Cho et al. (2005); Yalamanchili et al. (2006)). Albeit similar, our approach differs from the orginal MRA in several respects. In the original MRA, the authors were only able to determine the rows of the Jacobian up to a scalar multiple. This shortcoming is related to the fact that the equilibrium conditions are unchanged if each rate function is multiplied with a nonzero constant, while the Jacobian is not. However, if the MRA is supplemented by non-steady state data, the full Jacobian can be estimated (Sontag et al., 2004). Using only steady state data, we are able to determine the correct sign of the elements of the Jacobian, and if the protein degradation rates are known, their numeric values as well.

We consider two particular approaches to modify or perturb a gene and by that modifying the functioning of the network. The first approach involves perturbing the mRNA degradation rates, for instance through RNA interference methods. The second approach is to knock out one of the two alleles of a diploid gene. We illustrate both approaches by in silico experiments. We show that if the protein degradation rates can be measured or estimated, both approaches can be used to infer the Jacobian from unperturbed and perturbed protein concentration data from which the Jacobian of the system can be inferred.

Mathematical analyses and results

Model framework of gene regulatory networks

We consider a set of genes believed to be part of a network of nodes or loci , , where and . A real gene regulatory network can be modelled by a dynamic model designed according to the following lines. A non-negative variable describes the concentration or amount of the output of , its time rate of change being given by

(1)

where . The differentiable rate function represents the production rate or dose-response function of , and is its constant relative degradation rate. The quantity , , represents a set of parameters defining the system’s genotype, the subset defining the genotype of . This model framework is commonly used to model gene regulatory networks. In fact, it has been around for decades (de Jong, 2002). Frequently, the dose-response functions are modelled by means of sigmoidal functions, for example the well-known Hill function. We assume that Eqs. (1) have a single, hyperbolic equilibrium point . Our goal is to estimate the Jacobian of Eqs. (1) in in terms of experimentally observable quantities.

Notation: Vector and matrix components are indicated by subscripts as usual. Superscripts are used extensively as indices and except in a few obvious cases, never indicate a power. For vectors and matrices, superscripts in parentheses indicate that the enclosed component has been excluded. A superscript to a matrix, for example , indicates that row number and column number in have been deleted. Similarly, is the set (or vector) with deleted, . Superscripts in brackets indicate the value when a node has been perturbed. For example, is the equilibrium value of when node has been perturbed. We also use a set notation for subscripts. For example, if is a subset of , then is the vector with components , . If is a matrix, denotes row number in . The equilibrium condition of is denoted by . The Jacobian of Eq. (1) is , and . If is a square matrix, is the diagonal matrix with the same main diagonal as . The superscript to a matrix denotes its transpose.

Eq. (1) could be seen as a simplification of a more realistic regulatory system model comprising mRNA, proteins and metabolites.Often gene outputs do not act directly as transcription factors regulating the expression rates of the genes. Rather, there are frequently long and complicated pathways that propagate and modify the regulatory processes. A philosophy behind Eq. (1) is that all these complicated reactions can be condensed into the response functions (Brazhnik et al., 2002). The segment polarity network model of von Dassow et al. (2000) is an example of a model framework in which the concentrations of mRNA and protein for each gene in the network are modelled independently. This model framework, which has been used by a number of authors (see e.g. Lewis (2003); Ichinose et al. (2008); Polynikis et al. (2009)), is

(2)

. Here and are the concentrations of protein and mRNA of gene number , respectively, is the production rate (dose-response function) of mRNA, dependent on the concentration of the input proteins, is the mRNA-protein conversion rate, and and are positive relative degradation rates. The gene products might act directly as transcription factors, or the function might implicitly contain chains of reactions from the gene products to the real transcription factors so that is the combined effect of these chains and the transcription.

Of course the mRNA-protein conversion rate might not be constant, but some nonlinear function of . However, our analysis is local around the stable point , and the second of Eqs. (2) would then represent a locally valid linear approximation to the nonlinear transcription model.

As mRNA molecules are in general less stable than the corresponding protein molecules, we can safely assume that for all , . With a number of reasonable assumptions we can make the quasi-stationarity hypothesis , . This can be justified in a rigorous way by means of singular perturbation theory (see Appendix A), and leads to the reduced model

(3)

where . This equation is of the general form Eq. (1) on which all our derivations are based. For our purposes, Eq. (3) is equivalent to Eq. (2).

In a diploid organism the transcriptional machinery of each gene is composed of two alleles, each allele residing in one of the two chromosomes and transcribing mRNA at a certain rate which depends on the genotype and the concentrations of the gene’s active transcription factors. Due to small differences in the two alleles’ nucleotide sequences, transcription in the two alleles may proceed at (slightly) different transcription rates. The regulatory properties of the two products might also be different. However, a number of experimental results indicate that in many cases, the two alleles of a gene differ only in their regulatory domain without any variation in the coding region (Capon et al., 2004; Chamary and Hurst, 2005; Duan and Antezana, 2003; Gehring et al., 2001; Hoogendoorn et al., 2003; Jones et al., 2012; Mayo et al., 2006; Peng et al., 2005; Wang et al., 1999; Rosenfeld et al., 2005). In particular it seems reasonable to assume that for a homozygous gene, the two identical alleles are regulated in the same way and produce identical mRNAs.

Let the amounts or concentrations of mRNA produced by the two chromosomes of gene be and , respectively. The total amount (concentration) of mRNA is . The same modelling approach as the one leading to Eq. (2) then gives

(4)

where is the vector of protein concentrations . By addition this leads to

(5)

Applying the quasi-stationarity hypothesis to mRNA production leads finally to a single equation for the protein output concentration of node :

(6)

where again . Eq. (6), or alternatively Eq. (3), is our final model for which we want to estimate the Jacobian . The stationarity condition of Eq. (6),

(7)

will be denoted as in (Plahte et al., 2013).

Propagation functions, feedback functions and the Jacobian

For an investigation of how genetic variation at a locus propagates to the other loci in the network, it is easier and more fruitful to express all equilibrium values as functions of the equilibrium value of the perturbed node than to express them by the values of perhaps unknown parameters. We showed in Plahte et al. (2013) that the stationarity conditions of Eq. (1) define propagation functions expressing how a change of the equilibrium value for the locus propagates via the network connections to any other locus . The relation

(8)

where is the set of parameters not specific to , expresses an important property of ; for a given , the propagation functions are invariant under genetic variation of (Plahte et al., 2013).

To determine the changes in any , , due to modification imposed on , all we need is the resulting change in and the propagation function . We do not need a model for how a genotypic change in affects the rate function . This important property is a consequence of a theoretical result (Radulescu et al., 2006) stating that the propagation function is defined by all except , which is the only one containing the parameters specific to . It follows that the derivative of with respect to can be easily estimated by the ratio of a small change in divided by the change in due to a small genotypic variation in . On the other hand, we showed in Plahte et al. (2013) that can be expressed in terms of the elements of the Jacobian :

(9)

Accordingly, the propagation functions provide links between the observable protein concentrations and the Jacobian.

Because there are only propagation functions, Eq. (9) alone is not sufficient to determine the elements of completely from observable quantities. It does not tell how a genetic variation of affects itself. To determine this is a more difficult task because it requires knowledge of how a change of a parameter in directly influences , which in general may require a detailed model of the transcription and translation process. This information is in principle contained in the so-called feedback functions defined in Plahte et al. (2013).

Let . Expressing all , , as by means of all and inserting this into gives

(10)

The right-hand side of this equation is what we call the feedback function for . The stationarity condition for is then

(11)

The feedback function describes and quantifies the feedback effects of changes in the equilibrium value of on itself. If , where the prime denotes the derivative with respect to , there is an effective feedback of on itself, mediated by one or more feedback loops. We showed in Plahte et al. (2013) that

(12)

Combining Eqs. (9) and (12) with the well-known formula for the matrix inverse in terms of determinant and minors, we get for

(13)

or

(14)

where is a square matrix defined by , and is the diagonal matrix with diagonal elements , . It is easy to see that Eq. (13) is valid for as well because .

Let us for a moment assume that is known and invertible while and are unknown. The effect of is to multiply each row in by a constant. Let be the nonzero diagonal elements of . Because the system given by , , has the same stationary states as system given by , while their Jacobians are related by , it might seem impossible to determine from equilibrium values alone (see e.g. Sontag (2008)).

The problem is related to the fact that the invariance property of with respect to genotypic variation in is not shared by the feedback function . The function itself changes, and unless this dependence is known, Eq. (11) cannot be used to estimate . Accordingly, it is not trivial to obtain an estimate of by an arbitrary genotypic variation of . If one were able to change some node-specific parameter in , the effect on would depend explicitly on this parameter in a way that would require a model for the transcription and translation of the gene.

While our main object is to develop methods to estimate , we note in passing that if were known, Eq. (13) would yield an expression for . Because all , it follows that , hence

(15)

This formula expresses the total derivative of any variable with respect to any other in terms of the partial derivatives of the dose-response functions and the degradation rates, taking all the network connections into account. This can be interpreted as follows: For the set of differentiable functions , , assume the set of equations has a solution , and let its Jacobian be , . For each , assume that the set of all the equations except define all except in terms of in an open domain around . (If all have the particular form assumed in the present paper, this is ensured if a few additional conditions are fulfilled (Radulescu et al., 2006).) Then for any , is given by Eq. (15). The formula could be useful for computing the derivates of functions defined implicitly by a set of equations.

Eq. (14) is the basis for reconstructing the Jacobian from observable equilibrium data. The main problem is to find a way to estimate . Below we present two different approaches. One is to perturb the degradation rate which does not enter into , but enters into Eq. (11) in a known and simple way. The other approach is by allele knockout in diploid, homozygous genes, in which case we may assume that knocking out one of the alleles reduces the production rate of the gene by 50%.

Reconstruction of by perturbing the mRNA degradation rates

We first analyse the problem of estimating by perturbing the mRNA degradation rates in Eqs. (3) and (6) and recording the effects on the equilibrium concentrations. Note that occurs in the dose-response function of in a known way. This will lead to an estimate of .

Selecting one node and keeping all other parameters fixed, we perturb the degradation rate from to and record the unperturbed equilibrium values and the perturbed values . The index in the bracket indicates which node has been perturbed. Because occurs in in a known way, we are able to derive a formula for for any given . Again we let . Expressing in terms of their propagation functions (which are defined by all except ), the equation

(16)

defines implicitly as a function of . Implicit differerentiation with respect to gives readily

Combined with Eq. (13) this yields

(17)

Changing to induces a change from to . If and its effect on are small, we can approximate the derivatives by

(18)

where and are the effects on and of perturbing the degradation rate of . This gives

(19)
(20)

where is the estimated Jacobian, is the square matrix with elements , and is the diagonal matrix with diagonal elements . If the are perturbed by different values , we get

(21)

where is the diagonal matrix with .

A convenient property of Eqs. (20) and (21) is that if the sign of each is known, the sign of the elements in can be determined even if the values of the are unknown, because all diagonal elements in are positive and would be known. In other words,

(22)

The advantage of this approach from the mathematical point of view is that the degradation rates can be perturbed by different and small amounts. If there is no noise in the data, the induced errors when derivatives are approximated by ratios of finite differences can be made arbitrarily small. Nevertheless, errors in may occur if some nonzero elements in are very small, or if is very close to singular. In the latter case, arbitrarily large errors may occur in no matter how small is.

If recordings for several perturbed values of can be obtained, more precise estimates of the derivatives could be computed by more advanced mathematical methods. This is a great advantage of this method compared to the allele knockout method considered in the next subsection.

Reconstruction of by allele knockouts in diploid loci

The allele knockout method is based on the reasonable assumption that if one of the alleles is knocked out, the production rate of the gene for fixed amounts of its regulators is reduced to one half its unperturbed value. Our starting point is again Eq. (13), where now is the Jacobian of the system Eq. (6). However, in the derivation of (13) is derived from , the dose-response function for a single allele. As the dose-response function for the homozygous diploid gene is , Eq. (13) must be replaced by

(23)

To find an approximation to we use that the unperturbed and knock-out values and are the solutions of

(24)

Eqs. (24) give

If is small compared to , expanding the last term to first order leads to

(25)

Combined with Eq. (18) and Eq. (23) this finally gives

(26)
(27)

where is the diagonal matrix with diagonal elements , and was defined in the previous subsection. Note the similarity between Eqs. (20) and (27). Also note that even if the protein degradation rates are unknown, the dummy values will give the same sign to the elements of as Eq. (27).

Conditions on

One should check that each eigenvalue of has a negative real part. When is estimated by allele knockout, a further condition on the estimated Jacobian follows from Plahte et al. (2013). For a homozygous locus the allele interaction value (Gjuvsland et al., 2010) is defined by

(28)

Let be the sum of all the terms in in which there is a real regulation of . This definition inplies that does not contain , rather, each term in contains a factor representing a regulation of , i.e. a factor for some . Each term in is a loop product of a feedback loop (called circuit in Plahte et al. (2013)) in . Then

(29)

If is any of these loop products, its contribution to is , where the signature factor is the number of subloops in with an even number of elements. If for some loop has the same sign as , the loop is sign dominant, and

(30)

An estimate of can be computed by analysing the loop structure of , and can be computed directly from the observed equilibrium values. If the signs do not match with Eq. (29) or Eq. (30), there could be an error in the loop structure of , or the sign of or could be wrong due to noise. It is not obvious how to extract the most useful and reliable information from these inequalities. In the present paper we have made no effort to check our simulation results agains these sign rules.

Discrepancy measures of estimated Jacobians

In tests of the method on systems with a known true Jacobian , the estimate could be compared numerically to in many ways to produce a measure of how well has been reconstructed. Due to the degradation terms in the rate functions, the diagonal elements of are generally nonzero except in the unlikely case that a positive autoregulation cancels the degradation term . Because our method presupposes that the are known, we can in fact decide whether a nonzero diagonal element in indicates an autoregulation or only linear degradation. For this reason, it is more informative to compare with , where is the diagonal matrix with .

The choice of error measure should reflect the main purpose of the reconstruction. In our view, the main objective is to reveal the connectivity of the gene network, while the dynamic properties are of less interest because the true system is most likely highly nonlinear. Our prime concern is therefore whether reproduces the right sign structure of , using the sign function if , . Predicting the correct numeric magnitude of the matrix elements comes as a subordinate goal. We used the following classification for an estimated element when the true value is known. If , it is called a true zero when and false zero when . If , it is called a true nonzero when and , a false nonzero when and a wrong sign when and .

Our priorities are reflected in the discrepancy measure matrix defined by the squared relative difference

(31)

The very small positive number , much less than the desired accuracy, is included to make the definition valid also if both elements are zero. It is a matter of elementary algebra to show that satisfies the requirements of a distance measure in . Obviously, for a correct estimate, while for a false nonzero or a false zero in or if comes with the wrong sign. In all other cases, , a small value indicating a good estimate. If the error is small, is approximately equal to half the relative error in . The average discrepancy measure is

(32)

Simulation results

Estimating by modifying mRNA degradation rates

When the equilibrium concentrations are noise-free and is estimated by perturbing the mRNA degradation rates, it should in theory be equal to within computational accuracy, which depend on the user-defined relative perturbation of , integration tolerance, etc. Numerical simulations on randomly generated gene regulatory networks (see below) show that choosing of the order to gives a discrepancy measure of the same order of magnitude. By reducing sufficiently, can in theory be brought down to zero.

With real data, however, this is unattainable due to noise and experimental inaccuracy. Too small values of will lead to large uncertainties in the estimates of the derivatives. To obtain better estimates for the derivatives a fairly large number of observations with different perturbation levels would be needed. Many methods to estimate derivatives from noisy data can be found in the literature. Considering this to be part of the experimental and data processing setup, we do not elaborate this point any further. Instead, we limit ourselves to assuming that a single perturbation level is used, and that this experiment is repeated a certain number of times to produce a distribution of observed values. With this approach one must seek an optimal tradeoff between reducing the error in the derivatives due to finite differences (using a small ) and minimising the noise to signal ratio. The differences in the equilibrium concentrations should not vary too much due to the noise, while should not be so large that nonlinear effects jeopardise the estimates of the derivatives. Intuitively, the perturbation level should be considerably larger than the standard deviation of the distributions of equilibrium levels.

Estimating Jacobian for the segment polarity network by means of allele knockouts

With its crude estimate of the derivatives it is less obvious that the allele knockout approach will work with an acceptable accuracy. To test this, we applied this method to the single cell segment polarity network (Tegnér et al., 2003). We computed the protein equilibrium values by integrating the rate equations until the stable state was reached with high accuracy. Employing total least squares, we used the equilibrium data to compute a matrix from Eq. (27), then computed , and finally subjected to two kinds of cutoff to arrive at our final estimate (see Methods for details). To simulate the effect of noisy data, we also repeated the computation of after uniform noise had been added to . More precisely, before estimating we added a noise term to each equilibrium concentration , where was uniformly distributed in and increased in steps of 0.05 from 0 to 0.25. For each noise level we ran simulations.

Figure 1: The protein connectivity network for the single cell segment polarity network according to Tegnér et al. (2003). Arrow heads signify activation, crossbars inhibition. Colour codes refer to our estimated without noise. Black arcs are correctly predicted in with the right sign, blue arc are predicted with the wrong sign, green arcs signify false zeros (predicting no arc where there is one), and red arcs signify false nonzeros (predicting an arc where there is none). Thus, in the correct graph, black, blue and green arcs should be included, red arcs excluded.

The true and the predicted network connections obtained when no noise has been added to the protein equilibrium data, are shown in Figure 1. The cutoff value appeared to give the lowest number of false elements. The elements are colour coded to show false nonzero elements (red), false zero elements (green) and nonzero but false sign (blue). With the same colour coding the discrepancy measure is

(33)

with average value . For noisy data with noise level up to 0.25 the results are similar, with roughly the same average discrepancy measure (see Appendix C). Apart from the false nonzeros, false zeros and false signs for which , the estimates are all of the right order of magnitude.

Estimating Jacobian for randomly generated gene regulatory networks by allele knockouts

The segment polarity network model has fixed network structure and parameter values. To complement this we performed large-scale simulations of gene regulatory network models with varying number of genes (). For each network size we ran 100 Monte Carlo simulations, sampling network connectivities and wild-type parameter values. The model structure and simulation setup is explained in detail in the Methods section. For each of the 100 randomly generated systems we simulated sets of single knockout experiments, added noise () to the steady state expression levels and computed by means of ordinary least squares, trimming the estimates by the cutoffs and (see Methods).

Figure 2: Scatterplots of (x-axis) versus (y-axis) for in silico single-knockout experiments on 100 gene regulatory systems with genes. The left panel shows results without noise on steady state expression levels while the right panel shows results from repeated measurements with noise level . Observations where are not shown (68 and 85 of 10000 cases for the left and right panels, respectively).

Figure 2 shows scatterplots of estimated versus true values of elements in for . For noise-free steady-state expression levels (left panel) we observe mainly true nonzeros in the 1st and 3rd quadrants and false zeros on the -axis. When noise is added to the steady-state expression levels (right panel), false nonzeros appear on the -axis. Estimates with wrong sign would appear in the 2nd and 4th quadrants, but are rarely seen. Similar patterns are observed for systems with and (see Appendix D).

Figure 3: Summary of true and false discoveries of the signs of Jacobi elements for randomly generated gene regulatory networks with genes. Each panel summarizes 10,000 pairs from in silico single-knockout experiments on 100 simulated systems. (A) Results for simulations without noise on steady state expression levels. The pairs are sorted into subsets (x-axis) based on . The subset named Zero contains 8,495 pairs with , while the remaining pairs are sorted into 10 subsets NZ, , with boundaries corresponding to the th and th 10-quantiles of the 1,505 values. (B) Results for simulations without noise on steady state expression levels. The pairs are sorted into subsets (x-axis) based on . The subset Zero contains 7,394 pairs with , while the remaining pairs are sorted into 6 subsets; NZ0 which contains elements 1706 with and NZ, , with boundaries corresponding to the th and th 5-quantiles of the remaining 900 elements with . (C) Results for simulations with noise level on steady state expression levels. The pairs are sorted into subsets (x-axis) based on , as for (A). The Zero subset contains 9,202 pairs. (D) Results for simulations with noise level on steady state expression levels. The sorting of pairs is the same as in (B).

Not visible in the origin of Figure 2 is a large number of true zeros, and the figure does not convey much information about the relative numbers of true and false sign estimates. Figure 3 gives an overview of this by subdividing the pairs of values into subsets based on their absolute values and displaying relative number of true and false positives for each subset. Using our methods on noise-free steady-state expression levels we find that slightly over 80% of the cases where are true zeros, while the proportion of non-zero that are true non-zeros increase with from  50% for the smallest estimates to  90% for the largest (Figure 3A). When dividing pairs into subsets based on (Figure 3B), we find that more than 90% of the zero elements in are correctly identified as zeros. Furthermore, for the nonzero elements in , false zeros are only a major problem in the group NZ0 where . The most important effects of adding noise to the steady-state expression values is (i) a considerable increase in the proportion of large elements in that are incorrectly identified as zeros (Figure 3D, NZ1 to NZ5) and (ii) a reduction in the number of false non-zeros (Figure 3C, D). The overall pattern is similar for systems with and (see Appendix D).

The distribution of is shown in Table 1. The values of decrease with increasing number of nodes. The main reason is that the proportion of elements in that are zero increases, and as seen in Figure 2 these zero elements are often correctly identified. For , equilibrium values with a low noise-level () lead to slightly higher discrepancy measures, but for networks with 10 and 20 nodes there are no clear differences.

Table 1: Summary statistics of for simulated allele knockouts in gene regulatory netw th varying number of genes () and noise level ().
5 0 0.177 0.384 0.393 0.671
5 0.1 0.233 0.421 0.423 0.77
10 0 0.165 0.236 0.242 0.394
10 0.1 0.173 0.241 0.244 0.363
20 0 0.09 0.133 0.137 0.229
20 0.1 0.102 0.137 0.139 0.182

Conclusions

We have presented a method for estimating the Jacobian of an ODE model of a dynamic gene regulatory system based on the stable equilibrium values of the protein concentrations. The method is developed from our previous analysis of propagation of genetic variation (Plahte et al., 2013). Together with known relative protein degradation rates, the observed shifts in the protein concentrations due to perturbations of the genes are sufficient to obtain an estimate of the true Jacobian . We have analysed two experimentally feasible ways of perturbing the genes: perturbing the relative mRNA degradation rates in haploid or diploid systems, and allele knockout in diploid systems.

The reader should keep in mind that when one talks about a real system, the Jacobian always refers to a model of the system, not to the system itself. Our model framework is very general and includes explicit modelling of both mRNA and protein. We assume that the conversion rate from mRNA to protein is linear. However, as the Jacobian also is a linear representation valid around a stationary point of a usually nonlinear system, this linear conversion could be considered as a linear approximation of some nonlinear mRNA-protein response function.

Major advantages of our method are:

  1. The estimate can be obtained from Eq. (20) or Eq. (27). Both equations are derived from Eq. (14). In this equation, may be estimated from the shifts in equilibrium values obtained by any kind of perturbation according to , c.f. Eq. (18). Knowing the relative degradation rates is not necessary. The effect of is to multiply each row in by some factor proportional to the relative degradation rate of the corresponding protein. Thus, varying does not change the sign of the elements in , only their magnitudes. Accordingly, all the connectivities except autoregulations can be estimated with their right sign even if the are unknown. We consider this a major asset of the method.

  2. If the protein relative degradation rates are known, may be estimated by modifying the mRNA relative degradation rates or by allele knockouts. In those cases can be computed, a non-negative value of pointing to an autoregulation in .

  3. Already existing knowledge about the action of one node on itself or other nodes can easily be incorporated. For example, if is not autoregulated or negatively autoregulated, the value of should be such that . Only a sufficiently strong positive autoregulation may give . Furthermore, if the sign of the action is already known from experiments and the estimate of has the opposite sign, then necessarily if the sign of the estimate can be trusted. This fixes the sign of all the other nonzero elements in row in . Such knowledge may also be used to set admissible ranges for the cutoffs.

Errors in could be due to observational noise or occur because derivatives have been approximated by ratios of finite differences. The effects of noisy data on the computation of have been analysed by Andrec et al. (2005). Because is most likely sparse, in particular for large , there will be a large number of false nonzero elements in before the cutoffs have been applied. If is known, as it were in our numeric simulations, the values of the cutoffs and can be chosen to obtain an optimal fit. Our sign fluctuation analysis in terms of is of course quite simple-minded, and could be replaced or supplemented by more elaborate statistical analyses. In an experimental situation the optimal fit must instead be searched by repeated switching between experiment, theoretical analysis, and new, testable hypotheses based on high and low cutoff values.

A minimum value of can always be set because too small values of will be interpreted as indicating no effective regulatory action. However, the best cutoff value may be larger than this minimum. Large cutoffs gives fewer false nonzeros, but more false zeros (predicting a zero element in where the corresponding value in is nonzero), and vice versa. Setting large cutoffs will lead to just a few nonzero elements in which most likely are correct, but a large number of false zeros. Similarly, setting small cutoffs will lead to near-certain zeros in , but many false nonzeros. The optimal cutoff values are a trade-off between these two extremes.

Even if the true Jacobian is unknown, the estimate could be subjected to a few tests. All the eigenvalues of the optimal should have negative real parts. This is not a strong criterion because the negative diagonal elements stemming from the degradation terms tend to make stable. A second consistency check is to use the sign conditions Eq. (29) relating the sign of the allele interaction values to the dominant loops.

Estimating derivatives from noisy data is notoriously difficult and error-prone. However, if the genes may be perturbed by small amounts and with many different values (for instance by using RNA interference), more elaborate methods could be used to estimate the derivatives , even when data are noisy. The optimal method would depend on available experimental data, and is therefore outside the scope of this paper. Allele knockout, on the other hand, does not admit these options. An allele is either present or knocked out, leaving the researcher with just two data points from which the derivative may be approximated by finite differences. Accordingly, allele knockout is inherently a less precise method than degradation rate perturbation. For this reason, our simulations and discussion are mainly devoted to this suboptimal method in an attempt to determine its potential. Of course, combinations of the two methods could also be envisaged. Thus, our statistical analyses of the equilibrium data should be considered more as examples than recipes on how the data should be analysed. Our object has not been to develop the optimal way of analysing specific data, but to illustrate different approaches and to show that our method actually works.

Using steady state data, our method gives a sign estimate of the Jacobian, and if the protein degradation rates are known, a value estimate. Some connections are predicted more reliably than others, and should point to further experiments that may resolve the inconsistencies and uncertainties in the more uncertain estimates. However, with real data it is not to be expected that one reconstruction method alone will give the complete and correct answer. Where one methods is inaccurate or fails, another may work, perhaps leading to conflicting conclusions that have to be resolved by further experimentation and analysis.

Networks and models

We tested both approaches for reconstruction—allele knockouts and degradation rate manipulation—on in silico networks. In all the networks considered, protein and mRNA are modelled separately according to our general model framework. The stable protein concentration values were obtained by numeric integration of the rate equations until convergence to a steady state. In a few cases the solution approached a limit cycle. These cases were discarded. We compared estimated by the approaches described above to the “true” Jacobian estimated directly from the rate equations in the standard numeric way by estimating the partial derivatives of the rate functions in the reduced model Eq. (3) (without the factor if the system is not diploid).

Segment polarity network

One of the systems we used to test the method is the segment polarity network model of von Dassow et al. (2000) as it was adapted to a single cell (Tegnér et al., 2003). In this model and are the mRNA and protein concentrations of the genes engrailed (en) (), wingless (Wnt) (), Patched (Ptc) (), cubitus interruptus (ci) (), and repressor fragment of cubitus interruptus (), respectively. Application of the quasi-stationarity hypothesis to the equations of motion given in Tegnér et al. (2003, see Supplement), leads to a realisation of Eqs. (3) for all and

(34)

Here is the vector of protein concentrations, are the mRNA dose-response functions, and , , and are constant parameters (see Tegnér et al. (2003) for explicit formulae, parameter values and other details). Due to the presence of in the equation for , this system does not fit completely with our assumptions. A genotypic variation in will affect the dose-response function of directly, implying that the parameters describing the genotype of are not completely node-specific. It is interesting to see whether this fact will jeopardise our reconstruction or the system’s Jacobian.

Random network models

As a further test we ran a series of numerical simulations with and without noise on a range of systems defined by Eqs. (6) of varying dimension and feedback structure for randomly generated dose-response functions and parameter values. For each system size () we sampled 100 systems. For each system we first set up the overall connectivity as follow: for each node we sampled two regulator nodes , , and a mode of regulation (activation or repression) for each regulator. This simplifies the rate equation Eq. (6) to

(35)

We used the dose-response function

(36)

where is the basal and the maximal mRNA production rate, and is the algebraic equivalent of a Boolean AND or OR function. We set if activates and if represses , where is the Hill function with threshold and steepness . We sampled parameter values uniformly in the following ranges: , and .

Estimation methods

Least squares

The question of how to work out the best estimate from allele knockout or degradation rate perturbation data is not trivial. In all cases we get conditions of the kind , where and are square and is diagonal. If there is just a single set of measurements and is invertible, our estimated Jacobian is uniquely given by . If data are noisy and we have a set of observations leading to and , , it is a matter of statistics to decide on an optimal estimation procedure. Obvious options are ordinary least squares (OLS) and total least squares (TLS) (Markovsky and Van Huffel, 2007). OLS assumes no measurement error in , but works well as long as the measurement errors are small (Montgomery et al., 2012). In its simple form, TLS assumes equal variances in and , which is at best only approximately fulfilled in our case. We used TLS for the segment polarity network and OLS for the randomly generated systems. (See Appendix B for details on OLS and TLS.) We do not claim that these are the optimal estimation procedures. Rather, they should be considered as examples used to show that the method actually works. More sophisticated estimates could certainly be found, but considering this as a problem belonging to the experimental and data processing setup, we do not elaborate this point any further.

Cutoffs

Since real networks, in particular large ones, seem to have sparse Jacobians, will in general probably contain many elements with small absolute values. The question is then whether these are just the consequence of approximation errors in the estimation procedure, or correspond to weak couplings in the network. Our algorithm does not make any qualitative distinction between an element in the Jacobian with a small absolute value and a zero element. This is reasonable in light of the probabilistic nature of transcription suggested by Bintu et al. (2005), who express the value of the dose-response function by the binding probabilities of transcription factors and polymerase. However, in common deterministic models an action of one gene on another either exists or does not exist. To relate our estimated to this kind of models we therefore have to apply some kind of cutoff to small elements in , the estimate without cutoffs computed by one of the above estimation procedures..

In our simulations we applied two cutoffs to . First we set all elements in with an absolute value less than a cutoff to zero. Secondly, for the cases of noisy data, we used the estimates obtained by means of Eq. (20) or Eq. (27) for all data set to define the average sign matrix . The elements that are consistently equal to zero or have strongly varying sign for varying , will come out as zero in or with small absolute values. If is smaller than a chosen sign variation cutoff , the corresponding element in is most likely zero. In these cases we set . In simulation studies when the true is known, an optimal can be found by comparing with for different . In real cases, a combination of intuition and repeated experiments may help in choosing the optimal value.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

EP conceived the study and carried out the mathematical analysis. ABG designed and ran the numerical simulations and performed the statistical analysis. Both authors took part in evaluating the results and drafting the manuscript, and approved the final version.

Acknowledgements

We thank Stig W. Omholt for encouragement and comments. This work has been supported in part by The Research Council of Norway, project number 178901/V30, “Bridging the gap: disclosure, understanding and exploitation of the genotype-phenotype map”.

Appendices

The appendices contain details of the singular perturbation analysis of the mRNA-protein system and of all derivations and details of the simulation procedure that are not included in the main document, and additional simulation results that are too lengthy to be included in the main document.

Appendix A mRNA and protein networks

The segment polarity network model of von Dassow et al. (2000) analysed in the paper is an example of a model framework in which the concentrations of mRNA and protein for each gene in the network are modelled independently. In this section we show how such models can be reduced by singular perturbation theory. For a network of genes the model framework is Eq. (2), repeated here for convenience:

Here and are the concentrations of protein and mRNA of gene number , respectively, is the production rate (dose-response function) of mRNA, dependent on the concentration of the input proteins, is the mRNA-protein conversion rate, and and are positive relative degradation rates. The gene products might act directly as transcription factors, or the function might implicitly contain chains of reactions from the gene products to the real transcription factors so that is the combined effect of these chains and the transcription. This framework has been used by a number of authors, see e.g. Polynikis et al. (2009) for a review.

As mRNA molecules are in general less stable than the corresponding protein molecules, we can safely assume that for all , . We define . Then , and by a suitable renumbering of the genes we can alway achieve .

Using standard terminology in singular perturbation theory we call Eq. (2) the full model. To introduce in the equations we transform them to non-dimensional form by scaling the variables and the time according to

(A.1)

For convenience we continue to use a dot to denote time derivatives, but now with respect to the scaled time . This leads to the dimensionless equations

(A.2)

where , .

When is small and with a number of reasonable assumptions, we can make the quasi-stationarity hypothesis

(A.3)

This leads to the reduced model

(A.4)

where . Obviously, the full and the reduced model have the same stationary states. The above derivation can be justified in a rigorous way by means of singular perturbation theory. In terms of the fast time , Eqs. (A.2) are

(A.5)

where the prime denotes differentiation with respect to . The crucial necessary assumption for the singular perturbation to be valid is that the stationary point of Eq. (A.5) is asymptotically stable for fixed values of . In the present case this is ensured by assumption. Singular perturbation theory also ensures that when , the solution of the reduced system approaches the solution of the full system for all except in a narrow, initial time interval.

Appendix B Estimate of by least squares

In this section we consider briefly how to analyse the gene perturbation data using ordinary least squares (OLS) and total least squares (TLS). According to the main paper, both perturbation methods lead to

(B.1)

where the subscript indicates the data obtained from experiment number , and is the diagonal matrix . For each , in the case of degradation rate perturbation and in the case of allele knockout, is the matrix with elements , in the case of degradation rate perturbation and in the case of allele knockout. Finally, is the Jacobian estimated from dataset number , and .

The problem is to derive the best fit to from the total dataset. Eq. (B.1) can be rewritten as

(B.2)

In this form the equation is amenable to an ordinary least squares solution because all noise is confined to the right-hand side. If also has a nonzero variance, this is no longer so, and a more careful analysis is necessary. Assuming that is known with negligible inaccuracy, we form the system by stacking all the right-hand sides of Eq. (B.2) into the matrix , and stacking -matrices into the matrix . Computing the least squares solution is straightforward and leads to

(B.3)

Combining this with Eq. (B.1) we readily arrive at

(B.4)

from which we find our final estimate by inversion.

The advantage of estimating rather than is that the latter alternative would require the inverses of each , while the above method essentially inverts the average of all the , which has a lower variance.

To compute the TLS solution we define and . Then Eq. (B.1) can be written or

(B.5)

where is the error matrix. Because there are