Understanding the physics of oligonucleotide microarrays: the Affymetrix spike-in data reanalysed
The Affymetrix U95 and U133 Latin Square spike-in datasets are reanalysed, together with a dataset from a version of the U95 spike-in experiment without a complex non-specific background. The approach uses a physico-chemical model which includes the effects the specific and non-specific hybridisation and probe folding at the microarray surface, target folding and hybridisation in the bulk RNA target solution, and duplex dissociation during the post-hybridisatoin washing phase. The model predicts a three parameter hyperbolic response function that fits well with fluorescence intensity data from all three datasets. The importance of the various hybridisation and washing effects in determining each of the three parameters is examined, and some guidance is given as to how a practical algorithm for determining specific target concentrations might be developed.
A number of papers [15, 16, 7, 4, 6, 10, 14, 8, 17] have used chemical adsorption models to analyse data from two well-known Affymetrix Latin-Square spike-in experiments , known as the U95 and U133 datasets. The immediate aim of these papers has been to study the physical processes responsible for converting concentrations of specific target RNA in prepared solutions hybridised onto microarrays to measured fluorescence intensities. Their ultimate aim has been to provide biologists with a practical algorithm for estimating absolute specific target concentrations in the presence of a complex non-specific background from fluorescence intensity data. Even though there are a number of existing algorithms (or “expression measures”) of varying degees of statistical sophistication currently available to and widely used by biologists, such algorithms pay little attention to the detailed physics and chemistry of hybridisation at the microarray surface. As a result they are prone to underestimating fold changes at high target concentrations because saturation effects are not modelled, and at low target concentrations because of a failure to account adequately for non-specific hybridisation . Furthemore, the measures reported are not target concentrations as such, but surrogates derived directly from fluorescence intensities. At best, they could be described as an unknown increasing function of specific target concentrations, modulo statistical noise.
Before a reliable algorithm for inferring target concentrations can be developed, an accurate model of the physics and chemistry of the process is needed. In a recent review of chemical adsorption effects at the microarray surface , Binder has described in detail a number of processes influencing fluorescence intensity measurements, including bulk dimerisation of target molecules in solution, non-specific hybridisation and probe folding at the microarray surface, and partial zippering of duplexes. Analyses of the Affymetrix spike-in data have provided strong evidence that these effects cannot be ignored. For instance, Binder and Preibisch  have isolated the effects of specific and non-specific binding, Carlon and Heim and Heim et al.  have stressed the importance of bulk hybridisation and target folding in solution, and by analysing other data sets, Matveeva et al. have produced evidence for the importance of probe folding.
More recently, analyses of the U95 dataset [8, 17] and subsequent experimental evidence  have demonstrated the significant influence of the post-hybridisation washing step in determining fluorescence intensities. Although the washing step has an important scaling effect over the whole range of target concentrations, it is particularly noticeable as a probe sequence dependent scaling of the asymptote of the measured intensity isotherm at saturation concentrations. Because adsorption theories of the hybridisation step which neglect the post-hybridisation washing predict a common asymptote for these isotherms, many adsorption-model-based analyses of the Affymetrix Latin Square data sets have forced the data to fit a common asymptote [16, 10, 14, 6], in spite of strong statistical evidence to the contrary [7, 15].
In this paper we reanalyse both the U95 and U133 datasets, and a third dataset which is analogous to the U95 dataset, but without the complex background. Our analysis uses a chemical adsorption model which includes a number of chemical reactions occurring during the hybridisation and post-hybridisation washing steps. Details of the datasets are summarised in Section 2, and all three datasets are shown to be consistent with the empirical observations previously observed for the U95 dataset, namely that measured fluorescence intensities follow a hyperbolic isotherm with probe-sequence dependent parameters . Our physico-chemical model is described in Section 3 and demonstrated to be quantitatively consistent with these observations in Section 4. A quantitative analysis of how well the model fits the three datasets is given in Section 5. An analysis in Section 6 gives some indication of how intensity measurements over a whole microarray might be used to infer some of the isotherm parameters for each feature on the microarray, which in turn has the potential to contribute towards development of a practical algorithm for estimating target concentrations. Concluding comments are given in Section 7.
2 Datasets and empirical fits
Three datasets are analysed in this paper, the two publicly available Affymetrix Latin-Square spike-in experiments , known as the U95 and U133 datasets, and a version of the U95 dataset without the complex human pancreas background, kindly made available to us by Affymetrix. We will refer to these datasets as numbers I, II and III respectively.
In the manufacture of Affymetrix GeneChip® arrays, single strand DNA probes, 25 bases in length are synthesized base by base onto a quartz substrate using a photolithographic process. They are attached to the substrate via short covalently bonded linker molecules roughly 10 nanometres apart. A microarray chip surface is divided into some hundreds of thousands of regions called features, commonly 11 to 20 microns square, and with the single strand DNA probes within each feature being synthesized to a specific nucleotide sequence.
The main step in the laboratory process of gene detection with microarrays is the hybridization of cRNA target molecules, fractionated to lengths of typically 50 to 200 bases and with biotin labels attached to their U and C bases, onto the single strand DNA probes. The hybridisation is performed at C for 16 hours. The microarray is then washed to remove excess cRNA target, the biotin labels stained with fluorescent dye, and the density of hybridized probe-target duplexes in each feature detected via intensity measurements of the fluorescent dye. Each gene or EST is represented by a set of pairs of features (16 pairs in the case of the U95 dataset and 11 pairs for U133) using sequences of length 25 selected for their predicted hybridization properties and specificity to the target gene. The first element of the pair, termed the perfect match (PM), is designed to be an exact match to the target sequence, while the second element, the mismatch (MM), is identical except for the middle (13th) base being replaced by its complement.
In the Affymetrix spike-in experiments, RNA transcripts were spiked in at cyclic permutations of a set of known concentrations together with a complex background of cRNA extracted from human pancreas (dataset I), human adenocarcinoma cell line (dataset II), or no background (dataset III). Each of datasets I and III consist of PM and MM fluorescence intensity values from a set of 14 probe sets corresponding to 14 separate genes, each containing 16 probe pairs. For each probe set a set of fluorescence intensity values was obtained for the 14 spiked-in concentrations 0, 0.25, 0.5, 1,2, …, 1024 pM. In common with previous analyses of dataset I, the following analysis of datasets I and III is restricted to 12 of the 14 genes, omitting data from the two defective probesets, 36889_at and 407_at. Dataset II consists of PM and MM fluorescence intensity values from a set of 38 probe sets corresponding to 38 separate genes, each containing 11 probe pairs. For each probe set a set of fluorescence intensity values was obtained for the 14 spiked-in concentrations 0, 0.125, 0.25, 0.5, 1, …, 512 pM. Dataset II also contains data from a further 4 bacterial gene probe sets each containing 20 probe pairs, which we do not include in the current analysis.
In a previous paper  it was demonstrated that the dataset I is consistent with the empirical observation that the measured fluorescence intensities at spike-in concentration are sampled from a Gamma distribution with constant coefficient of variation about a mean given by a hyperbolic response curves of the form
Importantly, it was further shown using a thorough statistical analysis that each of the parameters , and is feature (i.e. probe-sequence) dependent, and that the asymptote, , is almost invariably lower for the MM feature than the neighbouring perfect-match PM feature within any PM/MM pair.
On the assumption that by far the greatest proportion of the intensity signal across a whole microarray in either dataset I or II comes from the complex background, we have preprocessed the data by carrying out a quantile normalisation across each of these two datasets . Thus all microarrays within a particular dataset have a common distribution of fluorescence intensities after preprocessing. Because of the absence of a complex background, we have not carried out this step for dataset III. Instead we have included in the fit the “wafer dependent scaling” described as Model E in ref.  to allow for the fact that the three replicates of the experiment used microarrays constructed from three separate wafers. That is to say, we fit to the data a model of the form , where the index labels the three replicates, the index labels a feature within a probe set and the scaling factors satisfy .
In Figures 1 and 2 are plotted fits of fluorescence intensity data to the response curve Eq. 1 for one of the spiked-in transcripts for datasets I and III. A complete set of analogous plots for all transcripts from all three datasets can be found at http://dayhoff.anu.edu.au/~conrad/Spike_in_Isotherms/.
For a small minority of features the fit gives negative values to some of the parameters , or , whereas the physical model set out in Section 3 predicts that all three parameter should be positive. This problem is more slightly more prevalent for MM features than for PM features, and is most acute for dataset II. In some cases, such as probe number 3 in Fig. 1, it appears that the data does not extend far enough into the high concentration, i.e. saturation, limit to allow an acceptable fit. In other cases, such as probe number 9 of Fig. 2, the data may be too noisy. The range of spike-in concentrations used in Dataset II is shifted downwards from that of datasets I and III, and as such may not be adequately probing the saturation region to give acceptable fits in all cases. Furthermore, the spike-in concentrations at the lower end may be probing the regime in which the target concentration is depleted during the hybridisation step, which is beyond the applicability of the model leading to Eq. 1 which we describe below. The analyses in Sections 4 and 5 below are restricted to the subset of fits to Eq. 1 for which all three parameters , and are positive. Table 1 gives the coefficient of variation of the fitted data for each dataset, and the percentage of features for which an acceptable fit with three positive parameters to the hyperbolic response function was obtained. In general, agreement with a hyperbolic response curve with positive parameters is excellent for datasets I and III, and reasonable for dataset II.
|Dataset||Coef. of||% of accepted fits|
3 The model
Consider the response of a given feature to a spike-in concentration of a particular RNA transcript in the presence of an unknown complex background of non-specific target RNA. We write the measured fluorescence intensity in the form
where is a physical background due to effects unrelated to fluorescent label carrying duplexes, such as reflection from the glass surface of the microarray, and is the maximum fluorescence intensity, that is, the contribution from fluorescent dye if all probes on the feature were occupied with labelled probe-target duplexes. It is argued in ref.  that the maximum intensity should vary only weakly due to differing probe sequences. Throughout this paper we assume and to be fixed constants for a given microarray. The coverage fraction, , is the fraction of probes on the feature carrying probe-target duplexes at the time of scanning. It satisfies .
The coverage fraction is determined by a number of reactions between various chemical species. The species and reactions considered in our model are set out in Tables 2 and 3 respectively. The first five reactions in Table 3 are assumed to reach equilibrium during the hybridisation step. The rate constants and are the ratio of the forward to backward rates for each reaction. The washing phase, which is primarily designed to remove unbound targets before scanning, also dissociates some duplexes[8, 17]. Thus the last two reactions are unidirectional as dissociated duplexes are continuously flushed out of the cartridge and replaced with a buffer solution containing no RNA.
|Single strand PM-feature-specific RNA target in solution|
|Single strand non-specific RNA target in solution of species|
|Bound RNA/RNA duplex in solution unavailable for hybridisation|
|Folded specific target in solution rendered unavailable for hybridisation|
|Unbound probe at the microarray surface|
|Duplex formed from PM-feature-specific RNA target binding to probe|
|Duplex formed from non-specific RNA target binding to probe|
|Folded probe at the microarray surface rendered unavailable for hybridisation|
|In bulk solution||Non-specific hybridisation|
|Specific target folding|
|At the microarray||Specific hybridisation|
|During the||Dissociation of specific duplexes|
|washing phase||Dissociation of non-specfic duplexes|
The effect of the first two reactions, bulk hybridisation and specific target folding in solution, is to reduce the concentration of specific target available for hybridisation onto the microarray from its spike-in value of to a value , that is the concentration of single strand RNA target in solution. (Following the usual convention, square brackets indicate concentrations.) For these reactions we follow the analysis of ref. . The label in Table 3 ranges over all possible subsequences of the 25-mer part of the specific target RNA sequence complementary to the PM probe. The species in this reaction includes any target RNA molecule containing a subsequence complementary to the th subsequence. At equilibrium, we have
The relation then gives
The next three reactions, occurring at the microarray surface, determine the duplex coverage fraction of the feature at the end of the hybridisation step, and before washing. Let the fraction of probes on the feature that have formed duplexes with either specific or non-specific target mRNA molecules and survived to a time after the commencement of the washing process be . We split the fraction of probes which have formed duplexes at the end of the hybridisation step and before washing into two contributions:
The first contribution, , is that due to duplexes formed with specific mRNA targets, and the second contribution, , is that due to duplexes which have formed with non-specific mRNA targets, the sum being over targets containing a subsequence complimentary to the th subsequence of the probe.
where, following the notation of ref., we define
The calculation leading to these isotherms assumes negligible depletion of target molecules in bulk solution during the hybridisation process.
Note that, in the asymptotic limit of high spike-in concentration, namely while holding constant, Eqs. 4 to 9 imply and , implying that the feature becomes saturated with specific duplexes. This is contrary to the differing isotherm asymptotes observed empirically. To explain the differing asymptotes, we include in our model the final two reactions in Table 3, namely the washing step . Define the probability that a given probe-target duplex has survived up to a washing time to be for a specific duplex and for a non-specific duplex of species . The survival functions and depend on probe and target base sequences, satisfy , are positive and are monotonically decreasing. Defining an average non-specific survival function by
the coverage fraction at washing time is then
Eqs. 14 to 17 (with the help of Eqs. 5, 9 and 10) relate the empirically fitted parameters , and to underlying physical quantities, namely , , the concentrations of chemical species in Table 2, reaction rates in Table 3 and washing survival functions and .
Physical effects which have not been included in the model include target depletion, which should only manifest at extremely low target concentrations, incomplete probe synthesis during the manufacturing process , and probe-probe interactions. Each of these effects will, in theory, cause the response curve to deviate from a hyperbolic form. A discussion of probe-probe interactions, for instance, can be found in ref. . The choice of model in this paper is guided by a desire to balance complexity of the problem with practicality.
4 Qualitative behaviour of the fits
Before considering a detailed analysis of the ability of the model to explain the parameters of the empirical fits, one can carry out a number of simple qualitative checks. The first three panels of Figure 3 compare the fitted saturation asymptotes for PM/MM pairs of features for each of the three datasets. Consistent with the hypothesis that a portion the bound probe-target duplexes are removed during the washing step, the asymptotes cover a broad range of values. The MM asymptote is almost always less than its PM partner, consistent with the scenario that a saturated feature of PM duplexes will lose less duplexes to washing than the partner saturated feature of less tightly bound MM duplexes . The observed pattern breaks down at higher values of , as the fits must be extrapolated further past the highest spike-in concentration to estimate the asymptote, and numerical accuracy is lost. This is most evident for dataset II, for which spike-in concentrations only extend to 512 pM, compared with 1024 pM for datasets I and III.
From Eqs. 15 and 17, the saturation asymptote is given by . This depends only on the rate at which specific duplexes are dissociated by the washing process, and not on the properties of non-specific duplexes. Thus the model predicts that the asymptote of the response function is unaffected by non-specific hybridisation. The fourth panel of Figure 3, which compares the asmptote for dataset I, (U95 with a complex non-specific background), with that for dataset III, (U95 without a non-specific background), confirms this. That is, the asymptote for any feature is the same for dataset I as for dataset III to within the standard errors of the isotherm fits.
The parameter in Eq. 1 is the baseline intensity estimate at zero spike-in concentration. From Eq. 17, it consists of a physical background component , and a component due to non-specific hybridisation, . Consistent with this, the values, shown in Fig. 4, are spread over a much broader range for datasets I and II in which a complex mRNA background was present than for dataset III with no background and therefore little non-specific hybridisation.
An obvious pattern, which emerges from comparing from PM/MM pairs of features in datasets I and II, is that non-specific hybridisation is stronger for a probe whose middle base is a pyrimidine (C or T) than for its partner probe whose middle base is a purine (A or G). This effect has been noted previously for microarray intensity data generally, and there is some debate about the its physical origins [20, 5, 11]. The effect is better understood at the level of individual letter frequencies. Binder et al. have noted that probe sensitivities increase with C-content, and decrease with A-content, while the G- or T-content of the probe has little effect. There are probably two effects contributing to this pattern. Firstly, the averaged contributions to DNA/RNA binding energies calculated from nearest neighbour stacking models  are ordered as [3, 11]
causing the substitution of a pyrimidine by a purine to decrease probe sensitivity and vice versa. Secondly, there is the simple geometric effect that pyrimidines, having a small single ring structure, will tolerate mismatches more easily than purines, which have a double ring structure, and would therefore need to deform the molecular backbone to bind with a target closely matching the probe sequence either side of the central base. No obvious pyrimidine/purine asymmetry is observed in the values from dataset III. This is to be expected as the parameter in this case is essentially the physical background without hybridisation contributions.
The fourth panel of Fig. 4 shows a histogram of values from dataset III, for which there is no complex background present. There is very little non-specific hybridisation, and most of this data represents statistical noise in the physical background parameter (defined in Eq. 2). Ignoring the tail, which we assume to be due to a small amount of non-specific hybridisation from the other spiked-in transcripts in the latin square protocol, we estimate by fitting the left hand part of the histogram to a gamma distribution in the unlogged data. The fitted distribution has a mean of 59 and a coefficient of variation of 0.057.
Various comparisons of the effective adsorption rate constant are shown in Fig. 5. This parameter is modelled in terms of more fundamental rate constants via Eq. 16. One can reasonably assume the differences between PM and MM for the indirect effective rate constants , and to be small compared with that for the specific rate constant , because of averaging over large numbers of non-specific species. Thus, to a reasonable approximation, , where is the difference in specific binding energies between a specific mRNA target and a PM and MM probe respectively. This empirical result has been noted previously as a shift away from the diagonal in a plot of versus for dataset I  and for dataset II , and is confirmed here for all three datasets.
Estimates of were obtained by taking a weighted average of over fitted isotherms for each of the central base letters. The weighting , where is the standard error in estimated from the standard error in arising from the isotherm fit is chosen to minimise the error in the average. The results, given in Table 4, are consistent with the ordering of binding energies calculated from nearest neighbour stacking models in Eq. 18.
The third panel of Fig. 5 compares the effective rate constant for the U95 experiments with and without the complex human pancreas background. From Eq. 16, one sees that removing the background, which has the effect of setting and to zero, should increase . This is confirmed in the plot, and is also evident as a shift in the inflection points to the left between Figs. 1 and 2. The fourth panel compares the amount by which increases as the complex background is removed for PM and MM. We observe that removing the effects of and affects for a PM probe and its MM partner by a similar factor. The small number of points away from the diagonal line to the left of the plot are caused by the difficulty in fitting the MM isotherm when is beyond the upper limit of the range of spike-in concentrations.
5 Quantitative behaviour of the fits
In this section we explore the ability of the model to explain the quantitative relationship between the fitting parameters of the hyperbolic response curve Eq. 1 and known physical properties of microarrays. We divide the analysis into two parts: The parameters and which set the vertical scale of the hyperbolic response curve, and the parameter which sets its horizontal scale.
5.1 Vertical scale parameters
The parameters and are explained in the model in terms of the more fundamental quantities , the physical background and , saturation intensity above background (which together set the intensity scale in terms of the dimensionless duplex coverage fraction) and and (which are driven by the chemical reactions in Table 3).
We begin with and , which are assumed to be fixed for an entire microarray. In Fig. 6 are plotted histograms of measured fluorescence intensities across microarrays from datasets I and II. Because these data have been quantile normalised, a common histogram will apply to all microarrays within a given dataset. In the absence of statistical noise, the parameters and should provide bounding limits for these histograms. However, given the coefficients of variation reported in Table 1, the raw intensity measurements could well extend beyond these limits. According to Eq. 17 the parameter should also be a lower cutoff on the fitted parameter (corresponding to the case of negligible non-specific hybridisation), while the analysis of in Section 4 for dataset III (see the fourth panel of Fig. 4) suggests a much smaller coefficient of variation for the fitted value of than for the intensity data generally. For these reasons we use as an estimate of the minimum over all fits within a dataset of the parameter . These estimates are shown in Fig. 6, together with bars extending two standard deviations either side using the coefficients of variation in Table 1.
|Defining equation||Parameter||Dataset I||Dataset II||Dataset III|
|( Model 2)|
|( Model 5)|
|( Model 7)||69.4|
To estimate the saturation parameter from fits to the hyperbolic response function, and to explain the observed values of the combination , we make use of the model’s prediction that the asymptotic intensity at high spike-in concentration, , is determined by the washing-step survival function of PM-specific duplexes, . Thus from Eqs. 15 and 17 we have
where we assume a uniform washing rate that depends only on the probe and target nucleotide sequences via their binding affinity. Following Ref. , we model in terms of the RNA/DNA duplex free binding energy in bulk solution:
Here is calculated using the nearest neighbour stacking model and parameters of Ref. , is the gas constant, the absolute temperature and and are undetermined constants. We use the convention that is negative for a bound state. Rearranging gives
where the and are determined for each feature from fitted hyperbolic response functions, has been estimated above, and , and are fitting parameters. Fits to datasets I and II are shown in Fig. 7, and fitting parameters listed in Table 5. The fits were done by minimising the sum of the squares of the residuals with respect to the three fitting parameters.
The parameter is in principle predicted by the model in terms of fundamental physical constants via Eqs. 10 and 14. It is determined mainly by non-specific hybridisation, including a strong influence from the probe sequence pyrimidine content as suggested by Fig. 4, and by probe folding. Without a detailed knowledge of the composition of the non-specific target solution, a direct evaluation of is of course impossible. However, one can aim for an ansatz in terms of those quantities which are known. Following the reasoning used above, the washing-step survival function of the th non-specific species can be assumed to take the form of a double exponential function of the corresponding free binding energy . The value of the double exponential function () undergoes a changeover from 1 for to 0 for over a narrow range of its argument. Thus the factor in Eq. 10 can be thought of as a switch with removes from the sum any binding configuration less tightly bound than some threshold.
The numerator of Eq. 14 is then a sum of reaction rate constants , weighted by the number of target molecules in solution with nucleotide sequences complementary to the th subsequence of the probe. Numerical estimates carried out in the context of bulk hybridisation in solution show that such a sum can be well approximated as an exponential function of free binding energy of the entire probe sequence to its complement. Assuming then that the behaviour of is dominantly exponential in , and taking into account the added effect of the probe’s pyrimidine count, we have tested the following nested models:
where are fitting parameters to be determined, , is a count of the number of pyrimidines in the 25-mer probe sequence and is the residual error, which is assumed Gaussian.
To include the effect of probe folding, we approximate in Eq. 14 by a single exponential term
where and are fitting parameters and , where is calculated for each 25-mer probe sequence from Zuker’s Mfold web server  with the temperature set to C and other parameters set to their default values. The Mfold web server calculates the free energy of the most energetic folding configuration of a given single strand DNA sequence, though ideally one should include a Boltzman weighted sum over all possible folding configurations. Models 1 and 2 are then nested within Models 4 and 5 respectively:
The above nested models can be tested for the significance of the extra parameters introduced in going from a less to a more complicated model. For instance, to test the significance of the extra parameter distinguishing model from the simpler , consider for each model the residual sums of squares , where the sum is taken over fitted data points. Under the null hypothesis that the extra complexity is not significant, and assuming the data points to be independent, the test statistic defined by
(where and count the residual degrees of freedom of and respectively) has an F distribution with degrees of freedom equal to and . This allows us to assign a p-value to the significance of model over .
|Parameter||Dataset I||Dataset II|
|model 0 to model 1:|
|model 1 to model 2:|
|model 2 to model 3:||0.0098||0.648|
|model 1 to model 4:||0.60|
|model 2 to model 5:||0.81|
|model 4 to model 5:||0.00064|
We have fitted each of the five models to the combination using and from Table 5 and from the hyperbolic response function fits of both PM and MM data for datasets I and II separately. The number of data points fitted, that is, the number of fitted hyperbolic response functions for which all three parameters , and are positive, was 364 for dataset I and 460 for dataset II. Table 6 gives the calculated p-values.
The parameters and modelling linear effects in and respectively are highly significant in both datasets. The parameter defining a mixed effect is barely significant at the 1% level in dataset I and not significant in dataset II, and we shall ignore it from here on.
The probe folding effect is highly significant for dataset I, but not significant for dataset II. Note that Models 4 and 5 contain the functional form
for . Thus the probe folding effect “switches on” once the energy of a folded probe is below some threshold , at which point the effect becomes linear in . From Table 5, the fitted value of in Model 5 for dataset I is 4.57, whereas the range of probe folding energies calculated by Mfold for the probe sequences of the U95 microarray is . Thus is well above the folding energy of any of the probes, implying that the probe folding effect is effectively linear for dataset I. On the other hand, the fitted value of in Model 5 for dataset II is , which is below the range of calculated probe folding energies for the U133 microarray, implying that the probe folding is switched off for dataset II. There is no obvious reason why the probe folding parameter should should shift markedly from one spike-in experiment to another, given that, although the U133 probeset nucleotide sequences are completely redesigned from those of the U95 microarray, the experimental protocol and geometry of the microarray should be similar. One, albeit prosaic, explanation may simply be that dataset II is noisier (see Table 1) and the probe folding effect has been lost in the noise.
5.2 Horizontal scale parameter
The parameter sets the horizontal scale of the hyperbolic response function Eq. 1. is an estimate of the spike-in concentration required to give a fluorescence intensity half way between the background, zero concentration level and the asymptotic, infinite concentration level. Our model in Section 3 explains as an effective rate reaction constant which is determined by the reactions occurring during the hybridisation step, and which is unaffected by the washing step. As was the case for the vertical scale parameters, much of the information required to evaluate from first principles is unknown, and so we try for an ansatz based on probe sequences and free binding energies.
Eqs. 16, 5 and 9 give in terms of reaction rate constants and concentrations of reactants. In general, each term or occurring in Eq. 16, where labels one of the reactions in Table 3, is a sum of terms of the form , weighted by the concentration of reactant . Once again we will be guided by Heim et al.’s numerical estimate of , and approximate each sum as a single exponential term. Thus we write
where the and are fitting parameters, and , with the effective binding energy for each reaction is estimated from some external physical model. With the sign convention that is defined negative for a bound state, each is expected to be positive.
Consider first dataset III, for which the complex non-specific background is absent. In Eq. 16 we can set the non-specific binding terms and to zero, giving
The rate constants are modelled as
For the free binding energy we use the nearest neighbour stacking model parameters of Sugimoto et al., for we use Xia et al.’s nearest neighbour stacking parameters for RNA binding to RNA , and for we use Zuker’s Mfold web server . The Mfold web server also has the facility to calculate folding energies of RNA targets. However, since for RNA target folding we are interested in the propensity for the 25-mer stretch of target complimentary to a given probe to bind with any segment of the much longer target RNA (or possibly another RNA molecule), we believe it is more appropriate to model target folding in solution using an RNA-to-RNA binding energy.
To try to understand the relative importance of each of the effects contributing to the effective rate constant we have analysed a set of models containing nested pairs, the relationship between which is illustrated in Fig. 9:
where is the residual error, which is assumed Gaussian. The first term in each of Models 1, 4, 5 and 7 could equally well be written as to make the nesting explicit, but for convenience we use a parameterisation based on Eq. 33. Models 1 to 3 include only the effects of specific hybridisation, target folding in bulk solution and probe folding respectively. Models 4 to 6 include pairwise effects, and Model 7 includes all three effects. A recurring theme in these models is the functional form of Eq. 30. Thus the target and probe folding effects “switch on” once the binding energy is below (i.e. more tightly binding than) thresholds and respectively, and have the effect of reducing the effective rate constant .
|Parameter||Dataset I||Dataset II||Dataset III|
|model 0 to model 1:|
|model 0 to model 2:|
|model 0 to model 3:|
|model 1 to model 4:|
|model 1 to model 5:|
|model 2 to model 4:|
|model 2 to model 6:|
|model 3 to model 5:|
|model 3 to model 6:|
|model 4 to model 7:|
|model 5 to model 7:|
|model 6 to model 7:|
|model 7 to model 8:|
P-values calculated from the F-statistic, Eq. 29, testing the pairwise comparative significance of these models are given for Dataset III in the right hand column of Table 7 (see also Fig. 9). Results are shown for PM data only as no complete set of stacking model parameters for with mismatches is available. Comparisons of Model 0 with Models 1 to 3 indicate that, taken in isolation, the specific hybridisation and bulk target folding effects appear not to be significant, whereas the probe-folding effect appears to be highly significant. This is also illustrated in Fig. 10. However, when taken in combination with probe folding, the analysis shows specific binding and target folding to be significant at the 1% level (see the comparisons Model 3 to 5 and 3 to 6 in Table 7). Thus we accept Model 7 for Dataset III. The fitted parameters are given in the right hand column of Table 5. Note that each is positive as required of the model.
For Datasett III, the apparent non-significance of the specific hybridisation and bulk target folding effects in Models 1and 2 can be explained as follows. In Model 7, the bulk folding is a stronger effect than specific hybridisation by a factor of 2 ( in Table 5). Furthermore, from Eq. 30, the bulk folding effect is opposite in sign to the specific hybridisation effect, and only comes into effect for . Also, it turns out that and are very highly correlated, with a Pearson correlation coefficient of 0.89. Examination of the first two plots in Fig. 10 shows a tendency of the data to increase with to start with, while the bulk target folding dominates, and then to decrease once the bulk folding effect switches off and the specific hybridisation effect takes over. Attempting to fit a straight line through data which first increases and then decreases has resulted in the conclusion that the term linear in in Model I is not significant. A related statement has been made by Carlon and Heim , namely that the effective target concentration needs to be appropriately “rescaled” for those targets with a high binding affinity in bulk solution in order to see the expected relationship between and .
We now turn to Datasets I and II. In the presence of a complex non-specific background, and are reinstated in Eq. 16. The bulk hybridisation effect will be a sum of exponentials of , and its modelling can be absorbed into that for bulk target folding, while the non-specific effect will be a sum of exponentials of . Thus we set
which suggests one further model:
Turning to Table 7, columns I and II, we discover that the extra parameters introduced to account for non-specific probe-target binding are not significant at the 1% level (see also Fig. 9). This surprising result can be explained by the fact that the fitted values of are in both cases close to the maximum value of within the dataset, so most of the fitted points fall into the regime of Eq. 30, and the effect is adequately covered by the term of Model 7. To further illustrate the point, fits to Models 1, 2 and 3 are plotted In Fig. 11. If Model 1 is taken in isolation, appears to have the “wrong” sign, as the non-specific probe-target binding and target folding and binding in solution all combine to dominate the specific binding effect. A similar result is observed for dataset II.
The generally small p-values in the first column of Table 7 indicate that Model 7 is an appropriate description of the parameter for dataset I. For dataset II the picture is less clear. In agreement with the analysis of the parameter , the probe folding is in general less significant. Nevertheless, for consistency we list the fitting parameters of Model 7 to both datasets in Table 5, while acknowledging there is redundancy in the Dataset II parameters.
5.3 How close are the fits?
Fig. 12 gives some idea of how much information has been lost in the above fits. The plot compares estimated parameters , and of the fitted hyperbolic response curves, such as those in Fig. 1, with those which would be predicted by the fitting constants and parameters listed in Table 5, namely
Dotted lines either side of the diagonal are the boundary of the region within which predicted values do not differ from the original fitted parameters by more than a factor of 2. A clear majority of estimates of and fall within this range. Clearly the most difficult parameter to explain adequately is the horizontal scale , owing to the large number of contributing chemical reactions. In general, dataset II has proved to be more problematic than dataset I, probably because the concentration range tested does not extend far enough into the saturation regime to demonstrate a clear hyperbolic isotherm.
6 Parameter prediction
For the above model to be of value in constructing a practical algorithm for inferring target concentrations, some or all of its parameters should ideally be predictable using only information available to experimental biologists. That available information consists of fluorescence intensities for the complete set of features on each microarray used in an experiment, the probe sequences of each feature, and any parameters associated with the experimental protocol. By contrast, the fitted parameters of Table 5 were obtained from spike-in experiments. Comparing datasets I and III in Figs. 4 and 5, one sees for instance that the unknown nature of the complex background has a profound effect on the parameters and of the hyperbolic response function. At first sight it appears one may need a new set of spike-in data for each experiment, which is clearly not a practical consideration. However, we argue here that if one exploits the distribution of fluorescence intensities from the entire microarray, an estimation of vertical scale parameters at least may be possible.
In the following qualitative description we propose a two step process for the vertical scale parameters, in which the physical background and maximum intensity for a microarray are first determined from the entire distribution of intensities over the microarray. The intensities can then be scaled to the dimensionless coverage fraction via Eq. 2, and one is left with the remaining problem of estimating the parameters and , which are driven by the chemical reactions of Table 3.
To estimate and , consider the histograms in Fig. 6. For both datasets I and II, our estimate of the physical background , based on hyperbolic response curves derived from spike-in data, is close to two standard deviations above the minimum measured fluorescence intensity. Assuming an experiment consisting of a number of technical replicates of each hybridisation setup, the data can be quantile normalised across replicates. A representative minimum intensity can be obtained by fitting a suitable smooth curve to the logged histogram (i.e. the lower panel of Fig. 6), and the coefficient of variation in the data easily estimated from the replicate intensity values over the whole microarray. then gives an estimate of .
Estimating from the histogram proves to be quite difficult because of the gradual tail at its the right hand end. With some experimentation we find that a cubic fit to the logged histogram over the range , where and are the lower and upper extremities of the histogram, crosses the line close to two standard deviations above the previously obtained estmate of . Calling this point , an estimate of is then . However, we find that such a method is highly sensitive to the range over which the cubic is fitted.
To gain some insight into how and may be estimated, consider the scatter plot, Fig. 13, of fluorescence intensities against the theoretical free binding energy obtained from the probe letter sequences using the nearest neighbour stacking model . Superimposed on these plots are the fits from Eqs. 47 and 48 to the asymptotic saturation intensity and background intensity at zero spike-in concentration , using the parameters of Table 5. According to our model, the asymptote curve should form an upper envelope to the data, with some slight leakage across the envelope due to the finite coefficient of variation in the data. Because the vast majority of the genes are not expressed in RNA samples taken from a typical cell, most of the data is expected to lie along, or close to, the lower set of background intensity curves. Indeed this is precisely what is seen. Conversely, in the absence of spike-in data, there is potential to estimate the upper, asymptote intensity curve by fitting an envelope to the data and the lower, background intensity curve by fitting a curve through the ridge of the scatter plot’s contour lines. In principle, these fitted curves, together with and then determine estimates of and for each feature on the microarray.
7 Conclusions and outlook
This paper concentrates mainly on the immediate aim of understanding the physical processes at work in the operation of microarrays, but in doing so highlights some of the challenges and, we hope, gives some guidance, for meeting the ultimate aim of providing an algorithm for converting the set raw fluorescence intensities from microarrays to absolute target concentrations.
The model we have examined includes the effects of specific and non-specific hybridisation, folding and hybridisation in bulk solution of target RNA, the folding of probes at the microarray surface, and the removal of signal during the post-hybridisation step. It leads to the hyperbolic response curve (or Langmuir isotherm) of Eq. 1 with three fitting parameters , and , which depend on a set underlying physical physical parameters including chemical reaction rate constants, washing survival functions and RNA target concentrations. In more practical terms, all three parameters will depend on the probe letter sequence, whether the probe is PM or MM, the nature of the complex non-specific background, and experimental protocols such as hybridisation temperature and washing times. Determining the parameters only from information likely to be known to biologists in a practical situation, as opposed to a highly controlled spike-in experiment, remains a formidable task.
The model is tested against the Affymetrix U95 spike-in datasets with and without a complex non-specific background and the Affymetrix U133 spike-in dataset. In general, agreement with a hyperbolic response curve is excellent for the U95 spike-in datasets and reasonable for the U133 spike-in dataset (see Table 1).
The response function parameters and set the vertical scale of the isotherm, that is, the scale of the measured fluorescence intensities. The parameter is a combination of a relatively straightforward physical background, and a non-trivial contribution from the complex non-specific background. It is important to understand the nature of the non-specific background component as it is responsible for the “bright mismatches” problem which complicates the naive PM MM subtraction scheme used in the MAS5 algorithm, for instance, for dealing with non-specific hybridisation. Our analysis shows that the DNA/RNA binding energy, pyrimidine content, and (in the case of U95 dataset) the folding of probes, all contribute significantly to the value of this parameter. The dependence of on binding energies and pyrimidine count is illustrated in Fig. 8.
The parameter (or, more precisely, the combination , where in general) is mainly concerned with the asymptote at high spike-in concentration, which, according to our model, is driven by the washing step. The qualitative prediction that it should be less for a mismatch feature than for a perfect match feature in a PM/MM pair is verified for all three datasets in Fig. 3. Its quantitative behaviour as a function of specific probe-target binding energies, using bulk solution free binding energies as a guide, is verified in Fig. 7.
The parameter can be thought of a an effective overall reaction rate. It sets the horizontal scale of the isotherm, that is, the scale of the specific target concentration. If an algorithm to determine absolute target concentrations is to be constructed, it is necessary to understand this parameter. Because of the large number of hybridisation reactions which have the potential to contribute it is by far the hardest of the three parameters to explain effectively. The model predicts that it is affected by all of the reactions listed in Table 3 occurring in the bulk hybridisation solution and at the microarray surface, but is unaffected by the dissociation during the washing phase. Analysis to determine which reactions are significant is complicated by the fact that the effect on of non-specific hybridisation and probe and target folding act in the opposite direction to that of specific binding, so obscuring the effects. Nevertheless, our analysis indicates that all of the hybridisation reactions considered are potentially significant contributors.
A common practice in previous studies [15, 16, 7, 10, 14, 17] has been to invert fits of hyperbolic response curves to recover spike-in target concentrations in order to test the predictive ability of models. We have deliberately refrained from doing so here, as one of the results of this study has been to demonstrate the strong dependence of the parameters of the isotherm on the (in practice unknown) complex non-specific background. Recovering spike-in concentrations using fitting parameters which implicitly contain information about the background belonging to a particular dataset is an inherently circular argument and is guaranteed to give unrealistically good results.
Instead, in Section 6, we address the problem of determining the hyperbolic response function parameters from information likely to be available to biologists in a typical microarray experiment, that is, fluorescence intensities for the complete set of features on each microarray, the probe sequences, and parameters associated with the experimental protocol. We argue that information for the vertical scale parameters is in principal implicitly contained in the distribution of intensities across the microarray by partitioning the intensities by quantities which can be estimated from probe sequences such as probe-target binding energies, probe folding energies and probe pyrimidine content. Determination of the horizontal scale parameter is a more formidable and open problem.
Ultimately, our aim is find practical algorithms for biological analysis through an improved understanding of the physics of microarrays. A problem encountered in in this paper, particularly with analysis of the horizontal scale parameter of the isotherm , has been the difficulty encountered unravelling mutual correlations and compensations between competing effects. In such cases one can never be totally certain, given the available data, that the physical interpretation is correct. However, one could argue that is not necessary, nor possibly even helpful, to know all of the contributing physical effects in order to meet our ultimate aim. If, for instance, our interpretation of the significance analysis of various models of the parameter in Section 5.2 is correct, we discover that we often do just as well by modelling a number of physical effects by a single effective term. In general, the lesson is that it is important to strike a balance between attempting to understand and incorporate all physical aspects on the one hand, and relying on empirically determined effective models on the other.
The above analysis has concentrated on Affymetrix gene expression microarrays. Genomic microarrays come in a variety of types for a variety of tasks. As well as gene expression arrays there are tiling arrays, which interrogate large contiguous tracts of genome rather than specific genes, resequencing arrays designed to detect the location of mutations and single nucleotide polymorphisms (SNPs), SNP arrays which are used for genotyping, and non-DNA arrays such as protein arrays designed to identify protein-protein interactions and antibody arrays for detecting antigens. In addition to this there exist a number of technologies including photolithographic deposition, inkjet printing and fabrication of probes onto micro-beads.
In each case they share the common property that they detect large biological molecules of specific known letter sequences via their binding to complementary sequences attached to a solid surface. In each case they will almost invariably share the common problems of non-specific hybridisation, saturation and bulk solution target hybridisation and folding dealt with in this work. The physico-chemical model we have explored is consistent with spike-in data over a broad range of target concentrations and should serve as a starting point for a variety of microarray types and platforms.
Thanks to Hans Binder, Susan Wilson and Yvonne Pittelkow for useful discussions and advice.
Physical background intensity measurement from factors such as reflection off the microarray surface and photomultiplier dark current. Assumed to be constant for all features on a microarray.
One of three parameters in the hyperbolic response curve Eq. 1 fitted to the measured fluorescence intensity data. estimates the (background) fluorescence intensity at zero PM-specific spike-in concentration.
Saturation fluorescence intensity above the physical background before washing, in a hypothetical situation in which all probes on a feature have formed biotin label carrying duplexes. Assumed to be constant for all features on a microarray.
One of three parameters in the hyperbolic response curve Eq. 1 fitted to the measured fluorescence intensity data. estimates the asymptotic saturation fluorescence intensity at infinite PM-specific spike-in concentration.
Measured fluorescence intensity signal at PM-specific spike-in concentration .
One of three parameters in the hyperbolic response curve Eq. 1 fitted to the measured fluorescence intensity data. estimates the PM-specific spike-in concentration required to give a fluorescence intensity half way between the background level and asymptotic level .
The specific washing survival function, i.e. the probability that a duplex formed with a PM-feature-specific mRNA target existing at the beginning of the washing step will survive to a washing time .
The non-specific washing survival function, i.e. the probability that a duplex formed with a PM-feature-non-specific mRNA target of species existing at the beginning of the washing step will survive to a washing time .
The washing time.
() Spike-in concentration of mRNA PM-specific target.
Binding free energies of a DNA/RNA duplex, a RNA/RNA duplex and of DNA probe self-folding. We use the convention that is negative for a bound state. are the corresponding dimensionless binding free energies , where is the gas constant and the absolute temperature.
The fraction of probes on the feature carrying probe-target duplexes after a washing time of at zero spike-in concentration . See Eq. 13.
is the fraction of probes on the feature carrying probe-target duplexes after a washing time of at infinite spike-in concentration . See Eq. 13.
The fraction of probes on the feature carrying probe-target duplexes after a washing time of , as a result of a spike-in concentration of mRNA specific to the PM feature. At the end of the washing time, that is, at the time of scanning, we write simply .
the fraction of probes on the feature carrying PM-specific and PM-nonspecific duplexes respectively at , i.e., after the hybridisation step and before the washing step.
The reversible chemical reaction by which target molecules in solution bind to probes attached to the microarray surface to form duplexes.
A high-throughput device for detecting the presence of large biological molecules (DNA, RNA or proteins) of specific known letter sequences via their binding to molecules of complementary sequences attached to a solid surface. They are high-throughput in the sense that large numbers of sequences are tested for in a single device. The microarrays discussed here are oligonucleotide gene expression microarrays, that is, they have short DNA probes and are intended for the detection of expressed genes through their messenger RNA.
- Non-specific hybridisation.
The hybridisation of target molecules with sequences other than those of the intended sequence. Sometimes ‘non-specific’ is used to mean ‘non-PM-specific’, that is, hybridisation of target molecules which are not complementary to the PM sequence, irrespective of whether they are binding to the PM or MM member of a probe pair. PM and MM are defined in Section 2.
A biological molecule attached to the microarray surface during fabrication.
- Spike-in experiment.
An experiment in which known concentrations of a specific set of target molecules are artificially added to a solution not otherwise containing those specific targets, and the solution hybridised onto microarrays.
A biological molecule in the solution hybridised onto the microarray during a laboratory experiment.
-  http://www.affymetrix.com/support/technical/sample_data/datasets.affx.
-  H. Binder. Thermodynamics of competitive surface adsorption on DNA microarrays. J. Phys. (Condens. Matter), 18:S491–S523, 2006.
-  H. Binder, T. Kirsten, I. L. Hofacker, P. F. Stadler, and M. Loeffler. Interactions in oligonucleotide hybrid duplexes on microarrays. Journal of Physical Chemistry, 108:18015–18025, 2004.
-  H. Binder, T. Kirsten, M. Loeffler, and P. F. Stadler. Sensitivity of microarray oligonucleotide probes: variability and effect of base composition. Journal of Physical Chemistry, 108:18003–18014, 2004.
-  H. Binder and S. Preibisch. Specific and nonspecific hybridization of oligonucleotide probes on microarrays. Biophys. J., 89:337–352, 2005.
-  H. Binder and S. Preibisch. Genechip microarrays – signal intensities, RNA concentrations and probe intensities. J. Phys. (Condens. Matter), 18:S537–S566, 2006.
-  C. J. Burden, Y. E. Pittelkow, and S. R. Wilson. Statistical analysis of adsorption models for oligonucleotide microarrays. Statistical Applications in Genetics and Molecular Biology, 3:Article 35, 2004.
-  C. J. Burden, Y. E. Pittelkow, and S. R. Wilson. Adsorption models of hybridisation and post-hybridisation behaviour on oligonucleotide microarrays. J. Phys. (Condens. Matter), 18:5545–5565, 2006.
-  C. J. Burden, Y. E. Pittelkow, and S. R. Wilson. Statistical analysis and physical modelling of oligonucleotide microarrays. In I. A. Deutsch, L. Brusch, H. Byrne, G. de Vries, and H.-P. Herzel, editors, Mathematical Modeling of Biological Systems, volume I, pages 333–346. Birkhäuser, Boston, MA, USA, 2007.
-  E. Carlon and T. Heim. Thermodynamics of RNA/DNA hybridization in high-density oligonucleotide microarrays. Physica A, 362:433–449, 2006.
-  E. Carlon, T. Heim J. Klein Wolterink, and G.T. Barkema. Comment on “Solving the riddle of the bright mismatches: Labelling and effective binding in oligonucleotide arrays”. Phys. Rev. E, 73:063901, 2006.
-  J. E. Forman, I. D. Walton, D. Stern, R. P. Rava, and M. O. Trulson. Thermodynamics of duplex formation and mismatch discrimination on photolithographically synthesised oligonucleotide arrays. In N. B. Leontis and J. SantaLucia, editors, Molecular Modeling of Nucleic Acids, ACS Symposium Series, volume 682, pages 206–228. Am. Chem. Soc., Washington, DC, USA, 1998.
-  A. Halperin, A. Buhot, and E. B. Zhulina. Sensitivity, specificity and the hybridization isotherms of DNA chips. Biophysical Journal, 86:718–730, 2004.
-  T. Heim, L.-C. Tranchevent, E. Carlon, and G.T. Barkema. Physical-chemistry-based analysis of Affymetrix microarray data. J. Phys. Chem. B, 110:22786–22795, 2006.
-  D. Hekstra, A. R. Taussig, M. Magnasco, and F. Naef. Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Research, 31:1962–1968, 2003.
-  G. A. Held, G. Grinstein, and Y. Tu. Modeling of DNA microarray data by using physical properties of hybridization. Proceedings of the National Academy of Science, 100:7575–7580, 2003.
-  G. A. Held, G. Grinstein, and Y. Tu. Relationship between gene expression and observed intensities in DNA microarrays - a model study. Nucleic Acids Research, 34:e70, 2006.
-  R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, and et al. Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics, 4:249–264, 2003.
-  O.V. Matveeva, S.A. Shabalina, V.A. Nemtsov, A.D. Tsodikov, R.F. Gesteland, and J.F.Atkins. Thermodynamic calculations and statistical correlations for oligo-probe design. Nucl. Acids Res., 31:4211–4217, 2003.
-  F. Naef and M. O. Magnasco. Solving the riddle of the bright mismatches: Labelling and effective binding in oligonucleotide arrays. Phys. Rev. E, 68:011906, 2003.
-  D. Skvortsov, D. Abdueva, C. Curtis, B. Schaub, and S. Tavaré. Explaining differences in saturation levels for Affymetrix Genechip® arrays. Nucleic Acids Research, 35:4154–4163, 2007.
-  N. Sugimoto, S. Nakano, M. Katoh, A. Matsumura, H. Nakamuta, and T. Ohmichi. Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes. Biochemistry, 34:11211–11216, 1995.
-  T. Xia, J. SantaLucia, M. E. Burkard, R. Kierzek, S. J. Schroeder, X. Jiao, C. Cox, and D. H. Turner. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochem., 37:14719–14735, 1998.
-  M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucl. Acids Res., 31:3406–3415, 2003. http://frontend.bioinfo.rpi.edu/applications/mfold/.