Transcription Factor-DNA Binding Via Machine Learning Ensembles

Transcription Factor-DNA Binding Via Machine Learning Ensembles

Abstract

Abstract

Motivation:

The network of interactions between transcription factors (TFs) and their regulatory gene targets governs many of the behaviors and responses of cells. Construction of a transcriptional regulatory network involves three interrelated problems, defined for any regulator: finding (1) its target genes, (2) its binding motif and (3) its DNA binding sites. Many tools have been developed in the last decade to solve these problems. However, performance of algorithms for these has not been consistent for all transcription factors. Because machine learning algorithms have shown advantages in integrating information of different types, we investigate a machine-based approach to integrating predictions from an ensemble of commonly used motif exploration algorithms.

Results:

We have developed an ensemble methods in a machine learning (ML) framework that combine predictions from five known motif and binding site exploration algorithms. For a given TF, the ensemble starts with position weight matrices (PWM’s) for the motif, collected from the component algorithms. The collected ensemble of PWM’s is used as a dimension reduction tool, identifying significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF’s gene (promoter) targets (problem 1 above). These PWM-based subspaces form an ML-based sequence analysis tool, particularly useful in small sample situations. Problem 2 (binding motifs) is solved by agglomerating k-mer (string) features PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel native machine learning approach, the w-scanning model. This uses each gene promoter’s string features and their ML importance scores in a classification algorithm to locate binding sites across the genome.

For target gene identification, averaged over 88 yeast TFs, this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method.

For binding motif identification, the top motif predictions from this method are reasonably similar to the known motifs for 62 out of 88 TFs, which outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME).

For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) of 56 TFs, this method achieved a similar performance to the best performer without much human intervention. It also improved the performance on mammalian TFs, for which the training sample size is often larger.

Conclusion:

The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more transcription. The TF gene target identification component (problem 1 above) can be particularly useful in constructing of a more complete transcriptional regulatory network from a smaller sub-network based on known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information, from possibly new motif algorithms.

Motivation:

The network of interactions between transcription factors (TFs) and their regulatory gene targets governs many of the behaviors and responses of cells. Construction of a transcriptional regulatory network involves three interrelated problems, defined for any regulator: finding (1) its target genes, (2) its binding motif and (3) its DNA binding sites. Many tools have been developed in the last decade to solve these problems. However, performance of algorithms for these has not been consistent for all transcription factors. Because machine learning algorithms have shown advantages in integrating information of different types, we investigate a machine-based approach to integrating predictions from an ensemble of commonly used motif exploration algorithms.

Results:

We have developed an ensemble methods in a machine learning (ML) framework that combine predictions from five known motif and binding site exploration algorithms. For a given TF, the ensemble starts with position weight matrices (PWM’s) for the motif, collected from the component algorithms. The collected ensemble of PWM’s is used as a dimension reduction tool, identifying significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF’s gene (promoter) targets (problem 1 above). These PWM-based subspaces form an ML-based sequence analysis tool, particularly useful in small sample situations. Problem 2 (binding motifs) is solved by agglomerating k-mer (string) features PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel native machine learning approach, the w-scanning model. This uses each gene promoter’s string features and their ML importance scores in a classification algorithm to locate binding sites across the genome.

For target gene identification, averaged over 88 yeast TFs, this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method.

For binding motif identification, the top motif predictions from this method are reasonably similar to the known motifs for 62 out of 88 TFs, which outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME).

For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) of 56 TFs, this method achieved a similar performance to the best performer without much human intervention. It also improved the performance on mammalian TFs, for which the training sample size is often larger.

Conclusion:

The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more transcription. The TF gene target identification component (problem 1 above) can be particularly useful in constructing of a more complete transcriptional regulatory network from a smaller sub-network based on known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information, from possibly new motif algorithms.

Introduction

The expression of genes in tissue and biological pathways is primarily controlled by transcriptional regulation [2, 3, 4]. Transcription factors (TFs, also known as regulators) are proteins that initiate transcription of a gene (the target) by binding, with different levels of affinity, to nearby binding sites, usually 6-15 base pairs wide. These sites are often located in DNA within so-called promoter regions that are hundreds to thousands of base pairs long, usually upstream of the gene [4, 5, 6]. Current problems have included identification for every known regulator, of (1) its target genes, (2) its binding motif (the general DNA pattern to which it binds) and (3) its DNA binding sites or locations. As genomes have been sequenced, new methods have provided significant results toward solving these problems1.

Current methods [7, 8, 9, 10, 11, 12, 13, 14] typically begin by identifying motifs for a given TF, searching for common DNA patterns in a collection of promoter regions of known or suspected target genes (we denote these as the positive set). A binding motif is usually represented as a position weight matrix (PWM) [15, 16, 17] whose column consists of the four probabilities of the DNA bases , , , and in position of the motif. The motif can then be used to detect new target genes and corresponding binding sites via rescanning of its PWM through the promoter regions of candidate genes [15, 16, 17]. The scores (scanning or rescanning scores) for each position in the promoter score it against the probability distribution defined by the PWM. A promoter position with a rescanning score higher than a given threshold is reported as a new binding site and suggests a new gene target for the given TF.

More recently high dimensional machine learning methods have been used for such analysis and have in some cases achieved state-of-the-art performance [18, 1, 19]. Several algorithms which form a suite for investigating transcription regulation have been developed based on a common machine learning framework [18, 19]. In this paper, we introduce an ensemble machine method built on this framework, integrating information from diverse motif discovery algorithms.

Because of the nature of machine learning, the framework is initially based on solving a classification problem, in this case separating gene targets from non-targets (of a given TF) in a training or test data set. To determine whether a gene is a target, its promoter region is mapped into one or more high dimensional feature spaces, using maps capturing promoter properties that determine TF binding. A common and powerful feature space is the so-called -mer spectrum (or string) space. This represents a gene (actually its promoter) in terms of a vector whose components are counts -mers (short consecutive DNA strings of length ) in the promoter region [18, 19]. For a fixed TF, the feature map takes the promoter sequence of a potential target gene into a feature vector (the -mer feature space), whose component counts occurrences in , of the -mer (in an indexed list of all -mers). Each sequence is also labeled as indicating whether it represents a regulatory target (positive) or a non-target (negative). Target/non-target data are typically determined under a union of different experimental conditions.

A machine learning classifier function is trained on the dataset of known target/nontarget genes . The function is selected so that for training examples , the value for positive samples (where ), and for negative ones (). The trained is then used to classify a new gene with promoter as positive (target) or negative (non-target), based on the classification of feature vector ; see [18]). The ML algorithm to train here is the SVM - similar procedures have been used to classify proteins via amino acid sequences [1].

As mentioned, an important goal of DNA binding analysis is to understand the transcriptional regulatory relationships between TFs and genes, with the purpose of identifying a new or extending a known regulatory gene network. Note that in the machine learning framework, identifying TF regulatory target genes does not require a prior choice of motif. To identify new targets of a TF, the classifier can be used directly [18].

Identification of new TF gene targets using training sets of known targets has previously been based on two common methods: the above-mentioned motif scanning method [17] and coexpression-based association analysis [20]. The latter method and other machine learning methods for identifying targets have been shown to use information more efficiently and achieve better gene classification performance [18, 21].

The strongest -mer string features used by our gene target classifier (for a given TF) are usually related or identical to the -mer strings in DNA that in fact bind the TF. Based on this fact a tool, SVMotif, was developed to search for binding motifs (PWMs) by agglomerating the most discriminative -mers into binding motifs [19] within a structured machine learning framework. Specifically, for a fixed length , the discriminative -mer features (separating targets and non-targets) will overlap to form a longer (motif) sequence that can be used to form a candidate PWM.

Tested on the yeast genome, this method was able to discover 57 true motifs out of 100 TFs among its top three predictions per TF [19] - this included 49 ungapped motifs and 8 gapped motifs. The results were comparable to or better than two other commonly used algorithms, AlignACE and BioProspector. In contrast to many conventional motif-finding methods, SVMotif uses a set of known or hypothesized negatives (non-target genes) in addition to positives to provide more specific genomic background information. For choosing negative examples, randomly selected sequences can be used as representatives of the genome-wide background. Alternatively, simulated sequences can also be used to mimic a statistical background sequence model.

A support vector machine (SVM) classifier has the form of a linear discriminant, i.e., , predicting that the gene with feature vector is a positive (target) if . The -vector has components that are largest when the feature (counting occurrences of the listed -mer in the promoter) is most significant in discriminating whether the gene is a target. Thus the magnitude can also score the likelihood that the -mer appears in the binding motif, helping to construct . Since this identifies the most likely binding -mers, it can also be used to score potential binding sites of the TF in the promoter. This can be done by scanning the promoter and scoring each location using the -vector directly, instead of the derived PWM. We will describe this -scanning method and test it on a well-known benchmark dataset [22].

An analysis of 14 existing motif discovery algorithms [22] suggested that no single algorithm can perform consistently well for every transcription factor. To take advantage of different strengths, ensemble methods, which combine predictions from different algorithms, have been of interest for refining predicted motifs. One type of ensemble approach scores motifs in an candidate pool by measuring their “goodness”. WebMOTIFS (the web interface of TAMO; [23, 24]) assesses each candidate motif with several statistics, (e.g. hypergeometric enrichment score; [25]), and a ranked list is reported.

An existing ensemble approach for motif finding [26, 27, 28, 29] starts with multiple predicted binding sites of the TF based on several algorithms. The locations agreed to by the most algorithms are reported as binding sites. The motif matrices are then formed by agglomerating the subsequences at those locations into a PWM. A scoring scheme is then used to rank predicted motif matrices based on their information content [30] and their matching frequency in the positive set. Some algorithms then optimize their motif score by locally adjusting binding locations to improve accuracy [31, 32, 33]. One ensemble to which we will compare our algorithm, BEST [31], is of this type. At the end the algorithm reports motifs and the subsequences used to form them. Such ensemble algorithms can show a huge improvement over their individual components. Nevertheless, because these ensemble methods are used only to select only a single ‘best’ choice of motif that is ultimately used, the out-of-training TF target and binding site identification can still have a high false positive rate if conventional PWM scanning methods are used [34] in the ensemble.

In this paper, we use the above-mentioned ML representations of candidate promoter targets of a TF to develop a modular and extendable ensemble machine framework, SVMotif Ensemble. Using this we develop approaches for all three of the above problems (identifying (1) target genes, (2) binding motifs and (3) binding sites), within this framework, and improve performance for all three. In particular the approach produces an ensemble-based classifier for out-of-training identification of TF target genes and binding sites, replacing commonly used PWM scanning models. Here we will focus more on the framework of this algorithm rather than selecting and tuning its individual components to obtain the most accurate predictions. Thus only five widely-used algorithms, BioProspector [10], AlignACE [9], MEME [12, 11], Weeder [35] and SVMotif [19] are integrated, without any fine-tuning of their combinations. The resulting ensemble is fully scalable to allow other motif discovery tools to be added as future components.

We mention several points about the algorithm’s properties as related to the above problems.

(1) In applying the algorithm we first validated the ensemble machine on discovery of S. cerevisiae transcriptional regulatory gene targets. For target gene identification, averaged over 88 TFs, this method improves precision and recall (measured by the score) by about 10 percentage points over the (a) motif scanning method using PWMs generated from 5 aforementioned individual algorithms.

We also compared the ensemble as a gene target identifier with (b) the coexpression-based association method [21] and (c) a previously developed machine-based -mer method [18, 36]. The ensemble produces improvements, varying over different transcription factors.

(2) We have used the same dataset as in (1) for testing performance of SVMotif Ensemble in binding motif identification; the top motif predictions are highly similar to the known motifs for 20 out of 88 TFs and reasonably similar to the known motifs for another 42 out of 88. SVMotif Ensemble not only outperformed each of its 5 component algorithms, but also outperformed another ensemble algorithm - BEST [31] and another discriminative method - DEME [37].

(3) We also tested the function of identifying individual binding sites on a standard benchmark database [22], containing 56 datasets from the human, mouse, D. melanogaster, and S. cerevisiae genomes. The ensemble is comparable to the best performer - Weeder with a special ad-hoc binding site selection producure [22] as measured by both nucleotide and site-level performance measures (See [22] for measure definitions). In addition, for mammalian datasets, which usually contain more training examples of TF-gene interactions (such large datasets are becoming much more prevalent in current research; see [38]), there is an improvement as well over Weeder in identification of binding sites.

Some methodological points here are worth noting. The first involves a connection between machine learning classifiers like (here the SVM linear discriminant) and PWM rescanning-based classifiers, for both TF target gene identification and binding site identification (see Section Using the -mer Feature Space and SVM Classifier) . This connection essentially will show SVM-based TF target gene classifiers to form a superset of the set of gene classifiers which are based on PWM re-scanning (see Section Using the -mer Feature Space and SVM Classifier). Based on this, it will be possible to conclude that such machine classifiers form a strictly better alternative to PWM rescanning in solving out-of-training identification problems (for both TF target gene and binding site identification). Specifically, consider the SVM-based classifier. Let be the SVM-based gene classification function (for a given TF), for in the -mer feature space . On one hand, note that the coefficient contains the information used as related the TF target gene problem, and hence the motif finding problem. We will show (Section Using the -mer Feature Space and SVM Classifier) that, given any PWM for the TF, it is possible to emulate -based TF target gene classifications using a classifier whose classifications of genes are effectively identical to those based on motif . To this extent we can consider the machine classifier to be a strict generalization of the PWM scanning classifier.

In particular the set of vectors giving classifiers that are compatible as above with rescanning by a fixed motif form only a subset of possible -vectors. The latter can be searched in constructing all SVM candidates , from which the best SVM is selected. Given that the SVM optimizes a loss function (based on errors in the training samples) by searching a collection of classifiers larger than just those based on PWM, it should be able to perform at least as well as any PWM-based algorithm. We believe that if sample size is large enough the SVM algorithm generalizes significantly better than PWM methods to predict out-of-training targets of a TF.

The second point involves an advantage of the ML approach to ensemble learning, based on the nature of feature space . This space (here consisting of -mer count feature vectors) provides a common source-independent framework into which information from component algorithms is imported. We note also that its dimension can be reduced dramatically based on the ensemble technique. Each candidate PWM obtained from component PWM algorithms ensemble can be used to generate a reduced low-dimensional subspace of , the PWM subspace. This is obtained by extracting from the PWM a selection of -mer features [19] which are most likely based on the PWM. The reduced space is the span of these -mer features. This dimension reduction approach in our ML ensemble framework is initially used for identifying TF target genes. The choice of PWM subspace carries the information in to the feature space . Training can be done on a reduced machine classifier on subspace (yielding a coefficient vector within ), derived from the ensemble via . Thus each ensemble motif algorithm is a component dimension reduction tool producing subspaces . We will call the map from PWM to subspace a subspace-valued weak learner, weak because the resulting dimension reduced mapping is a relatively small component of the full learning algorithm producing the machine classifier in . We remark that the discrimination power of subspace in finding TF gene targets is a good measure of the quality of the PWM itself.– it provides an alternative to the area under ROC curve (AUROC) score of a PWM measuring binding site identification, first introduced by [39].

Third, finding gene targets using the (SVM-derived) linear discriminant extends directly to a method for finding specific binding sites, extending the standard PWM scanning binding site identification method. We call this method -scanning; it uses the above feature space to find binding site positions. Like standard PWM scanning, the approach scans a promoter by scoring each of its -mers using the above SVM -vector. As mentioned above, this strictly extends capabilities of PWM-type scanning methods, in particular avoiding the implicit assumption that binding site positions are independent. This has a two-sided effect. On one hand, if the independence assumption is invalid, -scanning can improve accuracy over PWM-based models. However, the approach needs a relatively large training set of known positives, because learning complexity is higher. In particular the method may overfit noise (false dependences among motif positions) when trained on small samples sizes.

Last (but not least), the machine learning approach easily combines information sources that go beyond sequence information. This can include information like experimental mRNA co-expression, phylogenetic sequence conservation, and nucleosome positioning. In [18], such information was combined with -mer features to find TF-target associations. Thus the ensemble can be expanded to include PWM information from new algorithms, as well as other sources such as gene expression and sequence conservation.

Our approach uses each component PWM algorithm to provide candidate motifs/PWMs. Each PWM generates a large number of strings ‘typical’ for it, that then form the basis for an associated synopsis subspace of the full string space . Machine training the set of positive and negative promoters just on yields for each test promoter’s feature vector a so-called synopsis score . This is the SVM discriminant score based only on . The ensemble of individual scores themselves form a reduced feature vector with one feature (the value ) for each , giving an extensive dimension reduction from . This reduced vector with component is called the synopsis vector, and the space of these is denoted as the synopsis space. We should mention that other ways to combine machine learning information can be used instead - these include adding kernels corresponding to different subspaces (kernel addition), and forming direct sums of the feature spaces [40]. Compared with these, however, this method is computationally efficient. Combined with a sub-feature selection tool, (selecting only important synopsis features ), this maintains scalability, leaving room for more useful future information.

SVMotif Ensemble is a machine learning software suite for solving the above problems. It takes known target and non-target promoter sequences (the training set) as input, and automatically runs the input algorithms of the ensemble (e.g. Bioprospector [ref], Alignace [ref] to get potential PWMs. These PWM’s are then used to reduce dimensionality of the full machine learning string feature space . On the reduced spaces , classifier functions are trained to distinguish target and non-target promoters, forming the trained ensemble machine.

As a suite for transcription regulation analysis, the trained SVMotif Ensemble predicts target genes of the TF, outputs a binding consensus motif matrix, and predicts potential binding sites near each target. The user can store the learned ensemble machine, which contains the learned subspace information as well as information from training samples, for future use. Compared to the traditional way of using PWMs as the direct information source, the ensemble contains more and more accurate information for our identification problems. The software suite is available for download from our website at http://cagt10.bu.edu/SVMotif.

Results

Experimental Protocol

To identify gene targets of a given TF, we used a benchmark dataset of TF-DNA interactions from [18] that contains positives (known gene targets) and negatives (genes with large -values in ChIP-chip experiments; [41]), based on information for 163 yeast transcription factors. We also downloaded PWMs for 102 TFs (out of the 163) that are available from the UCSC database [42], to test against performance of the PWM scanning model, using known PWMs. We excluded those TFs with less than 20 known targets in our dataset since ML performance is unreliable for small numbers of positive examples. This left 88 TFs to be tested. For each TF, we selected all known positives (targets) and an equal number of (presumed) negatives as our experimental dataset. A 5-fold cross validation was performed on each dataset (for an individual TF), dividing the target/non-target genes into 5 equal groups. Specifically, to ensure full isolation of training and test data (failure to do this would overstate performance measures), the promoter sequences (including positives and negatives) were randomly divided into 5 portions. For each withheld test data fold, we used the remaining 4 folds of data to train individual weak learners (here using AlignACE, BioProspector, MEME, SVMotif and Weeder without the special ad-hoc procedure) and 3 different ensemble methods, and tested the resulting target gene classifiers on the withheld test fold. This was repeated withholding all 5 folds (as test folds) one at a time, and we obtained cross-validation predictions for all genes in the dataset. The SVM used output probability-values between 0 and 1 (probability of membership in one of the classes), which were used in the scores [43]. The score, defined by

was used as an overall measure of prediction quality. Performance is discussed in section Ensemble Classification of Gene Targets

For the second task of identifying binding motifs, the same 88 yeast transcription factors were tested. We used all known positives and a randomly selected equal number of negatives to train both weak learners and ensembles. As a performance measure, we calculated the motif similarity (Section Identifying Binding Motifs with Ensembles) between the UCSC PWMs and each of the top 3 predictions from all of the tested ensemble algorithms. Performance is discussed in Section Ensemble Binding Motif Identification.

For the final task of identifying binding sites, we chose benchmark TF binding site datasets from [22]. These datasets covered 4 species, including human, mouse, Drosophila melanogaster and Saccharomyces cerevisiae data. Only positive sequences were originally provided in each of the datasets. For a training set of negative (non-target) genes, we downloaded 1000 base pair upstream sequences. This data was obtained for each of the four species from the whole-genome database from the UCSC genome browser. Randomly selected sequences from this collection were used as negatives. Because the number of positive sequences was small, we selected twice as many negatives as positives for training the ensemble classifiers. Performance is discussed in Section Binding Site Identification.

As mentioned earlier, five commonly used motif exploration algorithms were combined as weak learners - these were AlignACE [9], BioProspector [10], MEME [12], Weeder [35] without any ad-hoc binding site selection procedure as well as the SVMotif algorithm [19] based on the full -mer feature space. We selected top-ranked PWMs from each algorithm based on their own ranking scores. The selected candidate pool of motif matrices contained the top 5 motifs from AlignACE; the top 2 motifs from each different run of BioProspector (each based on a single motif width ranging from 7 to 12, with 12 PWMs in total); the top motif from separate runs of MEME, each using a different width ranging from 7 to 12 (6 PWMs in total); the top 5 motifs from SVMotif, and the top motif from Weeder. Since Weeder can only output individual DNA strings rather than PWMs, an in-house string agglomeration algorithm was applied to build PWMs. This setup yielded 29 motif matrices as the candidate pool for each transcription factor.

Ensemble Classification of Gene Targets

To benchmark prediction quality of gene targets of a TF, we first tested the performance of individual learners in the ensemble. Because the component algorithms generally output binding motifs2 rather than gene targets, we combined motif predictions with the PWM scanning model to identify TF binding targets (see Methods), based on aggregated PWM predictions of the weak learners. For component algorithms with multiple PWM predictions, we selected the PWM predictions whose scanning scores did best at identifying binding sites on the training portion of the dataset. As also observed in [22], no single learning method could perform consistently well for all transcription factors (Fig. 1).

Figure 1: Performance of component algorithms: Plot of score for each transcription factor and each component algorithm. The number in parenthesis after each component is the number of transcription factors for which the component showed the best performance.

For comparison, we also tested another class of TF target identification algorithms, based on gene co-expression studies. Such studies have been used as tools for gene regulatory network construction using various algorithms, for example in [20] and [21]. For such coexpression studies we use SVM algorithms in [21] (using coexpression databases in [18]) As pointed out in [21]; this method had previously performed reasonably well in predicting E. Coli regulatory relationships. In this test, these expression-based classifiers achieved on average 57% score on our yeast data.

We also compared our ensemble algorithm with the previously developed SVMotif algorithm based on the full -mer feature space, with . On the same dataset as above, the full -mer space method achieved a 66% score. In [18], SVM performance using this full -mer space dominated performances using other classes of feature spaces based on 25 other information sources, including co-expression data. However, the full -mer space method can reach computational limitations as the size of the motif becomes larger.

As mentioned the ensemble methods were tested on each TF. The average score (over 88 TFs) of the methods is approximately 10 percentage points higher than that of the best component algorithms (70% versus 57% for AlignACE; Fig. 2).

Figure 2: Performance measures of ensemble algorithm versus individual components in gene target identification: Precision, recall and scores are computed for each transcription factor individually. This box-plot chart shows the ensemble methods outperform other individual methods in score on the average by 10 percentage points.

Looking at performances on individual TFs, for 75/88 TFs the ensemble outperformed the best performing of its five component algorithms in TF gene target identification (Fig. 3). Thus the integration of orthogonal (very different) algorithms can not only preserve best performance, but also improve overall performance.

Figure 3: Ensemble versus best performing component: In terms of score, the ensemble method outperforms the best-performing individual component for 75transcription factors among 88 tested.

Ensemble Binding Motif Identification

Under a PWM rescanning model, any new identifications (of target genes or binding sites) rely on a good PWM estimate. Though the machine classifier initially scores genes as potential TF targets and does not estimate a PWM, it is possible to build PWMs from the classifier by ranking and merging its most informative features affecting the SVM score. To assess the performance in predicting binding motifs, we computed the motif similarity (Section Identifying Binding Motifs with Ensembles) between our prediction and the UCSC [42] standard motif , for each TF.

We first considered similarities among PWMs in the candidate pool (from the component weak learners) to the standard motif . The best performing motif matrix among these (denoted as ) and its similarity scores to were used as an ensemble testing benchmark. An ideal ensemble should reproduce this best performing matrix at the top of its list. We considered performance of both the top single and the top three ensemble predictions. The top prediction and the top three predictions had similarity scores to that exceeded best individual component scores (see Section Identifying Binding Motifs with Ensembles) for 46 and 55 transcription factors, respectively (among 87 TFs tested, Fig. 4). The top ensemble prediction also outperformed the top predictions from the another ensemble method, the BEST algorithm [31], and from another supervised method, DEME [37], for most transcription factors (Fig. 5).

Figure 4: Selected motif similarities: ensemble vs. component weak learners. Benchmark denotes the PWM from any component algorithm with greatest similarity to the standard PWM . Top prediction and best among 3 predictions refer to predictions of the ensemble algorithm. The top ensemble prediction was same as the benchmark for 46 out of 87 TF’s.
Figure 5: Selected motif similarities compared to BEST and DEME. Missing values for BEST indicate it did not output a result.

To confirm our predictions were biologically meaningful we also looked at curated information from the Transfac database [44], which reported binding motifs for 26 TFs out of the full list of 87. Among these, three of the weak learners (Bioprospector, AlignACE, and SVMotif) could jointly predict the correct motif for 21 TFs with their selected predictions (i.e., at least one of the three had the correct prediction among its pooled PWMs in 21 cases out of the 26). The ensemble method predicted the correct motifs for 18 out of these 21 cases as its top ranked prediction. This shows the ability of the subspace ensemble to integrate a variety of information from different weak learners and to pick out the most meaningful parts. If a sub-learner predicts the true motif, an ideal ensemble machine should give the right prediction, which occured in over 75 percent of these cases.

Binding Site Identification

We used the above datasets also to test how the ensemble method performs using sparse training sets (with small numbers of known gene targets). Since the number of training samples in such a benchmark is limited, a long motif pattern is difficult to detect. For example, if the true pattern is a -mer, this requires a learning machine to identify it within a million-dimensional space of possibilities. Since the number of training data points (positives) might be as small as 10, identifying such a subtle signal in a high dimensional space is difficult. The present ensemble method collects PWM outputs from sub-learners using largely to reduce the search dimensionality to several dozen, with the effect of increasing the signal/noise ratio.

Nucleotide Level
Notation Definition and Formula
number of nucleotide positions in both known sites and predicted sites
number of nucleotide positions not in known sites but in predicted sites
number of nucleotide positions in neither known sites nor predicted sites
number of nucleotide positions in known sites but not in predicted sites
Sensitivity / Recall:
Positive Predictive Value/Precision:
Specificity:
Performance Coefficient:
Correlation Coefficient
score:
Site Level (“Overlap” indicates overlapping by at least 1/4 of a given site)
Notation Definition and Formula
number of known sites overlapped by predicted sites
number of known sites not overlapped by predicted sites
number of predicted sites not overlapped by known sites
Sensitivity / Recall:
Positive Predictive Value/Precision:
Average Site Performance:
statistics:
Table 1: Definitions of performance metrics used in the assessment of motif discovery algorithms

The overall result of this test over the four species shows that the sensitivity of the ensemble method surpasses the best among those tested in [22]. The best nucleotide-level sensitivity3 is below for all other algorithms tested, while the ensemble method gives . Looking at site level sensitivity, our method has a value of , indicating the ensemble successfully predicts about of true binding sites among the 4 species. The precision (a.k.a. PPV in [22]) is still at a similar level to other algorithms. The score, which combines specificity and sensitivity, is comparable to the best component score, that of the Weeder algorithm with a special ad-hoc binding site selection procedure [35] (Fig. 6).

Because the ensemble starts with multiple motif models, it predicts more sites (signals along the promoter) in its first stage. It then excludes false positives by restricting to just the primary motif cluster in the ad-hoc analysis (Section Identifying Binding Sites with Ensembles). From a dimension reduction point of view, the predictions in the first stage can effectively reduce the search space from the entire promoter region down to about 5 potential sites per sequence. The sensitivity of the ensemble is high, so the true site is more likely to be included in this initial list. In addition to sensitivity we also looked at the discovery power, measured by the proportion of TFs for which the algorithm has non-zero score. A non-zero suggests that the algorithm is able to predict at least one true binding site within the given datasets. Table 2 shows the ensemble is able to identify at least one true binding site (non-zero score) for 22 out of 56 TFs. It performed the best among all algorithms tested.

In order to distinguish the functioning from non-functioning binding sites, some ad-hoc analysis is needed, such as a database search or conservation analysis. Experimental or computational approaches will also make sense under such circumstances. In addition, a more refined background model can also be used at this point (e.g., see discussion of Weeder’s special ad-hoc procedure on the [22] website) to select the most overrepresented binding sites. Such ad-hoc analyses will produce more useful results if the sensitivity (discovery power) of the computational algorithm is high.

In addition to our overall performance comparison using the 56 datasets based on cross-averaged statistics, we also compared the algorithms on these datasets individually. Using score as a metric we ranked 16 algorithms, including 14 tested in [22], alongside SVMotif and Ensemble, based on performance on each of the 56 TFs in the dataset. For each of the algorithms we counted the number of TFs for which its performance on binding site identification was ranked at the top, within the top 2 or within the top 3 algorithms. Among the 56 TFs, SVMotif Ensemble had the best performance on 11 and 9 of them, at nucleotide and site level respectively. It outperformed all other algorithms substantially (see Table 2).

We also noted that the ensemble method in particular performed better than other methods on mammalian datasets. As seen from Fig. 7, both nucleotide level and site level scores surpass those of all other algorithms. Because of the nature of the machine learning approach, it performs better when sample size is relatively large. The correlation coefficient between the ensemble method’s site level score and the number of positive examples per TF is for all four species, while the value for Weeder with special ad-hoc procedure is only . Hence for a dataset with large sample size (say ), the machine learning method is more predictive. This is important given the large numbers of positive instances of TF binding sites obtained, e.g., in ENCODE [38].

Figure 6: Performance metrics on all Tompa datasets for binding site identification.
Figure 7: Performance metrics on all Tompa mammalian datasets for binding site identification.
Algorithms Nucleotide Level Site Level
0 Top Top2 Top3 0 Top Top2 Top3
AlignACE 10 2 2 7 8 2 2 6
ANN-Spec 25 5 8 12 20 2 6 7
Consensus 7 1 1 2 6 1 2 4
GLAM 15 0 2 4 11 0 3 5
Improbizer 22 1 4 4 21 1 3 5
MEME 23 3 7 12 20 5 9 13
MEME3 19 3 5 7 15 3 5 8
MITRA 16 1 3 4 10 1 4 5
MotifSampler 21 4 8 10 18 4 7 11
oligodyad-analysis 15 2 3 5 12 2 3 4
QuickScore 14 1 5 6 7 0 1 1
SesiMCMC 19 4 6 11 19 4 8 12
WEEDER 18 4 6 7 18 7 9 11
YMF 20 3 8 10 18 3 5 7
SVMotif 20 5 11 17 19 2 6 8
SVMotif Ensemble 24 11 15 16 22 9 12 13
Table 2: Performance ( scores) of 16 algorithms on individual Tompa datasets. Columns 2 and 6 indicate the number of datasets on which the algorithm has discovery power (i.e., produces non-zero nucleotide and site level ). Columns 3 to 5 and 7 to 9 indicate the number of datasets on which the algorithm performs best among the 16 (in terms of score).

The machine classifier is an alternative to the PWM model

Identifying target genes of a TF is central to reconstructing regulatory networks, which has been of recent interest [20, 41, 25, 45, 18]. Up to now the majority of TF binding analysis has been based on PWM models. In conventional PWM models the rescanning score can be a powerful tool for identifying new target genes. As is shown in Section Using the -mer Feature Space and SVM Classifier, however, the PWM rescanning algorithm effectively forms a linear classifier, even when maximum local PWM scores are used. A PWM cannot capture all information in its training sequences however, and the accuracy of such a classifier is not optimal on out-of-training samples. The machine classifier uses known target information directly and represents its decision rule in a high dimensional machine that effectively captures dependence information between bases. To compare the two models on gene target identification, we tested the PWM model against the ensemble machine classifier on UCSC motifs [42]. Among 88 transcription factors tested the ensemble method performed better for 70 datasets in terms of score (Fig. 8).

Figure 8: Ensemble classifier performance versus PWM model-based performance on identification of target genes for 88 TFs. scores are based on balanced positive/negative target sets.

Conclusion

We have presented an ensemble method for solving the problem of integrating information on transcription factor/gene interactions, and making predictions on this basis. The approach begins with identifying regulatory target genes and ends with predicting binding locations of TFs. Computational results show the ensemble method can successfully integrate information from five known motif-finding algorithms, each of which focuses on a different optimization problem. The performance is consistently good over variety of data in predicting regulatory target genes and binding site motifs. Generally, machine learning methods have limitations for small sample sizes. The ensemble method together with a useful dimension reduction strategy (using weak learners) gives a tool with high sensitivity (recall) in predicting TF binding sites. The machine learning framework is also an alternative to the PWM model for all three identification problems mentioned earlier. It improves accuracy in identifying a TF’s gene targets and binding motifs, and can also improve binding site identification. The algorithm is scalable, so that more component algorithms and standard motif databases can be added to provide more comprehensive results. There are a number of new TF-DNA databases based on recent experiments [38] which can contribute significantly to these methods for solving such problems. We have shown that machine classifiers can play a role as an alternative to PWM methods in dealing with the three above-mentioned TF regulation-related problems. With more sophisticated models and methodologies as well as additional data types, this method can scale up and can be of further use.

Methods

Overview

For a given TF the computational model we consider is initially based on supervised two-way classification of its target genes (positives) against non-target genes (negatives). This machine learning framework based on -mer feature spaces was implemented in the development of the SVMotif algorithm [19]. In SVMotif Ensemble, individual component motif finding algorithms first run through the training dataset . This consists of a given set of promoter sequences containing both positive (known target) and negative (known non-target or presumed non-target) promoter sequences of genes, along with (known) labels for positives/negatives, respectively. For the TF, candidate PWMs are collected using the available algorithms in the ensemble into an initial candidate pool . In practice, candidates could additionally be obtained from standard databases such as Transfac [44] and JASPAR [46].

For each available PWM , we construct two (independent) classification sub-models. Each takes information from , and classifies every novel gene as a target or non-target based on its promoter sequence . The first sub-model used is the standard PWM scanning model based on (Section Using the -mer Feature Space and SVM Classifier), while the second is the -mer space SVM model (Section Subspace Synopsis Features and the -Scanning Model). Two kinds of scalar synopsis features are extracted from each promoter using the two models: the rescanning synopsis and the subspace synopsis, denoted as and . Each synopsis feature (rescanning and subspace) is a scalar (one dimensional) machine learning (ML) score based on the training samples and an -based sub-classifier. The sub-classifier is trained by one of the above two (scanning or subspace) methods to predict the labels (see below; recall indicates a target and a non-target). For the TF, the basic gene target finding problem (1 above) and the two additional problems of binding site (2) and motif (3) identification (see Sections Identifying Binding Motifs with Ensembles and Identifying Binding Sites with Ensembles) are solved using the above classifiers as follows.

  1. For identification of target genes (problem 1) of the TF, an SVM is trained on the ensembles of combined synopsis features. Thus each promoter sequence is mapped into the synopsis feature space via

    combining classification scores based on all PWM forming both types of scores (the scanning scores and subspace scores ). The (scanning) scores are based on scoring the promoter sequences using standard scanning with PWM (see below). On the other hand, the (-vector) subspace scores are based on scoring the same sequences using the SVM classifier score , based on their feature vectors in the -mer feature space . See Sections PWM Scanning Models and PWM synopsis features and Subspace Synopsis Features and the -Scanning Model for a fuller description. The ensemble-based linear (boosting) classifier now trained in the reduced synopsis feature space then has the form of single SVM score

    used to score out-of-training test sequences .

  2. For motif identification (problem 2), a feature selection algorithm is used on the synopsis space to select the most discriminative features (PWMs) in . Those PWMs corresponding to the top ranked synopsis features (based on discrimination of positive and negative genes) are themselves better PWMs for the true binding motif. A PWM agglomeration algorithm is then used to collapse similar PWMs into a single motif. f

  3. Finding individual binding sites (problem 3) is then done by both standard PWM-rescanning (based on the PWMs in (2)) and the -rescanning method (see section Subspace Synopsis Features and the -Scanning Model). The union of these two sub-models (based on the and scores) for each PWM produces a score for every location on the promoter . These local scores are aggregated through a dot product with the coefficients learned from the ensemble target classifier (problem 1). The local peaks of this score are identified and reported as predicted binding sites.

PWM Scanning Models and PWM synopsis features

The PWM model is a widely used motif model in sequence binding analysis [15, 17]. A PWM corresponding to a TF is a matrix, whose column defines the probability distribution of appearing at position in the set of all binding sites of length , i.e.,

The matrix is usually generated empirically from a large number of likely binding sites by aligning and then counting frequencies of residues at given positions. In order to identify new potential binding sites within a sequence (problem 3), the given weight matrix is first changed into a log-ratio scoring matrix measuring likelihood against a background distribution , so

where is the background probability of observing DNA base . For any subsequence of length in the promoter, the matrix can score using the PWM scanning score

i.e., is the log-likelihood ratio of observing under motif model versus observing it under background . A high score suggests that is more likely to be a binding site. For a given , PWM scanning scores are computed for all -mers in the promoter sequence. A threshold is then used to identify significant binding -mers.

PWM models have also been used to explore gene targets (problem 1) using certain nonlinear gene candidate scoring systems. An example is use of the maximal local PWM scanning score (point 3 at the end of section 2.1 above) over the entire promoter sequences, with a (different) threshold . The explicit form of such a PWM scanning classifier is

with the maximum over all -mer strings in , and a value above 0 indicating the gene is a target. Because this maximum is not always robust against noise and outliers, some alternatives based on functions of the -mer scanning scores , for , have been developed to score a promoter . We can write , where can have several forms in addition to the above maximum:

  1. Linear scoring:

  2. -trimmed linear scoring:

where the latter means the sum is formed not of all rescanning scores, but just the top scores within the candidate promoter . Below we have chosen the trimmed linear thresholding score with .

A trivial (one dimensional) SVM algorithm can be used on training data consisting of the single feature to determine the best and . Selection of does not change the ranking power of or area under the ROC curve of the resulting classifier . We call the rescanning synopsis feature of corresponding to PWM .

Using the -mer Feature Space and SVM Classifier

String (-mer) representations of sequences have been studied for a number of years [1, 18, 19]. Any DNA sequence is an ordered list from the alphabet . Let be all DNA strings of length (denoted as -mers). Without considering the appearance order or partial overlap of the -mers in the sequence , we map into a feature space with feature map

where vector has one component corresponding to each string of length . Thus the element of sequence is a count in this sequence of occurrences of -mer . The feature map is the spectrum map [1], as it contains counts of all possible -mers in . The feature space of all possible is the -mer (or spectrum) feature space. Combining several such spaces with different values of yields a full string feature space, , where denotes direct sum over . Thus a feature vector in is a concatenation of feature vectors in the spaces for allowed ; without confusion will sometimes also be called the -mer space.

The full feature map from into , also denoted as , maps training samples ( is the gene promoter sequence and ) to the more convenient dataset . Here is the -mer spectrum of . An SVM classifier trained on can classify new out-of-training samples [18] by computing the discriminant

If , the corresponding gene is classified as a binding target. In testing on more than 100 yeast transcription factors, the SVM was trained on a space combining 4, 5 and 6-mer spaces. The gene classification accuracy was roughly 70% on test sets balanced between positives and negatives [36].

The -vector of the SVM classifier has component that measures importance of feature (count of the -mer) in ) [47, 48]. Thus measures the power of feature in discriminating positive () from negative () training samples. Note here the index and index are considered the same. Table 3 shows the level of association between top ranking -mers and known binding motifs of TF’s. Based on this, the SVMotif algorithm [19] was developed to extract binding motifs from the -vector by agglomerating -mers with top feature importance scores [19]. Tested on 85 yeast TFs with known binding motifs [25, 45], SVMotif was able to correctly predict 40 standard motifs in its top prediction and 57 of them within its top 3 predictions [19].

Rank GCN4 UME6 MIG1 STE12
1 gagtca 8.099 gccgcc 5.262 cccgc 0.858 gtttca 2.764
2 agtcat 4.434 agccgc 5.103 ggggaa 0.658 cgagaa 1.084
3 gactca 4.094 cggcta 4.253 acccca 0.622 gaaaca 1.014
4 agtca 1.679 gccgc 3.04 ccccgc 0.616 cattcc 0.934
5 gactc 1.66 ccgcc 2.718 ccgga 0.604 tcctaa 0.79
6 agtcac 1.127 ccgccg 2.05 acccc 0.603 agtatg 0.708
7 cattag 0.933 cgccga 1.857 ccgg 0.597 acattc 0.649
8 cttatc 0.886 agccg 1.634 ccccac 0.568 aaacag 0.609
9 actca 0.832 cgccg 1.124 ccgta 0.556 atgaaa 0.562
10 catgac 0.734 gcgcc 0.845 gcaaca 0.524 taggaa 0.556
Table 3: Top 10 -mers in the output for sample yeast transcription factors GCN4(TGACTCA), UME6(TAGCCGCCSA), MIG1(WWWWSYGGGG), and STE12(TGAAACA). True (standard) binding motifs, retrieved from YeastGenome [www.yeastgenome.org], are listed in UPPER case following the gene name above. -mers matching the corresponding motifs are highlighted in bold. The feature importance scores for -mer features are listed next to the corresponding -mers.

We note, as briefly mentioned earlier, that this SVM classifier can be viewed as a generalization of the standard PWM re-scanning classifier with linear scanning score (section PWM Scanning Models and PWM synopsis features). Specifically, if each -component were to be artificially constructed to equal the PWM scanning score of the corresponding -mer (so ), then the linear motif rescanning classifier could be written as

Here the first sum is over all substrings of of length , while the second is over all possible -mers . Each element of is the linear PWM scanning score of on -mer . From this viewpoint any PWM corresponds to a -vector, which we denote as , such that the SVM-based classifier is equivalent to the PWM scanning classifier function with linear score. We call such vectors PWM-compatible -vectors.

On the other hand, in standard SVM training on a sample set , the optimal SVM -vector is obtained from optimizing (minimizing) the Lagrangian

Here is the standard SVM hinge loss function; recall that

Since this optimization searches the entire -vector space (including all PWM-compatible -vectors) this optimized SVM classifier will be systematically as good as or better than (by the optimization criterion) any PWM-based scanning classifier with linear thresholding function.

However, not all PWM re-scanning classifiers are based on linear scoring. The classifier with maximum scanning score

defined earlier in Section PWM Scanning Models and PWM synopsis features, can similarly be emulated in this ML framework using another sub-class of SVM-based classifiers, as shown below.

This time we replace by a modified feature space in which each component of feature vector is binary, i.e. either value 1 or 0. This depends on whether -mer does or does not appear in the promoter . In practice it has been shown [19] that such a modified feature space can improve accuracy of motif identification. We begin with the mathmetical fact that the norm approaches the norm as , i.e., for any numbers ,

Based on this we can approximate the maximum scanning score (with large ) by

Here is the power of the PWM scanning score of -mer , and or indicates whether . For very large , the term approximates the maximum scanning score of , i.e.,

; recall that denotes the (re)-scanning score with PWM of -mer .

The monotonic increase in of the function , ensures that the SVM scoring function always ranks the samples in the same way as . Thus the approximation

with choice of threshold can be constructed to select the same targets as the original PWM-based thresholding classifier based on the maximum scanning score.

Note that the new -vector , with (), is still in the search space of the SVM optimization. Therefore after finding the minimizing the above Lagrangian over the space of all candidate ’s (including the above PWM-based -vectors), the quality of the resulting SVM (as defined by lower values of itself) is also as good as or better than the maximum score PWM-based classifier.

PWM subspaces

In the PWM rescanning model, a gene is classified as a TF target if its promoter contains -mers that are high-scoring with respect to to the PWM motif of the TF. The frequency of -mers with high PWM scores in a promoter is strongly correlated with the PWM-based classifier predicting the promoter as a target. Thus in the -mer feature space , an ML algorithm based on can be dimensionally reduced (without much loss) to operate within the subspace that is spanned only by basis -mers that are high-scoring with respect to .

For a given , we define the set of such high-scoring -mers to be the profile set, denoted as . We call the feature subspace the profile subspace of corresponding to . A motif discovery algorithm producing such dimensionally reduced subspaces of the full (e.g., based as above on a PWM) is then called a subspace-valued weak learner. For a single , the ML term ‘weak learner’ refers to the fact that the subspace learned from is a small part of a sum of such subspaces aggregated from different candidate for the TF (it does not necessarily denote weakness in its ususal sense). Unlike standard weak learners, which make individual predictions that are then aggregated, these weak learners produce dimensionally reduced candidate subspaces for further ML search. This is based on the ML notion that individual weak learners can be aggregated into stronger machines.

Given PWM , the -mers in the profile set spanning the subspace are selected as follows. Suppose is a PWM. From we can randomly generate length DNA base sequences by selecting the letters from a probability distribution, with the column vector of , giving the probability distribution of bases ( or ) at position . To generate shorter sequences (), a smaller block , which consists of consecutive columns of , is randomly selected and used to replace . On the other hand, to generate a longer sequences (), columns generated by a background nucleotide distribution can be concatenated to expand on both sides.

After generating a representative set of -mers from in this way, an additional step is to score each -mer using the log-ratio scoring matrix above and to eliminate those that score below a set threshold. This ensures that the remaining -mers follow a pattern significantly different from the background. Among many different ways to choose the threshold, we used

[we need to explain the choice - is it the one that worked best overall?]

For , this method results in a -mer subspace of dimension 50 to 200, out of an original dimensionality of 2080 4. In our implementation we consider -mers of length from 4 to 10. If a PWM succeeds in capturing the pattern of binding sites of the TF, the -mers in should be sufficient to separate binding target sequences from non-targets. Thus the SVM built in the dimensionally reduced should perform similarly to that built in the full feature space , or better if noise reduction is taken into account. From this point of view the component algorithms providing ’s can be seen as dimension reduction tools that filter out noisy -mers from the feature space. Because the dimension is significantly reduced, machine learning methods can now also be used to examine longer -mer patterns.

We give a numerical example to illustrate the connections among the above-mentioned types of classifiers. These include PWM rescanning classifiers (using maximum scanning scores), the SVM classifiers

linear SVM classifiers (trained on full -mer feature space ), and subspace-based SVM classifiers (trained on the profile subspace; see Using the -mer Feature Space and SVM Classifier and PWM subspaces). The PWM for a well studied yeast TF, GCN4, was retrieved from the UCSC database [42], and positive and negative samples of target and non-target promoters were obtained (based on ChIP-chip experiment data [41]).

Note that the above PWM-based rescanning classifier and the SVM-based classifiers with a PWM-compatible -vector do not involve training (after selection of the PWM), since the -vector is obtained strictly using PWM-based -mer scanning scores, i.e., (with the -mer corresponding to the component of ). Thus they are tested on the whole dataset directly. The two SVM classifiers in table 4 were assessed under a 5-fold cross-validation protocol, where all samples are randomly divided into 5 portions, with the classifier trained on 4 portions and scored (tested) on the remaining portion - this is repeated 5 times, rotating the test portion, until all samples are scored. Table 4 shows the area under the ROC curve for each algorithm, and shows that for sufficiently large the SVM-based classifier has about the same performance as the PWM-scanning classifier. This is consistent with the argument that the search space of the second classifier is in fact contained in the other (Section Using the -mer Feature Space and SVM Classifier). Interestingly, the dimension-reduced SVM classifier trained on the subspace (the subspace of -mers generated by , here restricted to 35 dimensions) outperforms the SVM classifier built on the full space ( dimensions), suggesting the effectiveness of dimension reduction through PWM-subspaces (Section PWM subspaces).

PWM SVM-based(1) SVM-based(4) SVM-based(10) SVM () SVM ()
0.8428 0.6691 0.8282 0.8441 0.6833 0.8195
Table 4: Area under ROC curves (AUROC) for different classifiers. The PWM column presents the AUROC of the PWM maximum scanning score-based classifier. Columns 2 through 4 present the AUROC of the above-mentioned SVM classifier using the powers of PWM-compatible -vectors (). The two columns on the right present the AUROC for two linear SVMs, one () trained on full 7-mer feature space and one () on the -derived 7-mer subspace.

Subspace Synopsis Features and the -Scanning Model

The profile set and profile subspace corresponding to a PWM , defined the previous section, are used to reduce dimension, focusing classifiers on the parts of the feature space with the most information. Here we will define some tools arising out of profile subspaces. First the profile set of a good PWM has -mers whose frequency counts in a promoter can discriminate positives (targets) and negatives (non-targets) easily. Thus the discrimination power of the SVM classifier trained just in subspace , given as

also provides a goodness measure of the corresponding .

Second, because the shift does not affect gene rankings, we will use scores without , given by

, as a subspace synopsis feature of the classifier built from . A single synopsis feature such as this summarizes information from the subspace , while the union of such features (over different from the ensemble) forms the feature vector for discriminating the TF’s targets.

Third, for binding site detection we use a -vector scanning model to find the binding sites in a promoter . Instead of using the standard PWM scanning-based log-ratio scoring matrix (Section PWM Scanning Models and PWM synopsis features) to score each -mer string in (as a potential binding site), the -scanning model provides with an SVM feature importance score [48],

Here the are coefficients in the trained classification function , and and are average numbers of appearances of in the positive and negative training sets respectively. Scanning along the promoter , successive feature importance scores give for -mer at position an importance score . As is the case for the PWM-based rescanning, a threshold is determined for selecting -mers giving significant predictions (Section Identifying Binding Sites with Ensembles).

Identifying Gene Targets with Ensembles Using Synopsis Feature Spaces

With the basics above we construct in more detail the ensemble methods for our identification problems. We use five component ensemble algorithms, or weak learners, each generating PWMs to be used in the ensemble algorithm.

For each TF we trained these five components on the training data sequences (known positives and negatives), without any parameter tuning. For BioProspector and MEME, the size of the motif is a required parmeter, and we therefore used multiple runs of each algorithm with motif sizes from 7 to 12. For each TF we collected 29 PWMs (see Section Experimental Protocol) from the five algorithms into a candidate pool, . With these PWMs we constructed the (synopsis) feature spaces defined below for their ensemble estimates of the best PWM.

Rescanning Features: For each PWM , we used the PWM rescanning synopsis scores as features mapping both training and test data into the rescanning synopsis feature space with feature vectors

Subspace Features: In addition to using PWM rescanning synopsis scores as features, we also studied the subspace-based synopsis features. We first trained a dimensionally reduced SVM in the (previously mentioned) subspaces determined by PWMs . Then each sequence (in the training or test data) was mapped into the subspace synopsis feature space with feature vectors

This formed a second set of features for discriminating genes as targets or non-targets of the TF.

Ensemble Features: We combined the features from the rescanning and subspace synopsis to get the full synopsis feature vector

The resulting full (but still low dimensional) synopsis feature space thus has 58 dimensions. This space allows for using standard low dimensional statistical methods to classify targets of the TF based on these features.

We can now use the above feature spaces to solve problem (1), the discrimination of gene targets of the TF. Our feature maps send training and test samples into synopsis spaces , or (for the combined map) . An SVM is built in each of these spaces to form an ensemble-based classifier determining the TF targets and non-targets, from the original weak classifiers or . Thus the full ensemble SVM discriminant function (separating targets and non-targets) is .

Below we describe the algorithms for identifying binding motifs and binding sites based on this ensemble. For definiteness we assume the sample sequence is mapped into the synopsis feature space using the above feature map .

Identifying Binding Motifs with Ensembles

Our ML approach formalizes the problem of finding binding motifs for a given TF as a feature selection problem. Specifically, sample sequence (among the known targets or non-targets) is represented in the combined synopsis space by feature vector

Selection of the best PWM will mean finding the one giving the most discriminative feature in the feature set . Here discrimination is measured based on target/non-target separation in the training set of known targets (positives)/non-targets (negatives). Thus we select the set of the above PWMs whose synopsis features (rescanning synopsis or subspace synopsis ) jointly form the best set for discriminating targets.

Among a number of feature selection methods for SVM, we tested RSVM [48] within . In each run the machine generates an importance score for each synopsis feature ( or ) from which we rank the .

We repeated the feature selection procedure multiple times with different sets of negatives (presumed non-target sequences selected at random from the genome) together with the known set of positives. The averaged RSVM importance score was finally used to rank PWMs in . Because some PWMs in are highly similar to each other, we also clustered the 10 top PWMs and re-ranked them using their weighted entropy scores (see [19] for the score definition).

The reported PWMs were compared with UCSC motifs [42] using the following standard motif similarity measure. Because each PWM column defines a distribution among , we first define the similarity between two PWM columns and as based on Jensen-Shannon divergence (symmetrized Kullback-Leibler divergence), defined as the distance

between the two probability distributions, where . Here and represent probabilities of points in the probability space, corresponding to distributions and . Then given an alignment (giving a correspondence) between the columns of two PWMs and , the similarity is the sum of similarity scores between pairs of aligned columns and , i.e.,

The final similarity is the maximum among all possible non-gapped alignments,

Identifying Binding Sites with Ensembles

We have discussed the PWM scanning model (Section PWM Scanning Models and PWM synopsis features) and the -scanning model (Section Subspace Synopsis Features and the -Scanning Model) for identifying binding sites. Either model can generate a series of binding strength scores at the local -mer level, i.e., for each -mer within promoter . Each consecutive -mer in yields a local feature vector

of 58 scores, 29 PWM scanning scores and 29 -scanning scores. Comparing with the global definition of for the entire promoter, this vector is a local version, applied to -mers in . The local ensemble score of the -mer is defined as a linear combination of these 58 scores with the same coefficients as the ensemble SVM classifier for finding gene targets (Section Identifying Gene Targets with Ensembles Using Synopsis Feature Spaces). Thus coefficients of the classifier for gene target identification are the components of , now used locally with the above feature vectors to score consecutive -mers as candidate binding sites. The local ensemble score for each is thus computed as

We use the same scheme to score locally both positives and negatives. The scores for negatives can provide a background distribution for this class of comprehensive local scanning scores. Then the scores of positives are standardized by subtracting the mean of the background scores and dividing by their standard deviation. The threshold 2.575 has the property that 10% of the negatives score above it, thus yielding a 10% false positive rate. The -mer locations in the promoter yielding local scoring peaks above 2.575 () are identified as potential binding sites at the first stage of the algorithm.

These raw potential sites usually self-aggregate into several motif patterns. Therefore, in order to further remove false positives, we cluster the motif patterns with a greedy algorithm [19]. The potential binding sites matched to the primary cluster are finally reported as second stage predicted binding sites.

Because the algorithm used in the second (clustering) stage is greedy, some -mers that temporarily match the pattern of the primary cluster may not match it at the end of the clustering process. We then use the following method to select the final (third stage) matched sites. We scan each potential binding site from the first stage with the PWM of the primary cluster from the second stage. A normalized rescanning score is then calculated as

where is the rescanning score of the (see section PWM Scanning Models and PWM synopsis features) and MAX/MIN are best and worst rescanning scores obtained by scanning all possible substrings with the PWM . If , then is kept in the final (third stage) prediction.


References

Footnotes

  1. footnotetext: This work was partially supported by NIH grant 1R21CA13582-01 and NIH grant 1R01GM080625-01A1
  2. They also predict binding sites for training samples. However, this cannot be used to identify target genes in new samples.
  3. Table 1 lists the definitions of performance metrics used in [22] and this paper
  4. -mers forming reverse complements are treated as the same feature, so the dimension is smaller than the standard

References

  1. Middendorf, M and Kundaje, A and Wiggins, C Freund, Y, Leslie, C: A classification-based framework for predicting and analyzing gene regulatory response. BMC bioinformatics 2006, 7
  2. Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M: What is a gene, post-ENCODE? History and updated definition. Genome Research 2007, 17(6):669–681.
  3. Davidson EH, McClay DR, Hood L: Regulatory gene networks and the properties of the developmental process. Proceedings of the National Academy of Sciences of the United States of America 2003, 100(4):1475–1480.
  4. Levine M, Tjian R: Transcription regulation and animal diversity. Nature 2003, 424(6945):147–151.
  5. Warner JB, Philippakis AA, Jaeger SA, He FS, Lin J, Bulyk ML: Systematic identification of mammalian regulatory motifs’ target genes and functions. Nature Methods 2008, 5(4):347–353.
  6. Das M, Dai HK: A survey of DNA motif finding algorithms. BMC Bioinformatics 2007, 8(Suppl 7):S21+.
  7. Hertz G, Hartzell G, Stormo G: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Computational and Applied Biosciences 1990, 6:81–92.
  8. Workman CT, Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pacific Symposium on Biocomputing 2000, :467–478.
  9. Roth F, Hughes J, Estep P, Church G: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology 1998, 16:939–945.
  10. Liu X, Brutlag D, Liu J: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Proceedings of the Sixth Pacific Symposium on Biocomputing 2001, :127–138.
  11. Bailey T, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 1995, 21:51–80.
  12. Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Research 2006, 34(suppl_2):W369–373.
  13. Liu XS, Brutlag DL, Liu JS: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology 2002, 20:835–839.
  14. Lawrence C, Altschul S, Boguski M, Liu J, Neuwald A, Wootton J: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262:208–214.
  15. Hertz GZ, Hartzell I George W, Stormo GD: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Computational and Applied Biosciences 1990, 6(2):81–92.
  16. Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 1999, 15(7):563–577.
  17. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16:16–23.
  18. Holloway DT, Kon MA, DeLisi C: Machine learning methods for transcription data integration. IBM Journal of Research and Development 2006, 50(6):631–644.
  19. Kon MA, Fan Y, Holloway D, DeLisi C: SVMotif: A Machine Learning Motif Algorithm. In ICMLA ’07: Proceedings of the Sixth International Conference on Machine Learning and Applications, Washington, DC, USA: IEEE Computer Society 2007:573–580.
  20. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS: Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. PLoS Biology 2007, 5:e8+.
  21. Mordelet F, Vert JP: SIRENE: supervised inference of regulatory networks. Bioinformatics 2008, 24(16):76–82.
  22. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 2005, 23:137–144.
  23. Romer KA, Kayombya GR, Fraenkel E: WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches. Nucleic Acids Research 2007, 35(suppl_2):W217–220.
  24. Gordon DB, Nekludova L, McCallum S, Fraenkel E: TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs. Bioinformatics 2005, 21(14):3164–3165.
  25. Harbison C, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431(7004):99–104.
  26. Hu J, Yang Y, Kihara D: EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics 2006, 7:342.
  27. Reddy, Timothy E, Shakhnovich, Boris E, Roberts, Daniel S, Russek, Shelley J, Delisi, Charles: Positional clustering improves computational binding site detection and identifies novel cis-regulatory sites in mammalian GABAA receptor subunit genes. Nucleic Acids Research 2007, 35(3):e20.
  28. Reddy TE, Delisi C, Shakhnovich BE: Binding Site Graphs: A New Graph Theoretical Framework for Prediction of Transcription Factor Binding Sites. PLoS Computational Biology 2007, 3(5):e90+.
  29. Yanover C, Singh M, Zaslavsky E: M are better than one: an ensemble-based motif finder and its application to regulatory element prediction. Bioinformatics 2009, 25(7):868–874.
  30. Pavesi G, Mauri G, Pesole G: In silico representation and discovery of transcription factor binding sites. Brief Bioinformatics 2004, 5:217–236.
  31. Che D, Jensen ST, Cai L, Liu JS: BEST: Binding-site Estimation Suite of Tools. Bioinformatics 2005, 21(12):2909–2911.
  32. Wei Z, Jensen ST: GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics 2006, 22(13):1577–1584.
  33. Jensen ST, Liu JS: BioOptimizer: a Bayesian scoring function approach to motif discovery. Bioinformatics 2004, 20(10):1557–1564.
  34. Elnitski L, Jin VX, Farnham PJ, Jones SJM: Locating mammalian transcription factor binding sites: A survey of computational and experimental techniques. Genome Research 2006, 16(12):1455–1464.
  35. Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 2001, 17(suppl_1):S207–214.
  36. Holloway D, Kon M, DeLisi C: In silico regulatory analysis for exploring human disease progression. Biology Direct 2008, 3:24+.
  37. Redhead E, Bailey T: Discriminative motif discovery in DNA and protein sequences using the DEME algorithm. BMC Bioinformatics 2007, 8:385.
  38. Consortium TEP: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799–816.
  39. Clarke ND, Granek JA: Rank order metrics for quantifying the association of sequence features with gene regulation. Bioinformatics 2003, 19(2):212–218.
  40. Fan Y, Kon MA, DeLisi C: Ensemble Machine Methods for DNA Binding. In Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications, Washington, DC, USA: IEEE Computer Society 2008:709–716.
  41. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science 2002, 298(5594):799–804.
  42. Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ, Pohl A, Pheasant M, Meyer LR, Learned K, Hsu F, Hillman-Jackson J, Harte RA, Giardine B, Dreszer TR, Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC Genome Browser database: update 2010. Nucleic Acids Research 2010, 38(suppl_1):D613–619.
  43. Platt J: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Nucleic Acids Research 2000:61–74.
  44. Wingender E, Dietze P, Karas H, Knüppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Research 1996, 24:238–241.
  45. MacIsaac K, Wang T, Gordon DB, Gifford D, Stormo G, Fraenkel E: An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 2006, 7:113+.
  46. Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research 2004, 32(Database issue):91–94.
  47. Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Machine Learning 2002, 46:389–422.
  48. Zhang X, Lu X, Shi Q, qin Xu X, chiu E Leung H, Harris LN, Iglehart JD, Miron A, Liu JS, Wong WH: Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics 2006, 7:197.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
204631
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description