EBIC: an artificial intelligence-based parallel biclustering algorithm for pattern discovery

EBIC: an artificial intelligence-based parallel biclustering algorithm for pattern discovery

Patryk Orzechowski  ***Correspondence and requests for materials should be addressed to P.O.  (email: patryk.orzechowski@gmail.com) and J.H.M.  (email: jhmoore@upenn.edu)    Moshe Sipper     Xiuzhen Huang  & Jason H. Moore  Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA, and Department of Automatics and Biomedical Engineering, AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Krakow, Poland, and Department of Computer Science, Ben-Gurion University, Beer Sheva 8410501, Israel, and Department of Computer Science, Arkansas State University, Jonesboro, AR 72467, USA
Abstract

In this paper a novel biclustering algorithm based on artificial intelligence (AI) is introduced. The method called EBIC aims to detect biologically meaningful, order-preserving patterns in complex data. The proposed algorithm is probably the first one capable of discovering with accuracy exceeding 50% multiple complex patterns in real gene expression datasets. It is also one of the very few biclustering methods designed for parallel environments with multiple graphics processing units (GPUs). We demonstrate that EBIC outperforms state-of-the-art biclustering methods, in terms of recovery and relevance, on both synthetic and genetic datasets. EBIC also yields results over 12 times faster than the most accurate reference algorithms. The proposed algorithm is anticipated to be added to the repertoire of unsupervised machine learning algorithms for the analysis of datasets, including those from large-scale genomic studies.

\copyrightyear

2017 \pubdate \jvolume \jissue

\history

Receipted/Accepted/Published dates not available

1 Introduction

Discovering meaningful patterns in complex and noisy data, especially biological one, is a challenge. Traditional clustering approaches such as k-means or hierarchical clustering are expected to group similar objects together and to separate dissimilar objects into distinctive groups. These methods assume that all object features contribute to the classification result, which renders clustering a valuable technique for global similarity detection. Clustering does not, however, succeed when only some subset of features is important to a specific cluster.

The inability to capture local patterns is one of the main reasons for the advent of biclustering techniques, where biclusters – subsets of rows and columns – are sought. Both rows and columns subsets may contain elements that are not necessarily adjacent to each other, thus differentiating biclustering from other problems of pattern matching (e.g., image recognition), making the task unsuitable for deep learning [9]. Biclustering has its roots in data partitioning into subgroups of approximately constant values (Morgan, 1963) [27] and simultaneously clustering rows and columns of a matrix (Hartigan, 1972) [17]; this was later called biclustering (Mirkin, 1996) [25]. For the last two decades biclustering has been applied to multiple domains, including biomedicine, genomics (especially gene expression analysis), text-mining, marketing, dimensionality reduction, and others [7, 12].

Designing biclustering algorithms involves many challenges. First, although over fifty biclustering algorithms have been proposed (much more when derivatives are considered), no method has proven capable of detecting – with sufficient accuracy – six major types of patterns that are commonly present in gene expression data. Most biclustering algorithms find only one or a few of these patterns [24, 13, 34, 39, 31]: column-constant, row-constant, shift (i.e., additive coherent), scale (i.e., multiplicative coherent), shift and scale (i.e., simultaneous coherent), and order-preserving. Detection of order-preserving patterns is especially important, because it may be considered a generalization of the five other patterns [2, 13, 39]. Many biclustering algorithms are also unable to detect negative correlations or capture exact biclusters instead of providing approximations. Moreover, biclustering algorithms fail to properly separate partially overlapping biclusters. The performance of these algorithms on overlapping problems usually drops dramatically with increasing levels of overlap [39].

A third drawback of current biclustering methods is their limited success assessing which biclusters are the most relevant. Multiple measures for assessing quality of biclusters have been used so far [28, 35]. Some algorithms yield only a single bicluster at a time, rendering their application cumbersome [34]. Other methods output a high number of biclusters (e.g., BiMax [36] and PBBA [30]). This usually produces many overlaps and degrades the overall performance of the algorithm [13].

Providing the proper balance between local and global context within the data is also difficult. The methods that model global relations are typically able to deliver only a limited number of results (e.g., Plaid [22], FABIA [19], and ISA [4]), or tend to exhibit decreased accuracy with each result (e.g., CC [8]). On the other hand, algorithms that focus on local similarities are susceptible to losing global reference (e.g., Bimax, PBBA, or UniBic [39]). For example, UniBic, which sorts pairs of values and column indices of each row in order to identify the longest common subsequences, is able to detect the longest order-preserving pattern between each pair of rows, irrespective of the order of columns, but it fails to capture narrow biclusters containing only a few rows and multiple columns.

As the biclustering problem is NP-hard, designing an efficient and accurate parallel biclustering algorithm remains a challenge. Most of the reference biclustering algorithms are purely sequential. The reason for this is that the methods either require intensive computations, which limit their application to datasets of smaller size, or are fast but at the cost of lower accuracy.

2 Materials and Methods

In this paper a novel biclustering algorithm based on Artificial Intelligence – EBIC – is introduced that overcomes the above shortcomings. It is likely the first biclustering algorithm capable of detecting all aforementioned types of meaningful patterns with very high accuracy. EBIC is also one of very few parallel biclustering methods. We show that the proposed algorithm outperforms the most established methods in the field with respect to accuracy and relevance on both synthetic and real genomic datasets. An open-source, multi-GPU, parallel implementation of the algorithm is also provided.

The algorithm is designed for environments with at least a single GPU and requires the installation of CUDA. The algorithm was developed in C++11 with OpenMP, with CUDA used for parallelization. EBIC is open-source and available on GitHub at https://github.com/EpistasisLab/ebic.

2.1 Motivation.

The design of the algorithm is motivated by the following observation. Given the input matrix , where stands for rows and for columns, consider counting the number of rows with the property that the value in column is smaller than the value in column , i.e. . If the values in the dataset are generated randomly with univariate distribution, half of the rows on average are expected to have this property, and half are not. Addition of another column to the series, such that values in this columns are larger than the values in column , i.e. , should result in another reduction of the number of rows by half. Thus, for data without any signal, each addition of a column to the series reduces the number of concordant rows by half. On the other hand, if the distribution of the data is not uniform and there exists a monotonic relationships between rows in some subset of conditions, any addition of the pattern-specific column won’t eliminate the rows belonging to this pattern. Thus, the algorithm attempts to intelligently manipulate multiple series of columns and assigns higher scores to those series in which column additions do not result in total reduction of the rows.

The quality of each bicluster is determined by a function (called fitness), which takes into consideration the number of columns and, exponentially, the number of rows that follow the monotonically increasing trend represented by each series of columns. The design of the fitness function promotes incorporation of new columns to biclusters, provided there is a sufficient number of rows matching the trend (1).

(1)

where is the expected minimal number of rows that should be included within a bicluster.

EBIC uses a different representation compared with other evolutionary-based biclustering methods [11, 26, 1]. Instead of modeling a bicluster as a pair , where corresponds to row indices and to column indices, biclusters in EBIC are represented by a series of column indices. The quality of a given series is calculated based on the number of rows that match the monotonous rules present within the series of columns. The modification of column series is performed using an AI-based technique known as genetic programming (GP) [21, 32]. Series of columns are expanded only when the rule they impose is matched by sufficient number of rows.

EBIC belongs to the family of hybrid biclustering approaches [29] and features several techniques commonly used in evolutionary algorithms. The development of biclusters is driven by simple genetic operations: 1) four different types of mutations – insertion of a new column to the series (Fig. 1a), deletion of one of the columns from the series (Fig. 1b), swap of two columns within the series (Fig. 1c), and substitution of a column within the series (Fig. 1d); and 2) crossover (Fig. 1e). The individuals that are set to undergo genetic operations are determined using tournament selection. To obtain a diverse set of solutions, a variant of a technique called crowding is used, which limits the probability of selecting those individuals that share columns with those already added to the new generation [37]. More specifically, the fitness of individuals that take part in a tournament is decreased by the homogeneous penalty of , where corresponds to the average penalty of using each of the columns separately. The explanation for this value of the parameter is provided in Supplementary Material. The described penalty enhances additions to the population individuals with underrepresented columns, what highly increases the diversity in population.

Figure 1: Genetic operations in EBIC: (a) insertion mutation, (b) deletion mutation, (c) swap mutation, (d) substitution mutation, and (e) crossover.

Individuals whose overall fitness is the highest are stored in the top-rank list, which is updated only if a newly found individual does not substantially overlap with an individual in the list. During the construction of a new population a variant of a tabu list is used, which forbids calculation of the previous biclusters [15, 16]. Elitism is used to clone a group of the best individuals found so far, so that the population is still able to search around local minima [33]. To limit the communication overhead, a Compressed Biclusters Format (CBF) is proposed for storing biclusters (see Fig. 2). The format was motivated by Compressed Row Storage (CRS), a popular representation of sparse matrices.

Figure 2: Compressed Bicluster Format (CBF) uses two arrays. The first array determines the starting positions of each of the biclusters, while the second one holds indexes of columns of biclusters. In this example the population consists of three biclusters (individuals): (1,4,2), (4,2), and (2,3,5,1,4), which start at indices 0, 3, and 5, respectively.

2.2 EBIC Algorithm.

The basic concept of EBIC – a parallel biclustering algorithm based on Artificial Intelligence (AI) – is presented in Figure 3. The dataset is split into equal chunks of data and distributed across multiple GPUs. A population of different series of columns is generated on the CPU, stored in CBF format, and broadcast to multiple GPUs. Each GPU counts the number of rows which match the given series. The results are summarized on each GPU and sent back to the CPU in order to calculate fitness, which is used later to assess bicluster quality.

Figure 3: Overview of EBIC. After dispatching chunks of the input data to multiple GPUs, biclusters – represented by multiple series of columns and stored in CBF format – are broadcast to GPUs. Each GPU calculates the number of how many rows of the chunk match the series imposed by the columns. This is used to determine fitness of each bicluster and generate a new set of biclusters.
Step 1: Initialization.

Set up GPUs, divide the dataset proportionally by rows depending on the number of GPUs, and distribute the data across multiple GPUs. Generate initial population, calculate fitness on GPUs. Initialize top-rank list by sequentially adding unique (non-overlapping) series of columns with the highest fitness according to (1).

Step 2: Elitism.

Reproduce 1/4 of the best biclusters from the top-rank list, add them to the new population. Update penalties for using each column (each column addition to the population increases the penalty for using this column).

Step 3: Prepare population of biclusters.

Until the population reaches its required size, try to generate unique solutions (i.e., that haven’t been previously analyzed). Select each new individual using tournament selection. Thus, select a solution randomly from the previous population and adjust its quality by applying the penalty for similarity with the previously accepted solutions. The penalty is calculated by averaging penalties incurred by selecting each column separately over the number of columns within the series. The final penalty is calculated using the value of . After selecting individuals, perform genetic operations (crossover and mutation). If the solution is novel (i.e., does not belong to the tabu list) add it to the population and the tabu list, and update penalties for using the solution’s columns. Store the population in Compressed Biclusters Format (CBF). If the solution was previously analyzed, increase the number of tabu-list hits. If this number is greater than the size of population, finish calculations and go to step 6 in order to report the previously found best patterns.

Step 4: Calculate quality of biclusters in parallel.

Dispatch the new population (i.e., sets of column series) to each of the GPUs. Determine how many rows match each of the series of columns. Collect the results from multiple GPUs and determine the fitness of each bicluster according to (1).

Step 5: Update top-rank list.

Sort the population according to fitness. Try to add new individuals to the top-rank list by checking if they do not substantially overlap with records with higher fitness. If a bicluster is added, remove from the top-rank list all records that have lower fitness and substantially overlap with the bicluster. After all individuals in the population are checked, remove from the top-rank list the records that have the lowest fitness, until the required size of the top-rank list is reached. If the maximal number of iterations is not accomplished, go back to Step 2.

Step 6: Prepare biclusters.

Determine in parallel on each GPU the indices of rows that match each of the series of columns in the top-rank list.

Step 7: Expansion of biclusters.

Expand the biclusters that have approximate and negative trends. Output the required number of biclusters (or all biclusters from the top-rank list).

2.3 Pattern discovery on synthetic datasets.

The performance of EBIC is evaluated based on the benchmark of synthetic datasets from [39] and compare it to top biclustering methods: UniBic [39], OPSM [2], QUBIC [23], ISA [4], FABIA [19], CPB [6], and BicSPAM [18], as well as a newly published GPU-accelerated biclustering algorithm called Condition-dependent Correlation Subgroup (CCS) [5]. The latter hasn’t been benchmarked yet on the established collection of datasets, neither synthetic nor genomic.

The test suite that was used to benchmark the algorithms contains three very popular biclustering problems: pattern discovery, biclusters overlap, and narrow biclusters detection. Recovery and relevance scores were determined using the Jaccard index [20] from the BiBench package [13], specifically (2) and (3) :

(2)
(3)

The first set of problems verifies the ability of the algorithm to identify six different data patterns, including trend-preserving, column-constant, row-constant, shift, scale, and shift-scale. The tests assess how accurately a biclustering algorithm detects three biclusters of size 15x15 implanted within a matrix of size 150x100, four biclusters of size 20x20 implanted within a matrix of size 200x150, and five biclusters of size 25x25 implanted within a matrix of size 300x200. Each problem consists of 5 different datasets for each of 6 patterns – which constitute 90 unit tests in total. The tests on overlapping patterns measure the ability of the algorithms to detect 5 biclusters of size 20x20 implanted within the matrix of size 200x150 that overlapped with each other by 0x0, 3x3, 6x6, and 9x9 elements – 20 tests in total [39]. Narrow biclusters are biclusters with 100 rows and 10–30 columns implanted within a large matrix of size 1000x100 – 9 tests in total. The tests determine whether biclustering methods are capable of discovering patterns that feature multiple rows but only a small number of columns [39]. To show independence of the starting seeds our method was run 10 times on all problems, each time with a different initial seed.

2.4 Enrichment analysis on genomic datasets.

The effectiveness of pattern discovery with EBIC was further evaluated on real-world gene expression datasets. For this purpose, BiBench software and a benchmark of genetic datasets from Eren et al. [13] were used. Details of the gene datasets used for the study are presented in Table 1. The same procedures of data acquisition, preprocessing, and analysis were followed. Thus, datasets were downloaded using GEOquery [10] and preprocessed using PCA imputation [38]. After completing biclustering, a gene enrichment analysis of each bicluster was performed using the R package GOstats [14]. Biclusters were considered significantly enriched if any of the p-values associated with a given GO term were lower than 0.05 after Benjamini-Hochberg correction [3]. Assessment of the results was based on the proportion of enriched biclusters to all biclusters reported. Each algorithm was allowed to return no more than 400 biclusters per dataset. The number of biclusters found and the proportion of significantly enriched results were compared to the study by [39] and are presented in Table 2. EBIC was tested with two overlap ratios, 0.5 and 0.75.

Dataset Genes Samples Description
GDS181 12626 84 Large-scale analysis of the human Transcriptome
GDS589 8799 122 Multiple normal tissue gene expression across strains
GDS1406 12488 87 Brain regions of various inbred strains
GDS1451 8799 94 Toxicants effect on liver: pooled and individual sample comparison
GDS1490 12488 150 Neural tissue profiling
GDS2520 12625 44 Head and neck squamous cell carcinoma
GDS3715 12626 110 Insulin effect on skeletal muscle
GDS3716 22283 42 Breast cancer: histologically normal breast epithelium
Table 1: Description of GDS datasets.

3 Results

The performance of EBIC was tested on both synthetic as well as real gene expression datasets. As the synthetic benchmark publicly available data from Wang et al. [39] was used, which is available at https://sourceforge.net/projects/unibic/files/data_result.zip. For biological validation a well-established benchmark from Eren et al. was used [13] with eight genetic datasets.

3.1 Pattern discovery on synthetic datasets.

For synthetic datasets EBIC was set to stop either after 20,000 iterations or when the number of tabu-list hits exceeded the size of the population. All parameters were set to their defaults. Columns of biclusters were allowed to overlap no more than 0.5, and the block-size for the CUDA kernel was set to 64. This took a reasonable amount of computation time (1-25 minutes on Intel Core i7-6950X CPU with GeForce GTX 1070 GPU). Comparison of the accuracy of EBIC with selected biclustering methods in terms of recovery and relevance is presented in Fig. 4. CCS did not manage to return any result for trend-preserving, and row- and column-constant patterns, thus the method was excluded from the comparison. The CCS algorithm managed to present partial solutions for shift-, scale-, and shift-scale patterns only.

Figure 4: Comparison of the performance of biclustering algorithms on different types of patterns. Scores of the algorithms other than EBIC and CCS are quoted from [39].

The average recovery and relevance scores of EBIC are better than those reported by any of the previous methods. This difference is especially visible in order-preserving and shift-scale problems, which are considered to be the most biologically meaningful [39]. EBIC managed to detect all patterns perfectly for trend-preserving patterns, while other methods reached 70% on average. The average relevance and recovery rate for shift-scale patterns were also much higher. As for scale and shift-scale patterns, EBIC attained high recovery/relevance scores across all tests (95.2%/85.5% for scale- and 94.2%/84.5% for shift-scale patterns), although scores for the worst-case scenarios were comparable to other methods (75.1%/37.8% and 72.0%/46.7%, respectively). EBIC may be the first biclustering algorithm capable of detecting all aforementioned patterns with over 90% average recovery and relevance [34]. The algorithm also did not vary much on the initial seed, but rather on the specific dataset. The recovery/relevance scores from algorithm runs with different initial seeds did not differ statistically.

EBIC was also tested on the datasets provided by Bhattacharya et al. [5] and detected biclusters with recovery and relevance scores over 95%, whereas CCS reported those scores to vary from approximately 20% to nearly 90%.

Overlapping biclusters.

The second set of tests compares the deterioration of the accuracy of biclustering algorithms in detecting trend-preserving biclusters that overlap with each other. This set of problems contains tests of 3 biclusters of size 20x20 that overlap with each other by 0x0, 3x3, 6x6, and 9x9 within a matrix of size 200x150. Each problem is represented by 5 dataset variants, resulting in up to 20 tests in total [39]. The effect of the overlap on the recovery and relevance of different algorithms is presented in Fig. 5.

All biclustering methods tend to deteriorate if implanted biclusters start to significantly overlap with each other [39]. The performance of EBIC also decreased when the higher level of overlap was considered, but the decrease was small. The algorithm was still able to maintain recovery and relevance scores close to 90% on the average.

Figure 5: Comparison of the performance of biclustering algorithms in scenarios with different levels of biclusters’ overlap. Scores of the algorithms other than EBIC and CCS are quoted from [39].
Narrow biclusters.

The last phase of our benchmark considers the detection of narrow biclusters comprising 100 rows and 10/20/30 columns, which were implanted within the matrix of size 1000×100. Each scenario contains 3 variants, resulting in up to 9 tests in total. The results are presented in Fig. 6.

Figure 6: Comparison of biclustering-algorithm performance in scenarios with narrow biclusters. The reference results are quoted from Wang et al.[39].

UniBic was reported to have low accuracy in detecting narrow biclusters within the dataset. CCS did not manage to return any bicluster for every dataset in this test. In contrast to all other algorithms, EBIC managed much better in this task and discovered almost perfectly all implanted structures. For the narrowest biclusters, our algorithm was approximately twice as good as the second method dedicated to finding narrow biclusters (BicSPAM) and nearly three times as good as the best overall algorithm (UniBic).

Our conclusion is that EBIC is not only capable of detecting different types of patterns, but also different sizes of patterns (i.e., wide or narrow patterns) with very high accuracy.

3.2 Enrichment analysis on genomic datasets.

For the genetic datasets, it was observed that the proportion of enriched biclusters obtained after approximately 5000 iterations highly depended on the dataset (see Supplementary Material). Further iterations either improved or worsened the proportion. EBIC was run for 5000 iterations, columns were allowed to overlap by 50% or 75%. The results of enrichment analyses are presented in Table 2.

Some memory management issues were encountered with CCS (both the sequential and parallel versions). The algorithm was unable to detect biclusters in some of the genomic datasets and terminated prematurely with an error. After fixing a bug, the algorithm, even in parallel mode, proved to be extremely slow. Although the dataset was of reasonable size it took over 8 days of computation (on CPU+GPUs) to yield results for the most challenging genomic datasets (GDS 1451). In contrast EBIC needed less than 3 minutes to yield higher number of significantly enriched biclusters for this dataset.

EBIC generated the highest percentage of enriched biclusters. EBIC with a more restrictive overlap ratio (0.5) generated a higher percentage of significantly enriched biclusters (52.4%) in comparison to any other method. The second best was CCS (43.5%), which on the other hand generated much more significantly enriched biclusters. EBIC with less restrictive overlap (0.75) outperformed all the methods included in our study, both in terms of the number and percentage of significantly enriched biclusters. EBIC generated 20 significantly enriched biclusters more than the second-best method (323 vs 303 by CCS). More importantly, EBIC managed to find nearly 11% more significantly enriched biclusters. This result is noticeable, considering that the difference between the second- and third-best methods was only 2.7%. In addition, the biclusters returned did not overlap substantially, from less than 4% up to 31%, depending on the dataset.

Algorithm Found Enriched
EBIC, 0.75 589 323 (54.8%)
EBIC, 0.5 145 76 (52.4%)
CCS 691 303 (43.8%)
UniBic 151 62 (41.1%)
OPSM 163 48 (29.5%)
QUBIC 91 34 (37.4%)
ISA 217 71 (32.7%)
FABIA 80 22 (27.5%)
CPB 96 34 (35.4%)
Table 2: Significantly enriched biclusters found across all GDS datasets. Two overlap thresholds of EBIC are considered: 0.5 and 0.75. The scores of the algorithms other than EBIC and CCS are quoted from [39].

3.3 Complexity analysis.

The total time complexity of the sequential version of EBIC is on the order of , where m – number of columns, n – number of rows, k – number of iterations, and b – number of biclusters (size of the population).Taking into consideration the fact that EBIC exploits data parallelism, its complexity is reduced to , where p – number of processors.

The algorithm in each iteration sorts individuals based on their fitness, which takes up to . Usually, . The most time-consuming part of the algorithm is determining the fitness of the set of biclusters. This requires checking at every iteration, for each of the rows, the values of columns contained in CBF, which takes .

The theoretical worst-case scenario of the algorithm is related to performing an exhaustive search on all combinations of columns. There are possible combinations. Thus, the worse-case time complexity of EBIC, when performing an exhaustive search on all combinations of columns, is on the order of . Nonetheless, the actual complexity of the algorithm is much lower and driven by evolution, which allows to check only the most promising trends.

3.4 Scalability of the algorithm.

In order to assess the scalability of the methods, five datasets with 100 columns and different numbers of rows ranging between 5000 and 25000 were generated. Each algorithm was run five times on each of the datasets with default values of its parameters. The algorithms were allowed to yield up to 100 biclusters. All tests were performed on a machine with an Intel Core i7-6950X CPU and 64GB of RAM. Comparison of run times in logarithmic scale is presented in Fig. 7. Due to its parallel implementation, EBIC maintained a similar running times for all test cases. We noticed that GPU initialization and data transfer between CPU and GPU took a considerable amount of EBIC’s run time. Starting with 10,000 rows, EBIC began to run faster than both CCS and UniBic, the most precise methods so far. For problems with 25,000 rows, EBIC run over 12 times faster than UniBic and over 20 times faster than CCS. With increasing data size, the running times of EBIC have started to be comparable with ones from OPSM and ISA. The actual performance of EBIC for larger datasets on multiple GPUs requires further investigation.

Figure 7: Comparison of running time of the algorithms on datasets with 100 columns and varying numbers of rows.

4 Discussion

EBIC is probably the first known algorithm capable of detecting multiple biologically important patterns in genetic data with accuracy exceeding 50%. It is also one of the very few parallel biclustering methods in the field, dedicated for multi-GPU environments. In comparison with most state-of-the-art algorithms EBIC exhibited a number of advantages: (1) EBIC outperformed the state-of-the-art biclustering algorithms on established synthetic datasets. EBIC was the only algorithm to discover each of six types of major genetic patterns in synthetic datasets with over 95% average accuracy and the only one to maintain over 90% accuracy on narrow and overlapping biclusters. (2) EBIC found over 11% more significantly enriched biclusters than the second-best method (CCS) on a benchmark of 8 genomic datasets. (3) EBIC yielded far more relevant (i.e. enriched) biclusters than any of the methods. (4) EBIC proved to be over 12 times faster than any of the most accurate methods (CCS or UniBic) on the largest datasets.

We would like to formulate the requirements for the next-generation of biclustering methods. Such algorithms are expected to meet the following criteria: (1) be powerful enough to discover the six major types of biclusters discussed above with high accuracy (over 75% on average); (2) be capable of handling overlapping, narrow, and approximate patterns with similar accuracy; (3) provide meaningful solutions for both synthetic and real datasets; (4) be scalable. In contrast to other methods described in this paper, EBIC with its average accuracies exceeding 90% certainly meets these requirements and could be called a next-generation biclustering method.

EBIC has certainly some limitations. First, the closer the overlap threshold is to 0, EBIC may no longer be able to capture different series that are present within the same columns. Instead, this series of columns which is represented by the largest number of rows will incorporate all other permutations. The reason for this is construction of top-rank list. For performance purposes, the list uses intersection of columns as the merging criterion, what makes the actual order of columns within the series irrelevant. A full overlap of biclusters within the top-rank list is possible, but discouraged.

Secondly, the application of EBIC to datasets that have fewer than 20 columns is discouraged. In this case an exhaustive search guarantees discovery of all meaningful patterns in a much shorter time.

The guidelines for the exact number of iterations to run EBIC, as well as the optimal level of overlap on biclusters in the top-list, need to be empirically defined. The number and percentage of significantly enriched biclusters depends on the overlap ratio. More restrictive overlap threshold (0.5) allows the algorithm to detect fewer biclusters with less overlapping columns. Setting a less restrictive overlap threshold (0.75) returns more biclusters, which overlap more with each other. Unfortunately, the final level of biclusters’ overlap can’t be controlled in EBIC. On the analyzed datasets, biclusters didn’t overlap with each other for more than 30% on the average.

EBIC scores do not seem to improve with every iteration. The accuracy of pattern detection generally improves over time for synthetic datasets, but this did not hold for real genomic datasets. The highest proportion of significantly enriched biclusters oscillated or even slightly deteriorated for real-world genetic datasets after 100 iterations. For all genomic datasets, EBIC was stopped after 5,000 iterations, as it seemed to be a reasonable compromise between the percentage of enriched results and run time. Additional study on the influence of the size of the input matrix on the number of required iterations is needed.

Our initial tests using larger volumes of data indicate that the algorithm supports datasets of up to 60k rows per GPU. Achieving full scalability of EBIC and preparing the algorithm for big data challenges requires more work.

5 Conclusion

EBIC is anticipated to become a reference method for future studies in biclustering. EBIC may also prove beneficial in other domains beyond genomics. The method may impact pattern detection in multiple other fields, such as medicine, applied informatics (e.g., web clickstream mining), economics (e.g., stock-market rule discovery, finance market analysis), biology (e.g., metabolites concentration comparison, protein complex interactions detection), and chemistry (e.g., drug-activity analysis). Extensive AI method development is necessary to fully realize the potential of AI for solving the most challenging big data problems.

6 Authors’ Contributions

P.O. conceived the study, designed and implemented the algorithm. P.O. and J.H.M. performed the analysis. M.S. and X.H. consulted the project, analyzed the results and participated in writing the manuscript. J.H.M. oversaw the project.

7 Acknowledgements

This research was supported in part by PL-Grid Infrastructure and by National Institutes of Health grants LM012601, TR001263, and ES013508.

7.0.1 Conflict of interest statement.

None declared.

References

  • [1] Wassim Ayadi, Ons Maâtouk, and Hend Bouziri. Evolutionary biclustering algorithm of gene expression data. In Database and Expert Systems Applications (DEXA), 2012 23rd International Workshop on, pages 206–210. IEEE, 2012.
  • [2] A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini. Discovering local structure in gene expression data: the order-preserving submatrix problem. J. Comput. Biol., 10(3-4):373–384, 2003.
  • [3] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289–300, 1995.
  • [4] Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical review E, 67(3):031902, 2003.
  • [5] Anindya Bhattacharya and Yan Cui. A gpu-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules. Scientific Reports, 7(1):4162, 2017.
  • [6] Doruk Bozdağ, Jeffrey D Parvin, and Umit V Catalyurek. A biclustering method to discover co-regulated genes using diverse gene expression datasets. In Bioinformatics and Computational Biology, pages 151–163. Springer, 2009.
  • [7] Stanislav Busygin, Oleg Prokopyev, and Panos M Pardalos. Biclustering in data mining. Computers & Operations Research, 35(9):2964–2987, 2008.
  • [8] Y. Cheng and G. M. Church. Biclustering of expression data. In Proceedings of the eighth international conference on intelligent systems for molecular biology, volume 8, pages 93–103, 2000.
  • [9] Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Wei Xie, Gail L Rosen, et al. Opportunities and obstacles for deep learning in biology and medicine. bioRxiv, page 142760, 2017.
  • [10] Sean Davis and Paul S Meltzer. Geoquery: a bridge between the gene expression omnibus (geo) and bioconductor. Bioinformatics, 23(14):1846–1847, 2007.
  • [11] Federico Divina and Jesus S Aguilar-Ruiz. Biclustering of expression data with evolutionary computation. IEEE transactions on knowledge and data engineering, 18(5):590–602, 2006.
  • [12] Sara Dolnicar, Sebastian Kaiser, Katie Lazarevski, and Friedrich Leisch. Biclustering: Overcoming data dimensionality problems in market segmentation. Journal of Travel Research, 51(1):41–49, 2012.
  • [13] Kemal Eren, Mehmet Deveci, Onur Küçüktunç, and Ümit V Çatalyürek. A comparative analysis of biclustering algorithms for gene expression data. Briefings in bioinformatics, 14(3):279–292, 2013.
  • [14] Seth Falcon and Robert Gentleman. Using gostats to test gene lists for go term association. Bioinformatics, 23(2):257–258, 2007.
  • [15] Fred Glover. Tabu search—part i. ORSA Journal on computing, 1(3):190–206, 1989.
  • [16] Fred Glover. Tabu search—part ii. ORSA Journal on computing, 2(1):4–32, 1990.
  • [17] J. A. Hartigan. Direct clustering of a data matrix. Journal of the american statistical association, 67(337):123–129, 1972.
  • [18] Rui Henriques and Sara C Madeira. Bicspam: flexible biclustering using sequential patterns. BMC bioinformatics, 15(1):130, 2014.
  • [19] S. Hochreiter, U. Bodenhofer, M. Heusel, A. Mayr, A. Mitterecker, A. Kasim, T. Khamiakova, S. Van Sanden, D. Lin, W. Talloen, et al. FABIA: factor analysis for bicluster acquisition. Bioinformatics, 26(12):1520–1527, 2010.
  • [20] Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat, 37:547–579, 1901.
  • [21] John R Koza. Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press, 1992.
  • [22] Laura Lazzeroni and Art Owen. Plaid models for gene expression data. Statistica sinica, pages 61–86, 2002.
  • [23] G. Li, Q. Ma, H. Tang, A. H. Paterson, and Y. Xu. QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic acids research, 37(15):e101–e101, 2009.
  • [24] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1):24–45, 2004.
  • [25] Boris Mirkin. Mathematical classification and clustering, 1996.
  • [26] Sushmita Mitra and Haider Banka. Multi-objective evolutionary biclustering of gene expression data. Pattern Recognition, 39(12):2464–2477, 2006.
  • [27] James N Morgan and John A Sonquist. Problems in the analysis of survey data, and a proposal. Journal of the American statistical association, 58(302):415–434, 1963.
  • [28] Patryk Orzechowski. Proximity measures and results validation in biclustering–a survey. In International Conference on Artificial Intelligence and Soft Computing, pages 206–217. Springer, 2013.
  • [29] Patryk Orzechowski and Krzysztof Boryczko. Hybrid biclustering algorithms for data mining. In European Conference on the Applications of Evolutionary Computation, pages 156–168. Springer, 2016.
  • [30] Patryk Orzechowski and Krzysztof Boryczko. Propagation-based biclustering algorithm for extracting inclusion-maximal motifs. Computing & Informatics, 35(2), 2016.
  • [31] Victor A Padilha and Ricardo JGB Campello. A systematic comparative evaluation of biclustering techniques. BMC bioinformatics, 18(1):55, 2017.
  • [32] Riccardo Poli, William B Langdon, Nicholas F McPhee, and John R Koza. A field guide to genetic programming. Lulu.com, 2008.
  • [33] Riccardo Poli, Nicholas Freitag McPhee, and Leonardo Vanneschi. Elitism reduces bloat in genetic programming. In Proceedings of the 10th annual conference on Genetic and evolutionary computation, pages 1343–1344. ACM, 2008.
  • [34] Beatriz Pontes, Raúl Giráldez, and Jesús S Aguilar-Ruiz. Biclustering on expression data: A review. Journal of biomedical informatics, 57:163–180, 2015.
  • [35] Beatriz Pontes, Ral Girldez, and Jess S Aguilar-Ruiz. Quality measures for gene expression biclusters. PloS one, 10(3):e0115497, 2015.
  • [36] A. Prelić, S. Bleuler, P. Zimmermann, A. Wille, P. Bühlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9):1122–1129, 2006.
  • [37] Bruno Sareni and Laurent Krahenbuhl. Fitness sharing and niching methods revisited. IEEE transactions on Evolutionary Computation, 2(3):97–106, 1998.
  • [38] Wolfram Stacklies, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. pcamethods—a bioconductor package providing pca methods for incomplete data. Bioinformatics, 23(9):1164–1167, 2007.
  • [39] Zhenjia Wang, Guojun Li, Robert W Robinson, and Xiuzhen Huang. Unibic: Sequential row-based biclustering algorithm for analysis of gene expression data. Scientific reports, 6, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
86731
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description