Object-Attribute Biclustering for Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data
Missing genotypes can affect the efficacy of machine learning approaches to identify the risk genetic variants of common diseases and traits. The problem occurs when genotypic data are collected from different experiments with different DNA microarrays, each being characterised by its pattern of uncalled (missing) genotypes. This can prevent the machine learning classifier from assigning the classes correctly. To tackle this issue, we used well-developed notions of object-attribute biclusters and formal concepts that correspond to dense subrelations in the binary relation . The paper contains experimental results on applying a biclustering algorithm to a large real-world dataset collected for studying the genetic bases of ischemic stroke. The algorithm could identify large dense biclusters in the genotypic matrix for further processing, which in return significantly improved the quality of machine learning classifiers. The proposed algorithm was also able to generate biclusters for the whole dataset without size constraints in comparison to the In-Close4 algorithm for generation of formal concepts.
Keywords:Formal Concept Analysis, Biclustering, Single Nucleotide Polymorphism, Missing Genotypes, Data Mining, Ischemic Stroke
The recent progress in studying different aspects of human health and diversity (e.g., genetics of common diseases and traits, human population structure, and relationships) is associated with the development of high-throughput genotyping technologies, particularly with massive parallel genotyping of Single Nucleotide Polymorphisms (SNPs) by DNA-microarrays [pmid23288464]. They allowed the determination of hundreds of thousands and millions of SNPs in one experiment and were the basis for conducting genome-wide association studies (GWAS). Although thousands of genetic loci have been revealed in GWAS, there are practical problems with replicating the associations identified in different studies. They seem to be due to both limitations in the methodology of the GWAS approach itself and differences between various studies in data design and analysis [pmid29876890]. The machine learning (ML) approaches were found to be quite promising in this field [pmid32351543].
Genotyping by microarrays is efficient and cost-effective, but missing data appear. GWAS is based on a comparison of frequencies of genetic variants among patients and healthy people. It assumes that all genotypes are provided (usually, their percentage is defined by a genotype calling threshold). In this article, we demonstrate that missing data can affect not only statistical analysis but also the ML algorithms. The classifiers can fail because of missing values (uncalled genotypes) being distributed non-randomly. We assume that each set of DNA-microarray can possess a specific pattern of missing values marking both the dataset of patients and healthy people. Therefore, the missing data needs to be carefully estimated and processed without dropping too many SNPs that may contain crucial genetic information.
To overcome the problem of missing data, we aimed to apply a technique capable of discovering some groupings in a dataset by looking at the similarity across all individuals and their genotypes. The raw datasets can be converted into an integer matrix, where individuals are in rows, SNPs are in columns, and cells contain genotypes. For each SNP, the person can have either AA, AB, or BB genotype, where A and B are the alleles. Thus the genotypes can be coded as 0, 1, and 2, representing the counts of allele B.
The proposed method can simultaneously cluster rows and columns in a data matrix to find homogeneous submatrices [tanay2002discovering], which can overlap. Each of these submatrices is called a bicluster [Mirkin:1996], and the process of finding them is called biclustering [tanay2002discovering, madeira2004biclustering, cheng2000biclustering, tanay2005biclustering, busygin2008biclustering].
Biclustering in genotype data allows identifying sets of individuals sharing SNPs with missing genotypes. A bicluster arises when there is a strong relationship between a specific set of objects and a specific set of attributes in a data table. A particular kind of bicluster if a formal concept in Formal Concept Analysis (FCA) [Ganter:1999:FCA]. A formal concept is a pair of the form (extent, intent), where extent consists of all objects that share the attributes in intent, and dually the intent consists of all attributes shared by the objects in extent. Formal concepts have a desirable property of being homogeneous and closed in the algebraic sense, which resulted in their extensive use in Gene Expression Analysis (GEA) [Besson:2005, Blachon:2007, Kaytoue:2011, Andrews:2013].
A concept-based bicluster (or object-attribute bicluster) [Ignatov:2010] is a scalable approximation of a formal concept withe the following advantages:
Reduced number of patterns to analyze;
Reduced computational cost (polynomial vs. exponential);
Manual (interactive) tuning of bicluster density threshold;
Tolerance to missing (object, attribute) pairs.
In this paper, we propose an extended biclustering algorithm of [ignatov2012concept] that can identify large biclusters with missing genotypes for categorical data (many-valued contexts with a selected value). This algorithm can generate a smaller amount of dense object-attribute biclusters than that of existing exact algorithms for formal concepts like concept miner In-Close4 [Andrews:2011], and is, therefore, better suited for large datasets. Moreover, during experimentation with the ischemic stroke dataset, we found that the number of large dense biclusters identified by our algorithm is significantly lower than the number of formal concepts extracted by In-Close4
The paper is organized as follows. In Section 2, we recall basic notions from Formal Concept Analysis and Biclustering. In Section 3, we introduce a method of FCA-based biclustering and its variants along with bicluster post-processing schemes, consider discussing the complexity of the proposed algorithm. In Section 4, we describe a dataset that consists of a sample of patients and their SNPs collected from various (independent) groups of patients. Then we present the results obtained during experiments on this dataset in Section 5 and mention the used hardware and software configuration. Section 6 concludes the paper.
2 Basic notions
2.1 Formal Concept Analysis
A formal context in FCA [Ganter:1999:FCA] is a triple consisting of two sets, and , and a binary relation between and . The triple can be represented by a cross-table consisting of rows , called objects, and columns , called attributes, and crosses representing incidence relation . Here, or means that the object has the attribute .
For and , let
These two operators are the derivation operators for .
Let be a formal context, for subsets and we have
Similar properties hold for subsets of attributes.
A closure operator on set is a mapping with the following properties:
Let , then
For a closure operator the set is called closure of , while a subset is called closed if .
It is evident from properties of derivation operators that for a formal context , the operators
are closure operators.
is a formal concept of formal context iff
The sets and are called the extent and the intent of the formal concept , respectively.
This definition says that every formal concept has two parts, namely, its extent and intent. It follows an old tradition in philosophical concept logic, as expressed in the Logic of Port Royal, 1662 [arnauld1662logique].
The set of all formal concepts is partially ordered, given by relation :
is called concept lattice of the formal context .
In case an object has properties like colour or age the corresponding attributes should have values themselves.
A many-valued context consists of sets , and and a ternary relation for which it holds that
The elements of are called (many-valued) attributes and those of attribute values.
Since many-valued attributes can be considered as partial maps from in , it is convenient to write .
In [madeira2004biclustering], bicluster is defined as a homogeneous submatrix of an input object-attribute matrix of real values in general. Consider a dataset as a matrix, , with a set of rows/objects/individuals and set of columns/attributes/SNPs . A submatrix constructed from a subset of rows and that of columns is denoted as is called a bicluster of [madeira2004biclustering]. The bicluster should satisfy some specific homogeneity properties, which varies from method to another.
For instance, for the purpose of this research, we use the following FCA-based definition of a bicluster [ignatovphdthesis2010, Ignatov:2010, ignatov2012concept].
For a formal context any biset with and is called a bicluster. If , then the bicluster is called an object-attribute or OA-bicluster with density .
The density of a bicluster is the bicluster quality measure that shows how many non-empty pairs the bicluster contains divided by its size.
Several basic properties of OA-biclusters are below.
For any bicluster it is true that ,
OA-bicluster is a formal concept iff ,
If is a OA-bicluster, there exists (at least one) its generating pair such that ,
If is a OA-bicluster, then .
For every ,
3, it follows .
In Fig. 1, you can see the example of OA-bicluster, for a particular pair of a certain context . In general, only the regions and are full of non-empty pairs, i.e. have maximal density , since they are object and attribute formal concepts respectively. The black cells indicate non-empty pairs, which one may found in less dense white regions.
Let be a OA-bicluster and , then is called dense if it satisfies the constraint .
The number of OA-biclusters of a context can be much less than the number of formal concepts (which may be ), as stated by the following propositions.
For a formal context the largest number of OA-biclusters is equal to and all OA-biclusters can be generated in time .
For a formal context and the largest number of dense OA-biclusters is equal to , all dense OA-biclusters can be generated in time .
3 Model and algorithm description
3.1 Parallel OA-biclustering algorithm
Algorithm 1 is a straightforward implementation, which takes an initial many-valued formal context and minimal density threshold as parameters and computes dense biclusters for each as pair in the relation that indicates which objects have SNP with missing values. However, since OA-biclusters for many-valued contexts were not formally introduced previously, we use a derived formal context with one-valued attributes denoting missing attribute-values of an original genotype matrix to correctly apply the definition of dense OA-bicluster.
Let is a many-valued context and is a selected value (e.g., denoting the absence of an SNP value), then its derived context for the value is where iff .
For genotype matrices with missing SNP values as many-valued contexts, similar representation can be expressed in terms of co-domains of many-valued attributes (the absence of means that of the corresponding SNP value) or by means of nominal scaling with a single attribute for the missing value [Ganter:1999:FCA].
If we compare the number of output pattern for formal concepts and dense OA-biclusters, in the worst case these values are versus . The time complexity of our algorithm is polynomial, , versus exponential in the worse case for BiMax [Prelic:2006], , or for CbO algorithms family [kuznetsov:1996], where is a number of generated concepts (also considered as biclusters) and is exponential in the worst case .
For calculating biclusters that fulfil a minimum density constraint, we need to perform several steps (see Algorithm 1). Steps 5-8 consists of applying the Galois operator to all objects in and steps 9-12 then to all attributes in within the induced context. The outer for loops are parallel (the concrete implementation may differ), while the internal ones are ordinary for loops. Then all biclusters are enumerated in a parallel manner as well, and only those that fulfil the minimal density requirement are retained (Steps 13-16). Again, efficient implementation of set data-structure for storing biclusters and duplicate elimination of the fly in parallel execution mode are not addressed in the pseudo-code.
The novelties of this algorithm are that we used parallelization to generate the OA-bicluster giving as input a medium-sized dataset (e.g. ), that is to make our program runs faster, and the possibility to work with selected values reducing many-valued context to contexts with one-valued attributes.
3.2 One-pass version of the OA-biclustering algorithm
Let us describe the online problem of finding the set of prime OA-biclusters based on the online OAC-Prime Triclustering [Gnatyshak:2014]. Let be a context. The user has no a priori knowledge of the elements and even cardinalities of , , and . At each iteration, we receive a set of pairs (“batch”) from : . After that, we must process and get the current version of the set of all biclusters. It is important in this setting to consider every pair of biclusters different if they have different generating pairs even if their extents and intents are equal, because any other pair can change only one of them, thus making them different.
Also, the algorithm requires that the dictionaries containing the prime sets are implemented as hash-tables or similar efficient key-value structures. Because of this data structure, the algorithm can efficiently access prime sets.
The algorithm itself is also straightforward (Alg. 2). It takes a set of pairs () and current versions of the biclusters set () and the dictionaries containing prime sets ( and ) as input and outputs the modified versions of the bicluster set and dictionaries. The algorithm processes each pair of sequentially (line 1). On each iteration the algorithm modifies the corresponding prime sets: it adds to (line 2) and to (line 3).
Finally, it adds a new bicluster to the bicluster set. Note that this bicluster contains pointers to the corresponding prime sets (in the corresponding dictionaries) instead of their copies (line 4).
In effect, this algorithm is very similar to the original OA-biclustering algorithm with some optimizations. First of all, instead of computing prime sets at the beginning, we modify them on spot, as adding a new pair to the relation modifies only two prime sets by one element. Secondly, we remove the main loop by using pointers for the bicluster’ extents and intents, as we can generate biclusters at the same step as we modify the prime sets. And third, it uses only one pass through the pairs of the binary relation , instead of enumeration of different pairwise combinations of objects and attributes.
Each step requires constant time: we need to modify two sets and add one bicluster to the set of biclusters. The total number of steps is equal to ; the time complexity is linear . Beside that the algorithm is one-pass.
The memory complexity is the same: for each of steps the size of each dictionary containing prime sets is increased either by one element (if the required prime set is already present), or by one key-value pair (if not). Since each of these dictionaries requires memory, the memory complexity is also linear.
3.3 Post-processing constraints
Another important step, in addition to this algorithm, is post-processing. Thus, we may want to remove additional biclusters with the same extent and intent from the output. Simple constraints like minimal support condition can be processed during this step without increasing the original complexity. It should be done only during the post-processing step, as the addition of a pair in the main algorithm can change the set of biclusters, and, respectively, the values used to check the conditions. Finally, if we need to fulfil more difficult constraints like minimal density condition, the time complexity of the post-processing will be higher than that of the original algorithm, but it can be efficiently implemented.
To remove the same biclusters we need to use an efficient hashing procedure that can be improved by implementing it in the main algorithm. For this, for all prime sets, we need to keep their hash-values with them in the memory. And finally, when using hash-functions other than LSH function (Locality-Sensitive Hashing) [Leskovec:2020], we can calculate hash-values of prime sets as some function of their elements (for example, exclusive disjunction or sum). Then, when we modify prime sets, we just need to get the result of this function and the new element. In this case, the hash-value of the bicluster can be calculated as the same function of the hash-values of its extent and intent.
Then it would be enough to implement the bicluster set as a hash-set in order to efficiently remove the additional entries of the same bicluster.
Pseudo-code for the basic post-processing (Alg. 3).
If the names (codes) of the objects and attributes are small enough (the time complexity of computing their hash values is ), the time complexity of the post-processing is if we do not need to calculate densities, and otherwise. Also, the basic version of the post-processing does not require any additional memory; so, its memory complexity is .
Finally, the algorithm can be easily paralleled by splitting the subset of input pairs into several subsets, processing each of them independently, and merging the resulting sets afterwards, which may lead to distributed computing schemes for larger datasets (cf. [Ignatov:2019]).
In case the output of the post-processing step is stored in a relational database along with the computed statistics and generating pairs, further usage of selection operators [Codd:1970] is convenient to consider only a specific subset of biclusters.
We use the following operator resulting in a specific subset of biclusters
where is the extent size, is the intent size, and is the density of OA-bicluster , respectively. One more reason to use postprocessing is neither monotonic nor anti-monotonic character of the minimal density constraint in the sense of constraints pushing in pattern mining [Besson:2005, ignatov2012concept].
4 Data collection
Collection of patients with ischemic stroke and their clinical characterisation were made at the Pirogov Russian National Research Medical University. The DNA extraction and genotyping of the samples were described previously [pmid22677768].
The dataset contains samples corresponding to individuals with a genetic portrait for each and a group label. The former represents the genotypes determined at many SNPs all over the genome. The latter takes values 0 or 1 depending on whether a person did not have or had a stroke. Each SNP is a vector that components can take values from , where 0, 1, and 2 denote the genotypes, and -1 indicates a missing value.
We represent the dataset as a many-valued formal context. In the derived context , where objects from stand for samples and attributes from stand for SNPs, means that an individual has a missing SNP . The context has the following parameters , , and which represents the total number of attributes with missing values in the dataset and cover 0.491% of the whole data matrix. The number of attributes without missing values is 40,067.
The genotypic data were obtained with DNA-microarrays. The dataset was compiled from several experiments where different types of microarrays were applied. Not all genotypes are equally measured during the experiment. Thus, there is a certain instrumental error. The quality of DNA can also affect the output of the experiments. Fig. 2 shows how many individuals have exactly missing genotypes per SNP in the dataset.
For instance, many individuals have about 85 missing genotypes per SNP.
5.1 Hardware and software configuration
The experimental results with OA-biclustering generation and processing were obtained on an Intel(R) Core(TM) i5-8265U CPU @ 1.80 GHz with 8 GB of RAM and 64-bit Windows 10 Pro operating system. We used the following software releases to perform our experiments: Python 3.7.4 and Conda 4.8.2.
5.2 Identification of Biclusters with Missing SNPs
The following experiment was performed with ischemic stroke data collection: first of all, 383,733 OA-biclusters, with duplicates, were generated after applying the parallel biclustering algorithm to the dataset.
As we can see from the graph in Figure 3, there is a reasonable amount of biclusters with a density value greater than 0.9. The distributions of biclusters by extent and intent show that the majority of biclusters have about 90 samples and 2,600 SNPs, respectively.
For the selection of large dense biclusters, we set the density constraint to be . Additional constraints were set as follows: for the extent size and for the intent size. In total, we selected 98,529 OA-biclusters with missing values. For this selection, the graph in Fig. 4 shows the selected peaks of large dense biclusters for different extent sizes.
Example 1. Biclusters in the form .
For generating pair we have that
, , , individuals, SNPs, 9,657 pairs out of 10,612 correspond to missing SNP values.
We studied further large dense biclusters and chose the densest ones with possibly larger sizes of their extents and intents from each of the peaks identified in their distributions, respectively (Fig. 3).
Here are some examples of these subsets with their associated graphs.
Example 2. We can further narrow down the number of patterns in the previous selection by looking at the distribution of biclusters by their extent size and choosing proper boundaries. Thus, in Fig. 4, there is the third largest peak of the number of biclusters near the extent size 125.
For the constraints below
the large dense bicluster with its intent size of 455 is identified and selected. Such bicluster has a large number of missing genotypes, which are subject to be eliminated later on.
Example 3. The selection around the rightmost peak (see Fig. 4) and further refining of the minimal value density
resulted in the large bicluster with the extent size of 108 and the intent size of 166.
5.3 Elimination of Large Biclusters with Missing Genotypes
After applying the proposed biclustering algorithm to the collected dataset, all large biclusters with missing genotypes were identified and eliminated. That resulted in a new data matrix ready for further analysis
As seen from Table 1, the biclustering algorithm application resulted in improvement in terms of entries corresponding to SNPs with missing genotypes, a fraction of such entries is reduced by 29.88%. The total number of biclusters generated before and after eliminating SNP with missing genotypes is 383,733 (with duplicates) and 259,440, respectively. The total amount of time for generating these biclusters before and after deleting missing data is and seconds (by Algorithm 1), respectively. As for online Algorithm 2, it has processed the original context (before elimination) in 1.5 seconds, while the post-processing Algorithm 3 for density computation has taken 907 seconds in sequential and 651 seconds in parallel (six cores) modes, respectively.
Fig. 5 shows the distribution of missing values in columns in the new data set (after elimination of missing data), which now has less ragged character.
5.4 Large Dense Biclusters Elimination and Classification Quality
We have conducted a number of machine learning experiments on our datasets to check the impact of eliminating missing data. Our proposed algorithm handled on the quality measures of supervised learning algorithms.
We choose to use gradient boosting on decision trees (GBDT). For this purpose, we selected two libraries where it is already implemented, CatBoost and LightGBM. Both implementations can handle missing values.
A genome can essentially be interpreted as a sequence of SNPs, so we made a decision to also use Long-Short Term Memory Network [lstm] as a strong approach to handling sequential data.
First dataset experiments. Firstly, we applied GBDT algorithm from CatBoost library to our initial dataset (before elimination of SNPs with missing genotypes). The following parameters were taken for the classifier:
Maximum number of trees: 3;
Tree depth limit: 3;
Loss function: binary cross-entropy (log-loss/binary cross-entropy).
We also applied LSTM approach the following way: the initial sequence was resized to 100 elements by a fully-connected layer, then the layer output was passed to the LSTM module element-wise. The hidden state of LSTM after the last element was passed to a fully-connected classification layer.
The scores on this dataset were evaluated with 3-fold cross-validation with stratified splits. Basic classification metrics’ scores are present in Table 2.
These unexpectedly high scores were unrealistic since the GBDT model complexity had one of the lowest possible configurations, and the LSTM model, which is handling the data in a different way, also achieved high accuracy. For a lot of samples, the model learned to “understand” on which chip it was analyzed by looking at the patterns of missing genotypes, so the data leak was present.
Second dataset experiments. This dataset was obtained after the identification of large dense biclusters by application of our proposed algorithm with subsequent elimination. Table 3 recaps the experiments conducted on the dataset. For the first and second experiments, we used CatBoost classifier with train/test split in the proportion of 8:2 and 3-fold cross-validation, respectively, while maintaining the balance of classes for model validation. In the third experiment, we used LGBMClassifier classifier with 3-fold cross-validation while maintaining the balance of classes for model validation. In the fourth experiment, the described earlier LSTM classifier was used with the aforementioned cross-validation.
From Table 3, one can see that scores are more realistic in comparison to those of Table 2, thus showing us that data leak and subsequent overfitting effects are gone. We realize that our proposed biclustering algorithm successfully identified large submatrices with missing data, which we eliminated and successfully removed the impact of data leak and overfitting.
5.5 Detecting concepts of missing SNP values under size constraints
In-Close4 is an open-source software tool [Andrews:2017], which provides a highly optimised algorithm from CbO family [kuznetsov:1996, Janostik:2020] to construct the set of concepts satisfying given constraints on sizes of extents and intents. In-Close4 takes as input a context and outputs a reduced concept lattice: all concepts satisfying the constraints given by parameter values ( and , where and are extent and intent of an output formal concept, and ).
To deal with our large real-world dataset, we changed the maximum default values used in the executable of In-Close4 parameters as follows:
#define MAX_CONS 30000000
//max number of concepts
#define MAX_COLS 90000
//max number of attributes
#define MAX_ROWS 2000
//max number of objects
From Tables 4 and 5, one can see that the number of concepts generated by In-Close4 becomes several times larger than that of OA-biclusters, in our case study. When we set the extent size constraint to 5 with the input context before and the extent and the intent size constraint to 20 and 0, respectively, after the elimination of missing data, the software crashed. Meanwhile, our proposed biclustering algorithms could manage to output all OA-biclusters in both cases.
As the author of InClose suggested in private communication, the tool was optimised for “tall” contexts with a large number of objects rather than attributes, while in bioinformatics the contexts are often “wide” like in our case when the number of SNPs is almost 57 times larger than that of individuals. So, the results on the transposed context along with properly set compilation parameters allowed to process the whole context for and
|Min intent size||Min extent size||Total Time, s||No. of Concepts|
|Min intent size||Min extent size||Total Time, s||Number of Concepts|
Even if we do not know the number of output concepts for the context after elimination of missing SNP values, their number is more than 10 times larger than that of OA-biclusters, which might be considered as argument in favour of their usage for the studied problem with rather low or no size constraints.
A new approach to process the missing values in datasets of SNP genotypes obtained with DNA-microarrays is proposed. It is based on OA-biclustering. We applied the approach to the real-world datasets representing the genotypes of patients with ischemic stroke and healthy people. It allowed us to estimate and eliminate the SNPs carefully with missing genotypes. Results of the OA-biclustering algorithm showed the possibility of detecting relatively large dense biclusters, which significantly helped in removing the effects of data leaks and overfitting while applying ML algorithms.
We compared our algorithm with In-Close4. The number of OA-biclusters generated by our algorithm is significantly lower than the number of concepts (or biclusters) generated by In-Close4. Besides, our algorithm has the advantage of using OA-bicluster without the need to experiment with finding the best minimum support, as in the case of using In-Close4 for generating formal concepts.
Since survey [Naulaerts:2015] mentioned frequent itemset mining (FIM) as a tool to identify strong associations between allelic combinations associated with diseases, the proposed algorithm needs further comparison with other approaches from FIM like DeBi [serin2011debi] and anytime discovery approaches like Alpine [Hu:17] tested on GEA datasets as well; though their use may get complicated if we need to keep information about object names for decision-makers. It also requires further time complexity improvements to increase the scalability and quality of the extensive bicluster finding process for massive datasets.
Another venue for related studies delve in Boolean biclustering [Michalak:2019] and factorisation techniques [Belohlavek:2019].
Speaking about other possible applications of biclustering, we suggest the development of a new imputation technique. Since biclustering has been recently applied to impute the missing values in gene expression data [chowdhury2020ncbi] and both GED and SNP genotyping data are obtained with DNA-microarrays and represented as an integer matrix, it can be potentially applied to impute the genotypes that facilitates statistical analyses and empowers ML algorithms.
This study was implemented in the Basic Research Program’s framework at the National Research University Higher School of Economics and the Laboratory of Models and Methods of Computational Pragmatics in 2020. The authors thank prof. Alexei Fedorov (University of Toledo College of Medicine, Ohio, USA) and prof. Svetlana Limborska (Institute of Molecular Genetics of National Research Centre “Kurchatov Institute”, Moscow, Russia) for insightful discussions of the results obtained, and anonymous reviewers.
The study was funded by RFBR (Russian Foundation for Basic Research) according to the research project No 19-29-01151. The foundation had no role in study design, data collection and analysis, writing the manuscript, and decision to publish.
- The equivalence classes are and .
- The last line in Table 4 and the last five lines in Table 5 corresponds to the experiments conducted for the final version of the paper on the transposed contexts.