ObjectAttribute Biclustering for Elimination of Missing Genotypes in Ischemic Stroke GenomeWide Data
Abstract
Missing genotypes can affect the efficacy of machine learning approaches to identify the risk genetic variants of common diseases and traits. The problem occurs when genotypic data are collected from different experiments with different DNA microarrays, each being characterised by its pattern of uncalled (missing) genotypes. This can prevent the machine learning classifier from assigning the classes correctly. To tackle this issue, we used welldeveloped notions of objectattribute biclusters and formal concepts that correspond to dense subrelations in the binary relation . The paper contains experimental results on applying a biclustering algorithm to a large realworld dataset collected for studying the genetic bases of ischemic stroke. The algorithm could identify large dense biclusters in the genotypic matrix for further processing, which in return significantly improved the quality of machine learning classifiers. The proposed algorithm was also able to generate biclusters for the whole dataset without size constraints in comparison to the InClose4 algorithm for generation of formal concepts.
Keywords:
Formal Concept Analysis, Biclustering, Single Nucleotide Polymorphism, Missing Genotypes, Data Mining, Ischemic Stroke1 Introduction
The recent progress in studying different aspects of human health and diversity (e.g., genetics of common diseases and traits, human population structure, and relationships) is associated with the development of highthroughput genotyping technologies, particularly with massive parallel genotyping of Single Nucleotide Polymorphisms (SNPs) by DNAmicroarrays [pmid23288464]. They allowed the determination of hundreds of thousands and millions of SNPs in one experiment and were the basis for conducting genomewide association studies (GWAS). Although thousands of genetic loci have been revealed in GWAS, there are practical problems with replicating the associations identified in different studies. They seem to be due to both limitations in the methodology of the GWAS approach itself and differences between various studies in data design and analysis [pmid29876890]. The machine learning (ML) approaches were found to be quite promising in this field [pmid32351543].
Genotyping by microarrays is efficient and costeffective, but missing data appear. GWAS is based on a comparison of frequencies of genetic variants among patients and healthy people. It assumes that all genotypes are provided (usually, their percentage is defined by a genotype calling threshold). In this article, we demonstrate that missing data can affect not only statistical analysis but also the ML algorithms. The classifiers can fail because of missing values (uncalled genotypes) being distributed nonrandomly. We assume that each set of DNAmicroarray can possess a specific pattern of missing values marking both the dataset of patients and healthy people. Therefore, the missing data needs to be carefully estimated and processed without dropping too many SNPs that may contain crucial genetic information.
To overcome the problem of missing data, we aimed to apply a technique capable of discovering some groupings in a dataset by looking at the similarity across all individuals and their genotypes. The raw datasets can be converted into an integer matrix, where individuals are in rows, SNPs are in columns, and cells contain genotypes. For each SNP, the person can have either AA, AB, or BB genotype, where A and B are the alleles. Thus the genotypes can be coded as 0, 1, and 2, representing the counts of allele B.
The proposed method can simultaneously cluster rows and columns in a data matrix to find homogeneous submatrices [tanay2002discovering], which can overlap. Each of these submatrices is called a bicluster [Mirkin:1996], and the process of finding them is called biclustering [tanay2002discovering, madeira2004biclustering, cheng2000biclustering, tanay2005biclustering, busygin2008biclustering].
Biclustering in genotype data allows identifying sets of individuals sharing SNPs with missing genotypes. A bicluster arises when there is a strong relationship between a specific set of objects and a specific set of attributes in a data table. A particular kind of bicluster if a formal concept in Formal Concept Analysis (FCA) [Ganter:1999:FCA]. A formal concept is a pair of the form (extent, intent), where extent consists of all objects that share the attributes in intent, and dually the intent consists of all attributes shared by the objects in extent. Formal concepts have a desirable property of being homogeneous and closed in the algebraic sense, which resulted in their extensive use in Gene Expression Analysis (GEA) [Besson:2005, Blachon:2007, Kaytoue:2011, Andrews:2013].
A conceptbased bicluster (or objectattribute bicluster) [Ignatov:2010] is a scalable approximation of a formal concept withe the following advantages:

Reduced number of patterns to analyze;

Reduced computational cost (polynomial vs. exponential);

Manual (interactive) tuning of bicluster density threshold;

Tolerance to missing (object, attribute) pairs.
In this paper, we propose an extended biclustering algorithm of [ignatov2012concept] that can identify large biclusters with missing genotypes for categorical data (manyvalued contexts with a selected value). This algorithm can generate a smaller amount of dense objectattribute biclusters than that of existing exact algorithms for formal concepts like concept miner InClose4 [Andrews:2011], and is, therefore, better suited for large datasets. Moreover, during experimentation with the ischemic stroke dataset, we found that the number of large dense biclusters identified by our algorithm is significantly lower than the number of formal concepts extracted by InClose4
The paper is organized as follows. In Section 2, we recall basic notions from Formal Concept Analysis and Biclustering. In Section 3, we introduce a method of FCAbased biclustering and its variants along with bicluster postprocessing schemes, consider discussing the complexity of the proposed algorithm. In Section 4, we describe a dataset that consists of a sample of patients and their SNPs collected from various (independent) groups of patients. Then we present the results obtained during experiments on this dataset in Section 5 and mention the used hardware and software configuration. Section 6 concludes the paper.
2 Basic notions
2.1 Formal Concept Analysis
Definition 1
A formal context in FCA [Ganter:1999:FCA] is a triple consisting of two sets, and , and a binary relation between and . The triple can be represented by a crosstable consisting of rows , called objects, and columns , called attributes, and crosses representing incidence relation . Here, or means that the object has the attribute .
Definition 2
For and , let
These two operators are the derivation operators for .
Proposition 1
Let be a formal context, for subsets and we have

if ,

,

(hence, ),

,

.
Similar properties hold for subsets of attributes.
Definition 3
A closure operator on set is a mapping with the following properties:
Let , then

(idempotency),

(extensity),

(monotonicity).
For a closure operator the set is called closure of , while a subset is called closed if .
It is evident from properties of derivation operators that for a formal context , the operators
are closure operators.
Definition 4
is a formal concept of formal context iff
The sets and are called the extent and the intent of the formal concept , respectively.
This definition says that every formal concept has two parts, namely, its extent and intent. It follows an old tradition in philosophical concept logic, as expressed in the Logic of Port Royal, 1662 [arnauld1662logique].
Definition 5
The set of all formal concepts is partially ordered, given by relation :
is called concept lattice of the formal context .
In case an object has properties like colour or age the corresponding attributes should have values themselves.
Definition 6
A manyvalued context consists of sets , and and a ternary relation for which it holds that
The elements of are called (manyvalued) attributes and those of attribute values.
Since manyvalued attributes can be considered as partial maps from in , it is convenient to write .
2.2 Biclustering
In [madeira2004biclustering], bicluster is defined as a homogeneous submatrix of an input objectattribute matrix of real values in general. Consider a dataset as a matrix, , with a set of rows/objects/individuals and set of columns/attributes/SNPs . A submatrix constructed from a subset of rows and that of columns is denoted as is called a bicluster of [madeira2004biclustering]. The bicluster should satisfy some specific homogeneity properties, which varies from method to another.
For instance, for the purpose of this research, we use the following FCAbased definition of a bicluster [ignatovphdthesis2010, Ignatov:2010, ignatov2012concept].
Definition 7
For a formal context any biset with and is called a bicluster. If , then the bicluster is called an objectattribute or OAbicluster with density .
The density of a bicluster is the bicluster quality measure that shows how many nonempty pairs the bicluster contains divided by its size.
Several basic properties of OAbiclusters are below.
Proposition 2

For any bicluster it is true that ,

OAbicluster is a formal concept iff ,

If is a OAbicluster, there exists (at least one) its generating pair such that ,

If is a OAbicluster, then .

For every ,
^{3} , it follows .
In Fig. 1, you can see the example of OAbicluster, for a particular pair of a certain context . In general, only the regions and are full of nonempty pairs, i.e. have maximal density , since they are object and attribute formal concepts respectively. The black cells indicate nonempty pairs, which one may found in less dense white regions.
Definition 8
Let be a OAbicluster and , then is called dense if it satisfies the constraint .
The number of OAbiclusters of a context can be much less than the number of formal concepts (which may be ), as stated by the following propositions.
Proposition 3
For a formal context the largest number of OAbiclusters is equal to and all OAbiclusters can be generated in time .
Proposition 4
For a formal context and the largest number of dense OAbiclusters is equal to , all dense OAbiclusters can be generated in time .
3 Model and algorithm description
3.1 Parallel OAbiclustering algorithm
Algorithm 1 is a straightforward implementation, which takes an initial manyvalued formal context and minimal density threshold as parameters and computes dense biclusters for each as pair in the relation that indicates which objects have SNP with missing values. However, since OAbiclusters for manyvalued contexts were not formally introduced previously, we use a derived formal context with onevalued attributes denoting missing attributevalues of an original genotype matrix to correctly apply the definition of dense OAbicluster.
Definition 9
Let is a manyvalued context and is a selected value (e.g., denoting the absence of an SNP value), then its derived context for the value is where iff .
For genotype matrices with missing SNP values as manyvalued contexts, similar representation can be expressed in terms of codomains of manyvalued attributes (the absence of means that of the corresponding SNP value) or by means of nominal scaling with a single attribute for the missing value [Ganter:1999:FCA].
If we compare the number of output pattern for formal concepts and dense OAbiclusters, in the worst case these values are versus . The time complexity of our algorithm is polynomial, , versus exponential in the worse case for BiMax [Prelic:2006], , or for CbO algorithms family [kuznetsov:1996], where is a number of generated concepts (also considered as biclusters) and is exponential in the worst case .
For calculating biclusters that fulfil a minimum density constraint, we need to perform several steps (see Algorithm 1). Steps 58 consists of applying the Galois operator to all objects in and steps 912 then to all attributes in within the induced context. The outer for loops are parallel (the concrete implementation may differ), while the internal ones are ordinary for loops. Then all biclusters are enumerated in a parallel manner as well, and only those that fulfil the minimal density requirement are retained (Steps 1316). Again, efficient implementation of set datastructure for storing biclusters and duplicate elimination of the fly in parallel execution mode are not addressed in the pseudocode.
The novelties of this algorithm are that we used parallelization to generate the OAbicluster giving as input a mediumsized dataset (e.g. ), that is to make our program runs faster, and the possibility to work with selected values reducing manyvalued context to contexts with onevalued attributes.
3.2 Onepass version of the OAbiclustering algorithm
Let us describe the online problem of finding the set of prime OAbiclusters based on the online OACPrime Triclustering [Gnatyshak:2014]. Let be a context. The user has no a priori knowledge of the elements and even cardinalities of , , and . At each iteration, we receive a set of pairs (“batch”) from : . After that, we must process and get the current version of the set of all biclusters. It is important in this setting to consider every pair of biclusters different if they have different generating pairs even if their extents and intents are equal, because any other pair can change only one of them, thus making them different.
Also, the algorithm requires that the dictionaries containing the prime sets are implemented as hashtables or similar efficient keyvalue structures. Because of this data structure, the algorithm can efficiently access prime sets.
The algorithm itself is also straightforward (Alg. 2). It takes a set of pairs () and current versions of the biclusters set () and the dictionaries containing prime sets ( and ) as input and outputs the modified versions of the bicluster set and dictionaries. The algorithm processes each pair of sequentially (line 1). On each iteration the algorithm modifies the corresponding prime sets: it adds to (line 2) and to (line 3).
Finally, it adds a new bicluster to the bicluster set. Note that this bicluster contains pointers to the corresponding prime sets (in the corresponding dictionaries) instead of their copies (line 4).
In effect, this algorithm is very similar to the original OAbiclustering algorithm with some optimizations. First of all, instead of computing prime sets at the beginning, we modify them on spot, as adding a new pair to the relation modifies only two prime sets by one element. Secondly, we remove the main loop by using pointers for the bicluster’ extents and intents, as we can generate biclusters at the same step as we modify the prime sets. And third, it uses only one pass through the pairs of the binary relation , instead of enumeration of different pairwise combinations of objects and attributes.
Each step requires constant time: we need to modify two sets and add one bicluster to the set of biclusters. The total number of steps is equal to ; the time complexity is linear . Beside that the algorithm is onepass.
The memory complexity is the same: for each of steps the size of each dictionary containing prime sets is increased either by one element (if the required prime set is already present), or by one keyvalue pair (if not). Since each of these dictionaries requires memory, the memory complexity is also linear.
3.3 Postprocessing constraints
Another important step, in addition to this algorithm, is postprocessing. Thus, we may want to remove additional biclusters with the same extent and intent from the output. Simple constraints like minimal support condition can be processed during this step without increasing the original complexity. It should be done only during the postprocessing step, as the addition of a pair in the main algorithm can change the set of biclusters, and, respectively, the values used to check the conditions. Finally, if we need to fulfil more difficult constraints like minimal density condition, the time complexity of the postprocessing will be higher than that of the original algorithm, but it can be efficiently implemented.
To remove the same biclusters we need to use an efficient hashing procedure that can be improved by implementing it in the main algorithm. For this, for all prime sets, we need to keep their hashvalues with them in the memory. And finally, when using hashfunctions other than LSH function (LocalitySensitive Hashing) [Leskovec:2020], we can calculate hashvalues of prime sets as some function of their elements (for example, exclusive disjunction or sum). Then, when we modify prime sets, we just need to get the result of this function and the new element. In this case, the hashvalue of the bicluster can be calculated as the same function of the hashvalues of its extent and intent.
Then it would be enough to implement the bicluster set as a hashset in order to efficiently remove the additional entries of the same bicluster.
Pseudocode for the basic postprocessing (Alg. 3).
If the names (codes) of the objects and attributes are small enough (the time complexity of computing their hash values is ), the time complexity of the postprocessing is if we do not need to calculate densities, and otherwise. Also, the basic version of the postprocessing does not require any additional memory; so, its memory complexity is .
Finally, the algorithm can be easily paralleled by splitting the subset of input pairs into several subsets, processing each of them independently, and merging the resulting sets afterwards, which may lead to distributed computing schemes for larger datasets (cf. [Ignatov:2019]).
In case the output of the postprocessing step is stored in a relational database along with the computed statistics and generating pairs, further usage of selection operators [Codd:1970] is convenient to consider only a specific subset of biclusters.
We use the following operator resulting in a specific subset of biclusters
where is the extent size, is the intent size, and is the density of OAbicluster , respectively. One more reason to use postprocessing is neither monotonic nor antimonotonic character of the minimal density constraint in the sense of constraints pushing in pattern mining [Besson:2005, ignatov2012concept].
4 Data collection
Collection of patients with ischemic stroke and their clinical characterisation were made at the Pirogov Russian National Research Medical University. The DNA extraction and genotyping of the samples were described previously [pmid22677768].
The dataset contains samples corresponding to individuals with a genetic portrait for each and a group label. The former represents the genotypes determined at many SNPs all over the genome. The latter takes values 0 or 1 depending on whether a person did not have or had a stroke. Each SNP is a vector that components can take values from , where 0, 1, and 2 denote the genotypes, and 1 indicates a missing value.
We represent the dataset as a manyvalued formal context. In the derived context , where objects from stand for samples and attributes from stand for SNPs, means that an individual has a missing SNP . The context has the following parameters , , and which represents the total number of attributes with missing values in the dataset and cover 0.491% of the whole data matrix. The number of attributes without missing values is 40,067.
The genotypic data were obtained with DNAmicroarrays. The dataset was compiled from several experiments where different types of microarrays were applied. Not all genotypes are equally measured during the experiment. Thus, there is a certain instrumental error. The quality of DNA can also affect the output of the experiments. Fig. 2 shows how many individuals have exactly missing genotypes per SNP in the dataset.
For instance, many individuals have about 85 missing genotypes per SNP.
5 Experiments
5.1 Hardware and software configuration
The experimental results with OAbiclustering generation and processing were obtained on an Intel(R) Core(TM) i58265U CPU @ 1.80 GHz with 8 GB of RAM and 64bit Windows 10 Pro operating system. We used the following software releases to perform our experiments: Python 3.7.4 and Conda 4.8.2.
5.2 Identification of Biclusters with Missing SNPs
The following experiment was performed with ischemic stroke data collection: first of all, 383,733 OAbiclusters, with duplicates, were generated after applying the parallel biclustering algorithm to the dataset.
As we can see from the graph in Figure 3, there is a reasonable amount of biclusters with a density value greater than 0.9. The distributions of biclusters by extent and intent show that the majority of biclusters have about 90 samples and 2,600 SNPs, respectively.
For the selection of large dense biclusters, we set the density constraint to be . Additional constraints were set as follows: for the extent size and for the intent size. In total, we selected 98,529 OAbiclusters with missing values. For this selection, the graph in Fig. 4 shows the selected peaks of large dense biclusters for different extent sizes.
Example 1. Biclusters in the form .
For generating pair we have that
, , , individuals, SNPs, 9,657 pairs out of 10,612 correspond to missing SNP values.
We studied further large dense biclusters and chose the densest ones with possibly larger sizes of their extents and intents from each of the peaks identified in their distributions, respectively (Fig. 3).
Here are some examples of these subsets with their associated graphs.
Example 2. We can further narrow down the number of patterns in the previous selection by looking at the distribution of biclusters by their extent size and choosing proper boundaries. Thus, in Fig. 4, there is the third largest peak of the number of biclusters near the extent size 125.
For the constraints below
the large dense bicluster with its intent size of 455 is identified and selected. Such bicluster has a large number of missing genotypes, which are subject to be eliminated later on.
Example 3. The selection around the rightmost peak (see Fig. 4) and further refining of the minimal value density
resulted in the large bicluster with the extent size of 108 and the intent size of 166.
5.3 Elimination of Large Biclusters with Missing Genotypes
After applying the proposed biclustering algorithm to the collected dataset, all large biclusters with missing genotypes were identified and eliminated. That resulted in a new data matrix ready for further analysis
no.  no.  no.  NaNs  

samples  SNPs  NaNs  fraction  
Before elimination  1,223  85,142  553,430  0.49% 
After elimination  1,472  82,690  388,052  0.31% 
As seen from Table 1, the biclustering algorithm application resulted in improvement in terms of entries corresponding to SNPs with missing genotypes, a fraction of such entries is reduced by 29.88%. The total number of biclusters generated before and after eliminating SNP with missing genotypes is 383,733 (with duplicates) and 259,440, respectively. The total amount of time for generating these biclusters before and after deleting missing data is and seconds (by Algorithm 1), respectively. As for online Algorithm 2, it has processed the original context (before elimination) in 1.5 seconds, while the postprocessing Algorithm 3 for density computation has taken 907 seconds in sequential and 651 seconds in parallel (six cores) modes, respectively.
Fig. 5 shows the distribution of missing values in columns in the new data set (after elimination of missing data), which now has less ragged character.
5.4 Large Dense Biclusters Elimination and Classification Quality
We have conducted a number of machine learning experiments on our datasets to check the impact of eliminating missing data. Our proposed algorithm handled on the quality measures of supervised learning algorithms.
We choose to use gradient boosting on decision trees (GBDT). For this purpose, we selected two libraries where it is already implemented, CatBoost and LightGBM. Both implementations can handle missing values.
A genome can essentially be interpreted as a sequence of SNPs, so we made a decision to also use LongShort Term Memory Network [lstm] as a strong approach to handling sequential data.
First dataset experiments. Firstly, we applied GBDT algorithm from CatBoost library to our initial dataset (before elimination of SNPs with missing genotypes). The following parameters were taken for the classifier:

Maximum number of trees: 3;

Tree depth limit: 3;

Loss function: binary crossentropy (logloss/binary crossentropy).
We also applied LSTM approach the following way: the initial sequence was resized to 100 elements by a fullyconnected layer, then the layer output was passed to the LSTM module elementwise. The hidden state of LSTM after the last element was passed to a fullyconnected classification layer.
The scores on this dataset were evaluated with 3fold crossvalidation with stratified splits. Basic classification metrics’ scores are present in Table 2.
accuracy  F1score  precision  recall  

CatBoostClassifier  0.966  0.9758  0.9558  0.9967 
FC+LSTM  0.890  0.926  0.880  0.982 
These unexpectedly high scores were unrealistic since the GBDT model complexity had one of the lowest possible configurations, and the LSTM model, which is handling the data in a different way, also achieved high accuracy. For a lot of samples, the model learned to “understand” on which chip it was analyzed by looking at the patterns of missing genotypes, so the data leak was present.
Second dataset experiments. This dataset was obtained after the identification of large dense biclusters by application of our proposed algorithm with subsequent elimination. Table 3 recaps the experiments conducted on the dataset. For the first and second experiments, we used CatBoost classifier with train/test split in the proportion of 8:2 and 3fold crossvalidation, respectively, while maintaining the balance of classes for model validation. In the third experiment, we used LGBMClassifier classifier with 3fold crossvalidation while maintaining the balance of classes for model validation. In the fourth experiment, the described earlier LSTM classifier was used with the aforementioned crossvalidation.
no. trees  depth  accuracy  F1score  precision  recall  
CatBoostClassifier  2  2  0.715  0.834  0.715  1.000 
5  2  0.773  0.862  0.761  0.995  
5  3  0.773  0.862  0.761  0.995  
CatBoostClassifier  4  3  0.768  0.859  0.990  0.759 
5  3  0.768  0.859  0.990  0.759  
LGBMClassifier  5  3  0.753  0.852  0.997  0.744 
5  5  0.753  0.852  0.996  0.744  
4  4  0.751  0.851  0.997  0.742  
4  3  0.749  0.850  0.997  0.741  
5  4  0.756  0.854  0.996  0.747  
FC+LSTM 
    0.731  0.839  0.735  0.981 

From Table 3, one can see that scores are more realistic in comparison to those of Table 2, thus showing us that data leak and subsequent overfitting effects are gone. We realize that our proposed biclustering algorithm successfully identified large submatrices with missing data, which we eliminated and successfully removed the impact of data leak and overfitting.
5.5 Detecting concepts of missing SNP values under size constraints
InClose4 is an opensource software tool [Andrews:2017], which provides a highly optimised algorithm from CbO family [kuznetsov:1996, Janostik:2020] to construct the set of concepts satisfying given constraints on sizes of extents and intents. InClose4 takes as input a context and outputs a reduced concept lattice: all concepts satisfying the constraints given by parameter values ( and , where and are extent and intent of an output formal concept, and ).
To deal with our large realworld dataset, we changed the maximum default values used in the executable of InClose4 parameters as follows:
itemize
{listliketab}
#define MAX_CONS 30000000
//max number of concepts
#define MAX_COLS 90000
//max number of attributes
#define MAX_ROWS 2000
//max number of objects
From Tables 4 and 5, one can see that the number of concepts generated by InClose4 becomes several times larger than that of OAbiclusters, in our case study. When we set the extent size constraint to 5 with the input context before and the extent and the intent size constraint to 20 and 0, respectively, after the elimination of missing data, the software crashed. Meanwhile, our proposed biclustering algorithms could manage to output all OAbiclusters in both cases.
As the author of InClose suggested in private communication, the tool was optimised for “tall” contexts with a large number of objects rather than attributes, while in bioinformatics the contexts are often “wide” like in our case when the number of SNPs is almost 57 times larger than that of individuals. So, the results on the transposed context along with properly set compilation parameters allowed to process the whole context for and
Min intent size  Min extent size  Total Time, s  No. of Concepts 

0  45  21.2  18,617 
0  40  23.6  34,400 
0  30  35.8  68,477 
0  20  46.1  165,864 
0  10  64.3  214,007 
0  5  188.3  1,220,576 
0  0  143.43  1,979,439 
Min intent size  Min extent size  Total Time, s  Number of Concepts 

0  40  10.4  2,743 
0  30  10.6  4,196 
0  20  12.6  19,620 
30  0  5.8  352,257 
25  0  6.2  466,695 
20  0  7.4  695,962 
15  0  10.7  1,308,222 
10  0  18.3  3,226,277 
Even if we do not know the number of output concepts for the context after elimination of missing SNP values, their number is more than 10 times larger than that of OAbiclusters, which might be considered as argument in favour of their usage for the studied problem with rather low or no size constraints.
6 Conclusion
A new approach to process the missing values in datasets of SNP genotypes obtained with DNAmicroarrays is proposed. It is based on OAbiclustering. We applied the approach to the realworld datasets representing the genotypes of patients with ischemic stroke and healthy people. It allowed us to estimate and eliminate the SNPs carefully with missing genotypes. Results of the OAbiclustering algorithm showed the possibility of detecting relatively large dense biclusters, which significantly helped in removing the effects of data leaks and overfitting while applying ML algorithms.
We compared our algorithm with InClose4. The number of OAbiclusters generated by our algorithm is significantly lower than the number of concepts (or biclusters) generated by InClose4. Besides, our algorithm has the advantage of using OAbicluster without the need to experiment with finding the best minimum support, as in the case of using InClose4 for generating formal concepts.
Since survey [Naulaerts:2015] mentioned frequent itemset mining (FIM) as a tool to identify strong associations between allelic combinations associated with diseases, the proposed algorithm needs further comparison with other approaches from FIM like DeBi [serin2011debi] and anytime discovery approaches like Alpine [Hu:17] tested on GEA datasets as well; though their use may get complicated if we need to keep information about object names for decisionmakers. It also requires further time complexity improvements to increase the scalability and quality of the extensive bicluster finding process for massive datasets.
Another venue for related studies delve in Boolean biclustering [Michalak:2019] and factorisation techniques [Belohlavek:2019].
Speaking about other possible applications of biclustering, we suggest the development of a new imputation technique. Since biclustering has been recently applied to impute the missing values in gene expression data [chowdhury2020ncbi] and both GED and SNP genotyping data are obtained with DNAmicroarrays and represented as an integer matrix, it can be potentially applied to impute the genotypes that facilitates statistical analyses and empowers ML algorithms.
Acknowledgements.
This study was implemented in the Basic Research Program’s framework at the National Research University Higher School of Economics and the Laboratory of Models and Methods of Computational Pragmatics in 2020. The authors thank prof. Alexei Fedorov (University of Toledo College of Medicine, Ohio, USA) and prof. Svetlana Limborska (Institute of Molecular Genetics of National Research Centre “Kurchatov Institute”, Moscow, Russia) for insightful discussions of the results obtained, and anonymous reviewers.
Funding.
The study was funded by RFBR (Russian Foundation for Basic Research) according to the research project No 192901151. The foundation had no role in study design, data collection and analysis, writing the manuscript, and decision to publish.
References
Footnotes
 https://sourceforge.net/projects/inclose/
 http://conexp.sourceforge.net
 The equivalence classes are and .
 https://github.com/dimachine/OABicGWAS/
 The last line in Table 4 and the last five lines in Table 5 corresponds to the experiments conducted for the final version of the paper on the transposed contexts.