An Online Learning Algorithm for a NeuroFuzzy Classifier with MixedAttribute Data
Abstract
General fuzzy minmax neural network (GFMMNN) is one of the efficient neurofuzzy systems for data classification. However, one of the downsides of its original learning algorithms is the inability to handle and learn from the mixedattribute data. While categorical features encoding methods can be used with the GFMMNN learning algorithms, they exhibit a lot of shortcomings. Other approaches proposed in the literature are not suitable for online learning as they require entire training data available in the learning phase. With the rapid change in the volume and velocity of streaming data in many application areas, it is increasingly required that the constructed models can learn and adapt to the continuous data changes in realtime without the need for their full retraining or access to the historical data. This paper proposes an extended online learning algorithm for the GFMMNN. The proposed method can handle the datasets with both continuous and categorical features. The extensive experiments confirmed superior and stable classification performance of the proposed approach in comparison to other relevant learning algorithms for the GFMM model.
I Introduction
Classical batch learning algorithms usually require the complete availability of data at the training time. These algorithms do not constantly accommodate new information to the built models. Instead, we need to reconstruct the model from scratch when the underlying data changes. This operation is timeconsuming, especially in the case of massive data, and the constructed models are more likely to be outdated in dynamically changing environments. Taking an advertising recommendation system as an example, this system constructs a customer preference model based on the tracking information about the shopping and browsing behaviors of the users. The buying activities and preferences are temporary and continuously changing. For example, the pandemic such as COVID19 has dramatically changed the online shopping behaviors of customers where people tend to purchase things they have never bought before. Therefore, the learning models trained on consumer behavior data prior to the pandemic have been deteriorated or crashed. As a result, these models need to be retrained on new (normal?) behavior data. In this context, and many others characterised by streaming data in changing environments, it is desirable or even necessary to have online learning algorithms that can learn constantly new information without retraining from scratch.
With the increase in the data volume and the rapid change of the environmental conditions nowadays, online learning algorithms are in high demand [1, 2]. These algorithms require smaller or no data storage as they only need one or few newest training samples at one time to rapidly update the constructed model. Hence, the online learning models are ideal candidates for the systems with frequently updating demands. General fuzzy minmax (GFMM) neural network [3, 4] is such an incremental learning model, which can be effectively utilized for data classification problems. This type of learning model combines the artificial neural network with the fuzzy set theory to form a consolidated framework. The model creates new hyperboxes or adjusts the existing hyperboxes to cover new samples in its structure. Each hyperbox is defined by the minimum and maximum points in an ndimensional space. The degreeoffit of an input pattern to a hyperbox is identified by a membership function.
GFMM neural network [3] is a significant enhancement of the FMNN [5]. Unlike the FMNN, the GFMM model can handle the uncertainty associated with the input data by accepting the input patterns not only as single points but also hyperboxes. In addition, it can handle both labeled and unlabeled data samples in a single model. The GFMMNN still maintains the online learning ability from the FMNN using a singlepass through training samples learning algorithms to expand or create new hyperboxes. To avoid the ambiguity in the classification phase, the original online learning algorithm proposed in [3] does not allow the overlap between hyperboxes representing different classes. Therefore, after expanding a hyperbox to cover the input pattern, a hyperbox contraction procedure must be performed if there is an overlapping region between two hyperboxes belonging to different classes. However, the hyperbox contraction operation can lead to undesirable classification errors as shown in the original paper and subsequent publications [3, 6, 7, 8, 9]. As a result, in a recent study, we have proposed an improved online learning algorithm for the GFMM model (IOLGFMM) [9] to overcome this limitation by not using the hyperbox contraction step during the learning process. This algorithm integrates the strong points of the batch learning algorithm proposed in [6] and the incremental learning ability of the original algorithm into a single algorithm.
However, both the original online learning algorithm [3] and the IOLGFMM algorithm [9] work well on the datasets with only numerical features. To perform classification for the datasets with mixedtype features, we would need to use the encoding methods to transform the categorical values into numerical values. As shown in a recent study [10], each encoding method has its own drawbacks and except for the CatBoost [11] and label encoding techniques, all of the remaining encoding approaches need to use the entire training set to encode the categorical features. Therefore, they are not appropriate for incremental learning algorithms, where the new values can appear during the operation time. In addition, according to the empirical results in [10], the classification performance of the online learning algorithms using the CatBoost or label encoding method for the GFMM model is quite poor. It is because the label encoding method imposes an artificial distance metric for categorical groups, in which this distance is not correspondent to the correlation among original categorical values [12]. Not only this poses a serious problem but the CatBoost encoding method is sensitive to the order of training samples presentation and a shift in the encoded values between training and testing data as well as between training samples have been observed. For the same categorical value, its encoded value in the training data may be distinct from that in the testing data. Even in the training set, the same categorical value may be mapped into many different encoded values depending on the historical patterns prior to the current training pattern. Our proposed method in this paper avoids all of these issues by not using any encoding methods for discrete attributes in the first place.
Many realworld datasets are in the form of mixedtype features. The mixedattribute data contain both continuous and discrete (or categorical) features. Nowadays, the mixedattribute data are more and more popular in a wide range of applications from the credit approval data to medical diagnostic data [13]. Hence, to apply the GFMMNN to such problems, we need to extend its current learning algorithms so that they can deal effectively with mixedattribute data. Although there are a large number of improved algorithms of the FMNN model, only two existing studies have focused on expanding the learning algorithms for both categorical and numerical features as shown in a recent survey paper [14]. The first study was proposed in [15] (denoted by OnlnGFMMM1 in this paper) using the correlation between the occurrence frequency of categorical values and classes to determine the similarity degrees among categorical values for each categorical feature. After that, the authors proposed to extend the original online learning algorithm [3] for mixedattribute data. The second idea of expanding the original online learning algorithm of the FMNN model for both numerical and categorical features was introduced in [16], called OnlnGFMMM2 in this paper. It uses the onehot encoding method for the categorical features and logical operators such as AND and OR to operate on the categorical groups. However, the main weak point of both algorithms is the use of the entire training set to encode or compute the similarity degree between categorical values. If a new value occurs without being encountered during a training process before, these algorithms cannot handle such situation and produce a valid prediction. Different from these two approaches, this paper proposes a new incremental learning algorithm for both continuous and categorical features. The proposed method does not use any encoding methods for categorical values. Instead, it uses a union operator of a set to add new categorical value to the current set of values in each categorical feature of a hyperbox. The decision on expanding a selected hyperbox to accommodate a new input pattern is based on the change in the entropy for each categorical feature. We also modify the membership function to handle both categorical and numerical attributes. The membership degree for all categorical features is computed from the average probability of categorical values in the input sample with regard to all of the existing discrete values stored in discrete features of the hyperbox. In short, our main contribution in this paper can be summarized as follows:

We propose a novel online learning algorithm for the GFMMNN able to learn from mixedattribute data. To the best of our knowledge, this is the first online learning algorithm for the family of fuzzy minmax neural networks which can handle both continuous and categorical features without using any encoding methods.

We present and prove several properties of the proposed method with regard to the categorical/discrete attributes.

We conduct extensive experiments to prove the effectiveness of the proposed method in comparison to other relevant methods.

We assess the impact of hyperparameters on the classification performance of the proposed method and propose a simple method for the parameter estimation.
The rest of this paper is structured as follows. Section II summarizes briefly the architecture of the GFMMNN and its improved online learning algorithm. Section III is devoted to describing the proposed method and its properties. Experimental results and discussion are shown in Section IV. Section V concludes the key findings in this paper and informs potential research directions.
Ii Preliminaries
Iia General fuzzy minmax neural network
The GFMMNN [3] are composed of three layers, i.e., input, hyperbox (hidden), and output layers. The input layer in the GFMM model can accept both real valued point and interval (hyperboxtyped) based input samples. If each input pattern has dimensions, there will be nodes in the input layer, in which the first nodes are for the lower bounds and the remaining nodes represent the upper bounds. The hidden layer contains hyperboxes dynamically generated in the learning process. The connection weights between the lower bound nodes and a hyperbox form a vector storing the minimum coordinates for that hyperbox. Similarly, the connection weights from the upper bound input nodes to a hyperbox are represented by a vector containing the maximum coordinates of that hyperbox. The values of matrices V and W for all hyperboxes are tuned during the learning process. Each hyperbox in the hidden layer is fully connected to all output nodes. The connection weights between the hyperbox layer and output layer are kept in a matrix U and each of its element is computed as follows:
(1) 
where is the  class node in the output layer.
Each hyperbox , where and are the minimum and maximum points respectively, is associated with a membership function . This membership function is used to calculate the degreeoffit for each input pattern to the hyperbox , where and are the lower and upper bounds of an input pattern suitably normalised within an dimensional unit hypercube . The membership function is given as follows:
(2) 
where is a ramp function defined in (3):
(3) 
with a sensitivity parameter controlling the decreasing speed of the membership degree, and . If , then is fully contained in the core of the hyperbox .
IiB An improved online learning algorithm
The learning process in GFMM consists of creating and adjusting hyperbox fuzzy sets on the basis of the presented input patterns. There have been a number of fundamental GFMM learning algorithms proposed in the literature which fall into one of two key categories: (i) incremental/online learning algorithms where the hyperboxes are adjusted, if needed, after every presentation of a single pattern [3, 9] and (ii) batch learning algorithms which assume the full training data is available from the beginning of the training process for the training algorithm to use [6, 17, 7]. The performance of the original online algorithm [3] is sensitive to the order of training data presentation and the maximum hyperbox size hyperparameter setting. When inappropriate maximum size of hyperbox is selected and combined with the existing hyperbox contraction process, it can lead to undesired classification errors as analysed and illustrated in [9]. Therefore, in a recent study, we proposed an improved version of the original online learning algorithm, in which, similarly to the agglomerative algorithms in [6], the contraction process is not used during the learning process. The algorithm contains only two main steps, i.e., creation or expansion of hyperboxes and overlap test. In the original online learning algorithm, if a selected hyperbox candidate fulfills the expansion condition related to the maximum hyperbox size, it will be expanded. Then, if the overlap between the newly expanded hyperbox and the existing hyperboxes belonging to other classes occurs, the relevant hyperboxes are contracted. In contrast, in the IOLGFMM algorithm, if the undesired overlap would happen after the expansion, the selected candidate will not be expanded.
For a training sample , the algorithm first filters all existing hyperboxes with the same class as . After that, the membership values between and these hyperboxes are computed and sorted in descending order. Hyperbox candidates will be then checked with regard to meeting the expansion conditions beginning from the hyperbox with the highest membership degree. If the maximum membership value is one, i.e. is contained in the hyperbox, the learning algorithm continues with the next training sample. Otherwise, the expansion condition checking process only terminates when there is a hyperbox candidate which can be expanded to cover or no further hyperbox candidates exist. If none of the existing candidates can be expanded, a new hyperbox is generated with the same coordinates as . The first expansion condition is the maximum hyperbox size. For the hyperbox candidate , first, the maximum hyperbox size condition given in (4) is checked:
(4) 
where is the number of features. If this condition is met, the hyperbox is temporarily expanded to new size as follows:
(5) 
Then, the newly expanded will be tested for undesired overlaps with all of the hyperboxes representing the other classes (i.e. different from the class associated with ). There are four overlap test cases shown in [3]. If there is no overlapping area occurring, the new size of is kept. Otherwise, is reverted to the coordinates before expanding and the next hyperbox candidate is considered.
For an unseen pattern, its predicted class is the class of the hyperbox representing the highest membership value for that input pattern among all existing hyperboxes in the model. In the case when many hyperboxes representing different classes have the same maximum membership degree (), an additional criterion is used to find the appropriate class for . The final class of is the class with the highest score of given by:
(6) 
where and comprises the indexes of all hyperboxes with the maximum membership value of , is a subset of containing indexes of the  class, and is the number of training samples covered by hyperbox . We would like to refer the interested readers to references [9] and [18] for the detailed algorithm as well as its time complexity.
Iii Proposed Method
Iiia Formal Description
Let be training patterns, where is the class of the  pattern, and are continuous attributes (determined in a unit hypercube ) of lower bound and upper bound for the  training sample, represent discrete attributes for the  training sample, is a categorical value of the  categorical feature () at the  training sample, , where is a domain of discrete values for the categorical attribute and is the number of symbolic values of . This paper proposes an online learning algorithm to train an efficient GFMM classifier from .
Architecture of GFMMNN for MixedAttribute Data
First of all, we need to expand the architecture of the GFMMNN for mixedattribute data. Instead of using input nodes as in the GFMM model for continuous data, we will need nodes for the input layer. The first nodes are lower bound and upper bound nodes for numerical features, respectively. The last remaining nodes correspond to categorical features in each input pattern. These input nodes are connected to hyperboxes by connection weights stored in a matrix D. New architecture of the GFMMNN is shown in Fig. 1. Beside the minimum points and the maximum points , each hypebox in the hidden layer also contains a vector storing discretevalued sets. Each element is a set of symbolic values with their cardinalities for the  categorical dimension of the hyperbox . For example, means that the first categorical feature of the hyperbox contains 5 values of apple and 1 value of orange. The values of vectors , , and for each hyperbox are generated and adjusted during the learning process. The membership function of each hyperbox with regard to each input pattern with mixedattribute is modified as follows:
(7) 
where is a tradeoff factor regulating the contribution level of numerical features part and categorical features part to the membership score, and is a probability of encountering a symbolic value in the categorical attribute of the hyperbox . This probability is formally defined as follows:
(8) 
where is the cardinality of a set. For the above example, we obtain , , and . Unlike the numerical part in the membership function, we use an average operation for the categorical part to reduce sensitivity to the membership value. If we also use the operator for the categorical part, the membership value for the categorical features will get the value of zero when there is only one discrete feature getting a new symbolic value.
Extended Improved Online Learning Algorithm for MixedAttribute Data Classification
To create new hyperboxes or adjust existing hyperboxes towards learning mixedattribute training samples in the GFMM model, we need to expand the current improved learning algorithm presented in subsection IIB, denoted by EIOLGFMM in this paper. The proposed modifications include the expansion condition for categorical features, the way of accommodating a categorical value into the hyperbox, and the overlap test for categorical features.
For each training sample, , the algorithm first filters all of the existing hyperboxes representing the same class as . Then, the membership values of in these selected hyperboxes are calculated and sorted in descending order. After that, in turn, we select expandable hyperbox candidates starting from the hyperbox with the highest membership degree if the highest membership score is smaller than one. Assuming that is the currently considered hyperbox, the numerical features of are checked for the maximum hyperbox size condition as shown in (4). If the expansion condition for continuous features is satisfied, the algorithm continues to check the constraint for discrete features.
Entropybased measures can be used to assess the heterogeneity of data in clusters, and they are appropriate for clustering of categorical data due to the lack of explicit distance measures between discrete values [19]. We propose to use the change in the entropy value of categorical values contained in the hyperbox to decide whether the current hyperbox can be expanded to accommodate the categorical values of a new training sample. Given a categorical attribute , let be the current entropy of hyperbox for the  categorical feature, computed from the probability of all current categorical values stored in the  attribute as follows:
(9) 
where is the number of different categorical values () in the  attribute, and is defined in (8). It is clear that if we add a new sample to the hyperbox for which most of the sample’s categorical values existed in the categorical attributes of the hyperbox, the change in the entropy of that hyperbox is small. In contrast, if we add a sample into the hyperbox for which most of the sample’s categorical values are new symbolic values to the set of existing discrete attributes of the hyperbox, the homogeneity of this hyperbox is significantly changed, and so the entropy will increase. As a result, we can use the change in the entropy of the hypebox as an expansion condition for categorical attributes. This entropy changing value is defined in (10) for each discrete feature of each hyperbox candidate .
(10) 
where is the entropy of the hyperbox on the  attribute after covering the input pattern , computed using (9), is the number of samples contained in the hyperbox ( is also equal to the summation of cardinalities of categorical values on the dimension ).
Based on , we have two ways to construct the expansion condition for categorical attributes:

The first method is similar to the expansion condition for continuous features. The extended algorithm using this way is denoted by EIOLGFMMv1 in this paper. We require the change in the entropy for every categorical attribute smaller than a maximum entropy changing threshold for all categorical dimensions:
(11) 
The second approach to build the expansion condition for categorical features uses a weaker condition compared to the first way. The proposed online learning algorithm adopting this condition is called EIOLGFMMv2 in this paper. This method requires the average change in the entropy of all of the categorical attributes smaller than a maximum average entropy changing threshold :
(12)
If both conditions for categorical and numerical features are met for the hyperbox , it will be temporarily expanded to new coordinates. The expansion of numerical features is performed using (5). Each categorical feature of is expanded as follows:
(13) 
where is a function to increase the number of elements of the categorical value by 1. After that, an overlap checking procedure is performed for the newly expanded to examine whether overlaps with any hyperboxes belonging to other classes. In the improved online learning algorithm for numerical features, only four overlap test cases are used as in the original online learning algorithm. However, these four cases are not sufficient to identify all potential overlap cases between two hyperboxes. Therefore, in this extended version, we will deploy a similarity measure between two hyperboxes based on their smallest gap introduced in [6] to check the overlap for numerical features between and other hyperboxes representing different classes. This similarity measure is defined in (14).
(14) 
where is the number of continuous features, is a ramp function given in (3). If and overlap with each other, ; otherwise, . If does not overlap with any representing other classes on the numerical features, we do not need to check the overlap conditions for their discrete features. Otherwise, we have to verify the overlap for the discrete features between and hyperboxes overlapping with on the continuous features. Let and be the set of categorical values on the  discrete attribute of two hyperboxes and , respectively. overlaps with on the  categorical feature if and only if:
(15) 
where is defined in (8). overlaps with on discrete attributes if the equation (15) is true for all of the categorical features of these two hyperboxes.
If the hyperbox candidate does not overlap with any hyperboxes representing other classes on either categorical or continuous features, the new coordinates of remain unchanged and the algorithm continues with the next training sample. Otherwise, the coordinates of are reverted to the previous values and the hyperbox candidate with the next highest membership value is selected as an expandable hyperbox candidate, and the above steps are reiterated.
If none of the hyperbox candidates can be extended to accommodate the input pattern , a new hyperbox is generated as follows. For each numerical feature , we set , and for each categorical feature , we assign .
The classification phase of the EIOLGFMM algorithm remains unchanged as in the original IOLGFMM algorithm. We can see that the way of working of EIOLGFMM algorithm itself can explain the reason leading to the classification results based on the selection of the hyperbox with the maximum membership degree.
IiiB Properties of the Change in Entropy of Categorical Features when Accommodating New Training Samples
This section presents several interesting properties related to the change of the entropy on each categorical attribute of a hyperbox when accommodating a new training sample .
Property 1.
When covering an input pattern, the change of the entropy on each discrete attribute of obtains its maximum value if and only if that attribute includes a new categorical value which does not exist in the list of its current categorical values. Formally,
(16) 
Proof.
See Section I in the supplemental material. ∎
Property 2.
The upper bound of the change in the entropy for every categorical dimension of depends on the current number of samples included in . That is:
(17) 
Proof.
See Section II in the supplemental material. ∎
Property 3.
The change of the entropy for each categorical dimension always falls in the range of [0, 1]:
Proof.
See Section III in the supplemental material. ∎
Property 3 also confirms that .
Property 4.
When the number of samples contained in approaches infinity, the change of the entropy for every categorical dimension will be limited at 0. Formally,
(18) 
Proof.
See Section IV in the supplemental material. ∎
Property 4 indicates that when the number of samples included in each hyperbox increases, the expansion condition for categorical attributes of this hyperbox becomes easier to be satisfied.
Iv Experimental Results
The main purposes of the experiments in this section are to

Analyze the critical roles of parameters and on classification accuracy for the proposed method

Compare the performance of the proposed method to relevant approaches of GFMMNN for mixedattribute data using fixed settings and tuning parameters

Assess several different methods to estimate the values of if we have sufficient samples at the training time.
These experiments were conducted on 14 datasets taken from the UCI machine learning repository
Iva Analyzing the Sensitivity of Parameters
There are three important parameters affecting the classification performance of the proposed method, i.e., the maximum hyperbox size for continuous attributes (), the tradeoff factor () regarding the contribution levels of continuous and discrete attributes to the membership function, and the maximum entropy changing threshold () for discrete attributes. The role of was analyzed in a recent study [21], in which smaller values of usually result in better performance than the use of larger values of does. However, the smaller values of are, the more complex the final model is (i.e. the larger number of generated hyperboxes). In this section, we only study the influence of two new parameters introduced in our proposed method, i.e., and .
Parameter
To evaluate the impact of on the performance of the EIOLGFMM algorithms, we have changed the values of from 0 to 1 with step 0.1 and recorded the average CBA scores using 10 times repeated stratified 4fold crossvalidation for 11 mixedattribute datasets. The impact of is studied for two cases, i.e., largesized hyperboxes and smallsized hyperboxes. To obtain the largesized hyperboxes, we established the parameters so that the expansion process of hyperboxes is not constrained. To achieve smallsized resulting hyperboxes, we used the small values for both and , i.e., in this experiment. Fig. 2 shows the change in the CBA for different values of in the case of largesized hyperboxes. For , the behaviors of the EIOLGFMMv1 and EIOLGFMMv2 algorithms are identical. Fig. 5 presents the change in the CBA results for different values of in the case of smallsized hyperboxes for both EIOLGFMM algorithms. We only present the results for a representative flag dataset. The results of the remaining datasets are presented in Figs. S1, S2 and S3 in the supplemental document.
In general, the CBA values of both proposed learning algorithms at (using categorical features only) and (using numerical features only) are usually smaller than the results of using both types of features. The impact of on both EIOLGFMMv1 and EIOLGFMMv2 algorithms are similar for many datasets. It can be observed that the influence of on the GFMM models with smallsized hyperboxes is significantly higher than that with largesized hyperboxes. This is demonstrated by the degree of oscillation in classification accuracy among different values of in Fig. 2 and Fig. 5. It is because the value of affects the results of the membership function, and the membership value in turn impacts the selection of the final hyperbox for each unseen pattern. In the case of smallsized hyperboxes, the number of hyperboxes is high and the small change in the membership value can lead to a significant change in the selected hyperbox.
As can be seen, the selection of can result in the change in the classification performance, thus this parameter needs to be tuned in the learning process. However, for a large number of training samples, performing a hyperparameter tuning step for is timeconsuming. For the online learning process, we also face another scenario, where we do not have sufficient samples at the training time. In this case, we cannot conduct the tuning process using a crossvalidation technique to select an appropriate for the learning algorithms. Therefore, we usually use a fixed setting for . From the empirical results in Figs. 2, 5 and Figs. S1, S2, and S3 in the supplemental document, it is interesting to observe that the highest CBA results are usually obtained for the value of near the threshold . Therefore, in the case of using a fixed setting for , we will set . With this setting, each feature is treated as equally important in decision making.
Parameter
In this subsection, we will assess the impact of the maximum entropy changing threshold () for discrete attributes on the classification performance. To rule out the influence of , we set so that numerical features can be expanded without any limitation. From the above experimental results, we used . Therefore, the performance of the learning algorithms depends on the selection of . We changed from 0.05 to 0.1 and kept the change step of 0.1 up to 1. The impact of on the IOLGFMMv1 and IOLGFMMv2 algorithms is illustrated in Fig. 8 for the flag dataset. The results for the remaining datasets can be found in Figs. S4 and S5 in the supplemental document.
From these results, it can be observed that the change in the CBA results for the EIOLGFMMv1 using is very small. For , its impact on the classification performance of the EIOLGFMMv1 on a number of datasets such as autralian, heart, and post operative is significant, especially in the case of . In general, the classification error for in the EIOLGFMMv1 is relatively high.
For the EIOLGFMMv2, however, the change in CBA values is small for and . In contrast, the classification performance has significantly changed for the values of from 0.2 to 0.7. It can be seen that the impact of on the performance of the EIOLGFMMv2 algorithm is higher than that on the EIOLGFMMv1 algorithm. It is because the hyperbox expansion procedure for discrete features in the EIOLGFMMv2 algorithm can be performed much easier than that in the EIOLGFMMv1 algorithm. As a result, the number of generated hyperboxes in the EIOLGFMMv1 algorithm is higher than that of hyperboxes in the EIOLGFMMv2 algorithm, and so it can capture better the underlying data distribution. For the EIOLGFMMv1 algorithm, each discrete feature can only accommodate a new categorical value if the number of samples for the current categorical values is sufficiently large according to Properties 1 and 2. As a result the homogeneity for each categorical feature of hyperboxes in the EIOLGFMMv1 is higher compared to that in the EIOLGFMMv2. Therefore, the change in the classification performance among the different values of in the first version of the proposed method is smaller in comparison to the second version.
IvB Comparing the Performance of the EIOLGFMM Algorithms with Other Methods using the FixedParameter Settings
As we assume that there will not be sufficient number of training samples up front in the considered online learning scenarios, as discussed earlier, we will use a fixed setting for certain hyperparameters which cannot be reliably tuned/optimised using available data. This section is to assess the proposed method in comparison to other solutions to deal with mixedattribute data for the GFMMNN shown in [10] using previously evaluated fixed values of hyperparameters. In particular, the proposed method will be compared with two learning algorithms with the mixedattribute handling ability for the GFMM model including the OnlnGFMMM1 [15] and the OnlnGFMMM2 [16]. We will also compare the proposed method to the use of the original IOLGFMM algorithm along with various encoding methods for categorical features.
Algorithms with MixedAttribute learning ability
In [10], the different learning methods were compared to each other using three different settings for , i.e., a small size , a large size , and an extreme case . To compare the proposed method to the previous solutions, we also used the same settings for the parameter. For the parameter, we set as recommended in [22]. In addition to and , the existing learning algorithms for the GFMMNN with mixedfeature handling ability have their own hyperparameters. The OnlnGFMMM1 algorithm depends on the parameter, which represents the maximum hyperbox size for discrete features. We used as shown in [10]. The OnlnGFMMM2 algorithm has the parameter to control the minimum number of categorical features matched between the selected hyperbox and the input pattern so that hyperbox can be expanded to cover the input pattern. Similarly to [10], we used of the total number of features for each dataset. To be fair in the comparison, we used the parameter and for the proposed learning algorithms.
The average CBA values over 10 times repeated stratified 4fold crossvalidation with different parameter settings are shown in Table S.II in the supplemental document. It can be easily observed that in extreme cases (the largest values of parameters), the classification performance of our proposed method significantly outperforms the OnlnGFMMM1 and OnlnGFMMM2 algorithms. To facilitate the comparison of results, for each value of , we will use the best results of the remaining parameter to rank four algorithms over 14 datasets. For example, for , the OnlnGFMMM1 usually obtains the best performance using , the OnlnGFMMM2 achieve its best results with , and two proposed methods attain their best results using . The average ranking of algorithms using their best settings is shown in Table I.
Method  Other parameters  Average rank  

0.1  OnlnGFMMM1  3  
OnlnGFMMM2  3.214  
EIOLGFMMv1  1.893  
EIOLGFMMv2  1.893  
0.7  OnlnGFMMM1  2.429  
OnlnGFMMM2  3.857  
EIOLGFMMv1  1.607  
EIOLGFMMv2  2.107  
1  OnlnGFMMM1  2.429  
OnlnGFMMM2  3.857  
EIOLGFMMv1  1.679  
EIOLGFMMv2  2.036 
It can be seen that the classification performance of our proposed methods is better than that of two existing algorithms with the mixedattribute learning ability for three different thresholds of . To conclude if there are statistically significant differences among algorithms, we will carry out a nonparametric test procedure as recommended in [23] employing the Friedman ranksum test with a confidence level of 95% (a significance level ). The null hypothesis is “there are no statistical differences between learning algorithms”, and if this hypothesis is rejected, then we perform the Nemenyi posthoc test to determine the particular differences. For datasets and algorithms, the Friedman statistic distribution is computed using average ranks of each algorithm as follows:
(19) 
From , a Fdistribution with and degrees of freedom can be calculated using (20).
(20) 
The rejection of the null hypothesis occurs with the significance level if is smaller than a critical value of . In this experiment, we used 14 datasets and four learning algorithms, so is distributed according to the Fdistribution with and degrees of freedom. The critical value of for is 2.845.
For , we obtain , and so the null hypothesis is rejected. This means that there are significant differences between the results of learning algorithms. Using the Nemenyi posthoc test, we obtain a critical difference (CD) diagram in Fig. 9. The groups of algorithms that are not significantly different from each other are connected by a solid line. We can see that our proposed methods are statistically better compared to the OnlnGFMMM2 algorithm with the selected settings. However, there is no statistically significant difference in the classification performance between the proposed methods and the OnlnGFMMM1 algorithm.
For , we have , and so the null hypothesis is also rejected. Applying the Nemenyi posthoc test, we have a CD diagram in Fig. 10. We can observe that there are no statistically significant differences among the obtained empirical results in the groups of two proposed learning algorithms and the OnlnGFMMM1 algorithm. However, the algorithms in this group significantly outperform the OnlnGFMMM2 algorithm.
Similarly, for , we obtain , and so the null hypothesis is rejected as well. Fig. 11 shows the CD diagram using the Nemenyi posthoc test. In this case, the statistical difference in the classification performance among the four methods is the same as in the case of .
For the complexity of the resulting GFMM models using these learning algorithms, the average number of generated hyperboxes for each method is shown in Table S.III in the supplemental document. We can see that in most of the cases, the complexity of the model using the OnlnGFMMM2 algorithm is lowest, while the complexity of the GFMMNN using the EIOLGFMMv1 is highest. The number of generated hyperboxes using the EIOLGFMMv2 algorithm is usually smaller than that using the OnlnGFMMM1 algorithm.
Comparing the Proposed Method to the Original Learning Algorithm using Encoding Methods
In this subsection, we will compare the EIOLGFMM algorithms to the original IOLGFMM algorithm using different encoding techniques. In [10], there are eight encoding methods used to transform the categorical features into numerical features, i.e., LeaveOneOut (LOO), CatBoost, label, onehot, target, JamesStein, Helmert, and Sum encoding techniques. Similarly to the above experiment, we will consider three different thresholds for including 0.1, 0.7, and 1. To be fair in the comparison, we established the value of equal to .
The average CBA values over 10 times repeated stratified 4fold crossvalidation of these approaches are shown in Table S.IV, and their ranks are presented in Table S.V in the supplemental document. The average rank for these methods for different thresholds of is shown in Table II, in which the best results are highlighted in bold. In general, we can see that the average performance of our proposed method is better than that using the original IOLGFMM algorithm along with encoding techniques. One of the strong points of our proposed method is that it does not use any encoding method for discrete attributes.
Method  

0.1  0.7  1  
IOLGFMM + CatBoost  5.357  5.071  5 
IOLGFMM + Onehot  8.179  8.036  7.536 
IOLGFMM + LOO  4.857  4.786  5.643 
IOLGFMM + Label  5.107  5.607  5.464 
IOLGFMM + Target  5.464  4.464  5.429 
IOLGFMM + JamesStein  5.250  4.536  5.214 
IOLGFMM + Helmert  7.893  7.750  7.179 
IOLGFMM + Sum  5.607  5.750  7.821 
EIOLGFMMv1  3.679  3.536  2.857 
EIOLGFMMv2  3.607  5.464  2.857 
Similarly to the above experiments, we will use a statistical test procedure to analyze the statistical difference among the methods. The critical value of for 10 methods and 14 datasets at is 1.9608.
For , we obtain , and so there are statistically significant differences among methods. Using the Nemenyi posthoc test, we obtain a CD diagram in Fig. 12. We can see that the proposed method is significantly better than the original IOLGFMM algorithm using the onehot or Helmert encoding method. However, there are no statistical differences between the proposed methods and the IOLGFMM algorithm using the remaining encoding techniques.
For , we have , and so there are also statistically significant differences among methods in this case. Employing the Nemenyi posthoc test, we obtain a CD diagram in Fig. 13. In this case, the EIOLGFMMv1 is significantly better than the original IOLGFMM algorithm using the onehot or Helmert encoding method as well, but it does not statistically outperform the original algorithm using the remaining encoding approaches. Moreover, in this case, there is not sufficient evidence to conclude that the EIOLGFMMv2 is statistically better than the original algorithm employing encoding methods.
For the extreme case , we obtain , and so the null hypothesis is also rejected. Fig. 14 shows the CD diagram, in this case, using the Nemenyi posthoc test. It can be observed that there are no statistical differences among the original algorithms using encoding methods. Our proposed methods significantly outperform the original learning algorithm using the onehot, sum, or Helmert encoding method. However, there are no statistical differences between our proposed methods and the original IOLGFMM algorithm using the remaining encoding approaches.
IvC Evaluating the Role of the HyperParameter Tuning on the Performance of the EIOLGFMM Algorithms
HyperParameter Tuning and the Estimation of
In the case that we are given a large number of samples at the training time to build an initial model, we can perform a hyperparameter tuning step for but this process is timeconsuming. Meanwhile, the empirical results in subsection IVA1 indicate the relation between suitable values of and the ratio of the number of continuous features over the total number of features. In this subsection, we propose a simple way to estimate the appropriate value of in a datadriven manner. The estimation method does not loop through predefined values of as in the tuning process, and so they will run faster than the hyperparameter tuning step for .
For each training fold , we split it into three inner folds. To estimate the value of for each training fold , we will repeat the learning process three times. Each time two inner folds are used to build the GFMM model and the remaining inner fold is used as a validation set. For each inner training fold , we split it into two separate parts, in which each part contains either continuous attributes or discrete attributes. Then, we will construct two separate GFMM models using the EIOLGFMM algorithm from these two training parts. After that, we compute the CBA value for each trained model using the inner validation fold . Let and be the CBA scores for the GFMM models trained on continuous features only and discrete features only, respectively. We have two ways to estimate the value of for each training fold . The first way uses both the CBA values and the number of features, denoted Estv1 in this paper, as follows:
(21) 
The second estimation way of uses only the obtained CBA values, called Estv2, as follows:
(22) 
Two ways of estimating are summarized in Fig. S6 in the supplemental material. After obtaining the value of , we use it to train a final model using the whole mixedattribute training fold and evaluate its performance using the  testing fold. The above process is repeated 40 times (10 times repeated stratified 4fold crossvalidation) to compare the average CBA values among different methods.
This section will compare the effectiveness of two above estimation methods to the fixed setting of and the parameter tuning method for . In the parameter tuning method, for each training fold , we split into three inner training folds. Two inner training folds are used to build the GFMM model using the proposed EIOLGFMM algorithm and the remaining fold is used as a validation fold. We will iterate this process three times to obtain three CBA values from three validation folds for each value . The value of resulting in the highest average CBA value over three inner validation folds is used to build the final GFMM model on the whole training fold , and the trained model is assessed by the CBA value on the  testing fold. The whole process is repeated 40 times for different training folds .
The average CBA results of 40 GFMM models trained using the proposed algorithms with two estimation methods of , the parameter tuning method and the fixed setting of for 11 datasets are shown in Table S.VI in the supplemental material. The rank for these methods is presented in Table S.VII in the supplemental document. It is noted that the results are reported over 11 out of 14 experimental datasets because these datasets contain both continuous and discrete features while three remaining datasets consist of only discrete features. The average rank over 11 datasets for different methods of finding the value of for the GFMM model trained using the proposed algorithm is shown in Table III. Similarly to subsection IVA1, we compare the methods of finding in two cases, i.e., smallsized hyperboxes () and largesized hyperboxes (). The best rank in each row is highlighted in bold. In the case of , the behavior of both EIOLGFMMv1 and EIOLGFMMv2 is the same, and so they lead to the same results.
Algorithm  Tuning  Estv1  Estv2  

EIOLGFMMv1  0.1  2.727  2  3  2.273 
EIOLGFMMv2  0.1  2.545  2.318  2.727  2.409 
Both  1  2.273  2.773  2.273  2.682 
We can observe that for small values of and , the estimation method using the CBA values from two separate models along with the number of features usually results in the best average CBA values in comparison to the second estimation method, the parameter tuning approach, and the fixed setting of for both learning algorithms. However, the second estimation method without using the number of features often leads to the worst results. Interestingly, in this case, the fixed value of shows slightly better results than the hyperparameter tuning method. In the case of generating the largest hyperbox sizes, the best predictive results belong to the models using the hyperparameter tuning method and the Estv2 method. Meanwhile, the first estimation method usually leads to the worst classification performance.
To explain these facts, we will examine the distribution of the obtained values of through 40 iterations and the change in the corresponding CBA values. Fig. 15 shows the distribution of the obtained values for different methods in the case of largestsized hyperboxes for the flag dataset. The results in the case of are presented in Fig. 18. Similar results for all of the remaining datasets can be found in Figs. S7, S8, and S9 in the supplemental material.
We can see that, in both cases, the use of the hyperparameter tuning method returns a wide range of values for , in which the obtained median value of locates near the value resulting in the best classification result. In the case of smallsized hyperboxes, it can be seen that the deviation in the classification results among adjacent values of is high. Therefore, a wide range of values usually leads to a low average classification result compared to the use of a narrow range of values near the best results. We can see from Fig. 18, the obtained values employing two estimation methods are distributed in a narrower area than that using the hyperparameter tuning approach. Also, the range of the obtained values of the Estv2 is wider than that using the Estv1 method. However, the range of the obtained values using the Estv1 is nearer the value leading to the best classification performance than one using the Estv2. Hence, in this case, the Estv1 method usually gives the best classification results among the four methods.
In the case of largestsized hyperboxes, the difference in the performance among different values of is small. Therefore, the wide range of the obtained values using the hyperparameter tuning method regularly leads to a better average classification result compared to the outcomes employing other methods. As can be seen from Fig. S7 in the supplemental document that two estimation methods return a narrower range of the obtained values in comparison to the use of the hyperparameter tuning approach. However, in this case, the obtained values using the Estv2 usually locates nearer the values leading to much better performance than those in the case of using the Estv1 method. Therefore, the performance of the GFMM model using the Estv2 outperforms that adopting the Estv1 method.
In short, the second estimation method is appropriate for the model having a small number of hyperboxes, while the first estimation method should be used in the case when the resulting model has a large number of hyperboxes.
Comparing the EIOLGFMM Algorithms to Other Algorithms with the MixedAttribute Learning Ability
In this experiment, we will compare the classification performance of our proposed method and two existing algorithms with the mixedattribute learning ability using the hyperparameter tuning procedure for important parameters in each learning algorithms. For the value in all learning algorithms, we will find its best parameter value in the range of for each training fold. The parameter for the OnlnGFMMM1 algorithm is searched in the range of . The searching range of the parameter for the OnlnGFMMM2 is of the total number of categorical features. For the two proposed algorithms in this paper, the parameter is searched in the range of , while the value is sought in the range of .
Each training fold is split into three inner folds, in which two inner folds are used for training a GFMM model using learning algorithms. Then, we use the remaining fold to obtain the CBA value. This process is repeated three times for every inner validation fold. The combination of parameters resulting in the best average CBA values through three validation folds is used to train the final GFMM model using the whole training fold . After that, this model is evaluated using the corresponding testing fold. This process is iterated 40 times (10 times repeated stratified 4fold crossvalidation) for each dataset. The average CBA results for four learning methods using the above hyperparameter tuning approach are shown in Table S.VIII and their ranks are presented in Table S.IX in the supplemental document. The average rank for each method over 11 mixedattribute datasets is shown in Table IV.
Algorithm  Average rank 

OnlnGFMMM1  3.182 
OnlnGFMMM2  2.909 
EIOLGFMMv1  1.5 
EIOLGFMMv2  2.409 
We can observe that the two proposed learning algorithms outperform two existing learning algorithms with the mixedattribute handling ability, in which the best performance belongs to the EIOLGFMMv1 algorithm. For the experimental results in subsection IVB1, we can see that the OnlnGFMMM1 algorithm is better than the OnlnGFMMM2 algorithm using the fixedparameter settings. However, by using the hyperparameter tuning method, the OnlnGFMMM2 algorithm overcomes the OnlnGFMMM1. This is because of the difference in the distribution between the inner training set used to find the best combination of parameters and the training fold used to build the final model. The OnlnGFMMM1 needs to use the entire training data to find the distance between categorical values based on the relationship between the occurrence frequency of discrete values and classes. These distance values are deployed to build membership functions. Therefore, when the training data change, the best combination of parameters on the inner training folds no longer maintains the superior classification performance when used on the training fold . Our proposed methods do not use the training samples to build the similarity measure among categorical features, and so they still achieve the best performance as in the case of using the fixed parameter settings.
Interestingly, the classification performance of learning algorithms using the hyperparameter tuning method in several datasets such as cmc, cmc, zoo, australian, and japanese credit is worse than those using fixed parameter settings presented in subsection IVB1. This is because the representativeness and distribution of the inner validation sets used to find the best combination of parameters are different from the training and testing folds. Therefore, the best parameters obtained from the inner validation folds may not lead to the best classification accuracy on the testing set. As a result, the hyperparameter tuning method does not always result in better performance than the use of fixed parameters.
To verify the statistical difference in the performance among the learning algorithms, we will use the above Friedman ranksum test. For 11 datasets and 4 learning algorithms, is distributed according to the Fdistribution with and degrees of freedom. The critical value of at a significant level is 2.9223. In this case, we obtain . Therefore, there are statistically significant differences among the four considering algorithms. Using the Nemenyi posthoc test, we achieve a CD diagram in Fig. 19.
We can see that there is a statistically significant difference in the classification performance between the EIOLGFMMv1 and the OnlnGFMMM1 algorithms in this case. For , we can also conclude that the EIOLGFMMv1 algorithm significantly better than the OnlnGFMMM2 algorithm. However, the EIOLGFMMv2 does not statistically outperform both existing learning algorithms with the mixedattribute learning ability.
V Conclusion and Future Work
This paper presented a new online learning algorithm for the GFMMNN with mixedattribute data. The proposed method expands the current membership function for both continuous and discrete features. We also extend the current architecture of the GFMM model for mixedattribute data and introduce a new way of learning for categorical dimensions based on the change in the entropy when accommodating new discrete values without using any encoding methods. The experimental results confirmed the superior classification performance of our proposed method in comparison to the current solutions to handle the mixedtype datasets for the GFMMNN.
Although the GFMMNN for mixedattribute data itself can explain the predicted results using the membership function to select the appropriate hyperbox, to make it friendly and easytoread for users, it is necessary to extract and optimize ifthen rule sets from the resulting hyperboxes for both continuous and discrete features in the future studies. The interpretability of predictive models is a critical factor when applying the machine learning algorithms for highstakes applications such as medicine, finance, or criminal justice [24]. Furthermore, the classification accuracy depends on the selection of parameters, thus the next research should assess the use of optimization algorithms such as genetic algorithms [25] to evolve the hyperboxes and optimize their hyperparameters simultaneously. When applying the online learning algorithms for applications in dynamic changing environments, these learning algorithms need to detect and adapt to the change of the underlying data distribution [4, 7, 26, 27, 28]. Therefore, one of the potential research directions is to integrate the adaptation ability into the proposed algorithm.
Acknowledgment
T.T. Khuat acknowledges FEITUTS for awarding his PhD scholarships (IRS and FEIT scholarships).
Footnotes
 https://archive.ics.uci.edu/ml/datasets.php
References
 B. Lakshminarayanan, D. M. Roy, and Y. W. Teh, “Mondrian forests: Efficient online random forests,” in Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, 2014, p. 3140â3148.
 E. Lughofer and M. Pratama, “Online active learning in data stream regression using uncertainty sampling based on evolving generalized fuzzy models,” IEEE Transactions on Fuzzy Systems, vol. 26, no. 1, pp. 292–309, 2018.
 B. Gabrys and A. Bargiela, “General fuzzy minmax neural network for clustering and classification,” IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 769–783, 2000.
 B. Gabrys and A. Bargiela, “Neural networks based decision support in presence of uncertainties,” Journal of Water Resources Planning and Management, vol. 125, pp. 272–280, 1999.
 P. K. Simpson, “Fuzzy minmax neural networks. i. classification,” IEEE Transactions on Neural Networks, vol. 3, no. 5, pp. 776–786, 1992.
 B. Gabrys, “Agglomerative learning algorithms for general fuzzy minmax neural network,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 32, no. 1, pp. 67–82, 2002.
 ——, “Learning hybrid neurofuzzy classifier models from data: to combine or not to combine?” Fuzzy Sets and Systems, vol. 147, no. 1, pp. 39–56, 2004.
 A. Bargiela, W. Pedrycz, and M. Tanaka, “An inclusion/exclusion fuzzy hyperbox classifier,” International Journal of Knowledgebased and Intelligent Engineering Systems, vol. 8, no. 2, pp. 91–98, 2004.
 T. T. Khuat, F. Chen, and B. Gabrys, “An improved online learning algorithm for general fuzzy minmax neural network,” in Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–9.
 T. T. Khuat and B. Gabrys, “An indepth comparison of methods handling mixedattribute data for general fuzzy minmax neural network,” arXiv eprints, p. arXiv:2009.00237, 2020.
 L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “Catboost: Unbiased boosting with categorical features,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPSâ18, 2018, p. 6639â6649.
 R. K. Brouwer, “A feedforward network for input that is both categorical and quantitative,” Neural Networks, vol. 15, no. 7, pp. 881 – 890, 2002.
 T. Huang, Y. He, D. Dai, W. Wang, and J. Z. Huang, “Neural networkbased deep encoding for mixedattribute data classification,” in Proceedings of the PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD), 2019, pp. 153–163.
 T. T. Khuat, D. Ruta, and B. Gabrys, “Hyperbox based machine learning algorithms: A comprehensive survey,” Soft Computing, pp. 1–39, 2020.
 P. R. D. Castillo and J. Cardenosa, “Fuzzy minmax neural networks for categorical data: application to missing data imputation,” Neural Computing and Applications, vol. 21, no. 6, pp. 1349–1362, 2012.
 S. Shinde and U. Kulkarni, “Extracting classification rules from modified fuzzy minâmax neural network for data with mixed attributes,” Applied Soft Computing, vol. 40, pp. 364 – 378, 2016.
 B. Gabrys, “Combining neurofuzzy classifiers for improved generalisation and reliability,” in Proceedings of the 2002 International Joint Conference on Neural Networks, vol. 3, 2002, Conference Proceedings, pp. 2410–2415.
 T. T. Khuat and B. Gabrys, “Accelerated learning algorithms of general fuzzy minmax neural network using a novel hyperbox selection rule,” Information Sciences, vol. 547, pp. 887–909, 2021.
 T. Li, S. Ma, and M. Ogihara, “Entropybased criterion in categorical clustering,” in Proceedings of the TwentyFirst International Conference on Machine Learning (ICML), 2004, pp. 68–75.
 L. Mosley, “A balanced approach to the multiclass imbalance problem,” Ph.D. dissertation, Iowa State University, 2013.
 T. T. Khuat and B. Gabrys, “A comparative study of general fuzzy minmax neural networks for pattern classification problems,” Neurocomputing, vol. 386, pp. 110 – 125, 2020.
 S. Abe, “Dynamic fuzzy rule generation,” in Pattern Classification: Neurofuzzy Methods and Their Comparison. Springer London, 2001, pp. 177–196.
 J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
 C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Machine Intelligence, vol. 1, pp. 206–215, 2019.
 T. T. Khuat and M. H. Le, “A genetic algorithm with multiparent crossover using quaternion representation for numerical function optimization,” Applied Intelligence, vol. 46, no. 4, pp. 810–826, 2017.
 Z. Sahel, A. Bouchachia, B. Gabrys, and P. Rogers, “Adaptive mechanisms for classification problems with drifting data,” in Proc. of the 11th International Conference on Knowledgebased Intelligent Engineering Systems (KES’2007). Springer, 2007, pp. 419–426.
 P. Kadlec and B. Gabrys, “Architecture for development of adaptive online prediction models,” Memetic Computing, vol. 1, no. 4, pp. 241–269, 2009.
 M. Salvador, M. Budka, and B. Gabrys, “Effects of change propagation resulting from adaptive preprocessing in multicomponent predictive systems,” Procedia Computer Science, vol. 96, pp. 713–722, 2016.