Explainable Ordinal Factorization Model: Deciphering the Effects of Attributes by Piecewise Linear Approximation
Abstract
Ordinal regression predicts the objects’ labels that exhibit a natural ordering, which is important to many managerial problems such as credit scoring and clinical diagnosis. In these problems, the ability to explain how the attributes affect the prediction is critical to users. However, most, if not all, existing ordinal regression models simplify such explanation in the form of constant coefficients for the main and interaction effects of individual attributes. Such explanation cannot characterize the contributions of attributes at different value scales. To address this challenge, we propose a new explainable ordinal regression model, namely, the Explainable Ordinal Factorization Model (XOFM). XOFM uses the piecewise linear functions to approximate the actual contributions of individual attributes and their interactions. Moreover, XOFM introduces a novel ordinal transformation process to assign each object the probabilities of belonging to multiple relevant classes, instead of fixing boundaries to differentiate classes. XOFM is based on the Factorization Machines to handle the potential sparsity problem as a result of discretizing the attribute scales. Comprehensive experiments with benchmark datasets and baseline models demonstrate that the proposed XOFM exhibits superior explainability and leads to stateoftheart prediction accuracy.
1 Introduction
Ordinal regression (or ordinal classification) aims to learn a pattern for predicting the objects (e.g.: actions, items, products) labels that exhibit a natural ordering [3]. Such problems are different from the nominal classification problems because the ordering specifies the user preferences to each object [27]. For example, we can use an ordinal scale to estimate the condition of vehicles. The misclassification cost for assigning a vehicle to the class is greater than assigning it to the class. Taking this ordinal information into consideration leads to more accurate models. However, if the standard nominal classification models are used without considering such information, we could obtain nonoptimal solutions [12].
Stateoftheart methods for ordinal regression problems either transform the original problems to several binary ones or rely on the thresholdbased models, which approximate a preference value for each object, and then use the trained thresholds to differentiate different classes. The ordinal regression problems play an important role in many managerial decision making problems such as clinical diagnoses [2], consumer preference analysis [10], nanoparticles synthesis assessment [15], age estimation [23], and credit scoring [16]. In these contexts, the ability to capture the detailed relationships between the predictions and different attribute value scales in the model is as important as accuracy, because it helps the users understand and utilize the underlying model.
The existing ordinal regression methods explain the results either by providing constant coefficients measuring the relative importance of the main and interaction effects of the attributes, for instance the logisticbased models [22, 27], or by presenting the estimated thresholds and the proportions of each possible classification, for instance the thresholdsbased models [30, 12] and ordinal binary decomposition approaches [3, 5]. Unfortunately, these methods cannot characterize the contributions of attributes at different value scales, which is critical to explaining how the model works. In addition, the boundary obtained by thresholdbased methods often oversimplifies the condition, and thus may not work well for the dataset where many objects are close to the boundary.
To fill the gaps mentioned above, we propose the Explainable Ordinal Factorization Model (XOFM) for the ordinal regression problems. XOFM adopts the common assumption that there is a ‘preference’ value for each object [27]. XOFM uses piecewise linear functions to approximate the actual contributions of different attribute value scales to the prediction using the training data. The trained XOFM can then estimate preference values for each object in the testing set, and calculate the probabilities of multiple relevant classes through comparing with the training samples. In this way, XOFM generalizes the thresholds from fixed values to intervals. Eventually, XOFM assigns each object the label with the highest probability. Because XOFM discretizes an attribute’s value into multiple scales in the form of an attribute vector, it may lead to the sparsity problem. Thus, XOFM adopts the Factorization Machines (FMs)based scheme to handle the sparsity problem [24]. The contributions are as follows:

The proposed XOFM introduces a novel ordinal regression transformation process that generalizes thresholdbased ordinal regression procedures. It determines an interval for the thresholds based on the preference relationships among the objects. As a result, there is no need to initialize or estimate the single values of thresholds.

In addition to stateoftheart prediction performance, XOFM model is able to explain the actual contributions of the main effects and interaction effects at different attribute value scales through determining the shapes of some piecewise linear functions. Such explainability can provide detailed information to users to decipher the relationships between attributes and predictions.

We formulate the XOFM into a FMsbased scheme to handle the data sparsity problem, and extend the FMs to handle ordinal regression problems. To our best of knowledge, this is the first study that enhances the explainability of FMs in performing ordinal regression tasks.
2 Related Work
2.1 Ordinal regression
The ordinal regression approaches are commonly classified into three groups: naïve approaches, ordinal binary decomposition approaches and thresholdbased models [12]. The naïve approaches either do not consider the preference levels of classes (such as using the standard nominal classifiers), or map the class labels into real values (such as the support vector regression, SVR model) [29]. However, the mapping of the class labels may hinder the performances because the metric distances between ordinal scales are usually unknown [30].
Ordinal binary decomposition approaches solve the problem by decomposing an original ordinal regression problem into several binary classification problems. [3] proposed a neural networkbased method for ordinal regression (NNOR). The method decomposes and encodes the class labels, then trains a single neural network model to classify the objects. Similarly, extreme learning machine (ELMOR), a singlelayer feedforward neural networkbased model, has also been adapted to ordinal regression problems [5]. More recently, CNNOR, a convolutional neural network (CNN)based model was proposed to handle the ordinal regression problems. CNNOR is unique in being able to handle small datasets given the CNN structure [18]. These methods achieve good performance, but are limited in model explainability given their neural network scheme.
The thresholdbased models are the most popular approaches for ordinal regression problems. They assume that there is a function measuring the ‘preference’ values of the objects and compare the preference values to a set of thresholds, which are either predefined or estimated using data. Following this common assumption, the proportional odds model (POM) uses a standard logistic function to predict the probability of an object being classified to a class [22]. POM has become the standard ordinal regression method, and the basis for most followup thresholdbased models [12]. Typical examples include the ordinal logistic model with immediatethreshold variant (LIT) and allthreshold variant (LAT) [27], and kernel discriminant learning for ordinal regression (KDLOR) [30]. Unfortunately, none of these models can decipher the contributions of the attributes at different specific value scales.
2.2 Factorization Machines
Factorization machines (FMs) combine the advantages of support vector machines with factorization models. FMs factorize the interaction parameters instead of directly estimating them, thus they have the advantage in handling sparse data [24]. FMs have been widely applied to various machine learning problems including recommender systems [26], click predictions [33], and image recognition [17]. FMs can be used for ranking problems given the aforementioned characteristics. However, few studies applied FMs to ordinal regression due to the difficulty in transforming the ordinal regression problems into a ranking form. Ordinal factorization machine and hierarchical sparsity (OFMHS) utilizes FMs for ordinal regression through formulating it as a convex optimization problem. In particular, OFMHS can model the hierarchical structure behind the input variables [11]. Although the method achieves stateoftheart performance, it does not explore the detailed relationship between the attributes and predictions.
2.3 Explainable Models
Apparently there are various definitions of the model explainability. Here we focus on enhancing the model induction as defined in [9]. Specifically, an explainable model should be able to characterize the contributions of the individual attributes and reveal the interaction effects of the attributes [19]. They used score/shape functions to measure the main effects and mapped the interaction effects to real values [20]. At last, different link functions are used to link these score functions with various machine learning tasks [32, 31]. Unfortunately, these powerful explainable models cannot be directly adopted for ordinal regression problems. In this study, we use the piecewise linear score functions to characterize the contributions of individual attributes. The mapping functions explore the interaction effects of the attributes, and the link function measures the ‘preference’ values of the objects.
3 Explainable Ordinal Factorization Model
3.1 Preliminaries
Consider an ordinal regression problem that concerns a set of objects , where , and the class label . The classes are in a natural ordering , where indicates that the objects in the class are preferred to those in . An attribute interaction, , is a subset of all attributes: . We denote be the set of all order interactions. More specifically, if , is the set of all pairwise interactions.
We assume the ‘preference’ value of each object is determined by a link function:
(1) 
where is the predefined highest order of interactions. denotes score function of th attribute and denotes a mapping function of the interacting attributes. We use a piecewise linear function to estimate each score function because any nonlinear functions can be approximated by sampling the curves and interpolating linearly between the points [13]. Let , where and , be the whole value scale of the th attribute. For each attribute, we partition the scale into equal subintervals , where are called characteristic points.
Definition 1.
The attribute vector of object is defined as follows:
(2) 
where and
Definition 2.
The marginal score vector is defined as:
(3) 
where and , is the difference between two consecutive characteristic points.
Given Definitions 1 and 2, the first term in Eq.(1), namely the main effects of the attributes, can be reformulated. As for the interaction effects part, we consider the pairwise interactions () because this is the most common situation. Since directly estimating the individual mappings of may cause data sparsity problem [24], we model these interactions by factorizing them. The new link function is as follows:
(4) 
where is a dot product of size , and is the th element in vector . The Eq.(4) is in the form of FMs. The only difference is that the Eq.(4) does not have a global bias term, which does not affect the predictions (please refer to the following section).
3.2 Transform Ordinal Regression Problems
We introduce a new process for transforming ordinal regression problems. The process determines the labels of the objects in the testing set by comparing their ‘preference’ values to the objects in the training set [7]. The boundaries of each class are learned from these comparisons and are defined by intervals instead of some fixed real values, which generalizes the thresholdbased ordinal regression procedures.
Definition 3.
Given a link function , the ‘preference’ relationship between two objects is concordance with if and only if:
(5) 
and equivalently,
(6) 
where is a predefined positive margin.
Definition 3 assumes the labels of objects are concordant to their ‘preference’ values (scores), i.e., the greater the value of , the more likely that the object being assigned to better classes. Such information is helpful for constructing loss functions for ordinal regression problems. Given the following definition, we determine the multiple relevant classes for objects :
Definition 4.
Given the link function , the class interval of an object is , where:
(7) 
and
(8) 
Proposition 1.
Given the link function , the interval of object is not empty.
Proof.
If , then , thus and the interval is not empty. Analogously, if , then , thus and the interval is not empty.
We prove it when and by contradiction. Assume , we have and such that . As stated in Definition 3, indicates that . Since , thus . Note that and , thus , which contradicts the assumption and concludes the proof. ∎
Given Definition 3 and Proposition 1, XOFM always provides an interval for each object. The interval contains either a single class or multiple relevant classes. The users can determine the final class based on either their domain knowledge or the following indicator. We define an indicator that favors the classification :
Definition 5.
Given an class interval , if , then . If not, the indicator , where and
Definition 5 provides a proportion of objects that are classified to a class either worse or better than . Obviously, the greater is, the more likely that . Hence, we classify to the class with the maximal .
Proposition 2.
The proposed transformation procedure for ordinal regression generalizes the thresholdbased procedure. The classifications determined by the proposed procedure can be obtained by fixing the thresholds within some intervals instead of single values.
Proof.
From Definition 3, the ‘preference’ between two objects are concordance with the value of link function , i.e., . Assume the threshold and are the upper and lower bounds of , respectively. If , then any object such that will be classified to class by thresholdbased procedure. Considering testing samples , we have two cases:
(1) and . Given Definition 4, and the proposed procedure will classify to class . Moreover, since , the thresholdbased procedure will always classifies to as long as the thresholds and satisfying and .
(2) and . Given Definition 4, and , the proposed procedure will classify to an interval . Moreover, if , the thresholdbased procedure will classify to , and if the classification will be . Given two cases, the thresholdbased procedure provides single values of the thresholds, and classifies an object to a single class within an interval that can be stemmed from the proposed procedure. Therefore, the propose procedure is a general form of the thresholdbased procedure. ∎
3.3 Learning XOFM
The parameters in the proposed XOFM, i.e., can be estimated under a standard FMsbased scheme with the following loss function:
(9) 
where if . All pairwise comparisons with less than the predefined margin are penalized. Obviously, since we focus on the difference between two objects scores, the global bias term in traditional FMs can be discarded. The model parameters can be learned by gradient descent methods. The gradient of the parameters in XOFM is [25]:
(10) 
For direct optimization of the loss function, the derivatives are:
(11) 
The computational complexity for the training process is linear while the computational complexity for preprocessing the data is . We can use the standard optimization algorithms that have been proposed for other machine learning models to estimate the parameters in XOFM. Stochastic gradient descent (SGD) is an iterative method for optimizing an objective function with smoothness properties [28]. Since the loss function in XOFM is convex, it is suitable to use SGD algorithm to optimize the parameters. Nevertheless however, other complex optimization algorithms such as advanced stochastic approximation algorithms and Markov Chain Monte Carlo inference are also suitable for the proposed XOFM. Algorithm 1 shows how to apply SGD to optimize the XOFM.
3.4 Explainable Model and Regularization
XOFM uses piecewise linear functions to approximate the actual contributions of the attributes at different value scales. More specifically, the vector characterizes the main effects by presenting a score function versus individual attributes at different value scales. The parameters in matrix decipher the interaction effects of discretized value scales between two attributes. These parameters can help us understand the pairwise interactions via a interaction matrix (visualized as a heatmap), in which the whole area is divided into small blocks and each block represents the interaction effects of the corresponding intervals of attribute scales.
To avoid the overfitting problem, we can modify the loss function of XOFM by adding regularization terms for both the main and interaction effects:
(12) 
where is Frobenius norm. In addition to avoid overfitting problem, the regularization terms can constrain the shape of score functions and adjust the effect of the attribute interactions. Obviously, determines the complexity of the score functions of individual attributes. When increases, the model tends to penalize the difference between the two consecutive characteristic points, thus the score functions change smoother. In contrast, determines the impact of the attribute interactions on the link function. A smaller leads to less intensity of attribute interactions. The value of and can be predefined in accordance with the users’ domain knowledge. For instance, if the user ensures that the involved attributes are usually irrelevant to each other, can be set to a large value.
The capacity to explain the results makes XOFM helpful for managerial problems. For example, physicians need accurate readmission prediction models that can reveal the detailed effects of risk factors for individual patients. By visualizing the main and interaction effects of the risk factors, it is easier for physicians to examine the consistency between the underlying model and their prior knowledge. Such explainability is important to test the rationality of a prediction model.
3.5 XOFM with monotonicity constraints
In some real world decision problems, the user’s prior knowledge about some attributes, for instance the monotonicity of attributes is required to be satisfied [14]. The proposed XOFM can also adapt to these cases where the attributes are restricted to be monotonic. Note that in these cases, the monotonicity of interaction effects of two monotonic attributes should also be maintained [8]. For this purpose, we reformulate original XOFM with additional constraints:
(13) 
There are many methods can be used to optimize Problem (13). In this study, we first substitute by and do not constrain on . In this way, the problems regarding are unconstrained, thus standard gradient decent algorithms can be used for optimizing and . We then use the projected gradient methods to optimize the parameters in [21]. More specifically, the constraints on dot products can be replaced with the following constraints regarding :
(14) 
where is the th element in vector . Note that the feasible region in Problem (14) is convex subset of the feasible region in Problem (13). Although the new constraints are stricter, they are simpler to solve [34]. To solve the Problem (13), we define an indicator function: and rewrite Problem (14) as:
(15) 
Note that the is replaced by in . in Problem (15), we first optimize the differentiable part in the objective function and then use an euclidean projection to ensure that the solutions are in the feasible region. We add an extra step between steps 8 and 9 in Algorithm 1, i.e., , where for all .
We apply this specific model to a real dataset and the experimental results, including the obtained marginal value functions and all pairwise interaction effects are presented in the online version.
4 Experimental Analysis
4.1 Experimental Design
We evaluate the proposed XOFM on seven benchmark datasets and compare its performance with that of stateoftheart baseline ordinal regression methods. The characteristics of the datasets^{1}^{1}1Abalone, Auto Riskiness, Boston Housing, Stock and Skill are downloaded from https://www.gagolewski.com/resources/data/ordinalregression/, Breast data is downloaded from UCI [6], and Chinese University data is collected from http://www.shanghairanking.com/Chinese_Universities_Rankings/. are presented in Table 1 and the selected baselines are presented in Table 2. A fivefold cross validation process is used to train the models. Note that for consistency, we do not add any regularization terms in this experiment and the parameters in XOFM are and . We use the standard stochastic gradient descent algorithm to optimize XOFM, but other algorithms introduced in [25] can also be applied to optimize XOFM. After the parameters in each model are determined, we randomly select 80% and 20% of the data as the training set and testing set, respectively, and then average the results over 30 trials to evaluate the performance. The code for the proposed XOFM is attached for review purpose, and will be made publicly available on Github.
Dataset  #Obj.  #Attr.  #Classes 

abalone ord (AO)  4,177  7  8 
auto riskiness (AR)  160  15  6 
breast (BR)  106  9  6 
Boston housing ord (BHO)  506  13  5 
Chinese university (CU)  600  10  5 
stock ord (SO)  950  9  5 
skill (SK)  3,302  18  6 
Abbr.  Short description 

LIT  Ordinal logistic model with immediatethreshold variant [27]. 
LAT  Ordinal logistic model with allthreshold variant [27]. 
KDLOR  Kernel discriminant learning for ordinal regression [30]. 
POM  Proportional odds model [22]. 
SVR  Support vector regression [29]. 
NNOR  Neural network with ordered partition for ordinal regression [3]. 
ELMOR  Extreme learning machine for ordinal regression [5]. 
CNNOR  Convolutional deep neural network for ordinal regression [18]. 
OFMHS  Ordinal factorization machine with hierarchical sparsity [11]. 
XOFM  Proposed explainable ordinal factorization model. 
To evaluate the performance, we adopt two measures. First, we use the Accuracy (Acc) to measure the global performance but does not consider the order:
(16) 
where is the predicted label.
Second, we adopt the Mean Absolute Error () to measure the deviation of the predicted labels from the actual labels [1]:
(17) 
4.2 Results Analysis
Approaches  AO  AR  BR  BHO  CU  SO  SK 

LIT  0.3040.034  0.4690.031  0.6190.015  0.6340.024  0.7830.019  0.7210.023  0.3250.086 
LAT  0.3160.037  0.2190.034  0.6190.014  0.6340.021  8080.021  0.7100.020  0.3460.067 
KDLOR  0.2950.028  0.3750.037  0.4290.017  0.5840.020  0.9000.008  0.7890.012  0.3770.055 
POM  0.3470.029  0.4060.029  0.6670.013  0.6530.022  0.9420.009  0.6890.021  0.4280.018 
SVR  0.3220.031  0.6250.030  0.4760.017  0.5840.021  0.9170.006  0.8450.018  0.3620.022 
NNOR  0.3330.027  0.6560.041  0.6190.015  0.6440.019  0.9000.005  0.8160.024  0.4420.031 
ELMOR  0.3520.029  0.5940.038  0.5710.018  0.5840.020  0.8250.019  0.8160.021  0.4240.028 
CNNOR  0.3440.036  0.3810.032  0.5980.011  0.5810.022  0.9330.008  0.7240.039  0.3180.035 
OFMHS  0.3230.039  0.4910.032  0.6380.012  0.5880.014  0.9170.005  0.7930.025  0.3010.029 
XOFM  0.3480.023  0.7190.031  0.6820.021  0.6370.017  0.9420.009  0.8580.033  0.3950.024 
Approaches  AO  AR  BR  BHO  CU  SO  SK 

LIT  1.2560.097  0.5940.071  0.5240.065  0.4750.053  0.2170.011  0.2950.042  0.9640.079 
LAT  1.0810.103  0.8750.081  0.6190.057  0.4750.043  0.2000.018  0.3050.044  0.8260.077 
KDLOR  1.2680.067  0.9060.062  0.6690.053  0.4850.045  0.1010.008  0.2160.029  0.8560.067 
POM  1.1030.078  0.6880.082  0.4290.052  0.4160.037  0.0580.009  0.3160.026  0.7120.055 
SVR  0.9820.089  0.3750.063  0.6190.042  0.4360.032  0.0830.005  0.1790.018  0.8390.050 
NNOR  1.0340.083  0.4060.042  0.3810.030  0.4260.041  0.1000.013  0.1840.022  0.6850.058 
ELMOR  0.9960.093  0.4380.035  0.6670.036  0.4750.030  0.1750.018  0.1840.021  0.6940.033 
CNNOR  1.3260.029  0.6810.013  0.4880.020  0.4970.019  0.0920.007  0.2980.028  1.1090.051 
OFMHS  1.3980.041  0.6190.040  0.4200.021  0.5020.020  0.1080.008  0.2820.016  1.1820.031 
XOFM  1.0330.021  0.3430.019  0.3630.009  0.4110.016  0.0580.008  0.1520.018  0.8010.032 
We report and in Table 3 and Table 4, respectively. The best result for each dataset is highlighted. From the results, the proposed XOFM achieves either the best (AR, BR, CU, and SO) or nearthebest (BHO) results for the smallsized datasets. Although EFOM does not perform the best on some largesized datasets (AO and SK), its performance is better than most baselines and not very much worse than the best one. Similar conclusions can be obtained given Table 4. The proposed XOFM is lowest for five out of seven datasets according to . This indicates that the wrong predictions made by XOFM are not much deviated from the true labels. For example, XOFM is not the best method for BHO dataset according to ; however, it achieves the least . It indicates that the proposed XOFM can better take advantage of the ordinal information to reduce the error.
The traditional ordinal regression methods usually require to calculate the difference between two objects, which leads to some sparse training data. This problem heavily affects the performance on smaller datasets given insufficient training samples and larger variance. Similarly, XOFM discretizes an attribute’s value into multiple scales in the form of an attribute vector and trains the parameters by determining the differences between every two attribute vectors. This process also leads to the sparsity problem. XOFM utilizes the FMs scheme to address this issue. The experiments results validate the effectiveness of XOFM on five smallersized datasets. In the next subsection, we will show that the proposed XOFM can provide meaningful explanations for predictions.
4.3 Explainability
The capability to decipher the detailed relationships between attributes and predictions is the key to explain the model. To demonstrate such explainability of XOFM, we present the obtained score functions for the Breast data in Figure 1. This dataset contains some electrical impedance measurements in samples of freshly excised tissue from the breast. XOFM is used to classify a sampled tissue into one of six ordered classes, Carcinoma Fibroadenoma Mastopathy Glandular Connective Adipose, where Carcinoma is the most severe class and Adipose is the safest. The involved attributes are described in Table 5.
Attr.  Description 

I0  impedivity (ohm) at zero frequency 
PA500  phase angle at 500 KHz 
HFS  highfrequency slope of phase angle 
DA  impedance distance between spectral ends 
AREA  area under spectrum 
ADA  area normalized by DA 
MAXIP  maximum of the spectrum 
DR  distance between I0 and max frequency point 
P  length of the spectral curve 
In Figure 1, each score function is in a piecewise linear form and characterizes a single attribute’s contribution to the risk for breast cancer. For example, the attribute I0, the Impedivity (ohm) at zero frequency, negatively affects the risk for an object belonging to the Carcinoma class because the marginal score decreases along with the increase of attribute value. Moreover, from the marginal scores scales (the distance between the maximal and the minimal values in axis), we find that the attribute I0 is the most important one because its marginal score ranges from 0 to 1.25, which is the largest one among all attributes. That conclusion is consistent with previous clinical findings [4].
XOFM can decipher the interaction effects of the attributes. For brevity, we report one of the pairwise interactions as a heatmap (Figure 2). The color represents the intensity of the interactions in different attribute value intervals. For example, when AREA is around 87,275.45, its interaction with DA is stronger, indicating that a breast tissue with around 87,275.45 and within the interval is more likely to be malignant.
4.4 Modification
XOFM is an flexible model that can be modified by tuning the regularization terms. As introduced in previous section, controls the complexity of the score functions and determines the intensity of the attribute interactions. In addition to determine the parameters by crossvalidation process, we can progressively adjust the parameters based on the user’s domain knowledge. For example, if a physician insists that the main effects are more important than the attribute interactions, we can increase and the resulted new heatmap is shown in Figure 3. Obviously, the intensity of the interaction is weaker than the previous one. On the contrast, if we increase , the score functions would become more flat and some exhibit different curve shapes (Figure 4). In practice, we can determine the values of regularization terms based on the performance, or by soliciting the opinion of domain experts.
5 Conclusion
In this study, we propose the XOFM, a new factorization machinesbased ordinal regression model. XOFM is able to provide stateoftheart prediction performance, and more importantly, provide meaningful explainability that deciphers the detailed contributions of attributes and their interactions. Such explainability makes XOFM uniquely effective in providing decision making support, where the ability to explain how the predictions are made is as much needed as achieving good accuracy. Moreover, XOFM presents a general explainablemodeling framework that can be calibrated/modified for various prediction problems other than ordinal regressions.
We appreciate the anonymous reviewers and PC members for their valuable suggestions and comments.
References
 [1] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani, ‘Evaluation Measures for Ordinal Regression’, in 2009 Ninth International Conference on Intelligent Systems Design and Applications, pp. 283–287. IEEE, (2009).
 [2] Ralf Bender and Ulrich Grouven, ‘Ordinal Logistic Regression in Medical Research’, Journal of the Royal College of Physicians of London, 31(5), 546, (1997).
 [3] Jianlin Cheng, Zheng Wang, and Gianluca Pollastri, ‘A Neural Network Approach to Ordinal Regression’, in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1279–1284. IEEE, (2008).
 [4] J Estrela Da Silva, JP Marques De Sá, and J Jossinet, ‘Classification of Breast Tissue by Electrical Impedance Spectroscopy’, Medical and Biological Engineering and Computing, 38(1), 26–30, (2000).
 [5] WanYu Deng, QingHua Zheng, Shiguo Lian, Lin Chen, and Xin Wang, ‘Ordinal Extreme Learning Machine’, Neurocomputing, 74(13), 447–456, (2010).
 [6] Dheeru Dua and Casey Graff. UCI Machine Learning Repository (2017), 2017.
 [7] Salvatore Greco, Vincent Mousseau, and Roman Słowiński, ‘Multiple Criteria Sorting with a Set of Additive Value Functions’, European Journal of Operational Research, 207(3), 1455–1470, (2010).
 [8] Salvatore Greco, Vincent Mousseau, and Roman Słowiński, ‘Robust Ordinal Regression for Value Functions Handling Interacting Criteria’, European Journal of Operational Research, 239(3), 711–730, (2014).
 [9] David Gunning, ‘Explainable Artificial Intelligence (XAI)’, Defense Advanced Research Projects Agency (DARPA), nd Web, 2, (2017).
 [10] Mengzhuo Guo, Xiuwu Liao, Jiapeng Liu, and Qingpeng Zhang, ‘Consumer Preference Analysis: A Datadriven Multiple Criteria Approach Integrating Online Information.’, Omega, (2019).
 [11] Shaocheng Guo, Songcan Chen, and Qing Tian, ‘Ordinal Factorization Machine with Hierarchical Sparsity’, Frontiers of Computer Science, (Mar 2019).
 [12] Pedro Antonio Gutierrez, Maria PerezOrtiz, Javier SanchezMonedero, Francisco FernandezNavarro, and Cesar HervasMartinez, ‘Ordinal Regression Methods: Survey and Experimental Study’, IEEE Transactions on Knowledge and Data Engineering, 28(1), 127–146, (2015).
 [13] Bernd Hamann and JiannLiang Chen, ‘Data Point Selection for Piecewise Linear Curve Approximation’, Computer Aided Geometric Design, 11(3), 289–301, (1994).
 [14] Eric JacquetLagreze and Yannis Siskos, ‘Preference Disaggregation: 20 Years of MCDA Experience’, European Journal of Operational Research, 130(2), 233–245, (2001).
 [15] Miłosz Kadziński, Marco Cinelli, Krzysztof Ciomek, Stuart R Coles, Mallikarjuna N Nadagouda, Rajender S Varma, and Kerry Kirwan, ‘Coconstructive Development of a Green Chemistrybased Model for the Assessment of Nanoparticles Synthesis’, European Journal of Operational Research, 264(2), 472–490, (2018).
 [16] Kyoungjae Kim and Hyunchul Ahn, ‘A Corporate Credit Rating Model Using Multiclass Support Vector Machines with an Ordinal Pairwise Partitioning Approach’, Computers & Operations Research, 39(8), 1800–1811, (2012).
 [17] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou, ‘Factorized Bilinear Models for Image Recognition’, in Proceedings of the IEEE International Conference on Computer Vision, pp. 2079–2087, (2017).
 [18] Yanzhu Liu, Adams Wai Kin Kong, and Chi Keong Goh, ‘Deep Ordinal Regression Based on Data Relationship for Small Datasets’, in Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, pp. 2372–2378. AAAI Press, (2017).
 [19] Yin Lou, Rich Caruana, and Johannes Gehrke, ‘Intelligible Models for Classification and Regression’, in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–158. ACM, (2012).
 [20] Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker, ‘Accurate Intelligible Models with Pairwise Interactions’, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 623–631. ACM, (2013).
 [21] David G Luenberger, Yinyu Ye, et al., Linear and Nonlinear Programming, volume 2, Springer, 1984.
 [22] Peter McCullagh, ‘Regression Models for Ordinal Data’, Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 109–127, (1980).
 [23] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua, ‘Ordinal Regression with Multiple Output CNN for Age Estimation’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4920–4928, (2016).
 [24] Steffen Rendle, ‘Factorization Machines’, in 2010 IEEE International Conference on Data Mining, pp. 995–1000. IEEE, (2010).
 [25] Steffen Rendle, ‘Factorization Machines with Libfm’, ACM Transactions on Intelligent Systems and Technology (TIST), 3(3), 57, (2012).
 [26] Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars SchmidtThieme, ‘Fast Contextaware Recommendations with Factorization Machines’, in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 635–644. ACM, (2011).
 [27] Jason DM Rennie and Nathan Srebro, ‘Loss Functions for Preference Levels: Regression with Discrete Ordered Labels’, in Proceedings of the IJCAI Multidisciplinary Workshop on Advances in Preference Handling, pp. 180–186. Kluwer Norwell, MA, (2005).
 [28] Herbert Robbins and Sutton Monro, ‘A Stochastic Approximation Method’, The annals of mathematical statistics, 400–407, (1951).
 [29] Alex J Smola and Bernhard Schölkopf, ‘A Tutorial on Support Vector Regression’, Statistics and Computing, 14(3), 199–222, (2004).
 [30] BingYu Sun, Jiuyong Li, Desheng Dash Wu, XiaoMing Zhang, and WenBo Li, ‘Kernel Discriminant Learning for Ordinal Regression’, IEEE Transactions on Knowledge and Data Engineering, 22(6), 906–910, (2009).
 [31] Michael Tsang, Hanpeng Liu, Sanjay Purushotham, Pavankumar Murali, and Yan Liu, ‘Neural Interaction Transparency (NIT): Disentangling Learned Interactions for Improved Interpretability’, in Advances in Neural Information Processing Systems, pp. 5804–5813, (2018).
 [32] Joel Vaughan, Agus Sudjianto, Erind Brahimi, Jie Chen, and Vijayan N Nair, ‘Explainable Neural Networks Based on Additive Index Models’, ArXiv preprint ArXiv:1806.01933, (2018).
 [33] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang, ‘Deep & Cross Network for Ad Click Predictions’, in Proceedings of the ADKDD’17, p. 12. ACM, (2017).
 [34] Jun Xu, Wei Zeng, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng, ‘Modeling the Parameter Interactions in Ranking SVM with Lowrank Approximation’, IEEE Transactions on Knowledge and Data Engineering, 31(6), 1181–1193, (2018).
SupplentaryMaterial.pdf, 15