Ordinal Regression as Structured Classification

# Ordinal Regression as Structured Classification

Niall Twomey University of Bristol, Bristol, United Kingdom
{niall.twomey, rp13102, cm13558, enrsr}@bristol.ac.uk
Rafael Poyiadzi University of Bristol, Bristol, United Kingdom
{niall.twomey, rp13102, cm13558, enrsr}@bristol.ac.uk
Callum Mann University of Bristol, Bristol, United Kingdom
{niall.twomey, rp13102, cm13558, enrsr}@bristol.ac.uk
Raúl Santos-Rodríguez University of Bristol, Bristol, United Kingdom
{niall.twomey, rp13102, cm13558, enrsr}@bristol.ac.uk
###### Abstract

This paper extends the class of Ordinal Regression (OR) models with a structured interpretation of the problem by applying a novel treatment of encoded labels. The net effect of this is to transform the underlying problem from an OR task to a (structured) classification task which we solve with conditional random fields, thereby achieving a coherent and probabilistic model in which all model parameters are jointly learnt. Importantly, we show that although we have cast OR to classification, our method still fall within the class of decomposition methods in the OR ontology. This is an important link since our experience is that many applications of machine learning to healthcare ignores completely the important nature of the label ordering, and hence these approaches should considered naïve in this ontology. We also show that our model is flexible both in how it adapts to data manifolds and in terms of the operations that are available for practitioner to execute. Our empirical evaluation demonstrates that the proposed approach overwhelmingly produces superior and often statistically significant results over baseline approaches on forty popular OR models, and demonstrate that the proposed model significantly out-performs baselines on synthetic and real datasets. Our implementation, together with scripts to reproduce the results of this work, will be available on a public GitHub repository.

Ordinal Regression as Structured Classification

## 1 Introduction

OR is the task of learning to classify data-points into one of many interval classes. It can be understood as lying in between the canonical problems of classification and regression, as it is a classification task where the classes follow a pre-defined order. Model learning in these domains therefore requires particular care and attention since many assumptions underpinning standard classifiers are unsuitable in OR settings.

Let us consider Alzheimer’s disease (AD) as a motivating application for this work. When assessing the current state of AD, healthcare professionals utilise one of several well-known assessment questionnaires (c.f. [1]). These questionnaires are designed to uncover the cognitive capacity of the persons and evaluate the risks of independent living. An emerging application area of machine learning has been to non-invasively predict questionnaire scores based on a person’s behaviour and circadian patterns of ADL and Instrumental () (IADL) in a Smart Home (SH) [2, 3] or to assess the cognitive ability from conversation analysis. These are challenging problems to model, and there has been some success in these areas already.

The standard machine learning approach is based on learning a mapping between samples and categories so that the probability of error is minimised. However, in the setting described here the categorisation of the scores into their groups is an ordinal operation (e.g. ‘severe’ diagnoses are more extreme than ‘moderate’ and ‘mild’), and indeed classifying a person with ‘severe’ AD as ‘mild’ is more costly than predicting ‘moderate’. Automated AD assessment presents an opportunity to produce valuable healthcare technology that can benefit vulnerable persons and their families, but also to benefit clinicians via the unprecedented and objective view into the effect of AD on routine and behaviour. Although in the authors’ experience the vast majority of the experimental literature on ordinal medical domains ignores the ordinal nature of the data and recasts the problem into traditional binary or multiclass problems, with some notable exceptions [4].

In this work we introduce a structural interpretation of ordinal regression. The advantage of this interpretation is that significantly more flexibility is ascribed to the predictive model, and this flexibility permits the model to operate efficiently on linear and non-linear data manifolds, while the baseline methods considered were unable achieve this. Additionally, our structured interpretation captures contextual information that the other baselines cannot.

The aims and contributions of this paper are as follows: We strongly advocate the selection of ordinal techniques for ordinal problems and a review of ordinal approaches in Section 2. We extend the class of ordinal regression models in this work with a new structural interpretation of the field (Section 3), outline empirical experiments (Section 4) demonstrate its utility in our results (Section 5). We summarise and conclude in Sections 6 and 7.

## 2 Ordinal Regression

Within the published area of OR, there are several methodologies that are well established. We describe these with strong reference to the ‘ordinal regression ontology’ from [5] and then introduce the proposed approach after.

### 2.1 Naïve Models

Intuitively we can reduce an OR task to either a classification or a regression problem. In the case of classification, we ignore the nature of the classes, and proceed with a model that uses nominal classification. This is considered a naïve approach as the practitioner ignores prior knowledge (e.g. of class ordering) that could otherwise be used to increase the accuracy and predictive power of the model. For the case of regression, one may map the classes on the real line, employ regression techniques, then map back to the original classes. Unless the practitioner has a well considered way of computing the forward and backward mappings, this approach appears naïve.

A similar approach to the classification reduction, but more advanced, is that of Cost Sensitive Classification (CSC). CSC is a general treatment of models where the practitioner provides (potentially) unique penalties for each type of misclassification [6]. This is usually accomplished through the use of a cost matrix during learning. CSC can therefore be employed for OR by devising a cost matrix that depends on the distance between classes [7]. This would again be a sensible approach given that the practitioner has a good understanding of the distances between classes, and a principled way of transforming them to suitable costs.

### 2.2 Threshold Models

Threshold models are another approach to OR. We assume that there is a latent continuous random variable that gives rise to the observed discrete classes. With this formulation we can perform a reduction to a regression problem. As criticised earlier this would be a naïve approach due to the lack of principled way of mapping from the real line to the given classes. Approaches under the Threshold Models category, aim to surpass this limitation by learning this map, or where to ‘cut’ the real line from data, as opposed to assuming knowledge of it, a priori.

##### Ordered Logit:

The classical ordered logit model [8] is a simple model that assumes a real-valued latent variable () is defined by

 y∗=w⊤x+ϵ (1)

where is a data point, is the dimensionality of the data, is a weight vector, and is a noise term following the logistic distribution with zero mean and unit variance. Assuming categories, and a set of thresholds (ordered by ) one can assign a response according to the interval into which falls with the function :

 fk(y∗)={1    if  θk−1

Three of the thresholds are fixed (, and ) to ensure that the process is identifiable [9]. The probability over the categories is computed by integrating the probability mass that falls between the intervals

 P(y=k|x) =P(θk−1

where is the logistic function (i.e. cumulative distribution of the logistic distribution). The log-likelihood and its gradient with respect to the parameters () are easily computed and can be optimised with standard optimisation techniques [8]. Previous work presents an approach based on the Support Vector Machine and a dataset constructed by considering all the pairwise difference vectors [10]. One of the main advantages of these models over simpler baselines (such as linear regression) is that the ordinal intervals are optimised during the learning routine and that the intervals can have arbitrary widths. It is important to understand that the primary assumption underpinning these models is that the data lies on a linear manifold, and in practice this is difficult to guarantee. Other approaches within the threshold models category include an adaptation of the online perceptron algorithm [11], as well as an approach based on a generative model, which uses Gaussian Processes [12].

### 2.3 Decomposition Models

##### Ordinary Binary Decompositions:

products of multiple binary models, or, single models capable of multiple-outputs. For example, in multi-class classification problems, one usually resorts to solving multiple smaller problems and then combining their predictions according to voting schemes such as One-Versus-One (OvO) or One-Versus-All (OvA). Considering a problem with classes, in the former setting, one would need ‘small’ learners, while in the latter ‘larger’ learners, where the distinction between small and large refers to the average size of the data they will be dealing with. OvA is also susceptible to the problem of class-imbalance. Based on the assumption of the ordering of the classes one could construct more developed voting schemes, that reflect his prior knowledge and reduce the computational complexity of the overall algorithm. Examples of such ordinal voting schemes include, one-vs-next, one-vs-followers, and decompositions based on Ordered Partitions (see Section 3.2. in [5]).

These decompositions are closely related to the concept of Error Correcting Output Codes (ECOC), which is used to reduce multi-class classification problems to combinations of binary tasks [13]. In this setting, every class is assigned to an ‘output code’, which usually contains values in . When considering multiple binary models, each of the entries of this output code is generated by one of the models. The predicted class is the one whose output code is closer to the composition of predictions. A similar line of work keeps the connection between classes and output codes, but instead of training one model per ‘bit’, trains a model capable of multiple outputs on the whole code. In the simplest case this could amount to the output codes being of the form of the popular one-hot embedding, but ECOC provides a framework for more delicate codes to be utilised, such as ones reflecting the prior knowledge of the classes being ordered.

##### Nested Binary Classifiers:

A flexible ordinal model based on a decomposition of the label space can be produced with cascades of linear classifiers [14] by recasting the ordinal task into independent binary classification problems. The -th binary problem re-partitions the dataset into two groups; the first group consists of all instances whose label is less than or equal to the value , and the second group consists of all instances with label greater than the value .

Using an equivalent rationale to that on Equation (2.2), the probability distributions over each partition are unified into a probability distribution over the categories with the following equation

 P(Y=k|x)=P(Y>k−0.5|x)−P(Y>k+0.5|x) (4)

with the base cases and . The models are learnt independently, and only the two classifiers that ‘neighbour’ the correct label are used in prediction.

Although this model is simple and derived from an intuitive standpoint, it also carries several disadvantages. Firstly, the binary classifiers are learnt independently. While this brings gains in terms of concurrently learning each model it is unlikely that the final model will produce optimal decisions. Secondly, the mechanism for decision making shown in Equation 4 cannot guarantee consistency in classification and in general may require clipping and renormalisation for probabilistic predictions [15], and this is particularly clear if one envisages an ordinal classification task when the data lies along complex or nonlinear data manifolds.

In the taxonomy of algorithms presented in [5], the Ordinary Binary Decompositions category has another sub-class of methods. Therein, a first group of methods takes advantage of the ordinal nature of the classes to devise clever decompositions, while the second group transforms the problem to a multi-target one, with ordinal encodings as targets. Models must be aware of the structured nature of the output space in order to take advantage of these encodings.

## 3 Methods

In this section we introduce our proposed technique for ordinal regression Structured Ordinal Regression Modeling (StORM). We cast the ordinal regression task into a structured classification task. We use a simple encoding scheme for the labels which allows for a simple propagation of information through a CRF constructed from the label representation. Although classification methods in general are considered naïve on the ordinal regression ontology (c.f. [5] and Section 2.1) the proposed method is further developed (and hence not naïve) since the label encoding deliberately captures several desirable properties of ordinal predictors. A key advantage of the application of CRFs to the encodings above is that one model is produced and optimised to produce outputs, in contrast to many approaches from the threshold and decomposition strand of the OR ontology.

### 3.1 Label Encoding

A key enabler of the proposed approach is the symbiotic relationship between a bespoke encoding scheme for ordinal variables on one hand and the modelling framework that is used to infer and predict on the space of encoded labels on the other (next section). The encoding scheme that we use has previously been introduced for capturing resemblance measures for ordinal variables [16, Ch. 8] but we believe we are the first that incorporate this representation directly into the modelling framework.

We consider an ordinal problem as having categories, and our encoding scheme transforms these into a sequence of binary digits. The following function defines the value of the -th bit of an encoded sequence:

 ˆfK(ˆy,k)={1if$k<ˆy$$(1≤k≤K−1)$0otherwise (5)

where the function subscript defines the support of the ordinal categories (i.e. ) and () is the ‘raw’ (i.e. un-encoded) label. As a concrete example, for and , the encoded label is given as:

 f7(4)=(1,1,1,0,0,0) (6)

where we have defined the new function

 fK(ˆy)=(ˆfK(ˆy,1),ˆfK(ˆy,2),…,ˆfK(ˆy,K−1)) (7)

To motivate this encoding scheme for OR, consider two instances with and and their encoded values:

 f7(3)=(1,1,0,0,0,0) (8) f7(5)=(1,1,1,1,0,0) (9)

Recalling that these are the encoded representation of the labels of two instances, we can see that even though the raw labels are distinct that four bits of the encoded labels are of the same value. Thus, we can split the encoded labels into three virtual segments: 1) the first two bits which are positive and identical; 2) the final two bits which are negative and identical; and 3) the middle two bits which disagree and encode the intrinsic differences between the instances. In the next section we introduce a framework for modelling sequences of data that obey the constraints of the encoding and thus capture ‘shared’ and ‘distinct’ aspects of the encoded labels above.

### 3.2 Conditional Random Fields

We utilise the language of probabilistic modelling and Conditional Random Fields (CRFs) in our setting. CRFs constitute a structured modelling framework [17], and in this section we motivate and introduce a generalisation of the traditional linear-chain CRFs for OR. Linear-chain CRFs incorporate weight-sharing on all positions of a sequence since, for these models, the dynamics (i.e. predictive response as a function of input) are stationary [18]. In other words the effect of one feature is equal at all positions of a sequence. This is a strong assumption, but in particular is inappropriate with our encoded labels since a feature may need to have diminishing (or increasing) responses depending on the position of the sequence. For the remainder of this section we assume the reader has familiarity with CRFs and recommend the following as an introducton: [19].

To overcome this incompatibility, we use the CRF framework with but importantly without weight sharing. We have a dataset that consists of observations of dimensionality , i.e. . With ordinal quantities the encoded labels are . In order to simplify mathematical notation for the remainder of this section we focus on one particular example/label pair (, ) which can be considered as the -th row of and respectively. Of critical importance for this method is the fact that the label has been mapped from the ‘one-of-K’ encoding to the ‘up-to-k’ encoding, and hence the space of labels (and predictions) have become a sequence of binary variables for every instance. Although this might be viewed as an unnecessary complication (since no new information is introduced) we will later see the value that is introduced by this encoding.

#### 3.2.1 CRFs for OR

CRFs yield structured predictions over graphs. In our setting, the graph consists of nodes with edges linking the nodes together in a chain. Each node (indexed by ) contains its own set of weights as does each edge (indexed by ). We follow standard potential and marginalisation methods from the CRF literature. First, node and edge potentials are computed. The -th node potential is given by

 ψ(n)=exp{W(n)x} (10)

where is the feature vector and is the weight vector associated with the -th node, and . To simplify notation we assume that a ‘bias feature’ with constant value of 1 is contained in the feature vector . Similarly the potential of the -th edge is given

 Ψ(e)=exp{U(e)x} (11)

where is the weight tensor associated with the -th edge of the model and multiplication takes place on the outermost dimension, and .

Inference can be performed with standard message passing which can efficiently be computed with the forward-backward dynamic program. The forward vector is given by

 α(n+1)=Ψ(n)⊤γ(n) (12)

where represents the matrix transpose, and represents the Hadamard product. The backward vector is calculated similarly with

 β(n−1)=Ψ(n)δ(n) (13)

where , and the base cases for the forward and backward vectors are and . Note, marginalisation is often performed in the log domain with the log-sum-exp function for numerical stability but identical marginal distributions are achieved to those above.

It can be shown that the forward and backward vectors yield sufficient information for exact marginal probability estimation [19] and the probability of the -th position of the label is given by

 P(yn)=α(n)⊙ψ(n)⊙β(n)/Z (14)

where is the global normaliser of the sequence that can be calculated at any position, , and the probability across the -th edge is

 P(yn,yn+1)=γ(n)⊙Ψ(n)⊙δ(n+1)⊤/Z (15)

Figure 1 illustrates the inference procedure along a graph. Three nodes are shown here, and each of the intermediate quantities introduced earlier are shown.

#### 3.2.2 Learning

Optimisation is performed by minimising the negative logarithm of the likelihood, i.e.

 NLL=−N∑n=1logP(Yn|Xn,Θ) (16)

where is the set of model parameters. It is easy to show that the gradient of the -th element of the -th weight vector is:

 δNLLδW(n)i=1NN∑j=1(P(Yj,n=i∣∣Xj,Θ)−I{Yj,n=i})Xj (17)

where is the identity function, and derivation of the above follows similar methodology for other log-linear models [17, 19] and very similar expressions can be produced to produce gradients with respect to the edge weights . Standard gradient-based optimisation techniques can be used to minimise the negative log likelihood, e.g. L-BFGS.

It is interesting to note that even though log loss is optimised here that the structure of the labels can be seen to be functionally related the absolute error between labels and predictions. One can view this either as a hybrid loss function or that the proposed methodology implicitly applies misclassification costs owing to the structure of the encoded label space.

#### 3.2.3 Comments on the Model

Since many aspects of this model are unexplored in the field of OR we take a moment to comment on some aspects of this model in this setting

Edges: We interpret the edges of the model as driving the ‘transitions’ between two adjacent encoded bits. In more traditional sequence learning settings, including natural language processing, is it very typical to direct bespoke features for the edges only. We ascribe a similar interpretation of the edges in our setting, i.e. the -th edge primarily drives whether the -th bit of the encoded label is sustained or transitions whereas the node weights drive the basic identification of categories.

Predictive Distribution: CRFs facilitate several methods for producing predictions: forward filtering, Viterbi path, marginal probability distribution of the sequence [9]. Although in this work we consider the Viterbi path, we acknowledge that existing literature exists that suggests other predictive functions to be used when optimising for different performance metrics.

Errors: Not all paths are permissible with our encoding scheme, with transitions in particular being forbidden. Several approaches can be incorporated to produce predictions consistent with the encoding. Firstly, one can explicitly forbid illegal transitions by setting . However, in our experimentation we recognised some evidence that there is a correspondence between invalid predictions and outlying data. This work is ongoing.

#### 3.2.4 Implementation Notes

Due to the specific nature of our problem, some operations can be vectorised to increase learning and inference time. Since all sequences are of the same length () the message passing procedure can be vectorised across all instances. In so doing forward messages will be passed from position to across all instances. Similarly, backward messages can be passed in a similar vectorised manner. This is often not possible due to the fact that most sequence learning problems have instances of different length.

Furthermore, if the carnality of the ordinal problem is small (in our experiments less than ) inference can be performed the exponential domain without re-normalisation without noticeable loss of fidelity in probability estimates. This yields significant gains in terms of the computational time since neither the logarithm or exponential functions are used for marginalisation.

## 4 Experiments

### 4.1 Models

We compare four different models that are linear in their parameters. We only consider linear models so that we can compare the proposed method with baselines in the ‘natural’ data representations. Practitioners that wish generalise this work and consider nonlinear predictors may incorporate kernel functions (polynomial, for example) or explicitly parameterise nonlinear representations with deep network architectures. Hence, we consider the following linear models only:

1. Ordered Logit (OrdLog);

2. Nested Binary Ordinal Regression (BinNest);

3. Logistic Regression (LogReg); and

4. Structured Ordinal Regression Modeling (StORM).

These models are all log-linear and regularisation was performed on the weight parameters and we perform crossvalidation over the norm of the parameters. We select the regularisation parameter on the training set using 5-fold cross validation.

### 4.2 Datasets

Table 1 shows the characteristics of the datasets considered in the empirical evaluation in this paper. The table presents four categories of datasets (synthetic, UCI, large and health) and these are explained in the subsequent subsections.

#### 4.2.1 Synthetic

For our synthetic experiments, we project data onto the four following data manifolds: 1. Linear; 2. Sine; 3. Circle; and 4. Spiral. These data manifolds lie in 2D spaces, and we illustrate the predictive distributions of all models visually in order to understand the strengths and limitations of each model. Empirical validation is performed with and . These datasets are shown visually in our results and discussion.

#### 4.2.2 UCI & Large

We follow [12] with two categories of datasets. The the following datasets from the UCI machine learning repository: AutoMPG, Diabetes, Abalone, BostonHousing, MachineCup, Pyrimidines, StocksDomain, Triazines, and Wisconsin. Although many of these datasets are used to understand regression models, we incorporated equal-frequency binning on these datasets so that they can be used in ordinal tasks. Empirical validation is performed with and . We also consider a second (larger) set of data that was also introduced in [12] as the ‘large’ dataset.

#### 4.2.3 Healthcare

Finally, we also evaluate our model on two AD datasets. DementiaBank [20] is a longitudinal dataset of multimedia interactions for the study of communication in dementia. The dataset contains transcript and audio files from interviews between patients and clinicians, and covers a range of diagnostic tests in mental health, such as Alzheimers Dementia, Parkinsons, and mild cognitive impairment. The transcripts and audio files were gathered as part of a larger protocol administered by the Alzheimer and Related Dementias Study at the University of Pittsburgh School of Medicine. We use the DementiaBank dataset in an ordinal regression setting to model the various stages of progression of AD: cognitively healthy, possible dementia, probable and dementia.

The Centre for Advanced Studies in Adaptive Systems (CASAS) research group produce models and datasets for smart-home behaviour modelling. Their datasets consist of sensor data (including Passive Infra-Red (PIR), temperature, door and object sensors) derived from naturalistic living in a SH environment. The ‘cognitive assessment activity dataset’ [2, 21, 22] consists of approximately 400 participants performing several ADLs and IADL in the SH. Cognitive clinicians graded the activities were graded by domain experts on a range of 1-5, and predicting the assigned grade from sensor data is the task that we investigate here.

### 4.3 Performance Evaluation

All datasets are partitioned randomly into 20 folds on the ‘synthetic’, ‘UCI’ and ‘healthcare’ datasets (c.f. Table 1). Following the protocol of [12] we also performed 100 randomised splits for the ‘large’ datasets. Model hyperparameters are selected with 5-fold cross validation on the training set, and the selected parameters are used for performance evaluation on the test set. We follow [23, 5] in our evaluation metrics and use macro-averaged 0/1 loss, Mean Absolute Error (MAE).

Additionally, significance of results is reported with the Wilcoxon’s signed rank test [24] at a (fairly stringent) significance level of . We illustrate the statistical significance with critical difference diagrams [25] that are for the understanding of statistical significance when multiple classifiers are compared over multiple datasets. An example is shown in Figure 2. Four classifiers are shown here (Model 1, Model 2, Model 3, and Model 4) and the average rank of each is marked on the number line. The groups of algorithms whose results are not statistically different are connected together with a heavy horizontal line, i.e. the difference between Models 2 and 3 is not statistically significant, whereas the difference between Models 1 and 2 is.

## 5 Results

In this section we present and discuss the results from the synthetic, UCI, large and healthcare datasets and conclude by discussing the complete results together.

### 5.1 Synthetic Datasets

#### 5.1.1 Results

We first present our results on synthetic datasets visually since these datasets are in two dimensions. The upper two subfigures of Figure 3 present the results from the four classifiers considered ( Ordered Logit (OrdLog), Nested Binary Ordinal Regression (BinNest), Logistic Regression (LogReg), and Structured Ordinal Regression Modeling (StORM)) on the Circle and Spiral (we show the Linear and Sine predictions in the supplementary material). The dots represent instances in a two dimensional space, and the fill colour of each depicts the ground-truth label; darkest blue representing class 1 and darkest red representing class K. Additionally, the background colour in these figures represents the predicted ordinal quantities obtained from each model. The colour scheme is shared between the background and fill colours.

Figure 3(a) and 3(b) show the predictions obtained when the ordinal data lies on a Circle and Spiral manifolds respectively on the 5-category dataset. Clearly, due to the limitations of the OrdLog model it cannot perform optimally here. Additionally the BinNest model does not adapt to the dynamics of the data manifold in this setting either since the greedy learning routine cannot resolve these manifolds (particularly with the Spiral dataset). The non-ordinal LogReg model and the proposed StORM are better able to adapt to the challenges with these data manifolds, with StORM adapting most efficiently.

We have observed noteworthy behaviour with the StORM on all synthetic experiments, namely that the space of low-valued predictions tend to be ‘consumed’ the domain of higher-valued predictions. This phenomenon is illustrated clearly in Figure 3(b) (right) with the spiral dataset and the StORM model, but can also be observed in Figures 3(a). This is achieved due to the encoding of the labels and is a fundamental property of StORM. However, this is also a feature of many OrdLog models, but cannot be guaranteed by the other baselines we consider, e.g. BinNest or LogReg.

Following [23, 5], we quantified performance using two metrics: macro-averaged mean absolute error and macro-averaged mean 0/1 loss. Figure 4 shows the critical difference diagram [25] for the mean zero-one loss (Figure 4(a)) and mean absolute error (Figure 4(b)). (For a description on how to read and interpret critical difference diagrams we refer the reader to Section 4.3 and more generally to [25].) We can see from this figure that the proposed approach is ranked best and that its performance is significantly better than those of the baselines on all performance metrics considered.

#### 5.1.2 Versatile Queries

Since the language of probabilistic graphical models underpin the proposed method, StORM may be queried in a variety of ways. In particular, here we will demonstrate how non-standard queries can be made by visualising the predictive distribution on an edge transition.

The probability distribution over the transition between the -th and -th positions is given by marginalising over all other positions, i.e.

 P(Yi,Yi+1) =∑Y1∑Y2…∑Yi−1∑Yi+2…∑YK−1∑YKP(Y1,Y2…YK) (18)

and this can efficiently computed with forward and backward vectors (see Equation 15). Figure 5 depicts the probability distribution over the transition between positions 3 and 4 on a variation of the 10-category Spiral dataset. Regions shaded in blue and red represent regions of low and high predicted probability respectively. The left figure shows , the middle figure shows , and the right figure shows . These probability distributions can be interpreted as in the left figure captures, in the middle captures and the right figure shows . To understand why, we can consider at the third and fourth tags of encoded labels for several labels, and observe that the third and fourth tags for are both 0, for are both 1, and for we observe the pair corresponding to Figure 5.

Although the model itself is linear in its parameters the predictive distribution has adapted to the nonlinear data manifold. In settings with large (i.e. many ordinal categories) one can easily execute more general queries (e.g. ). As discussed in Section 1, this is a common task in clinical settings, e.g. AD patients will first be graded on a large scale before these are reduced into important intervals. The predictive distribution of may be indicative of a particular grade (e.g. ‘moderate’ AD) and can be computed in our model. We demonstrate this visually for the a variant of the spiral dataset in Figure 6(a). Although the focus of this work is on linear settings, we demonstrate the effect of Nyströem kernel approximation [26] of the Radial Basis Function (RBF) kernel in Figure 6(b). We notice that the nature of the data manifold is better captured with this representation and the predictive distribution curves alongside the manifold.

### 5.2 Predictive Performance on UCI Datasets

Predictive performance was also evaluated on several datasets from the UCI machine learning datasets repository [27]. Figure 7 presents the critical difference diagrams for the 0/1 loss (Figure 7(a)) and mean absolute error (Figure 7(b)) over 5 (left) and 10 (right) categories. Figure 7 illustrates that the StORM model is the best performing model over all metrics with 5 categories, and its performance is significantly better on all metrics with the sole exception of mean squared error. Figure 7 also shows that other models are competitive with StORM on the 10-category datasets. StORM is never significantly less performant than the winning model, but is significantly better than LogReg and BinNest baselines.

We test the performance of StORM with larger datasets (in terms of number of instances and features) with the ‘large’ dataset from [12], and the results of these are shown in Figure 8. These experiments were repeated over 100 randomised folds with 5 and 10 categories. StORM significantly outperforms baseline approaches.

### 5.3 Healthcare Datasets

Finally, we present results on the healthcare datasets. For the CASAS dataset we were unable to produce the same feature representations that were used in the original paper since some of the data is withheld to preserve anonymity. We extracted the duration of the activity, the number of unique sensors, the most commonly triggered sensor, and the number of sensors from each category (presence, door, object etc. ) that were triggered. The task of this dataset is to estimate the ‘incompleteness’ of an AD with 5 meaning the task was not completed and 1 good completion.

With DementiaBank, we analysed the transcripts of the interviews conducted with the participants and defined an ordinal task on the following order: cognitively healthy, possible dementia, probable dementia, dementia. The transcripts also included annotations of pausing and verbal disfluency. Data representation consisted of counting occurrences and normalising features.

Figure 9 presents the critical difference diagrams for the healthcare datasets. Note, that in all cases the critical difference in these figures is larger than in the synthetic and UCI datasets due to the smaller number of datasets here. These experiments have produced a much more competitive set of results with no one model consistently out-performing the others in a statistically meaningful manner. The StORM model is the best performing model over all tests conducted, even though its performance is not significantly better than ordinal regression.

In Figure 10 we show feature embeddings of the CASAS dataset. We show two diagnostic categories from opposite ends of cognitive spectrum: young volunteers and volunteers with dementia. This visualisation highlights two challenges with this dataset: 1) the class distribution is unequal (much fewer dementia data are available), and 2) there is significant overlap between the classes in this visualisation. As a result it is not surprising that difference in performance is not significant since the task is challenging.

We present the raw classification tables for the healthcare datasets in Table 2. We can observe here that on average the 0/1 and MAE losses are much lower on DementiaBank than on the CASAS dataset. However, the losses are, on average, rather high, due to the challenging learning task.

## 6 Discussion

The main results presented here show that the proposed method (StORM) is a robust and a winning model for the prediction of ordinal quantities in most of the settings considered here. On the synthetic datasets (which primarily are used for the understanding of the model in comparison to baselines) we showed visually that our approach is able to adapt to non-linear and challenging data manifolds. Although it is highly unlikely that one will encounter manifolds of the exact form of Figure 3 in real datasets, we also find it highly unlikely that strictly linear manifolds will be encountered in real-life scenarios. We are confident in the utility of the proposed methodology given its robust adaptation to the variation of challenging data manifolds. Although the absolute performance of all models is slightly disappointing on the healthcare datasets, this is illustrative of data representation challenges that still remain. Indeed, on these datasets some of the most important and discriminatory features (including health records) are witheald to preserve the anonymity of the participants, which further exacerbates the classification task. Yet, StORM is the best performing model.

StORM is shown to have higher performance in a statistically meaningful way on the synthetic, UCI and large datasets across all categories. In particular, we see that StORM achieves very good results in the large datasets (Figure 8). However, in the case of the UCI datasets we see that for the 10 category dataset StORM is still the highest-performing model but that the the baseline ordinal regression model performs well. It is worth pointing out that many of the datasets within the UCI group were converted into an ordinal task from a regression task. Although the converted datasets still constitute legitimate ordinal challenges, we believe the process of conversion is relatively ‘arbitrary’ and that the groupings given do not necessarily constitute meaningful groups of data. We believe this to be the reason for the absence of statistically meaningful results on the 10 category UCI datasets. However, on the large datasets we see that StORM is comfortably the best model amongst the baselines. We believe that this is driven primarily by the scale of the datasets here: StORM is better able to capture the training data distribution with larger datasets. This makes sense intuitively. Since StORM has a larger number of parameters these models witll typically require more data.

## 7 Conclusion

In this paper we proposed a structured propabilistic architecture for ordinal regression that is based on a structured encoding of the target variables and undirected graphical models. We have shown empirically that the proposed method (structural ordinal regression modelling) performs significantly better than three baseline methods over several synthetic, UCI and healthcare datasets. Additionally, our proposed framework has several appealing properties: inference can be vectorised over the whole dataset to speed up optimisation, locally and globally consistent abstract queries can be executed on the data, and our model preserves several desirable monotonic features for ordinal model. Future work will investigate non-linear representation methods with the proposed system and to compare the proposed techniques against more baseline methods.

## Acknowledgements

This research was conducted under the ‘Continuous Behavioural Biomarkers of Cognitive Impairment’ project funded by the UK Medical Research Council Momentum Awards under Grant MC/PC/16029.

## References

• [1] Sidney Katz. Assessing self-maintenance: activities of daily living, mobility, and instrumental activities of daily living. Journal of the American Geriatrics Society, 31(12):721–727, 1983.
• [2] Prafulla N Dawadi et al. Automated assessment of cognitive health using smart home technologies. Technology and health care, 21(4):323–343, 2013.
• [3] Prabitha Urwyler et al. Cognitive impairment categorized in community-dwelling older adults with and without dementia using in-home sensors that recognise activities of daily living. Scientific Reports, 7:42084, 2017.
• [4] Orla M Doyle et al. Predicting progression of alzheimerâs disease using ordinal regression. PloS one, 9(8):e105542, 2014.
• [5] Pedro Antonio Gutiérrez et al. Ordinal regression methods: survey and experimental study. IEEE Transactions on Knowledge and Data Engineering, 28(1):127–146, 2016.
• [6] Han-Hsing Tu and Hsuan-Tien Lin. One-sided support vector regression for multiclass cost-sensitive classification. In ICML, pages 1095–1102, 2010.
• [7] Sotiris B Kotsiantis et al. A cost sensitive technique for ordinal classification problems. In Hellenic Conference on Artificial Intelligence, pages 220–229. Springer, 2004.
• [8] Peter McCullagh. Regression models for ordinal data. Journal of the royal statistical society. Series B (Methodological), pages 109–142, 1980.
• [9] Murphy Kevin. Machine learning: a probabilistic perspective, 2012.
• [10] Ralf Herbrich et al. Support vector learning for ordinal regression. IET Conference Proceedings, pages 97–102(5), January 1999.
• [11] Koby Crammer and Yoram Singer. Online ranking by projecting. Neural Computation, 17(1):145–175, 2005.
• [12] Wei Chu and Zoubin Ghahramani. Gaussian processes for ordinal regression. Journal of machine learning research, 6(Jul):1019–1041, 2005.
• [13] Thomas G Dietterich et al. Solving multiclass learning problems via error-correcting output codes. Journal of artificial intelligence research, 2:263–286, 1994.
• [14] Eibe Frank and Mark Hall. A simple approach to ordinal classification. Machine Learning: ECML 2001, pages 145–156, 2001.
• [15] Jaime S Cardoso et al. Learning to classify ordinal data: The data replication method. Journal of Machine Learning Research, 8(Jul):1393–1429, 2007.
• [16] Floriana Esposito, Donato Malerba, V Tamma, and HH Bock. Similarity and dissimilarity. In Analysis of Symbolic Data, pages 139–197. Springer, 2000.
• [17] John D. Lafferty et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
• [18] Niall Twomey, Tom Diethe, and Peter Flach. On the need for structure modelling in sequence prediction. Machine Learning, 104(2-3):291–314, 2016.
• [19] Charles Sutton et al. An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4):267–373, 2012.
• [20] James T Becker et al. The natural history of alzheimer’s disease: description of study cohort and accuracy of diagnosis. Archives of Neurology, 51(6):585–594, 1994.
• [21] Aaron S Crandall and Diane J Cook. Smart home in a box: A large scale smart home deployment. In Intelligent Environments (Workshops), pages 169–178, 2012.
• [22] Prafulla Dawadi et al. An approach to cognitive assessment in smart home. In Proceedings of the 2011 workshop on Data mining for medicine and healthcare, pages 56–59. ACM, 2011.
• [23] Stefano Baccianella et al. Evaluation measures for ordinal regression. In Intelligent Systems Design and Applications, 2009. ISDA’09. Ninth International Conference on, pages 283–287. IEEE, 2009.
• [24] Alessio Benavoli et al. Should we really use post-hoc tests based on mean-ranks? The Journal of Machine Learning Research, 17(1):152–161, 2016.
• [25] Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research, 7(Jan):1–30, 2006.
• [26] Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
• [27] Arthur Asuncion et al. Uci machine learning repository, 2007.

## Appendix

Here we present additional visualisations and results tables for the interpretation and reproduction of the main results of this paper.

### .1 Supplementary Figures

Figure 11 shows the predictions of the baseline and proposed methods on the Linear and Sine datasets.

### .2 Supplementary Tables

Tables 3 and 4 present the results on the synthetic datasets on the 5 and 10 category splits respectively, Tables 5 and 6 present the results on the UCI datasets on the 5 and 10 category splits respectively. The first two columns show depict the dataset and prediction model and the remining columns show the scores on 0/1 loss and MAE.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters