Abstract
While extraordinary progress has been made towards developing neural network architectures for classification tasks, commonly used loss functions such as the multicategory cross entropy loss are inadequate for ranking and ordinal regression problems. To address this issue, approaches have been developed that transform ordinal target variables series of binary classification tasks, resulting in robust ranking algorithms with good generalization performance. However, to model ordinal information appropriately, ideally, a rankmonotonic prediction function is required such that confidence scores are ordered and consistent. We propose a new framework (Consistent Rank Logits, CORAL) with theoretical guarantees for rankmonotonicity and consistent confidence scores. Through parameter sharing, our framework benefits from low training complexity and can easily be implemented to extend common convolutional neural network classifiers for ordinal regression tasks. Furthermore, our empirical results support the proposed theory and show a substantial improvement compared to the current stateoftheart ordinal regression method for age prediction from face images.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Consistent Rank Logits for Ordinal Regression with
Convolutional Neural Networks
Wenzhi Cao ^{0 } Vahid Mirjalili ^{0 } Sebastian Raschka ^{0 }
\@xsect
Ordinal regression, sometimes also referred to as ordinal classification, describes the task of predicting object labels on an ordinal scale. Here, a ranking rule or classifier maps each object into an ordered set , where . In contrast to classification, the ranks include ordering information. In comparison with metric regression, which assumes that is a continuous random variable, ordinal regression regards as a finite sequence where the metric distance between ranks is not defined.
Along with age estimation (Niu et al., 2016), popular applications for ordinal regression include predicting the progression of various diseases, such as Alzheimer’s disease (Doyle et al., 2014), Crohn’s disease (Weersma et al., 2009), artery disease (Streifler et al., 1995), and kidney disease (Sigrist et al., 2007). Also, ordinal regression models are common choices for text message advertising (Rettie et al., 2005) and various recommender systems (Parra et al., 2011).
While the field of machine learning field developed many powerful algorithms for predictive modeling, most algorithms were designed for classification tasks. About ten years ago, Li and Lin proposed a general framework for ordinal regression via extended binary classification (Li & Lin, 2007), which has become the standard choice for extending stateoftheart machine learning algorithms for ordinal regression tasks. However, implementations of extended binary classification for ordinal regression commonly suffer from classifier inconsistencies among the binary rankings (Niu et al., 2016), which we address in this paper with a new method and theorem for guaranteed classifier consistency that can easily be implemented in various machine learning algorithms. Furthermore, we present an empirical study of our approach on challenging realworld datasets for predicting the age of individuals from face images using our method with convolutional neural networks (CNN).
The main contributions of our paper are as follows:

the Consistent Rank Logits (CORAL) framework for ordinal regression with theoretical guarantees for classifier consistency and welldefined generalization bounds with and without dataset and taskspecific importance weighting;

CNN architectures with CORAL formulation for ordinal regression tasks that come with the added side benefit of reducing the number of parameters to be trained compared to CNNs for classification;

experimental validation showing that the guaranteed classifier consistency leads to a substantial improvement over the stateoftheart CNN for ordinal regression applied to age estimation from face images.
Several multivariate extensions of generalized linear models have been developed in the past for ordinal regression, including the popular proportional odds and the proportional hazards models (McCullagh, 1980). Moreover, ordinal regression has become a popular topic of study in the field of machine learning to extend classification algorithms by reformulating the problem to utilize multiple binary classification tasks. Early work in this regard includes the use of perceptrons (Crammer & Singer, 2002; Shen & Joshi, 2005) and Support Vector Machines (Herbrich et al., 1999; Shashua & Levin, 2003; Rajaram et al., 2003; Chu & Keerthi, 2005). A general reduction framework that unified the view of a number of these existing algorithms for ordinal regression was later proposed in (Li & Lin, 2007).
While earlier works on using CNNs for ordinal targets have employed conventional classification approaches (Levi & Hassner, 2015; Rothe et al., 2015), the general reduction framework from ordinal regression to binary classification by (Li & Lin, 2007) was recently adopted by (Niu et al., 2016). In (Niu et al., 2016), an ordinal regression problem with ranks was transformed into binary classification problems, with the th task predicting whether the age label of a face image exceeds rank , . Here, all tasks share the same intermediate layers but are assigned distinct weight parameters in the output layer. One issue with this architecture is that for some input images the outputs of the tasks do not agree with each other. Hence, the model does not guarantee that the predictions are consistent. For example, in an age estimation setting, it would be contradictory if the th binary task predicted that the age of a person was larger than 30, but a previous task predicted it was not larger than 20, which is suboptimal when the task predictions are combined to obtain the estimated age.
While the ordinal regression CNN yielded stateoftheart results on an ordinal regression problem such as age estimation, the authors acknowledged the classifier inconsistency as not being ideal but also noted that ensuring that the binary classifiers are consistent would increase the training complexity substantially (Niu et al., 2016). Our proposed method addresses both of these issues with a theoretical guarantee for classifier consistency as well as a reduction of the training complexity.
Due to its broad utility in social networking, video surveillance, and biometric verification, age estimation from human faces is an area of active research. Various techniques have been developed for extracting facial features as inputs to classification or metric regression algorithms (O’Toole et al., 1999; Ramanathan et al., 2009b; Turaga et al., 2010; Kohail, 2012; Wu et al., 2012; Geng et al., 2013).
In recent years, CNN research has rapidly advanced, and CNNs now surpass most traditional methods on imageanalyses tasks while not requiring feature extraction beyond standard image preprocessing steps (Krizhevsky et al., 2012; Parkhim & Zisserman, 2015; Canziani et al., 2016). Hence, most stateoftheart age estimation methods are now utilizing CNN architectures (Rothe et al., 2015; Chen et al., 2016; Niu et al., 2016; Ranjan et al., 2017; Chen et al., 2017).
Related to the idea of training binary classifiers separately and combining the independent predictions for ranking (Frank & Hall, 2001), a modification of the ordinal regression CNN (Niu et al., 2016) was recently proposed for age estimation, called RankingCNN, that trains an ensemble of CNNs for binary classifications and aggregates the predictions to predict the age label of a given face image (Chen et al., 2017). The researchers showed that training a series of CNNs improves the predictive performance over a single CNN with multiple binary outputs. However, ensembles of CNNs come with a substantial increase in training complexity and do not guarantee classifier consistency, which means that the individual binary classifiers used for ranking can produce contradictory results. Another approach for utilizing binary classifiers for ordinal regression is the siamese CNN architecture by (Polania et al., 2018). Since this siamese CNN has only a single output neuron, comparisons between the input image and multiple, carefully selected anchor images are required to compute the rank.
Age distribution learning (Pan et al., 2018) has made other notable progress in age estimation; here, the researchers defined a new loss function to penalize the difference between estimated age distributions and the ground truth age labels. Recent research has also shown that training a multitask CNN for various face analysis tasks, including face detection, gender prediction, age estimation, etc., can improve the overall performance across different tasks compared to a singletask CNN (Ranjan et al., 2017) by sharing lowerlayer parameters. In (Chen et al., 2016), a cascaded convolutional neural network was designed to classify face images into age groups followed by regression modules for more accurate age estimation. In both studies, the authors used metric regression for the age estimation subtasks. While our paper focuses on the comparison of different ordinal regression approaches, we hypothesize that such allinone and cascaded CNNs can be further improved by our method, since, as shown in (Niu et al., 2016), ordinal regression CNNs outperform metric regression CNNs in age estimation tasks.
This section describes the proposed CORAL framework that addresses the problem of classifier inconsistency in ordinal regression CNNs based on multiple binary classification tasks for ranking.
Let be the training dataset consisting of examples. Here, denotes the th image and denotes the corresponding rank, where with ordered rank . The symbol denotes the ordering between the ranks. The ordinal regression task is to find a ranking rule such that some loss function is minimized.
Let be a cost matrix (Li & Lin, 2007), where is the cost of predicting an example as rank . Typically, and for . In ordinal regression, we generally prefer each row of the cost matrix to be Vshaped. That is if and if . The classification cost matrix has entries , which does not consider ordering information. In ordinal regression, where the ranks are treated as numerical values, the absolute cost matrix is commonly defined by .
In (Li & Lin, 2007), the researchers proposed a general reduction framework for extending an ordinal regression problem into several binary classification problems. This framework requires the use of a cost matrix that is convex in each row ( for each ) to obtain a rankmonotonic threshold model. Since the costrelated weighting of each binary task is specific for each training example, this approach was described as unfeasible in practice due to its high training complexity (Niu et al., 2016). Our proposed CORAL framework does neither require a cost matrix with convexrow conditions nor explicit weighting terms that depend on each training example to obtain a rankmonotonic threshold model and to produce consistent predictions for each binary task. Moreover, CORAL allows for an optional task importance weighting, e.g., to adjust for label and class imbalances, which makes it more applicable in practice.
We propose the Consistent Rank Logits (CORAL) model for multilabel CNNs with ordinal responses. Within this framework, the binary tasks produce consistently ranked predictions.
Given the training dataset , we first extend a rank label into binary labels such that indicates whether exceeds rank , i.e., . The indicator function is if the inner condition is true, and otherwise. Providing the extended binary labels as model inputs, we train a single CNN with binary classifiers in the output layer. Here, the binary tasks share the same weight parameter but have independent bias units, which solves the inconsistency problem among the predicted binary responses and reduces the model complexity.
Based on the binary task responses, the predicted rank for an input is then obtained via
(1) 
where is the prediction of the th binary classifier in the output layer. We require that reflect the ordinal information and are rankmonotonic,
(2) 
which guarantees that the predictions are consistent.
Let denote the weight parameters of the neural network excluding the bias units of the final layer. The penultimate layer, whose output is denoted as , shares a single weight with all nodes in the final output layer. independent bias units are then added to such that are the inputs to the corresponding binary classifiers in the final layer. Let be the logistic sigmoid function. The predicted empirical probability for task is defined as
(3) 
For model training, we minimize the loss function
(4) 
which is the weighted crossentropy of binary classifiers. For rank prediction (Eq. 1), the binary labels are obtained via
(5) 
In Eq. (4), denotes the weight of the loss associated with the th classifier (assuming ). In the remainder of the paper, we refer to as the importance parameter for task . Some tasks may be less robust or harder to optimize, which can be taken into consideration by choosing a nonuniform task weighting scheme. Also, in many realworld applications, features between certain adjacent ranks may have more subtle distinctions. For example, facial aging is commonly regarded as a nonstationary process (Ramanathan et al., 2009a) such that face feature transformations could be more detectable during certain age intervals. Moreover, the relative predictive performance of the binary tasks may also be affected by the degree of binary data imbalance for a given task that occurs as a sideeffect of extending a rank label into binary labels. Hence, we hypothesize that choosing nonuniform task weighting schemes improves the predictive performance of the overall model. The choice of task importance parameters is covered in more detail in Section id1. Next, we provide a theoretical guarantee for classifier consistency under uniform and nonuniform task importance weighting given that the task importance weights are positive numbers.
In the following theorem, we show that by minimizing the loss (Eq. 4), the learned bias units of the output layer are nonincreasing such that . Consequently, the predicted confidence scores or probability estimates of the tasks are decreasing, i.e., for all , ensuring classifier consistency. given by Eq. 5 are also rankmonotonic.
Theorem 1 (ordered biases).
By minimizing loss function defined in Eq. (4), the optimal solution satisfies .
Proof.
Suppose is an optimal solution and for some . Claim: by either replacing with or replacing with , we can decrease the objective value . Let
By the ordering relationship we have . Denote and
Since is increasing in , we have and .
If we replace with , the loss terms related to th task are updated. The change of loss (Eq. 4) is given as
Accordingly, if we replace with , the change of is given as
By adding and , we have
and know that either or . Thus, our claim is justified, and we conclude that any optimal solution that minimizes satisfies . ∎
Note that the theorem for rankmonotonicity in (Li & Lin, 2007), in contrast to Theorem 1, requires the use of a cost matrix with each row being convex. Under this convexity condition, let be the weight of loss of the th task on the th example, which depends on the label . In (Li & Lin, 2007), the researchers proved that by using examplespecific task weights , the optimal thresholds are ordered. This assumption requires that when , and when . Theorem 1 is free from this requirement and allows us to choose a fixed weight for each task that does not depend on the individual training examples, which greatly reduces the training complexity. Moreover, Theorem 1 allows for choosing either a simple uniform task weighting or taking dataset imbalances into account (Section id1) while still guaranteeing that the predicted probabilities are nondecreasing and the task predictions are consistent.
Based on wellknown generalization bounds for binary classification, we can derive new generalization bounds for our ordinal regression approach that apply to a wide range of practical scenarios as we only require and . Moreover, Theorem 2 shows that if each binary classification task in our model generalizes well in terms of the standard 0/1loss, the final rank prediction via (Eq. 1) also generalizes well.
Theorem 2 (reduction of generalization error).
Suppose is the cost matrix of the original ordinal label prediction problem, with and for . is the underlying distribution of , i.e., . If are rankmonotonic, then
(6)  
Proof.
For any , we have
If , then .
If , then . We have and Also, and . Thus, if and only if . Since ,
Similarly, if , then and
In any case, we have
By taking the expectation on both sides with , we arrive at Eq. (6). ∎
In (Li & Lin, 2007), by assuming the cost matrix to have Vshaped rows, the researchers define generalization bounds by constructing a discrete distribution on conditional on each , given that the binary classifications are rankmonotonic or every row of is convex. However, the only case they provided for the existence of rankmonotonic binary classifiers was the ordered threshold model, which requires a cost matrix with convex rows and examplespecific task weights. Our result does not rely on cost matrices with Vshaped or convex rows and can be applied to a broader variety of realworld use cases.
According to Theorem 1, minimizing the loss of the CORAL model guarantees that the bias units are nonincreasing and thus the binary classifiers are consistent as long as the task importance parameters are positive ().
We first experimented with a weighting scheme proposed in (Niu et al., 2016) that aims to address the class imbalance in the face image datasets. However, compared to using a uniform scheme (), we found that it had a negative effect on the predictive performance for all models evaluated in this study.
Hence, we propose a weighting scheme that takes the rank distribution of the training examples into account but also considers the label imbalance for each classification task after extending the original ranks into binary labels. Specifically, our task weighting scheme (under which CORAL still guarantees classifier consistency) is defined as follows. Let be the number of examples whose ranks exceed . By the rank ordering we have . Let be the number of majority binary label for each task. We define the importance of the th task as the scaled :
(7) 
Under this weighting scheme, the general class imbalance of a dataset is taken into account. Moreover, in our examples classification tasks corresponding to the edges of the distribution of unique rank labels receive a higher weight than the classification tasks that see more balanced rank label vectors during training (Figure 1), which may help improve the predictive performance of the model. The lowest weight may not always be assigned to the centerrank: if , the last task has the lowest weight, and if , the first task has the lowest weight. It shall be noted that the task importance weighting is only used for model parameter optimization; when computing the predicted rank by adding the binary results (Eq. 1), each task has the same influence on the final rank prediction. Since , it prevents tasks from having negligible weights as in (Niu et al., 2016) when a dataset contains only a small number of examples for certain ranks. We provide an empirical comparison between a uniform task weighting and task weighting according to Eq. (7) in Section id1.
The MORPH2 dataset (Ricanek & Tesafaye, 2006) (55,608 face images) was preprocessed by locating the average eyeposition in the respective dataset using facial landmark detection (Sagonas et al., 2016) via MLxtend (Raschka, 2018) and then aligning each image in the dataset to the average eye position. The faces were then realigned such that the tip of the nose was located in the center of each image. The age labels used in this study ranged between 1670 years. The CACD database (Chen et al., 2014) was preprocessed similar to MORPH2 such that the faces spanned the whole image with the nose tip being in the center. The total number of images is 159,449 in the age range 1462 years.
Since the faces were already centered in the Asian Face Database (AFAD; 165,501 faces with ages labels between 1540) (Niu et al., 2016), no further alignment was applied. The UTKFace database (Zhang & Qi, 2017) was also available in a preprocessed form such that no additional steps were required. In this study, we considered face images with age labels between 2160 years (16,434 images).
Each image database was randomly divided into 80% training data and 20% test data. All images were resized to 128x128x3 pixels and then randomly cropped to 120x120x3 pixels to augment the model training. During model evaluation, the 128x128x3 face images were centercropped to a model input size of 120x120x3.
To evaluate the performance of CORAL for age estimation from face images, we chose the ResNet34 architecture (He et al., 2016), which is a modern CNN architecture that is known for achieving good performance on a variety of image classification tasks. For the remainder of this paper, we refer to the original ResNet34 CNN with cross entropy loss as CECNN. To implement CORAL, we replaced the last output layer with the corresponding binary tasks (Figure 2) and refer to this CNN as CORALCNN. Similar to CORALCNN, we replaced the crossentropy layer of the ResNet34 with the binary tasks for ordinal regression described in (Niu et al., 2016) and refer to this architecture as ORCNN.
For model evaluation and comparison, we computed the mean absolute error (MAE) and root mean squared error (RMSE), which are standard metrics used for crowcounting and age prediction:
(8) 
where is the ground truth rank of the th test example and is the predicted rank, respectively. The MAE and RMSE values reported in this study were computed on the test set after the last training epoch. The training was repeated three times with different random seeds for model weight initialization while the random seeds were consistent between the different methods to allow for fair comparisons. All CNNs were trained for 200 epochs with stochastic gradient descent via adaptive moment estimation (Kingma & Ba, 2015) using exponential decay rates and (PyTorch default) and learning rate .
In addition, we computed the Cumulative Score (CS) as the proportion of images for which the absolute differences between the predicted rank labels and the ground truth are below a threshold :
(9) 
By varying the threshold , CS curves were plotted to compare the predictive performances of the different age prediction models (the larger the area under the curve, the better).
All loss functions and neural network models were implemented in PyTorch 1.0 (Paszke et al., 2017) and trained on NVIDIA GeForce 1080Ti and Titan V graphics cards. The source code is available at https://github.com/Raschkaresearchgroup/coralcnn.
Method 

MORPH2  AFAD  UTKFace  CACD  

MAE  RMSE  MAE  RMSE  MAE  RMSE  MAE  RMSE  

0  3.40  4.88  3.98  5.55  6.57  9.16  6.18  8.86  
1  3.39  4.87  4.00  5.57  6.24  8.69  6.10  8.79  
2  3.37  4.87  3.96  5.50  6.29  8.78  6.13  8.87  
AVG SD  3.39 0.02  4.89 0.01  3.98 0.02  5.54 0.04  6.37 0.18  8.88 0.25  6.14 0.04  8.84 0.04  

0  2.98  4.26  3.66  5.10  5.71  8.11  5.53  7.91  
1  2.98  4.26  3.69  5.13  5.80  8.12  5.53  7.98  
2  2.96  4.20  3.68  5.14  5.71  8.11  5.49  7.89  
AVG SD  2.97 0.01  4.24 0.03  3.68 0.02  5.13 0.02  5.74 0.05  8.08 0.06  5.52 0.02  7.93 0.05  

0  2.68  3.75  3.49  4.82  5.46  7.61  5.56  7.80  
1  2.63  3.66  3.46  4.83  5.46  7.63  5.37  7.64  
2  2.61  3.64  3.52  4.91  5.48  7.63  5.25  7.53  
AVG SD  2.64 0.04  3.68 0.06  3.49 0.03  4.85 0.05  5.47 0.01  7.62 0.01  5.39 0.16  7.66 0.14 
We conducted a series of experiments on four independent face image datasets for age estimation (Section id1) to compare our CORAL approach (CORALCNN) with the ordinal regression approach described in (Niu et al., 2016), denoted as ORCNN. All implementations were based on the ResNet34 architecture as described in Section id1, including the standard ResNet34 with crossentropy loss (CECNN) as performance baseline.
Method 








NO  2.97 0.01  3.68 0.02  5.74 0.05  5.52 0.02  

YES  2.91 0.02  3.65 0.03  5.76 0.19  5.49 0.02  

NO  2.64 0.04  3.49 0.03  5.47 0.01  5.39 0.16  

YES  2.59 0.03  3.48 0.03  5.39 0.07  5.35 0.09 
First, we note that for all methods, the overall predictive performance on the different datasets appears in the following order: MORPH2 AFAD CACD UTKFace (Table 1 and Figure 3). Possible reasons why all approaches perform best on MORPH2 are that MORPH2 has the best overall image quality and relatively consistent lighting conditions and viewing angles. For instance, we found that AFAD includes some images of particularly low resolution (e.g., 20x20). While UTKFace and CACD also contain some lowerquality images, a possible reason why the methods perform worse on UTKFace compared to AFAD is that UTKFace is about ten times smaller than AFAD. While CACD has approximately the same size as AFAD, the lower performance can be explained by the wider age range that needs to be considered (1462 in CACD compared to 1540 in AFAD).
Across all datasets (Table 1 and Figure 3), we found that both ORCNN and CORALCNN outperform the standard crossentropy loss (ORCNN) on these ordinal regression tasks, as expected. Similarly, as summarized in Table 1 and Figure 3, our CORAL method shows a substantial improvement over the current stateoftheart ordinal regression method (ORCNN) by (Niu et al., 2016), which does not guarantee classifier consistency. Moreover, we repeated each experiment three times using different random seeds for model weight initialization and dataset shuffling, to ensure that the observed performance improvement of CORALCNN over ORCNN is reproducible and not coincidental. Furthermore, along with providing the theoretical proof for classifier consistency in CORALCNN (Theorem 1), we also empirically verified that the bias units of the CORALCNN output layers were indeed ordered after model training, in contrast to ORCNN. From these results, we can conclude that guaranteed classifier consistency via CORAL has a substantial, positive effect on the predictive performance of an ordinal regression CNN.
While all results described in the previous section are based on experiments without task importance weighting (i.e., ), we repeated all experiments using our weighting scheme proposed in Section id1, which takes label imbalances into account. Note that according to Theorem 1, CORAL still guarantees classifier consistency under any chosen task weighting scheme as long as weights are assigned positive values. From the results provided in Table 2, we find that by using a task weighting scheme that also takes label imbalances into account, we can further improve the performance of CORALCNNs across all four datasets.
In this paper, we developed the CORAL framework for ordinal regression via extended binary classification with theoretical guarantees for classifier consistency. Moreover, we proved classifier consistency without requiring rank or training labeldependent weighting schemes, which permits straightforward implementations and efficient model training. Furthermore, the theoretical generalization bounds assure that if the binary tasks generalize well, then the final rank prediction also generalizes well. We also showed that CORAL could be readily implemented to extend CNNs for ordinal regression tasks and evaluated it empirically on four large image databases for predicting the apparent age from face images. The results unequivocally showed that the guaranteed classifier consistency via CORAL substantially improved the predictive performance of CNNs for age estimation. While we evaluated the CORAL framework in an endtoend learning approach using CNNs for age estimation, our method can be readily generalized to other ordinal regression problems and different types of neural network architectures, including multilayer perceptrons and recurrent neural networks.
Support for this research was provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of WisconsinMadison with funding from the Wisconsin Alumni Research Foundation. Also, we thank the NVIDIA Corporation for a generous donation via an NVIDIA GPU grant to support this study.
References
 Canziani et al. (2016) Canziani, A., Paszke, A., and Culurciello, E. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
 Chen et al. (2014) Chen, B.C., Chen, C.S., and Hsu, W. H. Crossage reference coding for ageinvariant face recognition and retrieval. In Proceedings of the European Conference on Computer Vision, pp. 768–783. Springer, 2014.
 Chen et al. (2016) Chen, J.C., Kumar, A., Ranjan, R., Patel, V. M., Alavi, A., and Chellappa, R. A cascaded convolutional neural network for age estimation of unconstrained faces. In Proceedings of the IEEE Conference on Biometrics Theory, Applications and Systems, pp. 1–8, 2016.
 Chen et al. (2017) Chen, S., Zhang, C., Dong, M., Le, J., and Rao, M. Using RankingCNN for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5183–5192, 2017.
 Chu & Keerthi (2005) Chu, W. and Keerthi, S. S. New approaches to support vector ordinal regression. In Proceedings of the International Conference on Machine Learning, pp. 145–152. ACM, 2005.
 Crammer & Singer (2002) Crammer, K. and Singer, Y. Pranking with ranking. In Advances in Neural Information Processing Systems, pp. 641–647, 2002.
 Doyle et al. (2014) Doyle, O. M., Westman, E., Marquand, A. F., Mecocci, P., Vellas, B., Tsolaki, M., Kłoszewska, I., Soininen, H., Lovestone, S., Williams, S. C., et al. Predicting progression of Alzheimer’s disease using ordinal regression. PloS one, 9(8):e105542, 2014.
 Frank & Hall (2001) Frank, E. and Hall, M. A simple approach to ordinal classification. In Proceedings of the European Conference on Machine Learning, pp. 145–156. Springer, 2001.
 Geng et al. (2013) Geng, X., Yin, C., and Zhou, Z.H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(10):2401–2412, 2013.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
 Herbrich et al. (1999) Herbrich, R., Graepel, T., and Obermayer, K. Support vector learning for ordinal regression. In Proceedings of the IET Conference on Artificial Neural Networks, volume 1, pp. 97–102, 1999.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In Proceedings of the Conference on Learning Representations, 2015.
 Kohail (2012) Kohail, S. N. Using artificial neural network for human age estimation based on facial images. In Proceedings of the IEEE Conference on Innovations in Information Technology, pp. 215–219, 2012.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 Levi & Hassner (2015) Levi, G. and Hassner, T. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 34–42, 2015.
 Li & Lin (2007) Li, L. and Lin, H.T. Ordinal regression by extended binary classification. In Advances in Neural Information Processing Systems, pp. 865–872, 2007.
 McCullagh (1980) McCullagh, P. Regression models for ordinal data. Journal of the Royal Statistical Society. Series B (Methodological), pp. 109–142, 1980.
 Niu et al. (2016) Niu, Z., Zhou, M., Wang, L., Gao, X., and Hua, G. Ordinal regression with multiple output cnn for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4920–4928, 2016.
 O’Toole et al. (1999) O’Toole, A. J., Price, T., Vetter, T., Bartlett, J. C., and Blanz, V. 3D shape and 2D surface textures of human faces: The role of ”averages” in attractiveness and age. Image and Vision Computing, 18(1):9–19, 1999.
 Pan et al. (2018) Pan, H., Han, H., Shan, S., and Chen, X. Meanvariance loss for deep age estimation from a face. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5285–5294, 2018.
 Parkhim & Zisserman (2015) Parkhim, Omkar M, A. V. and Zisserman, A. Deep face recognition. In Proceedings of the British Machine Vision Conference, volume 3, 2015.
 Parra et al. (2011) Parra, D., Karatzoglou, A., Amatriain, X., and Yavuz, I. Implicit feedback recommendation via implicittoexplicit ordinal logistic regression mapping. Proceedings of the CARS Workshop of the Conference of Recommender Systems, pp. 5, 2011.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. In Neural Information Processing Systems Autodiff Workshop, 2017.
 Polania et al. (2018) Polania, L., Wang, D., and Fung, G. Ordinal regression using noisy pairwise comparisons for body mass index range estimation. arXiv preprint arXiv:1811.03268, 2018.
 Rajaram et al. (2003) Rajaram, S., Garg, A., Zhou, X. S., and Huang, T. S. Classification approach towards ranking and sorting problems. In Proceedings of the European Conference on Machine Learning, pp. 301–312. Springer, 2003.
 Ramanathan et al. (2009a) Ramanathan, N., Chellappa, R., and Biswas, S. Age progression in human faces: A survey. Journal of Visual Languages and Computing, 15:3349–3361, 2009a.
 Ramanathan et al. (2009b) Ramanathan, N., Chellappa, R., and Biswas, S. Computational methods for modeling facial aging: A survey. Journal of Visual Languages & Computing, 20(3):131–144, 2009b.
 Ranjan et al. (2017) Ranjan, R., Sankaranarayanan, S., Castillo, C. D., and Chellappa, R. An allinone convolutional neural network for face analysis. In Proceedings of the IEEE Conference on Automatic Face & Gesture Recognition, pp. 17–24, 2017.
 Raschka (2018) Raschka, S. MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. The Journal of Open Source Software, 3(24), 2018.
 Rettie et al. (2005) Rettie, R., Grandcolas, U., and Deakins, B. Text message advertising: Response rates and branding effects. Journal of Targeting, Measurement and Analysis for Marketing, 13(4):304–312, 2005.
 Ricanek & Tesafaye (2006) Ricanek, K. and Tesafaye, T. Morph: A longitudinal image database of normal adult ageprogression. In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition, pp. 341–345, 2006.
 Rothe et al. (2015) Rothe, R., Timofte, R., and Van Gool, L. DEX: Deep expectation of apparent age from a single image. In Proceedings of the IEEE Conference on Computer Vision Workshops, pp. 10–15, 2015.
 Sagonas et al. (2016) Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., and Pantic, M. 300 faces inthewild challenge: database and results. Image and Vision Computing, 47:3–18, 2016.
 Shashua & Levin (2003) Shashua, A. and Levin, A. Ranking with large margin principle: Two approaches. In Advances in Neural Information Processing Systems, pp. 961–968, 2003.
 Shen & Joshi (2005) Shen, L. and Joshi, A. K. Ranking and reranking with perceptron. Machine Learning, 60(13):73–96, 2005.
 Sigrist et al. (2007) Sigrist, M. K., Taal, M. W., Bungay, P., and McIntyre, C. W. Progressive vascular calcification over 2 years is associated with arterial stiffening and increased mortality in patients with stages 4 and 5 chronic kidney disease. Clinical Journal of the American Society of Nephrology, 2(6):1241–1248, 2007.
 Streifler et al. (1995) Streifler, J. Y., Eliasziw, M., Benavente, O. R., Hachinski, V. C., Fox, A. J., and Barnett, H. Lack of relationship between leukoaraiosis and carotid artery disease. Archives of neurology, 52(1):21–24, 1995.
 Turaga et al. (2010) Turaga, P., Biswas, S., and Chellappa, R. The role of geometry in age estimation. In Proceedings of the IEEE Conference on Acoustics Speech and Signal Processing, pp. 946–949, 2010.
 Weersma et al. (2009) Weersma, R. K., Stokkers, P. C., van Bodegraven, A. A., van Hogezand, R. A., Verspaget, H. W., de Jong, D. J., Van Der Woude, C., Oldenburg, B., Linskens, R., and Festen, E. Molecular prediction of disease risk and severity in a large dutch crohnâs disease cohort. Gut, 58(3):388–395, 2009.
 Wu et al. (2012) Wu, T., Turaga, P., and Chellappa, R. Age estimation and face verification across aging using landmarks. Proceedings of the IEEE Conference on Transactions on Information Forensics and Security, 7:1780–1788, 2012.
 Zhang & Qi (2017) Zhang, Zhifei, S. Y. and Qi, H. Age progression/regression by conditional adversarial autoencoder. In Proceddings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.