Abstract
Ordinal regression is an important type of learning, which has properties of both classification
and regression. Here we describe a simple and effective approach to adapt a traditional neural network to
learn ordinal categories. Our approach is a generalization of the perceptron method for ordinal regression.
On several benchmark datasets, our method (NNRank) outperforms a
neural network classification method. Compared with the ordinal regression methods
using Gaussian processes and support vector machines, NNRank achieves comparable
performance. Moreover, NNRank has the advantages of traditional neural networks: learning
in both online and batch modes, handling very large training datasets, and making rapid predictions.
These features make NNRank a useful and complementary tool for largescale data processing tasks such as information
retrieval, web page ranking, collaborative filtering, and protein ranking in Bioinformatics.
A Neural Network Approach to Ordinal Regression
Jianlin Cheng jcheng@cs.ucf.edu
School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA
Ordinal regression (or ranking learning) is an important supervised problem of learning a ranking or ordering on instances, which has the property of both classification and metric regression. The learning task of ordinal regression is to assign data points into a set of finite ordered categories. For example, a teacher rates students’ performance using A, B, C, D, and E (A B C D E) (Chu & Ghahramani, 2005a). Ordinal regression is different from classification due to the order of categories. In contrast to metric regression, the response variables (categories) in ordinal regression is discrete and finite.
The research of ordinal regression dated back to the ordinal statistics methods in 1980s (McCullagh, 1980; McCullagh & Nelder, 1983) and machine learning research in 1990s (Caruana et al., 1996; Herbrich et al., 1998; Cohen et al., 1999). It has attracted the considerable attention in recent years due to its potential applications in many dataintensive domains such as information retrieval (Herbrich et al., 1998), web page ranking (Joachims, 2002), collaborative filtering (Goldberg et al., 1992; Basilico & Hofmann, 2004; Yu et al., 2006), image retrieval (Wu et al., 2003), and protein ranking (Cheng & Baldi, 2006) in Bioinformatics.
A number of machine learning methods have been developed or redesigned to address ordinal regression problem (Rajaram et al., 2003), including perceptron (Crammer & Singer, 2002) and its kernelized generalization (Basilico & Hofmann, 2004), neural network with gradient descent (Caruana et al., 1996; Burges et al., 2005), Gaussian process (Chu & Ghahramani, 2005b; Chu & Ghahramani, 2005a; Schwaighofer et al., 2005), large margin classifier (or support vector machine) (Herbrich et al., 1999; Herbrich et al., 2000; Joachims, 2002; Shashua & Levin, 2003; Chu & Keerthi, 2005; Aiolli & Sperduti, 2004; Chu & Keerthi, 2007), kpartite classifier (Agarwal & Roth, 2005), boosting algorithm (Freund et al., 2003; Dekel et al., 2002), constraint classification (HarPeled et al., 2002), regression trees (Kramer et al., 2001), Naive Bayes (Zhang et al., 2005), Bayesian hierarchical experts (Paquet et al., 2005), binary classification approach (Frank & Hall, 2001; Li & Lin, 2006) that decomposes the original ordinal regression problem into a set of binary classifications, and the optimization of nonsmooth cost functions (Burges et al., 2006).
Most of these methods can be roughly classified into two categories: pairwise constraint approach (Herbrich et al., 2000; Joachims, 2002; Dekel et al., 2004; Burges et al., 2005) and multithreshold approach (Crammer & Singer, 2002; Shashua & Levin, 2003; Chu & Ghahramani, 2005a). The former is to convert the full ranking relation into pairwise order constraints. The latter tries to learn multiple thresholds to divide data into ordinal categories. Multithreshold approaches also can be unified under the general, extended binary classification framework (Li & Lin, 2006).
The ordinal regression methods have different advantages and disadvantages. Prank (Crammer & Singer, 2002), a perceptron approach that generalizes the binary perceptron algorithm to the ordinal multiclass situation, is a fast online algorithm. However, like a standard perceptron method, its accuracy suffers when dealing with nonlinear data, while a quadratic kernel version of Prank greatly relieves this problem. One class of accurate largemargin classifier approaches (Herbrich et al., 2000; Joachims, 2002) convert the ordinal relations into (: the number of data points) pairwise ranking constraints for the structural risk minimization (Vapnik, 1995; Schoelkopf & Smola, 2002). Thus, it can not be applied to medium size datasets ( 10,000 data points), without discarding some pairwise preference relations. It may also overfit noise due to incomparable pairs.
The other class of powerful largemargin classifier methods (Shashua & Levin, 2003; Chu & Keerthi, 2005) generalize the support vector formulation for ordinal regression by finding thresholds on the real line that divide data into ordered categories. The size of this optimization problem is linear in the number of training examples. However, like support vector machine used for classification, the prediction speed is slow when the solution is not sparse, which makes it not appropriate for timecritical tasks. Similarly, another stateoftheart approach, Gaussian process method (Chu & Ghahramani, 2005a), also has the difficulty of handling large training datasets and the problem of slow prediction speed in some situations.
Here we describe a new neural network approach for ordinal regression that has the advantages of neural network learning: learning in both online and batch mode, training on very large dataset (Burges et al., 2005), handling nonlinear data, good performance, and rapid prediction. Our method can be considered a generalization of the perceptron learning (Crammer & Singer, 2002) into multilayer perceptrons (neural network) for ordinal regression. Our method is also related to the classic generalized linear models (e.g., cumulative logit model) for ordinal regression (McCullagh, 1980). Unlike the neural network method (Burges et al., 2005) trained on pairs of examples to learn pairwise order relations, our method works on individual data points and uses multiple output nodes to estimate the probabilities of ordinal categories. Thus, our method falls into the category of multithreshold approach. The learning of our method proceeds similarly as traditional neural networks using backpropagation (Rumelhart et al., 1986).
On the same benchmark datasets, our method yields the performance better than the standard classification neural networks and comparable to the stateoftheart methods using support vector machines and Gaussian processes. In addition, our method can learn on very large datasets and make rapid predictions.
Let represent an ordinal regression dataset consisting of data points () , where is an input feature vector and is its ordinal category from a finite set . Without loss of generality, we assume that with ”” as order relation.
For a standard classification neural network without considering the order of categories, the goal is to predict the probability of a data point belonging to one category (). The input is and the target of encoding the category is a vector = , where only the element is set to 1 and all others to 0. The goal is to learn a function to map input vector to a probability distribution vector , where is closer to 1 and other elements are close to zero, subject to the constraint .
In contrast, like the perceptron approach (Crammer & Singer, 2002), our neural network approach considers the order of the categories. If a data point belongs to category , it is classified automatically into lowerorder categories () as well. So the target vector of is , where is set to 1 and other elements zeros. Thus, the goal is to learn a function to map the input vector to a probability vector , where is close to 1 and is close to 0. is the estimate of number of categories (i.e. ) that belongs to, instead of 1. The formulation of the target vector is similar to the perceptron approach (Crammer & Singer, 2002). It is also related to the classical cumulative probit model for ordinal regression (McCullagh, 1980), in the sense that we can consider the output probability vector as a cumulative probability distribution on categories , i.e., is the proportion of categories that belongs to, starting from category 1.
The target encoding scheme of our method is related to but, different from multilabel learning (Bishop, 1996) and multiple label learning (Jin & Ghahramani, 2003) because our method imposes an order on the labels (or categories).
Under the formulation, we can use the almost exactly same neural network machinery for ordinal regression. We construct a multilayer neural network to learn ordinal relations from . The neural network has inputs corresponding to the number of dimensions of input feature vector and output nodes corresponding to ordinal categories. There can be one or more hidden layers. Without loss of generality, we use one hidden layer to construct a standard twolayer feedforward neural network. Like a standard neural network for classification, input nodes are fully connected with hidden nodes, which in turn are fully connected with output nodes. Likewise, the transfer function of hidden nodes can be linear function, sigmoid function, and tanh function that is used in our experiment. The only difference from traditional neural network lies in the output layer. Traditional neural networks use softmax (or normalized exponential function) for output nodes, satisfying the constraint that the sum of outputs is 1. is the net input to the output node .
In contrast, each output node of our neural network uses a standard sigmoid function , without including the outputs from other nodes. Output node is used to estimate the probability that a data point belongs to category independently, without subjecting to normalization as traditional neural networks do. Thus, for a data point of category , the target vector is , in which the first elements is 1 and others 0. This sets the target value of output nodes () to 1 and () to 0. The targets instruct the neural network to adjust weights to produce probability outputs as close as possible to the target vector. It is worth pointing out that using independent sigmoid functions for output nodes does not guaranteed the monotonic relation (), which is not necessary but, desirable for making predictions (Li & Lin, 2006). A more sophisticated approach is to impose the inequality constraints on the outputs to improve the performance.
Training of the neural network for ordinal regression proceeds very similarly as standard neural networks. The cost function for a data point can be relative entropy or square error between the target vector and the output vector. For relative entropy, the cost function for output nodes is . For square error, the error function is . Previous studies (Richard & Lippman, 1991) on neural network cost functions show that relative entropy and square error functions usually yield very similar results. In our experiments, we use square error function and standard backpropagation to train the neural network. The errors are propagated back to output nodes, and from output nodes to hidden nodes, and finally to input nodes.
Since the transfer function of output node is the independent sigmoid function , the derivative of of output node is = = . Thus, the net error propagated to output node is for relative entropy cost function, for square error cost function. The net errors are propagated through neural networks to adjust weights using gradient descent as traditional neural networks do.
Despite the small difference in the transfer function and the computation of its derivative, the training of our method is the same as traditional neural networks. The network can be trained on data in the online mode where weights are updated per example, or in the batch mode where weights are updated per bunch of examples.
In the test phase, to make a prediction, our method scans output nodes in the order . It stops when the output of a node is smaller than the predefined threshold (e.g., 0.5) or no nodes left. The index of the last node whose output is bigger than is the predicted category of the data point.
We use eight standard datasets for ordinal regression (Chu & Ghahramani, 2005a) to benchmark our method. The eight datasets (Diabetes, Pyrimidines, Triazines, Machine CUP, Auto MPG, Boston, Stocks Domain, and Abalone) are originally used for metric regression. Chu and Ghahramani (Chu & Ghahramani, 2005a) discretized the realvalue targets into five equal intervals, corresponding to five ordinal categories. The authors randomly split each dataset into training/test datasets and repeated the partition 20 times independently. We use the exactly same partitions as in (Chu & Ghahramnai, 2005a) to train and test our method.
We use the online mode to train neural networks. The parameters to tune are the number of hidden units, the number of epochs, and the learning rate. We create a grid for these three parameters, where the hidden unit number is in the range , the epoch number in the set , and the initial learning rate in the range . During the training, the learning rate is halved if training errors continuously go up for a predefined number (40, 60, 80, or 100) of epochs. For experiments on each data split, the neural network parameters are fully optimized on the training data without using any test data.
For each experiment, after the parameters are optimized on the training data, we train five models on the training data with the optimal parameters, starting from different initial weights. The ensemble of five trained models are then used to estimate the generalized performance on the test data. That is, the average output of five neural network models is used to make predictions.
We evaluate our method using zeroone error and mean absolute error as in (Chu & Ghahramani, 2005a). Zeroone error is the percentage of wrong assignments of ordinal categories. Mean absolute error is the root mean square difference between assigned categories () and true categories () of all data points. For each dataset, the training and evaluation process is repeated 20 times on 20 data splits. Thus, we compute the average error and the standard deviation of the two metrics as in (Chu & Ghahramani, 2005a).
We first compare our method (NNRank) with a standard neural network classification method (NNClass). We implement both NNRank and NNClass using C++. NNRank and NNClass share most code with minor difference in the transfer function of output nodes and its derivative computation as described in Section id1.
Mean zeroone error  Mean absolute error  

Dataset  NNRank  NNClass  NNRank  NNClass 
Stocks  12.681.8%  16.97 2.3%  0.1270.01  0.1730.02 
Pyrimidines  37.718.1%  41.877.9%  0.4500.09  0.5080.11 
Auto MPG  27.132.0%  28.822.7%  0.2810.02  0.3070.03 
Machine  17.034.2%  17.804.4%  0.1860.04  0.1920.06 
Abalone  21.390.3%  21.74 0.4%  0.2260.01  0.2320.01 
Triazines  52.555.0%  52.845.9%  0.7300.06  0.7900.09 
Boston  26.383.0%  26.622.7%  0.2950.03  0.2970.03 
Diabetes  44.9012.5%  43.8410.0%  0.5460.15  0.5920.09 
As Table 1 shows, NNRank outperforms NNClass in all but one case in terms of both the meanzero error and the mean absolute error. And on some datasets the improvement of NNRank over NNClass is sizable. For instance, on the Stock and Pyrimidines datasets, the mean zeroone error of NNRank is about 4% less than NNClass; on four datasets (Stock, Pyrimidines, Triazines, and Diabetes) the mean absolute error is reduced by about .05. The results show that the ordinal regression neural network consistently achieves the better performance than the standard classification neural network. To futher verify the effectiveness of the neural network ordinal regression approach, we are currently evaluating NNRank and NNclass on very large ordinal regression datasets in the bioinformatics domain (work in progress).
To further evaluate the performance of our method, we compare NNRank with two Gaussian process methods (GPMAP and GPEP) (Chu & Ghahramani, 2005a) and a support vector machine method (SVM) (Shashua & Levin, 2003) implemented in (Chu & Ghahramani, 2005a). The results of the three methods are quoted from (Chu & Ghahramani, 2005a). Table 2 reports the zeroone error on the eight datasets. NNRank achieves the best results on Diabetes, Triazines, and Abalone, GPEP on Pyrimidines, Auto MPG, and Boston, GPMAP on Machine, and SVM on Stocks.
Table 3 reports the mean absolute error on the eight datasets. NNRank yields the best results on Diabetes and Abalone, GPEP on Pyrimidines, Auto MPG, and Boston, GPMAP on Triazines and Machine, SVM on Stocks.
In summary, on the eight datasets, the performance of NNRank is comparable to the three stateoftheart methods for ordinal regression.
Data  NNRank  SVM  GPMAP  GPEP 
Triazines  52.555.0%  54.191.5%  52.912.2%  52.622.7% 
Pyrimidines  37.718.1%  41.468.5%  39.797.2%  36.466.5% 
Diabetes  44.9012.5%  57.3112.1%  54.2313.8%  54.2313.8% 
Machine  17.034.2%  17.373.6%  16.533.6%  16.783.9% 
Auto MPG  27.132.0%  25.732.2%  23.781.9%  23.751.7% 
Boston  26.383.0%  25.562.0%  24.882.0%  24.491.9% 
Stocks  12.681.8%  10.811.7%  11.992.3%  12.002.1% 
Abalone  21.390.3%  21.580.3%  21.500.2%  21.560.4% 

Data  NNRank  SVM  GPMAP  GPEP 

Triazines  0.7300.07  0.6980.03  0.6870.02  0.6880.03 
Pyrimidines  0.4500.10  0.4500.11  0.4270.09  0.3920.07 
Diabetes  0.5460.15  0.7460.14  0.6620.14  0.6650.14 
Machine  0.1860.04  0.1920.04  0.1850.04  0.1860.04 
Auto MPG  0.2810.02  0.2600.02  0.2410.02  0.2410.02 
Boston  0.2950.04  0.2670.02  0.2600.02  0.2590.02 
Stocks  0.1270.02  0.1080.02  0.1200.02  0.1200.02 
Abalone  0.2260.01  0.2290.01  0.2320.01  0.2340.01 
We have described a simple yet novel approach to adapt traditional neural networks for ordinal regression. Our neural network approach can be considered a generalization of onelayer perceptron approach (Crammer & Singer, 2002) into multilayer. On the standard benchmark of ordinal regression, our method outperforms standard neural networks used for classification. Furthermore, on the same benchmark, our method achieves the similar performance as the two stateoftheart methods (support vector machines and Gaussian processes) for ordinal regression.
Compared with existing methods for ordinal regression, our method has several advantages of neural networks. First, like the perceptron approach (Crammer & Singer, 2002), our method can learn in both batch and online mode. The online learning ability makes our method a good tool for adaptive learning in the realtime. The multilayer structure of neural network and the nonlinear transfer function give our method the stronger fitting ability than perceptron methods.
Second, the neural network can be trained on very large datasets iteratively, while training is more complex than support vector machines and Gaussian processes. Since the training process of our method is the same as traditional neural networks, average neural network users can use this method for their tasks.
Third, neural network method can make rapid prediction once models are trained. The ability of learning on very large dataset and predicting in time makes our method a useful and competitive tool for ordinal regression tasks, particularly for timecritical and largescale ranking problems in information retrieval, web page ranking, collaborative filtering, and the emerging fields of Bioinformatics. We are currently applying the method to rank proteins according to their structural relevance with respect to a query protein (Cheng & Baldi, 2006). To facilitate the application of this new approach, we make both NNRank and NNClass to accept a general input format and freely available at http://www.eecs.ucf.edu/jcheng/cheng_software.html.
There are some directions to further improve the neural network (or multilayer perceptron) approach for ordinal regression. One direction is to design a transfer function to ensure the monotonic decrease of the outputs of the neural network; the other direction is to derive the general error bounds of the method under the binary classification framework (Li & Lin, 2006). Furthermore, the other flavors of implementations of the multithreshold multilayer perceptron approach for ordinal regression are possible. Since machine learning ranking is a fundamental problem that has wide applications in many diverse domains such as web page ranking, information retrieval, image retrieval, collaborative filtering, bioinformatics and so on, we believe the further exploration of the neural network (or multilayer perceptron) approach for ranking and ordinal regression is worthwhile.
References
 Agarwal & Roth, 2005 Agarwal and Roth][2005]agarwalroth05 Agarwal, S., & Roth, D. (2005). Learnability of bipartite ranking functions. In Proc. of the 18th annual conference on learning theory (colt05).
 Aiolli & Sperduti, 2004 Aiolli and Sperduti][2004]aiollisperduti04 Aiolli, F., & Sperduti, A. (2004). Learning preferences for multiclass problems. In Advances in neural information processing systems 17 (nips).
 Basilico & Hofmann, 2004 Basilico and Hofmann][2004]basilicohofmann Basilico, J., & Hofmann, T. (2004). Unifying collaborative and contentbased filtering. In Proceedings of the twentyfirst international conference on machine learning (icml), 9. New York, USA: ACM press.
 Bishop, 1996 Bishop][1996]bishop96 Bishop, C. (1996). Neural networks for pattern recognition. USA: Oxford University Press.
 Burges et al., 2006 Burges et al.][2006]burges06 Burges, C., Ragno, R., & Le, Q. V. (2006). Learning to rank with nonsmooth cost functions. In Advances in neural information processing systems (nips) 20. Cambridge, MA: MIT press.
 Burges et al., 2005 Burges et al.][2005]burgeshullender05 Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proc. of internaltional conference on machine learning (icml05), 89–97.
 Caruana et al., 1996 Caruana et al.][1996]caruanamitchell96 Caruana, R., Baluja, S., & Mitchell, T. (1996). Using the future to sort out the present: Rankprop and multitask learning for medical risk evaluation. In Advances in neural information processing systems 8 (nips).
 Cheng & Baldi, 2006 Cheng and Baldi][2006]chengfold06 Cheng, J., & Baldi, P. (2006). A machine learning information retrieval approach to protein fold recognition. Bioinformatics, 22, 1456–1463.
 Chu & Ghahramani, 2005a Chu and Ghahramani][2005a]chujmlr05 Chu, W., & Ghahramani, Z. (2005a). Gaussian processes for ordinal regression. Journal of Machine Learning Research, 6, 1019–1041.
 Chu & Ghahramani, 2005b Chu and Ghahramani][2005b]chughahramani05 Chu, W., & Ghahramani, Z. (2005b). Preference learning with Gaussian processes. In Proc. of international conference on machine learning (icml05), 137–144.
 Chu & Keerthi, 2005 Chu and Keerthi][2005]chukeerthi05 Chu, W., & Keerthi, S. (2005). New approaches to support vector ordinal regression. In Proc. of international conference on machine learning (icml05), 145–152.
 Chu & Keerthi, 2007 Chu and Keerthi][2007]chukeerthi07 Chu, W., & Keerthi, S. (2007). Support vector ordinal regression. Neural Computation, 19.
 Cohen et al., 1999 Cohen et al.][1999]cohensinger99 Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learning to order things. Journal of Artificial Intelligence Research, 10, 243–270.
 Crammer & Singer, 2002 Crammer and Singer][2002]crammersinger02 Crammer, K., & Singer, Y. (2002). Pranking with ranking. In Advances in neural information processing systems (nips) 14, 641–647. Cambridge, MA: MIT press.
 Dekel et al., 2004 Dekel et al.][2004]dekelsinger Dekel, O., Keshet, J., & Singer, Y. (2004). Loglinear models for label ranking. In Proc. of the 21st international conference on machine learning (icml06), 209–216.
 Frank & Hall, 2001 Frank and Hall][2001]frankhall Frank, E., & Hall, M. (2001). A simple approach to ordinal classification. In Proc. of the european conference on machine learning.
 Freund et al., 2003 Freund et al.][2003]freundsinger03 Freund, Y., Iyer, R., Schapire, R., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969.
 Goldberg et al., 1992 Goldberg et al.][1992]goldberg92 Goldberg, D., Nichols, D., Oki, B., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35, 61–70.
 HarPeled et al., 2002 HarPeled et al.][2002]harpeledzimak HarPeled, S., Roth, D., & Zimak, D. (2002). Constraint classification: a new approach to multiclass classification and ranking. In Advances in neural information processing systems 15 (nips).
 Herbrich et al., 1998 Herbrich et al.][1998]herbrichobermayer98 Herbrich, R., Graepel, T., BollmannSdorra, P., & Obermayer, K. (1998). Learning preference relations for information retrieval. In Proc. of icml workshop on text categorization and machine learning, 80–84.
 Herbrich et al., 1999 Herbrich et al.][1999]herbrichobermayericann Herbrich, R., Graepel, T., & Obermayer, K. (1999). Support vector learning for ordinal regression. In Proc. of 9th international conference on artificial neural networks (icann), 97–102.
 Herbrich et al., 2000 Herbrich et al.][2000]herbrichobermayer00 Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In A. J. Smola, P. Bartlett, B. Scholkopf and D. Schuurmans (Eds.), Advances in large margin classifiers, 115–132. Cambridge, MA: MIT Press.
 Jin & Ghahramani, 2003 Jin and Ghahramani][2003]jinghahramani02 Jin, R., & Ghahramani, Z. (2003). Learning with multiple labels. In Advances in neural information processing systems (nips) 15. Cambridge, MA: MIT press.
 Joachims, 2002 Joachims][2002]joachims02 Joachims, I. (2002). Optimizing search engines using clickthrough data. In D. Hand, D. Keim and R. NG (Eds.), Proc. of 8th acm sigkdd international conference on knowledge discovery and data mining, 133–142.
 Kramer et al., 2001 Kramer et al.][2001]kramerdegroeve01 Kramer, S., Widmer, G., Pfahringer, B., & DeGroeve, M. (2001). Prediction of ordinal classes using regression trees. Fundamenta Informaticae, 47, 1–13.
 Li & Lin, 2006 Li and Lin][2006]lilin06 Li, L., & Lin, H. (2006). Ordinal regression by extended binary classification. In Advances in neural information processing systems (nips) 20. Cambridge, MA: MIT press.
 MacKay, 1992 MacKay][1992]mackay92 MacKay, D. J. C. (1992). A practical bayesian framework for back propagation networks. Neural Computation, 4, 448–472.
 McCullagh, 1980 McCullagh][1980]mccullagh80 McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society B, 42, 109–142.
 McCullagh & Nelder, 1983 McCullagh and Nelder][1983]mccullaph83 McCullagh, P., & Nelder, J. A. (1983). Generalized linear models. London: Chapman and Hall.
 Minka, 2001 Minka][2001]minka01 Minka, T. P. (2001). A family of algorithms for approximate bayesian inference. PhD Thesis, Massachusetts Institute of Technology.
 Paquet et al., 2005 Paquet et al.][2005]paquetguzman05 Paquet, U., Holden, S., & NaishGuzman, A. (2005). Bayesian hierarchical ordinal regression. In Proc. of the international conference on artifical neural networks.
 Rajaram et al., 2003 Rajaram et al.][2003]rajaram03 Rajaram, S., Garg, A., Zhou, X., & Huang, T. (2003). Classification approach towards ranking and sorting problems. In Machine learning: Ecml 2003, vol. 2837 of lecture notes in artificail intelligence (n. lavrac, d. gamberger, h. blockeel and l. todorovski eds.), 301–312. SpringerVerlag.
 Richard & Lippman, 1991 Richard and Lippman][1991]richardlippmann Richard, M., & Lippman, R. (1991). Neural network classifiers estimate bayesian aposteriori probabilities. Neural Computation, 3, 461–483.
 Rumelhart et al., 1986 Rumelhart et al.][1986]rumelhart86learning Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning Internal Representations by Error Propagation. In D. E. Rumelhart and J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. vol. i: Foundations, 318–362. Bradford Books/MIT Press, Cambridge, MA.
 Schölkopf & Smola, 2002 Schölkopf and Smola][2002]schoelkopf02learning Schölkopf, B., & Smola, A. (2002). Learning with Kernels, Support Vector Machines, Regularization, Optimization and Beyond. Cambridge, MA: MIT University Press.
 Schwaighofer et al., 2005 Schwaighofer et al.][2005]schwaighoferyu05 Schwaighofer, A., Tresp, V., & Yu, K. (2005). Hiearachical bayesian modelling with gaussian processes. In Advances in neural information processing systems 17 (nips). MIT press.
 Shashua & Levin, 2003 Shashua and Levin][2003]shashualevin03 Shashua, A., & Levin, A. (2003). Ranking with large margin principle: two approaches. In Advances in neural information processing systems 15 (nips).
 Vapnik, 1995 Vapnik][1995]vapnik95 Vapnik, V. (1995). The nature of statistical learning theory. Berlin, Germany: SpringerVerlag.
 Wu et al., 2003 Wu et al.][2003]wuhong03 Wu, H., Lu, H., & Ma, S. (2003). A practical svmbased algorithm for ordinal regression in image retrieval. 612–621.
 Yu et al., 2006 Yu et al.][2006]yukriegel06 Yu, S., Yu, K., Tresp, V., & Kriegel, H. P. (2006). Collaborative ordinal regression. In Proc. of 23rd international conference on machine learning, 1089–1096.
 Zhang et al., 2005 Zhang et al.][2005]zhangsu Zhang, H., Jiang, L., & Su, J. (2005). Augmenting naive bayes for ranking. In International conference on machine learning (icml05).