Regression via Arbitrary Quantile Modeling
In the regression problem, L1 and L2 are the most commonly used loss functions, which produce mean predictions with different biases. However, the predictions are neither robust nor adequate enough since they only capture a few conditional distributions instead of the whole distribution, especially for small datasets. To address this problem, we proposed arbitrary quantile modeling to regulate the prediction, which achieved better performance compared to traditional loss functions. More specifically, a new distribution regression method, Deep Distribution Regression (DDR), is proposed to estimate arbitrary quantiles of the response variable. Our DDR method consists of two models: a Q model, which predicts the corresponding value for arbitrary quantile, and an F model, which predicts the corresponding quantile for arbitrary value. Furthermore, the duality between Q and F models enables us to design a novel loss function for joint training and perform a dual inference mechanism. Our experiments demonstrate that our DDR-joint and DDR-disjoint methods outperform previous methods such as AdaBoost, random forest, LightGBM, and neural networks both in terms of mean and quantile prediction.
In recent years, great progress has been made for machine learning thanks to the developments of regression methods, such as AdaBoost, random forest, LightGBM and Neural Network have gained popularity and been widely adopted. In each sub-field of machine learning, the regression methods are usually involved. In fact, the loss function plays an important role in the regression method. Most regression methods utilize the L1, L2 loss functions to directly obtain mean predictions, such as linear regression and random forest. However, these functions may produce poor results for small datasets due to overfitting. To alleviate this problem, many researchers propose to use regularization techniques, like L1-norm , L2-norm  and dropout . According to statistical learning theory , the large error bound on small datasets could be reduced by data augmentation , multi-task learning , one-shot learning  and transfer learning . Here, data augmentation and transfer learning increase training samples size, while multi-task learning utilizes helpful auxiliary supervised information. Despite these considerable advantages, all of these methods require additional datasets or human expertise to support them currently.
Ordinary least-squares regression models the relationship between one or more covariates X and the conditional mean of the response variable Y given X = . Quantile regression [16, 17] as introduced by Koenker and Bassett (1978), extends the regression model to the estimation of conditional quantiles of the response variable. The quantiles of the conditional distribution of the response variable are expressed as functions of observed covariates. Since conditional quantile functions completely characterize all that is observable about univariate conditional distributions, they provide a foundation for nonparametric structural models. Quantile regression methods are widely used in many risk-sensitive regression problems, but their performance on small datasets fluctuates like the L1 and L2 loss functions. In economics, food expenditure and household income relationship , the change of wage structure  and many other problem are analyzed with quantile regression. In ecology, quantile regression has been proposed to discover more useful predictive relationships between variables with complex interactions, leading to data with unequal variation of one variable for different ranges, such as growth charts , prey and predator size relationships  etc. The main advantage of quantile regression over ordinary least squares regression is that its flexibility for modeling data with heterogeneous conditional distributions and is more robust to outliers in response measurements. Moreover, the different measures of central tendency and statistical dispersion are useful for obtaining a more comprehensive analysis of the relationships between variables. For instance, Dunham et al.  analyzed the abundance of lahontan cutthroat trout to the ratio of stream width to depth. With Quantile regression, it indicated a nonlinear, negative relationship with the upper 30% of cutthroat desities across 13 streams and 7 years. But if just using mean regression exstimates, researchers will mistakenly concluded that there was no relation between trout densities and the ratio of stream width to depth.
Many traditional learning methods solve quantile regression problems by optimizing quantile loss, such as gradient boosting machine  and multi-layer perceptron . These efforts are intended to predict fixed quantile values, and usually result in linear computation costs compared to the number of fixed quantiles. Although we can leverage the structure privilege of multi-layer perceptron to predict multiple quantile values simultaneously, we still cannot predict arbitrary quantile values which is necessary for complicated tasks (e.g. conditional distribution learning). In addition, traditional methods struggled to avoid mathematical illegal phenomenon such as quantile crossing [1, 7]. This is mainly because these quantile values are either estimated independently. Since median prediction equals to 50% quantile prediction, it is natural to utilize a neural network with median regression as backbone to achieve arbitrary-quantile regression, and then to boost the performances of median and quantile predictions in return. When arbitrary-quantile regression is achieved, quantile values will be estimated universally and continuously, which means we can resolve quantile crossing problems by simply applying gradient regularization. Moreover, contrary to the multi-task learning, the arbitrary-quantile regression does not require extra labels or boost performance through augmenting datasets by enumerating arbitrary quantile inputs.
In this paper, we propose a Deep Distribution Regression (DDR) mechanism, which consists of a quantile regression model (Q model) and its dual model, cumulative distribution regression model (F model). Both models are deep neural networks, which predict corresponding values of any quantiles and predict quantiles of the corresponding values respectively. The joint training of Q and F provides extra regularization. Extensive experiments demonstrate that DDR outperforms fixed quantile loss and L1 / L2 loss based traditional methods, such as neural networks and ensemble trees. The key contributions of our proposed DDR mechanism include three aspects. First, we design a single neural network for arbitrary quantile prediction, which achieves better quantile loss, L1 loss and even L2 loss compared to fixed-point quantile regression, mean absolute error (MAE) regression and mean squared error (MSE) regression. Second, our DDR method can predict quantile for arbitrary value of corresponding variable, which is useful for anomaly detection or outlier detection. Third, we further utilize joint training and ‘dual inference’ mechanism of the two dual models to obtain better estimation of the whole conditional distribution.
The novelties of our method are as follows:
We treat quantile as an input variable instead of a fixed constant, which implies DDR uses an integral of loss function as loss function instead of a finite set of loss functions in training process.
We introduce mathematical constraint that conditional quantile function should be the inverse function of conditional c.d.f. in DDR, which is not considered in previous studies.
We introduce mathematical constraint that conditional quantile function and conditional c.d.f. should both be monotonous in DDR by adding regularization terms on corresponding gradients.
We leverage our trained dual models to perform ‘dual inference’ and achieve better performance.
2.1 Distribution function and quantile function
Let be a continuous real-valued random variables with cumulative distribution function (c.d.f.) :
For each strictly between and , we define:
The is called the quantile of or the 100 percentile of . The function defined here on the open interval is called as quantile function of .
In general cases, the quantile function (q.f.) is a continuous and strictly increasing function that takes response variable as inputs and outputs of the corresponding quantile :
hence is the quantile of . Clearly, the q.f. and the c.d.f. are the inverse of each other:
2.2 Regression analysis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. In most cases, the regression analysis estimates the conditional expectation of the dependent variable given a response variable with a mean squared error (MSE) loss function. Meanwhile, there are relatively few studies devoted to the analysis of a certain quantile, and one typical example is estimating the conditional median using the Mean Absolute Error (MAE) loss function.
In quantile regression, can be characterized as the unique solution to the problem :
where denotes the asymmetric absolute loss function
and is the indicator function of the event :
If ,then and is the median of .
We seek to find a to minimize
we have the following form by differentiating with respect to ,
Considering the problem of estimating the conditional distribution of a scalar random variables given a random vector when the available data are a sample from the joint distribution of , we can estimate the conditional c.d.f.
The other way is to estimate the conditional quantile function (c.q.f.). The th c.q.f. has the following form:
It can be easily verified that
The above statements mean that we can achieve estimating conditional expectation when we implement the estimation of arbitrary conditional quantiles, which also shows that an arbitrary quantile modeling is able to cover most regression problems, as evidenced below. Considering that conditional expectation of given ,
Using the definition of c.q.f. Q and leveraging the integral transformation, we can draw that
Since it is intractable to calculate the exact integration when the c.q.f. has a complex form, we calculate the numerical integration using Trapezoidal rule to estimate the conditional expectation:
where the interval is partitioned into () equal subintervals, each of width , such that and .
Notice that can all be calculated with our arbitrary quantile modeling, so conditional expectation can be estimated with equation above, which concludes the proof. In order to train a quantile model, according to the definition of c.q.f. , quantile can be found by minimizing the expected loss of
When it is actually applied in training process, this formula can be expanded to an empirical form:
where is the ReLU function:
Although some theories and methods have been proposed to leverage this loss function and estimate quantiles one at a time [5, 6, 12, 14, 23], few researchers have considered the panorama of arbitrary conditional quantiles of the response variable. There have been some significant efforts to estimate multiple quantiles in a single neural network with different outputs [3, 21], but these methods failed to estimate an infinity set of quantiles (i.e. arbitrary quantile). To train the conditional c.d.f. regression model, we simply leverage the maximum likelihood estimation method and set the conditional c.d.f. loss to negative likelihood of each batch . We denote in the range of as the anchor we want to estimate conditional c.d.f. with, and then it is clear that the likelihood could be represented by:
One way to automatically condition between 0 and 1 is not to directly model , but rather the log-odds . Then will be estimated by given using a Sigmoid function :
Therefore, the likelihood could be reformulated as:
As a result, the loss function of model will be:
The difference between our proposed DDR mechanism and these mechanisms is how do we deal with the in quantile regression and the in the conditional c.d.f. regression, and how we introduce several regularization terms to force and to be inverse functions. We will discuss them in detail in §3 and describe how our proposed DDR mechanism improves regression performance by simultaneously training and models.
3 Arbitrary Quantile Modeling
In this section, we outline the process of building a quantile regression model using the proposed DDR mechanism, which predicts the quantile curves for arbitrary percentile. For two inverse function model and model mentioned above, we have
Similar to neural machine translation, the c.q.f. model and the conditional c.d.f. model can be considered as samples from two model corpora. Unlike neural machine translation, the mapping between these two model corpora is mathematically limited to an inverse mapping, which guaranteed their alignments. This property allows us to train model and model together with some regularization that beyond any aligned data.
In the following, we will firstly introduce how to train a single c.q.f. model and a single conditional c.d.f. model, secondly we will discuss how to add constraints to ensure their inverse mapping. Finally, we will discuss how to use dual inference to potentially boost regression performance.
3.1 Arbitrary quantile and conditional c.d.f. regression
As mentioned in §2 that quantile regression can be done with some given percentiles, while conditional c.d.f. regression can be done with some given anchors. However, regression with arbitrary percentiles and anchors are not supported by traditional methods, because the latter treat loss functions as ‘constants’ instead of ‘variables’. Therefore, we will use an integral of corresponding loss function as loss function in our method. In DDR mechanism, we model instead of , which implies the Q model is not determined by a specific percentile , but takes as input directly. In this case, loss function should be an integral of previous loss function:
Since the integral is intractable, we simply use Monte Carlo method to estimate this loss function in practice. Notice that the different prior distribution assigned to leads to the different ‘attention’ to what we want to settle. In our experiment, we generally sample from uniform distribution U(0,1). Similarly, we model instead of using loss function . Since the lack of the lower and upper bound of , we use empirical bounds from our training dataset. We denote x as the original features, y as the labels and D as the dataset, let
In our experiment, we also sample from the uniform distribution, but other distribution could be used when more prior knowledge is available. Since arbitrary and can be seen during training, it is expected that both and model are able to capture quantile curves and conditional c.d.f. curves under arbitrary percentiles and anchors. In practice, however, we may want to focus on specific percentiles and anchors and also ensure their performances, consequently ‘anchor losses’ is introduced to both model and model. We will firstly define two anchor sets for percentiles and conditional c.d.f. anchors that we are interested in:
where is the number of anchors.
After sampling and from and during training, we will additionally sample and from and and then calculate their losses to ensure our model and our model focus more on these anchors. Moreover, since model and model should satisfy monotonic constraint, we introduce gradient losses as regularization terms as well:
Notice that model and model should satisfy monotonic constraint not only on inputs from our training set, but also on any inputs sampled from the data distribution. As a result, regularization terms above could be applied to synthetic inputs which are ‘close to’ our dataset, but it never appeared in our dataset (and maybe either in reality). We believe this kind of synthetic regularization will make our model more robust to quantile crossing problem.
3.2 Recover losses
For any and , we introduce two recover losses as follows:
where means recovering from to , and means recovering from to . These two losses imply the ‘inverse mapping’ directly, hence forcing and to become inverse functions of each other. We simply design absolute error for minimizing the loss.
3.3 Dual losses
Apart from recover loss, we also want the mapping between and satisfies their own monotonic constraints mentioned in Section 3.1. Specifically, we expect that the recovered from to will still satisfy the monotonic constraint after being input into the model again, and the recovered from to also satisfy the constraint. Therefore, we introduce two dual losses as follows:
where denotes the regularization term on by inputting and recovered , and denotes the regularization term on by inputting and recovered . These two losses imply that they are homogeneous after the two mappings, hence forcing Q and F to focus more on their own domain.
3.4 Dual inference
Since we will train and models simultaneously, when predicting conditional quantile , we can either directly calculate it with model, or indirectly calculate it by solving an optimization problem with model:
where loss function could be a simple L1, L2 loss, or other more specific loss functions when prior knowledge is accessible.
In reality, however, we are not sure whether we obtained a better model or we obtained a better model during the training process, it is natural to output a weighted sum of outputs from both model and model, which we denoted as dual inference :
In our experiment, we simply set , but better performance might be achieved by selecting them according to the performances of model and model on cross validation set.
3.5 Model structure design and training strategies
Unlike quantile regression which has different loss functions under different percentiles, median regression holds exclusive loss function, which is known as the L1 loss. Considering that median regression is a special case of quantile regression (where ), we build our model based on a median regression model. Since every neural network model can be treated as a sequence of transformations from one latent space to another, we inject information of different percentiles by projecting them into those latent spaces and adding them to the original network outputs.
We will use a two-layer Multi-Layer Perceptron (MLP) model with 256 hidden units and GLU activation function to accomplish the projections, but linear projection or more complex MLP structures could be taken into account when tasks are simpler or more complicated.
In our cases, we will use some structured datasets in experiments, hence network outputs in different latent spaces are simply those activations of different hidden layers in an MLP model. Assume that our MLP model is constructed by a single hidden layer with neurons, then our median regression model calculates its output by:
where and are weight matrices with shape and , and are biases vectors with shape and . Based on this median regression model , our Q model calculates its output by:
where both and are weight matrix and bias vector with shape of . Similarly, our conditional c.d.f. regression model calculates its output by:
where and are weight matrix and bias vector both of shape . These formulas can be extended recursively and thus the information of and will be directly injected into every latent space in the sequential calculation.
Since our model and model both relied on our median regression model, it is intuitive to primarily train the median regression model to an accepted level before we start to train model and model. Therefore, we introduce some annealing strategies to control the combination of each loss so as to guarantee that we focus more on median regression model in general. Notice that this training strategy could be treated as a hierarchical multi-task (hierarchical ‘infinite-task’ to be exact, since percentiles and anchors are sampled from continuous distributions) learning, which has been proved effective in many domains.
In addition, the proposed (linear) injection of ’s and ’s information means sharing most of the parameters between median regression model, model and model, thus high quality latent features in each latent space are required. Therefore, we separate our MLP model into two parts: the feature part and the regression part. We use GLU activation function and ReLU activation function in feature part and regression part respectively, and update parameters in feature part more frequently than those in regression part.
4.1 Experiments setting
In the experiments, we will compare the proposed DDR approach with LightGBM  and fully-connected neural networks (FCNN) in general quantile regression problems and median regression problems. For the FCNN in general quantile regression problems, we adopt two groups of experiments, namely FCNN and FCNN-joint, to study the effect of shared parameters on quantile regression performance. Specifically, the FCNN method will train k models when we need k quantiles, while the FCNN-joint method will only train 1 model with k outputs to fetch k quantiles together. We will evaluate our algorithm on 6 real datasets and regression benchmarks collected from open sources.
As shown in Table 1, we firstly evaluated our algorithm on four synthetic 1-dimensional datasets to do sanity checks with clear visualization. Since the constructed synthetic datasets are rather rough, we also evaluate our methods on the real-world datasets and regression benchmarks collected from open sources, which are illustrated in Table 2. The datasets include 6 classes.
|Name||Sample num||Feature num||Description|
4.3 Implementation details
In order to compare model performance under general quantile regression problems, we decided to compare quantile errors together. Here, we trained single model with our DDR mechanism and 9 models based on 9 different quantiles with LightGBM for each task to calculate . We denoted as our target metric and it could be calculated by:
We used grid search for LightGBM and developed a set of parameters for DDR which are suitable for various tasks, and we’ll use these parameters throughout this section.
To compare the performance of the model under specific regression problems, we decided to compare median regression performance to the L1 loss and compare the expectation regression performance to the L2 loss.
It should be noticed that the proposed DDR mechanism consists of two main parts: the ‘infinite-task’ learning part and the ‘dual learning’ part. The ‘infinite-task’ learning strategy should boost regression performance by augmenting the data, but it is not clear whether explicitly the impacts of training an additional F model or adding constraints between model and model have positive impacts on the regression performance or not. It is also not clear whether using ‘dual inference’ mechanism will benefit the performance.
Therefore, we use ‘DDR-q’ to denote DDR mechanism without training model, ‘DDR-disjoint’ to denote DDR mechanism with F model but without the ‘dual learning’ part and ‘DDR-joint’ to denote DDR mechanism with the ‘dual learning’ part. In the Table 3, the superscript ‘*’ is used when dual inference with both model and model outperforms direct prediction with model on cross validation set (notice that DDR-q is not available to perform dual inference).
4.4 Experimental results and analysis
4.4.1 Results on synthetic dataset
We first experiment on synthetic dataset with 100k samples. Since our synthetic datasets are generated from a certain ground truth function, it is possible for us to get the ground truth quantile curves based on statistic methods (e.g. sampling). In Figure 1, we visualized our model’s predictions and corresponding ground truths to prove that our model has enough capacity to learn the inner pattern of the quantile functions.
4.4.2 Quantile errors on test set and cross validation set
As shown in Table 3, dual inference can boost performance in most cases, and DDR-based methods are rather outstanding. It is worth noticing that performance boost on cross validation set could also boost performance on test set in our experiments as shown in Table 4 and Table 5.
More interestingly, the performances of DDR-joint and DDR-disjoint remain consistent between cross validation set and test set as well, which both demonstrate the high generalization ability of our DDR model.
4.4.3 Median regression performance
Apart from the overall performance of quantile regressions, we also compared the median regression performance, which use MAE to measure the performance, between our model and several common methods. The experimental results are illustrated in Table 6. In order to compare the performance, we highlight the top 3 performance in the experiments. We find that DDR-joint method still provide better performance for almost all of our experiments.
4.4.4 Mean regression performance
Finally, we also compared the mean regression performance, namely MSE, between our model and several commonly used methods. Note that the mean regression is rather implicit in DDR model, since we can only access it through the equation (17) mentioned in §2.2, where
We then used this approximation of to calculate the MSE, and made a comparison to commonly used models which are directly trained with MSE.
From Table 7, we can see that the DDR obtains competitive results on MSE metric, even when DDR does not model conditional expectation directly. In order to demonstrate the importance of modeling model and model together, we compare the quantile metric, MAE metric and MSE metric between DDR-disjoint, DDR-joint and DDR-q, where DDR-q only trains model and drops out model.
Comparing to the quantile loss metric and MAE metric, DDR-joint and DDR-disjoint significantly outperforms DDR-q in MSE metric. This is reasonable because we obtain the predictions by numerical integration, which requires general good performances for all possible quantiles at different percentiles ranging from 0 to 1. DDR-joint explicitly models and adds constraints between , which may hurt performance at specific percentile (e.g. 0.5 quantile, the median) but boost performance in a more general form.
In summary, we proposes a DDR mechanism to provide a quantile regression, which is a generalization of MAE regression and MSE regression. With this mechanism, we can obtain a model to predict with arbitrary quantile. Moreover, compare with other model, DDR perform better not only in quantile loss, but also L1 loss and L2 loss. Even better estimations are obtained with utilizing dual inference.
In the future, we can continue to explore in the following aspects. Firstly, we have completed experiments on structured dataset and expect to utilize DDR on unstructured dataset, e.g., to replace MSE loss in common regression problems in computer vision such as object detection. Secondly, we can leverage DDR to develop risk prediction models, because our DDR-joint model provides F model which is able to determine whether an observation is abnormal (i.e. whether the F model responses a value close to 0 or 1).
-  (2010-12) Noncrossing quantile regression curve estimation. Biometrika 97 (4), pp. 825–838. Cited by: §1.
-  (1994) Changes in the us wage structure 1963-1987: application of quantile regression. ECONOMETRICA-EVANSTON ILL- 62, pp. 405–405. Cited by: §1.
-  (2011) Quantile regression neural networks: implementation in r and application to precipitation downscaling. Computers & geosciences 37 (9), pp. 1277–1284. Cited by: §2.2.
-  (1997) Multitask Learning. Machine Learning 28 (1), pp. 41–75. Cited by: §1.
-  (2002) Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli 8 (5), pp. 561–576. Cited by: §2.2.
-  (2016) Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §2.2.
-  (2010-05) Quantile and probability curves without crossing. Econometrica 78 (3), pp. 1093–1125. Cited by: §1.
-  (2002) Influences of spatial and temporal variation on fish-habitat relationships defined by regression quantiles. Transactions of the American Fisheries Society 131 (1), pp. 86–98. Cited by: §1.
-  (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1.
-  (1994) Data augmentation and dynamic linear models. Journal of time series analysis 15 (2), pp. 183–202. Cited by: §1.
-  (1998) Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment 32 (14-15), pp. 2627–2636. Cited by: §1.
-  (2017-01) Short-term power load probability density forecasting method using kernel-based support vector quantile regression and copula theory. Applied Energy 185, pp. 254–266. Cited by: §2.2.
-  (1970-02) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §1.
-  (2005) A simple quantile regression via support vector machine. In International Conference on Natural Computation, pp. 512–520. Cited by: §2.2.
-  (2017) Lightgbm: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3146–3154. Cited by: §4.1.
-  (1978) Regression quantiles. Econometrica: journal of the Econometric Society, pp. 33–50. Cited by: §1.
-  (2001-11) Quantile regression. Journal of Economic Perspectives 15 (4), pp. 143–156. Cited by: §1.
-  (2006-04) One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4), pp. 594–611. Cited by: §1.
-  (2010-10) A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §1.
-  (2002) On estimating conditional quantiles and distribution functions. Computational statistics & data analysis 38 (4), pp. 433–447. Cited by: §2.2, §2.2.
-  (2018) Beyond expectation: deep joint mean and quantile regression for spatio-temporal problems. arXiv preprint arXiv:1808.08798. Cited by: §2.2.
-  (1998) Inferring ecological relationships from the edges of scatter diagrams: comparison of regression techniques. Ecology 79 (2), pp. 448–460. Cited by: §1.
-  (2009-02) Support vector censored quantile regression under random censoring. Computational Statistics & Data Analysis 53 (4), pp. 912–919. Cited by: §2.2.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §1.
-  (2011-06) Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (3), pp. 273–282. Cited by: §1.
-  (2013) The nature of statistical learning theory. Springer science & business media. Cited by: §1.
-  (2006) Quantile regression methods for reference growth charts. Statistics in medicine 25 (8), pp. 1369–1382. Cited by: §1.
-  (2017) Dual inference for machine learning.. In IJCAI, pp. 3112–3118. Cited by: §3.4.