# Improving Regression Performance with Distributional Losses

## Abstract

There is growing evidence that converting targets to soft targets in supervised learning can provide considerable gains in performance. Much of this work has considered classification, converting hard zero-one values to soft labels—such as by adding label noise, incorporating label ambiguity or using distillation. In parallel, there is some evidence from a regression setting in reinforcement learning that learning distributions can improve performance. In this work, we investigate the reasons for this improvement, in a regression setting. We introduce a novel distributional regression loss, and similarly find it significantly improves prediction accuracy. We investigate several common hypotheses, around reducing overfitting and improved representations. We instead find evidence for an alternative hypothesis: this loss is easier to optimize, with better behaved gradients, resulting in improved generalization. We provide theoretical support for this alternative hypothesis, by characterizing the norm of the gradients of this loss.

## 1 Introduction

The choice of problem formulation for regression has a large impact on prediction performance on new data—generalization performance. There is an extensive literature on problem formulations to promote generalization, including robust losses (huber2011robust; ghosh2017robust; barron2017amore); proxy losses and reductions between problems (langford2006predicting); the addition of regularization to impose constraints or preferences on the solution; the addition of label noise (szegedy2016rethinking); and even ensuring multiple tasks are learned simultaneously, rather than separately, as in multi-task learning (caruana1998multitask). There is typically a goal in mind—such as classification accuracy or absolute error for regression—but those losses are not necessarily directly minimized.

In recent years, there has been a particular focus on learning representations with neural networks that generalize better. With fixed representations, the loss or problem formulation can only have so much impact, because the learned function is a linear function of inputs. With (deep) neural networks, however, the performance can vary widely, based even on simple modifications such as the initialization (glorot2010understanding). Particularly in classification, modifying the outputs can significantly improve performance. An extensive empirical study on classification and age prediction (gao2017deep), under label ambiguity, showed that data augmentation on the label side—putting a distribution over an ambiguous label—significantly improved test accuracy, validated also by other work on age estimation (rothe2018deep). Work on model compression (ba2013do; urban2016do) and distillation (hinton2015distilling) highlight that a smaller student model can be trained to capture the generalization ability of a larger teacher model. In general, there is a growing literature on data augmentation and label smoothing, that advocates for reduced overfitting and improved generalization from modifying the outputs (norouzi2016reward; szegedy2016rethinking; xie2016disturblabel; miyato2016distributional; pereyra2017regularizing) and in reinforcement learning where learning distributional outputs, rather than means, improves performance (bellemare2017distributional).

There has been some work—though considerably less—towards understanding the impact of the properties of the loss that promote effective optimization. There is a recent insight that minimizing training time increases generalization performance (hardt2015train), motivating the design of losses that can be more easily optimized. Though not the focus in data augmentation, there have been some insights about loss properties. gao2017deep showed that their data augmentation approach provided a faster convergence rate (see their Figure 8). pereyra2017regularizing showed that label smoothing and their regularizer penalizing confident predictions for classification provided smoother gradient norms than without regularization. bellemare2017distributional hypothesized that the properties of the KL-divergence could have improved learning performance, in a reinforcement learning setting. These papers hint at something deeper occurring with the loss, and motivate investigation into not just the conversion of the problem but into the loss itself.

In this work, we show that the properties of the loss have a significant effect, and better explain the resulting increase in performance than preventing overfitting. We first propose a new loss for regression, called a Histogram Loss (HL). The targets are converted to a target distribution, and the KL-divergence taken between a histogram density and this target distribution. The choice of histogram density provides a relatively flexible prediction distribution, that nonetheless enables the KL-divergence to be computed efficiently. The prediction is then the expected value of this histogram density. This modification could be seen as converting the problem to a more difficult (multi-task) problem—from one output, to multiple values to represent the distribution—that promotes generalization in the learner and reduces overfitting. We show that instead of this hypothesis, the (optimization) properties of the HL seem to be the key factor in the resulting improved accuracy. We provide a series of empirical results to support this hypothesis. We also characterize the norm of the gradient of the HL which directly relates to sample complexity (hardt2015train). The bounds on the variability of the gradient help explain the positive empirical performance of the HL, and further motivate the use of this loss as an alternative for the standard loss for regression.

## 2 Distributional Losses for Regression

In this section, we introduce the Histogram Loss (HL), which generalizes beyond special cases of soft-target losses used in recent work (norouzi2016reward; szegedy2016rethinking; gao2017deep). We first introduce the loss and how it can be used for regression. We then relate it to other objectives, including maximum likelihood for regression and other methods that learn distributions.

### 2.1 Learning means and distributions

In regression, it is common to use the squared-error loss, or loss. This corresponds to assuming that the continuous target variable is Gaussian distributed, conditioned on inputs : for a fixed variance and some function on the inputs, such as a linear function for weights . The maximum likelihood function for samples , corresponds to minimizing the loss

(1) |

with prediction .

Alternatively, one could consider learning a distribution over directly, and then taking the mean of that distribution—or other statistics—to provide a prediction. This additional difficulty seems hardly worth the effort, considering only the mean is required for prediction. However, as motivated above, the increased difficulty could beneficially prevent overfitting and promote generalization.

There are many options for learning conditional distributions, even when only considering those that use neural networks (bishop1994mixture; tang2013learning; rothe2015dex; bellemare2017distributional). The goal of this work, however, is not to provide another method to learn distributions. Rather, the goal is to benefit from inducing a distribution over , even if that distribution will subsequently not be used, other than for computing a mean prediction. In our experiments, we will compare to an approach that learns distributions, but only to evaluate regression performance.

### 2.2 The Histogram Loss

Consider predicting a continuous target with event space , given inputs . Instead of directly predicting , we select a target distribution on . This target distribution is selected upfront, by us, rather than being learned. Suppose the target distribution has support , pdf , and cdf . We would like to learn a parameterized prediction distribution , conditioned on , by minimizing a KL-divergence to . For any , however, this may be expensive. Further, depending on the parameterization of the prediction distribution, this may also be potentially non-convex in those parameters.

We propose to restrict the prediction distribution to be a histogram density. Assume has been uniformly partitioned into bins, of width , and let function provide -dimensional vector of the coefficients indicating the probability the target is in that bin, given . The density corresponds to a (normalized) histogram, and has density values per bin. The KL-divergence between and is

The second term is the differential entropy—the extension of entropy to continuous random variables. Because the second term only depends on , the aim is to minimize the first term: the cross-entropy between and . This loss simplifies, due to the form on :

In the minimization, the width itself can be ignored, because , giving the Histogram Loss

(2) |

This loss has several useful properties. One important property is that it is convex in ; even if the loss is not convex in all network parameters, it is at least convex on the last layer. The other three benefits are due to restricting the form of the predicted distribution to be a histogram density. First, the divergence to the full distribution can be efficiently computed. This contrasts previous work, which samples the KL for a subset of values (norouzi2016reward; szegedy2016rethinking). Second, the choice of is flexible, as long as its CDF can be evaluated for each bin. The weighting can be computed offline once for each sample, making it inexpensive to query repeatedly for each sample during training. Third, different distributional choices simply result in different weightings in the cross-entropy. This simplicity facilitates interpreting the impact of changing the distributional assumptions on .

### 2.3 Target distributions and related objectives

Below, we consider some special cases for that are of interest and highlight connections to previous work.

Truncated Gaussian on and HL-Gaussian. Consider a truncated Gaussian distribution, on support , as the target distribution. The mean for this Gaussian is the datapoint itself, with fixed variance . The pdf is

where , and the HL has

This distribution enables significant smoothing over , through the variance parameter . We call this loss HL-Gaussian, defined by number of bins and variance . Based on positive empirical performance, it will be the main HL loss that we advocate for and analyze.

Soft Targets and a Histogram Density on . In classification, such as multinomial logistic regression, it is typical to assume is a categorical distribution, where is discrete. The goal is still to estimate and when training, hard 0-1 values for are used in the cross-entropy. Soft labels, instead of 0-1 labels, can be used by adding label noise (norouzi2016reward; szegedy2016rethinking; pereyra2017regularizing). This can be seen as an instance of HL, but for discrete , where a categorical distribution is selected for the target distribution. Minimizing the cross-entropy to these soft-labels corresponds to trying to match such a smoothed target distribution, rather than the original 0-1 categorical distribution.

Such soft targets have also been considered for ordinal regression, again motivated as label smoothing, for age prediction (gao2017deep; rothe2018deep). The outputs are smoothed using radial basis function similarities to a set of bin centers. This procedure can be seen as selecting a histogram density for the target distribution, where the coefficients for each bin are determined by these radial basis function similarities. The resulting loss is similar to HL-Gaussian, with slightly different , though introduced as data augmentation to smooth (ordinal) targets.

Dirac delta on . Finally, we consider the relationship to maximum likelihood. For classification, norouzi2016reward and szegedy2016rethinking used a combination of maximum likelihood and a KL-divergence to a (uniform) distribution. szegedy2016rethinking add uniform noise to the labels and norouzi2016reward sample from an exponentiated reward distribution, with a temperature parameter, for structured prediction. Both consider only a finite set for , because they both address classification problems.

The relationship between KL-Divergence and maximum likelihood can be extended to continuous . The connection is typically in terms of statistical consistency: the maximum likelihood estimator approaches the minimum of the KL-divergence to the true distribution, if the distributions are of the same parametric form (wasserman2004all, Theorem 9.13). They can, however, be connected for finite samples with different distributions. Consider Gaussians centered around datapoints , with arbitrarily small variances :

Let the target distribution have for each sample. Define function as . For each , as , if and otherwise. So, for s.t. ,

The sum over samples for the HL to the Dirac delta on , then, corresponds to the negative log-likelihood for

Such a delta distribution on results in one coefficient being 1, reflecting the distributional assumption that is certainly in a bin. In the experiments, we compare to this loss, which we call HL-OneBin.

Using a similar analysis to above, can be considered as a mixture between and a uniform distribution. For a weighting of on the uniform distribution, the resulting loss HL-Uniform has for , and .

## 3 Optimization properties of the HL

There are at least two motivations for this loss, in terms of promoting the search for effective solutions. The first is the stability of gradients, promoting stable gradient descent. The second is a connection to learning optimal policies in reinforcement learning. Both provide some insight that the properties of the HL, during optimization, promote better generalization performance.

Stable gradients for HL. hardt2015train have shown that the generalization performance for stochastic gradient descent is bounded by the number of steps that stochastic gradient descent takes during training, even for non-convex losses. The bound is also dependent on the properties of the loss. In particular, it is beneficial to have a loss function with small Lipschitz constant , which bounds the norm of the gradient. Below, we discuss how the HL with a Gaussian distribution (HL-Gaussian) in fact promotes an improved bound on this norm, over both the loss and the HL with all weight in one bin (HL-OneBin).

In the proposition bounding the HL-Gaussian gradient, we assume

(3) |

for some function parameterized by a vector of parameters . For example, could be the last hidden layer in a neural network, with parameters for the entire network up to that layer. The proposition provides a bound on the gradient norm in terms of the current network parameters. Our goal is to understand how the gradients might vary locally for the parameters, as opposed to globally bounding the norm and characterizing the Lipschitz constant only in terms of the properties of the function class and loss.

###### Proposition 1 (Local Lipschitz constant for HL-Gaussian).

Assume are fixed, giving fixed coefficients in HL-Gaussian. Let be as in (3), defined by the parameters and , providing the predicted distribution . Assume for all that is locally -Lipschitz continuous w.r.t

(4) |

Then the norm of the gradient for HL-Gaussian, w.r.t. to all the parameters in the network , is bounded by

(5) |

###### Proof.

First consider the gradient of the HL, with explicit details on these computations in Appendix A

The norm of the gradient of HL in Equation , w.r.t. which is composed of all the weights is

Similarly, the norm of the gradient w.r.t. is

Together, these bound the norm . ∎

The results by hardt2015train suggest it is beneficial for the local Lipschitz constant—or the norm of the gradient—to be small on each step. HL-Gaussian provides exactly this property. Besides the network architecture—which we are here assuming is chosen outside of our control—the HL-Gaussian gradient norm is proportional to . This number is guaranteed to be less than 1, but generally is likely to be even smaller, especially if reasonably accurately predicts . Further, the gradients should push the weights to stay within a range specified by , rather than preferring to push some to be very small—close to 0—and others to be close to 1. For example, if starts relatively uniform, then the objective does not encourage predictions to get smaller than . If are non-negligible, this keeps away from zero and the loss in a smaller range.

This contrasts both the norm of the gradient for the loss and HL-OneBin. For the loss, is the gradient, giving gradient norm bound . The constant , as opposed to , can be much larger, even if is normalized between , and can vary significantly more. HL-OneBin, on the other hand, shares the same constant as HL-Gaussian, but suffers from another problem. The Lipschitz constant in Equation (4) will likely be larger, because is frequently zero and so pushes towards zero. This results in larger objective values and pushes to get larger, to enable to get close to 1.

Connection to reinforcement learning. The HL can also be motivated through a connection to maximum entropy reinforcement learning. In reinforcement learning, an agent iteratively selects actions and transitions between states, to maximize (long-term) reward. The agent’s goal is to find an optimal policy, in as few interactions as possible. To do so, the agent begins by exploring more, to then enable more efficient convergence to optimal. Supervised learning can be expressed as a reinforcement learning problem (norouzi2016reward), where action selection conditioned on a state corresponds to making a prediction conditioned on a feature vector. An alternative view to minimizing prediction error is to search for a policy to make accurate predictions.

One strategy to efficiently find an optimal policy is through a maximum entropy objective. The policy balances between selecting the action it believes to be optimal—make its current best prediction—and acting more randomly—with high-entropy. For continuous action set , the goal is to minimize the following objective

(6) |

where ; is a distribution over states ; is the policy or distribution over actions for a given ; and is the reward function, such as the negative of the objective . Minimizing (6) corresponds to minimizing the KL-divergence across between and the exponentiated payoff distribution where , because

The connection between the HL and maximum-entropy reinforcement learning is that both are minimizing a divergence to this exponentiated distribution . The HL, however, is minimizing instead of . For example, Gaussian target distribution with variance corresponds to minimizing with and . These two KL-divergences are not the same, but a similar argument to norouzi2016reward could be extended for continuous , showing is upper-bounded by plus variance terms. The intuition, then, is that minimizing the HL is promoting an efficient search for an optimal (prediction) policy.

Method | Train objective | Train MAE | Train RMSE | Test objective | Test MAE | Test RMSE |

Linear Reg. | ||||||

HL-Gaussian | ||||||

+Noise | ||||||

+Clipping | ||||||

HL-OneBin | ||||||

HL-Uniform | ||||||

MDN | ||||||

+Softmax |

## 4 Experiments

In this section, we investigate the utility of the HL-Gaussian for regression, compared to using an loss. We particularly investigate why the modification to this distributional loss improves performance, designing experiments to test if it is due to (a) the utility of learning distributions or smoothed targets, (b) a bias-variance trade-off from bin size or variance in the HL-Gaussian, (c) an improved representation, (d) nonlinearity introduced by the HL and (e) improved optimization properties of the loss.

Datasets and pre-processing.
All features are transformed to have zero mean and unit variance. We randomly split the data into train and test sets in each run.

The CT Position dataset is from CT images of patients (graf20112d), with 385 features
and the target set to the relative location of the image.

The Song Year dataset is a subset of The Million Song Dataset (bertin2011million), with 90 audio features for a song and target corresponding to the release year.

The Bike Sharing dataset (fanaee2014event), about hourly bike rentals for two years, has 16 features and target set to the number of rented bikes.

Root mean squared error (RMSE) and mean absolute error (MAE) are reported over 5 runs, with standard errors. We include both errors and objective values, on train and test, to provide a more complete picture of the causes of differences between the losses. For space, we only include in-depth results on CT Position in the main body. We summarize the overall conclusions on all three datasets below, and include the tables for Song Year and Bike Sharing in Appendix C and more dataset information in Appendix B.

Algorithms. We compared several regression strategies, distribution learning approaches and several variants of HL. All the approaches—except for Linear Regression—use the same neural network, with differences only in the output layer. The architecture for Song Year is 90-45-45-45-45-1 (4 hidden layers of size 45), for Bike Sharing is 16-64-64-64-64 and for CT Position is 385-192-192-192-192-1. All units employ ReLU activation, except the last layer with linear activations. Unless specified otherwise, all networks using HL have 100 bins. Meta-parameters for comparison algorithms are chosen according to best Test MAE. Network architectures were chosen according to best Test MAE for , with depth and width varied across 7 different values with final choices being neither biggest nor smallest.

Linear Regression is included as a baseline, using ordinary least squares with the inputs.

Squared-error is the neural network trained using the loss. The targets are normalized to range , which was needed to improve stability and accuracy.

Absolute-error is the neural network using the loss.

+Noise is the same as , except Gaussian noise is added to the targets as a form of augmentation. The standard deviation of the noise is selected from .

+Clipping is the same as , but with gradient norm clipping during training. The threshold for clipping is selected from .

HL-OneBin is the HL, with Dirac delta target distribution.

HL-Uniform is the HL, with a target distribution that mixes between a delta distribution and the uniform distribution, with a weighting of on the uniform and on the delta, where .

HL-Gaussian is the HL, with a truncated Gaussian distribution as the target distribution. The variance is set to the radius of the bins.

MDN is a Mixture Density Network bishop1994mixture that models the target distribution as a mixture of Gaussian distributions. The original model uses an exponential activation to model the standard deviations. However, inspired by lakshminarayanan2017simple, we used softplus activation plus a small constant () to avoid numerical instability. We selected the number of components from . Predictions are made by taking the mean of the mixture model given by the MDN.

+Softmax use a softmax-layer with loss, for bin centers , with otherwise the same settings as HL-Gaussian.

We used Scikit-learn (scikit-learn) for the implementations of Linear Regression, and Keras (chollet2015keras) for the neural network models. All neural network models are trained with mini-batch size 256 using the Adam optimizer (kingma2014adam) with a learning rate 1e-3 and the parameters are initialized according to the method suggested by lecun1998efficient. Dropout (srivastava2014dropout) with rate is added to the input layer of all neural networks to avoid overfitting. We trained the networks for 1000 epochs on CT Position, 150 epochs on Song Year and 500 epochs on Bike Sharing.

Overall performance and conclusions (Tables 1, 4, 6). We first report the relative performance of all these models, on the CT Position dataset (Table 1) and, in Appendix C, the Song Year dataset (Table 4) and Bike Sharing dataset (Table 6). The overall conclusions are that the HL-Gaussian never harms performance—slightly improving performance on the Song Year dataset—and otherwise can significantly improve performance over alternatives—on both the CT Position and Bike Sharing datasets. We only report the full set of algorithms for CT Position, and more in-depth experiments understanding the result on that domain.

Learning other distributions is not effective (Table 1).

HL-Gaussian improves performance, but the other distribution-learning approaches appear to have little advantages, as shown in Table 1.
HL-OneBin and HL-Uniform can actually do worse than Regression. MDN provides only minor gains over Regression. Interestingly,
it has been shown MDN suffers from numerical instabilities, making training difficult oord2016pixel; rupprecht2016learning.

A related idea to learning the distribution explicitly is to use data augmentation, through label smoothing. We therefore also compared to directly modifying the labels and gradients, with -Noise and -Clipping. These models do perform slightly better than Regression for some settings, but do not achieve the same gains as HL-Gaussian.

The bias-variance trade-off in the loss definition is not significantly impacting performance (Figure 1).

If one fixes the possible range of the output variable, the distribution becomes more and more expressive as the number of bins increases. The model could have a higher chance of overfitting in this situation. Reducing the number of bins, on the other hand, introduces discretization error and increases the bias.
Further, the entropy parameter introduces a bias-variance trade-off, making the target distribution more uniform as entropy increases—likely resulting in lower variance—but also washing out the signal—incurring high bias.
The selection of these parameters, therefore, may provide a opportunity to influence this bias-variance trade-off, and improve performance by essentially optimizing the loss for a problem. The ability for the user to select these parameters could explain some of the performance gains in recent results (gao2017deep; bellemare2017distributional), compared to standard losses that cannot be tuned.

We tested the impact of varying the number of bins, and the entropy for HL-Gaussian. We found that these parameters, especially the entropy, can have an impact on performance, but that the results were much more robust to changing these parameters than might be expected (reported in more depth in Figure 1). It does not seem to be the case, therefore, that the tuning of these hyperparameters is the primary explanation for the improved performance.

Loss | Default | Fixed | Initialized | Random | |
---|---|---|---|---|---|

Train MAE | Regression | ||||

HL-Gaussian | |||||

Train RMSE | Regression | ||||

HL-Gaussian | |||||

Regression | |||||

Test MAE | HL-Gaussian | ||||

Test RMSE | Regression | ||||

HL-Gaussian |

The learned representation is not better (Table 2).

Learning a distribution, as opposed to a single statistic, provides a more difficult target—one that could require a better representation.
The hypothesis is that amongst the functions in your function class , there is a set of functions
that can predict the targets almost equally well. To distinguish amongst these functions, a wider range of tasks can make it more likely to select the true function, or at least one that generalizes better.

We conducted three experiments to test the hypothesis than an improved representation is learned. We first trained with HL-Gaussian and , to obtain their representations. We tested (a) swapping the representations and re-learning only the last layer, (b) initializing with the other’s representation, (c) and using the same fixed random representation for both. For (a) and (c), the optimizations for both are convex, since the representation is fixed. The results in Table 2, are surprisingly conclusive: using the representation from HL-Gaussian does not improve performance of , and even under a random representation, HL-Gaussian performs significantly better than . This suggests that HL-Gaussian is not causing a more useful or more general representation to be learned, as otherwise should be able to take advantage of that representation.

The softmax nonlinearity is not the main cause (Table 1).
The HL-Gaussian can be seen as a generalized linear model, where a small amount of non-linearity is introduced from the transfer.
The level of nonlinearity is similar to that in the cross-entropy loss, and the effect should be small because each transformed output has to predict a probability value. This contrasts with an alternative way to use a softmax layer—which we call +Softmax—which gets to tune the softmax layer to directly predict given . Such a layer has additional parameters to predict one target (100 additional parameters, for 100 bins). This contrast the HL-Gaussian, which has also 100 bins but has to predict 100 targets instead of just one target.^{1}

Despite the differences between the role of the softmax in HL-Gaussian and +Softmax, we provide this comparison to provide some insight into potential nonlinearities introduced by the loss. The result in Table 1 shows that this softmax layer can improve performance (to 12.720), but not as significant as HL-Gaussian (8.992). This is particularly intriguing, because as mentioned above, +Softmax can much more flexibly tune the nonlinear softmax layer. The ability to outperform +softmax-layer emphasizes that there are properties of the HL causing improvements beyond the use of the softmax.

HL-Gaussian trains fast (Figure 4).

We trained , HL-OneBin, and HL-Gaussian on the CT Position dataset with no dropout to find the role of the loss function on the rate of convergence. We also computed the norm the gradient w.r.t. the parameters of the last layer after each epoch, and normalized the gradient norms of each model by their median to compare their variability. As shown in Figure 4, HL-Gaussian has significantly better behaved gradients, than . Correspondingly, it converges significantly faster and more smoothly. The other two methods that more carefully controlled gradients—-Noise and -Clip—provided the next best gains to HL-Gaussian.

## 5 Conclusion

We introduced a novel loss for regression, called the Histogram Loss (HL), that explicitly constructs a distribution over targets to predict, rather than directly estimating the mean of the target conditioned on inputs. The loss involves minimizing the KL-divergence between a predicted distribution and this target distribution. To make this loss efficient to compute, without significantly reducing modeling power, we restrict the class of approximation densities to histogram densities. We highlight that for a particular setting of the HL—with a target Gaussian distribution—the norm of the gradient does not grow large or vary widely. Combined with recent results that show reducing training steps for stochastic gradient results in improved generalization provide some theoretical justification for why we observe such strong performance of HL-Gaussian in practice. We conduct a series of experiments to identify this gain, with evidence that the main role is not due to overfitting or an improved representation, but rather due to the fact that the HL can be optimized in a smaller number of steps, with smoother gradients.

The introduction of the HL provides several avenues to improve our choice of loss function. One direction is to more explicitly take advantage of the specification of the target distribution. In this work, we considered this loss only for a fixed set of bins, widths and variance parameter for the target distribution. To be more agnostic to these choices, we demonstrated performance across possible parameter settings. However, these parameters could be determined using meta-parameter optimization strategies, such as cross validation, or even learning strategies with particular objectives for these parameters. The key property to make the HL easy to specify and optimize was the use of a histogram to predict the target; the derivation does not prevent also optimizing the bins centers, widths and variances.

Overall, this work provides some unification of recent results using soft targets, through the introduction of the HL. We hope for it to facilitate discussion and development on the design of losses that promote learning, and direct further investigation into the importance of the optimization properties of these losses.

## Acknowledgments

We would like to thank Alberta Innovates for funding AMII (the Alberta Machine Intelligence Institute) and this research.

## References

## Appendix A Explicit gradient computations

Let and . Then, since , for

For , we get

Consider now the gradient of the HL, w.r.t

Then

where is the Jacobian of .

Dataset | # train | # test | # feats | range |
---|---|---|---|---|

Song Year | 463715 | 51630 | 90 | [1922,2011] |

CT Position | 42800 | 10700 | 385 | [0,100] |

Bike Sharing | 13911 | 3478 | 16 | [0,1000] |

## Appendix B Dataset details.

## Appendix C Additional experiments

We provide overall performance results for the two other datasets. We include the learning curves on test data for CT Position, corresponding to Figure 4 in the main text.

### c.1 Test Learning Curves for CT Position dataset

We include additional graphs for the variability in the RMSE and MAE, for the three objectives HL-Gaussian, HL-OneBin and , for test data in Figure 4.

Method | Train objective | Train MAE | Train RMSE | Test objective | Test MAE | Test RMSE |
---|---|---|---|---|---|---|

Linear Reg. | ||||||

HL-Gaussian | ||||||

HL-OneBin | ||||||

+Softmax |

Method | Train objective | Train MAE | Train RMSE | Test objective | Test MAE | Test RMSE |
---|---|---|---|---|---|---|

Linear Reg. | ||||||

HL-Gaussian | ||||||

HL-OneBin |

Method | Train objective | Train MAE | Train RMSE | Test objective | Test MAE | Test RMSE |
---|---|---|---|---|---|---|

HL-Gaussian | ||||||

HL-OneBin | ||||||

+Softmax |

### c.2 Experiments on Song Year dataset

For the Song Year dataset, we include results both for random train-test splits and report results for the fixed train/test split recommended by the authors of the dataset to avoid the effect of an artist having songs in both the train and test sets.

For this dataset, both HL-Gaussian and HL-OneBin outperform only slightly, and perform similarly to each other. The loss with a nonlinear softmax layer also performs about the same, suggesting the main (small) gain for this dataset is from this nonlinearity. This further suggests that the is likely a suitable loss for this problem, and there is little to gain for switching to the HL. There is a slightly larger gain for HL-Gaussian in Table 5 for the training/test split suggested by the authors of this data, but still not nearly as large as CT Position or Bike Sharing.

### c.3 Experiments on Bike Sharing dataset

We provide a comparison of performance on the Bike Sharing dataset in Table 6. We used early stopping to avoid overfitting, because on this dataset, dropout was ineffective.

The network for Bike Sharing uses four hidden layers of width 64, but we additionally tested a network architecture with four hidden layers of width 512. For this wider network, was able to get better final TEST MAE and Test RMSE performance of and respectively. However, performance was quite a bit more variable during learning—likely due to the overparameterization. Future work is to better understand the effect of different network architectures on the performance of the different losses.

### Footnotes

- It is possible that having 100 extra parameters in the last layer makes it possible to benefit from randomness, over the . We ran experiments enabling the to have 100 outputs, each predicting the target but with different initial weights. Even selecting the best of the 100 outputs on the test data only slightly improved performance, with a test MAE of 18.421.