Abstract
We propose a neural hybrid model consisting of a linear model defined on a set of features computed by a deep, invertible transformation (i.e. a normalizing flow). An attractive property of our model is that both , the features’ density, and , the predictive distribution, can be computed exactly in a single feedforward pass. We show that our hybrid model, despite the invertibility constraints, achieves similar accuracy to purely predictive models. Yet the generative component remains a good model of the input features despite the hybrid optimization objective. This offers additional capabilities such as detection of outofdistribution inputs and enabling semisupervised learning. The availability of the exact joint density also allows us to compute many quantities readily, making our hybrid model a useful building block for downstream applications of probabilistic deep learning.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Hybrid Models with Deep and Invertible Features
Eric Nalisnick ^{* }^{0 } Akihiro Matsukawa ^{* }^{0 } Yee Whye Teh ^{0 } Dilan Gorur ^{0 } Balaji Lakshminarayanan ^{0 }
Preprint. Work in progress.\@xsect
In the majority of applications, deep neural networks model conditional distributions of the form , where denotes a label and features or covariates. However, modeling just the conditional distribution is insufficient in many cases. For instance, if we believe that the model may be subjected to inputs unlike those of the training data, a model for can possibly detect an outlier before it is passed to the conditional model for prediction. Thus modeling the joint distribution provides a richer and more useful representation of the data. Models defined by combining a predictive model with a generative one are known as hybrid models (Jaakkola & Haussler, 1999; Raina et al., 2004; Lasserre et al., 2006; Kingma et al., 2014). Hybrid models have been shown to be useful for novelty detection (Bishop, 1994), semisupervised learning (Druck et al., 2007), and information regularization (Szummer & Jaakkola, 2003).
Crafting a hybrid model usually requires training two models, one for and one for , that share a subset (Raina et al., 2004) or possibly all (McCallum et al., 2006) of their parameters. Unfortunately, training a highfidelity model alone is difficult, especially in high dimensions, and good performance requires using a large neural network (Brock et al., 2019). Yet principled probabilistic inference is hard to implement with neural networks since they do not admit closedform solutions and running Markov chain Monte Carlo takes prohibitively long. Variational inference then remains as the final alternative, and this now introduces a third model, which usually serves as the posterior approximation and/or inference network (Kingma & Welling, 2014; Kingma et al., 2014). To make matters worse, the model may also require a separate approximate inference scheme, leading to additional computation and parameters.
In this paper, we propose a neural hybrid model that overcomes many of the aforementioned computational challenges. Most crucially, our model supports exact inference and evaluation of . Furthermore, in the case of regression, Bayesian inference for is exact and available in closedform as well. Our model is made possible by leveraging recent advances in deep invertible generative models (Rezende & Mohamed, 2015; Dinh et al., 2017; Kingma & Dhariwal, 2018). These models are defined by composing invertible functions, and therefore the changeofvariables formula can be used to compute exact densities. Moreover, these invertible models have been shown to be expressive enough to perform well on prediction tasks (Gomez et al., 2017; Jacobsen et al., 2018). We use the invertible function as a natural feature extractor and define a linear model at the level of the latent representation. With one feedforward pass we can obtain both and , with the only additional cost being the logdeterminantJacobian term required by the change of variables. While this term could be expensive to compute for general functions, much recent work has been done on defining expressive invertible neural networks with easytoevaluate Jacobians (Dinh et al., 2015; 2017; Kingma & Dhariwal, 2018; Grathwohl et al., 2019).
In summary, our contributions are:

Defining a neural hybrid model with exact inference and evaluation of , which can be computed in one feedforward pass and without any Monte Carlo approximations.

Evaluating the model’s predictive accuracy and uncertainty on both classification and regression problems.

Using the model’s natural ‘reject’ rule based on the generative component to filter outofdistribution inputs.

Showing that our hybrid model performs well at semisupervised classification.
We begin by establishing notation and reviewing the necessary background material. We denote matrices with uppercase and bold letters (e.g. ), vectors with lowercase and bold (e.g. ), and scalars with lowercase and no bolding (e.g. ). Let the collection of all observations be denoted with representing a vector containing features and a scalar representing the corresponding label. We define a predictive model’s density function to be and a generative density to be , where are the shared model parameters. Let the joint likelihood be denoted .
Deep invertible transformations are the first key building block in our approach. These are simply highcapacity, bijective transformations with a tractable Jacobian matrix and inverse. The best known models of this class are the real nonvolume preserving (RNVP) transform (Dinh et al., 2017) and its recent extension, the Glow transform (Kingma & Dhariwal, 2018). The bijective nature of these transforms is crucial as it allows us to employ the changeofvariables formula for exact density evaluation:
(1) 
where denotes the transform with parameters , the determinant of the Jacobian of the transform, and a distribution on the latent variables computed from the transform. The modeler is free to choose , and therefore it is often set as a factorized standard Gaussian for computational simplicity. The parameters are estimated via maximizing the exact loglikelihood . While the invertibility requirements imposed on may seem too restrictive to define an expressive model, recent work using invertible transformations for classification (Jacobsen et al., 2018) reports metrics comparable to noninvertible residual networks, even on challenging benchmarks such as ImageNet, and recent work by (Kingma & Dhariwal, 2018) has shown that invertible generative models can produce sharp samples.
The affine coupling layer (ACL) is the key building block of the RNVP and Glow transforms. One ACL performs the following operations (Dinh et al., 2017):

Splitting: split it at dimension into two separate vectors and (using Python list syntax).

Identity and Affine Transformations: Given the split , each undergoes a separate operation:
(2) where and are translation and scaling operations with no restrictions on their functional form. We can compute them with neural networks that take as input , the other half of the original vector, and since has been copied forward by the first operation, no information is lost that would jeopardize invertibility.

Permutation: Lastly, the new representation is ready to be either treated as output or fed into another ACL. If the latter, then the elements should be modified so that is not again copied but rather subject to the affine transformation. Dinh et al. (2017) simply exchange the components (i.e. ) whereas Kingma & Dhariwal (2018) apply a convolution, which can be thought of as a continuous generalization of a permutation.
Several ACLs are composed to create the final form of , which is called a normalizing flow (Rezende & Mohamed, 2015). Crucially, the Jacobian of these operations is efficient to compute, simplifying to the sum of all the scale transformations:
(3) 
where is an index over ACLs. The Jacobian of a convolution does not have as simple of an expression, but Kingma & Dhariwal (2018) describe ways to reduce computation. Sampling from a flow is done by first sampling from the latent distribution and then passing that sample through the inverse transform: .
Generalized linear models (GLMs) (Nelder & Baker, 1972) are the second key building block that we employ. They model the expected response as follows:
(4) 
where denotes the expected value of , a vector of parameters, the covariates, and a link function such that . For notational convenience, we assume a scalar bias has been subsumed into . A Bayesian GLM could be defined by specifying a prior and computing the posterior . When the link function is the identity (i.e. simple linear regression) and , then the posterior distribution is available in closedform:
(5) 
where is the response noise. In the case of logistic regression, the posterior is no longer conjugate but can be closely approximated (Jaakkola & Jordan, 1997).
We propose a neural hybrid model consisting of a deep invertible transform coupled with a GLM. Together the two define a deep predictive model with both the ability to compute and exactly, in a single feedforward pass. The model defines the following joint distribution over a labelfeature pair :
(6) 
where is the output of the invertible transformation, is the latent distribution, and is a GLM with the latent variables serving as its input features. For simplicity, we assume a factorized latent distribution , following previous work (Dinh et al., 2017; Kingma & Dhariwal, 2018). Note that are the parameters of the generative model and that are the parameters of the joint model. Sharing between both components allows the conditional distribution to influence the generative distribution and vice versa. We term the proposed neural hybrid model the deep invertible generalized linear model (DIGLM). Given labeled training data sampled from the true distribution of interest , the DIGLM can be trained by maximizing the exact joint loglikelihood, i.e.
via gradient ascent. As per the theory of maximum likelihood, maximizing this log probability is equivalent to minimizing the KullbackLeibler (KL) divergence between the true joint distribution and the model: .
Figure 1 shows a diagram of the DIGLM. We see that the computation pipeline is essentially that of a traditional neural network but one defined by stacking ACLs. The input first passes through , and the latent representation and the stored Jacobian terms are enough to compute . In particular, evaluating has an runtime cost for factorized distributions, and has a runtime for RNVP architectures, where is the number of affine coupling layers and is the input dimensionality. Evaluating the predictive model adds another cost in computation, but this cost will be dominated by the prerequisite evaluation of .
In practice we found the DIGLM’s performance improved by introducing a scaling factor on the contribution of . The factor helps control for the effect of the drastically different dimensionalities of and . We denote this modified objective as:
(7) 
where is the scaling constant. Weighted losses are commonly used in hybrid models (Lasserre et al., 2006; McCallum et al., 2006; Kingma et al., 2014; Tulyakov et al., 2017). Yet in our particular case, we can interpret the downweighting as encouraging robustness to input variations. In other words, downweighting the contribution of can be considered a Jacobianbased regularization penalty. To see this, notice that the joint likelihood rewards maximization of , thereby encouraging the model to increase the derivatives (i.e. the diagonal terms). This optimization objective stands in direct contrast to a long history of gradientbased regularization penalties (Girosi et al., 1995; Bishop, 1995; Rifai et al., 2011). These add the Frobenius norm of the Jacobian as a penalty to a loss function (or negative loglikelihood). Thus, we can interpret the deweighting of as adding a Jacobian regularizer with weight . If the latent distribution term is, say, a factorized Gaussian, the variance can be scaled by a factor of to introduce regularization only to the Jacobian term.
As mentioned in the introduction, having a joint density enables the model to be trained on data sets that do not have a label for very feature vector—i.e. semisupervised data sets. When a label is not present, the principled approach to the situation is to integrate out the variable:
(8) 
Thus, we should use the unpaired observations to train just the generative component.
Equation 8 above also suggests a strategy for evaluating the model in realworld situations. One can imagine the DIGLM being deployed as part of a userfacing system and that we wish to have the model ‘reject’ inputs that are unlike the training data. In other words, the inputs are anomalous with respect to the training distribution, and we cannot expect the component to make accurate predictions when is not drawn from the training distribution. In this setting we have access only to the userprovided features , and thus should evaluate by way of Equation 8 again, computing . This observation then leads to the natural rejection rule:
(9) 
where is some threshold, which we propose setting as where the minimum is taken over the training set and is a free parameter providing slack in the margin. When rejecting a sample, we output uniform probabilities , hence the prediction for is given by
(10) 
where denotes an indicator function. Similar generativemodelbased rejection rules have been proposed previously (Bishop, 1994). This idea is also known as selective classification or classification with a reject option (Hellman, 1970; Cordella et al., 1995; Fumera & Roli, 2002; Herbei & Wegkamp, 2006; Geifman & ElYaniv, 2017).
We next describe a Bayesian treatment of the DIGLM, deriving some closedform quantities of interest and discussing connections to Gaussian processes. The Bayesian DIGLM (BDIGLM) is defined as follows:
(11) 
The material difference from the earlier formulation is that a prior is now placed on the regression parameters. The BDIGLM defines the joint distribution of three variables——and to perform proper Bayesian inference, we should marginalize over when training, resulting in the modified objective:
(12) 
where is the marginal likelihood of the regression model.
While is not always available in closedform, it is in many practical cases. For instance, if we assume that the likelihood model is linear regression and that is given a zeromean Gaussian prior, i.e.
then the marginal likelihood can be written as:
(13)  
where is the matrix of all latent representations, which we subscript with to emphasize that it depends on the invertible transform’s parameters.
From Equation S4.Ex7 we see that BDIGLMs are related to Gaussian processes (GPs) (Rasmussen & Williams, 2006). GPs are defined through their kernel function , which in turn characterizes the class of functions represented. The marginal likelihood under a GP is defined as
with denoting the kernel parameters. Comparing this equation to the BDIGLM’s marginal likelihood in Equation S4.Ex7, we see that they become equal by setting , and thus we have the implied kernel . Perhaps there are even deeper connections to be made via Fisher kernels (Jaakkola & Haussler, 1999)—kernel functions derived from generative models—but we leave this investigation to future work.
If the marginal likelihood is not available in closed form, then we must resort to approximate inference. In this case, understandably, our model loses the ability to compute exact marginal likelihoods. We can use one of the many lower bounds developed for variational inference to bypass the intractability. Using the usual variational Bayes evidence lower bound (ELBO) (Jordan et al., 1999), we have
(14) 
where is a variational approximation to the true posterior. We leave thorough investigation of approximate inference to future work, and in the experiments we use either conjugate Bayesian inference or point estimates for .
One may ask: why stop the Bayesian treatment at the predictive component? Why not include a prior on the flow’s parameters as well? This could be done, but Riquelme et al. (2018) showed that Bayesian linear regression with deep features (i.e. computed by a deterministic neural network) is highly effective for contextual bandit problems, which suggests that capturing the uncertainty in prediction parameters is more important than the uncertainty in the representation parameters .
While several works have studied the tradeoffs between generative and predictive models (Efron, 1975; Ng & Jordan, 2002), Jaakkola & Haussler (1999) were perhaps the first to meaningfully combine the two, using a generative model to define a kernel function that could then be employed by classifiers such as SVMs. Raina et al. (2004) took the idea a step further, training a subset of a naive Bayes model’s parameters with an additional predictive objective. McCallum et al. (2006) extended this framework to train all parameters with both generative and predictive objectives. Lasserre et al. (2006) showed that a simple convex combination of the generative and predictive objectives does not necessarily represent a unified model and proposed an alternative prior that better couples the parameters. Druck et al. (2007) empirically compared Lasserre et al. (2006)’s and McCallum et al. (2006)’s hybrid objectives specifically for semisupervised learning.
Recent advances in deep generative models and stochastic variational inference have allowed the aforementioned frameworks to include neural networks as the predictive and/or generative components. Deep neural hybrid models haven been defined by (at least) Kingma et al. (2014), Maaløe et al. (2016), Kuleshov & Ermon (2017), Tulyakov et al. (2017), and Gordon & HernándezLobato (2017). However, these models, unlike ours, require approximate inference to obtain the component.
We are unaware of any work using normalizing flows as the generative component of a hybrid model. As mentioned in the Introduction, invertible residual networks have been shown to perform as well as noninvertible architectures on popular image benchmarks (Gomez et al., 2017; Jacobsen et al., 2018). However, while the changeofvariables formula could be calculated for these models, it is computationally difficult to do so, which prevents their application to generative modeling. The concurrent work of Behrmann et al. (2018) shows how to preserve invertibility in general residual architectures and describes a stochastic approximation of the volume element to allow for highdimensional generative modeling. Hence their work could be used to define a hybrid model similar to ours, which they mention as area for future work.
We now report experimental findings for a range of regression and classification tasks. Unless otherwise stated, we used the Glow architecture (Kingma & Dhariwal, 2018) to define the DIGLM’s invertible transform and factorized standard Gaussian distributions as the latent prior .
We first report a onedimensional regression task to provide an intuitive demonstration of the DIGLM. We draw observations from a Gaussian mixture with parameters , , and equal component weights. We simulate responses with the function where denotes observation noise as a function of the mixture component . Specifically we chose . We train a BDIGLM on 250 observations sampled in this way, use standard Normal priors for and , and three planar flows (Rezende & Mohamed, 2015) to define . We compare this model to a Gaussian process (GP) and a kernel density estimate (KDE), which both use squared exponential kernels.
Figure 2(a) shows the predictive distribution learned by the GP, and Figure 2(b) shows the DIGLM’s predictive distribution. We see that the models produce similar results, with the only conspicuous difference being the GP has a stronger tendency to revert to its mean at the plot’s edges. Figure 2(c) shows the density learned by the DIGLM’s flow component (black line), and we plot it against the KDE (gray shading) for comparison. The single BDIGLM is able to achieve comparable results to the separate GP and KDE models.
Thinking back to the rejection rule defined in Equation 9, this result, albeit on a toy example, suggests that density thresholding would work well in this case. All data observations fall within , and we see from Figure 2(c) that the DIGLM’s generative model smoothly decays to the left and right of this range, meaning that there does not exist an that lies outside the training support but has .
Next we evaluate the model on a largescale regression task using the flight delay data set (Hensman et al., 2013). The goal is to predict how long flights are delayed based on eight attributes. We use an RNVP transform as the invertible function and evaluate performance by measuring the root mean squared error (RMSE) and negative loglikelihood (NLL). Following Deisenroth & Ng (2015), we train using the first million data points and use the following as test data. We picked this split not only to illustrate the scalability of our method, but also due to the fact that the test distribution is known to be slightly different from training, which poses challenges of nonstationarity.
To the best of our knowledge, stateoftheart performance on this data set is a test RMSE of and a test NLL of (Lakshminarayanan et al., 2016). Our hybrid model achieves a slightly worse test RMSE of 40.46 but achieves a markedly better test NLL of 5.07. We believe that this superior NLL stems from the hybrid model’s ability to detect the nonstationarity of the data. Figure 3 shows a histogram of the evaluations for the training data (blue bars) and test data (red bars). The leftward shift in the red bars confirms that the test data points indeed have lower density under the flow than the training points.
Model  MNIST  NotMNIST  

BPD  error  NLL  BPD  NLL  Entropy  
Discriminative ()  81.80*  0.67%  0.082  87.74*  29.27  0.130 
Hybrid ()  1.83  0.73%  0.035  5.84  2.36  2.300 
Hybrid ()  1.26  2.22%  0.081  6.13  2.30  2.300 
Hybrid ()  1.25  4.01%  0.145  6.17  2.30  2.300 
Moving on to classification, we train a DIGLM on MNIST using 16 Glow blocks ( convolution followed by a stack of ACLs) to define the invertible function. Inside of each ACL, we use a 3layer Highway network (Srivastava et al., 2015) with hidden units to define the translation and scaling operations. We use batch normalization in the networks for simplicity in distributed coordination rather than actnorm as was used by Kingma & Dhariwal (2018). Optimization was done via Adam (Kingma & Ba, 2014) with a initial learning rate for k steps, then decayed by half at iterations k and k.
We compare the DIGLM to its discriminative component, which is obtained by setting the generative weight to zero (i.e. ). We report test classification error, NLL, and entropy of the predictive distribution. Following Lakshminarayanan et al. (2017), we evaluate on both the MNIST test set and the NotMNIST test set, using the latter as an outofdistribution (OOD) set. The OOD test is a proxy for testing if the model would be robust to anomalous inputs when deployed in a userfacing system. The results are shown in Table 1. Looking at the MNIST results, the discriminative model achieves slightly lower test error, but the hybrid model achieves better NLL and entropy. As expected, controls the generativediscriminative tradeoff with lower values favoring discriminative performance and higher values favoring generative performance.
Model  SVHN  CIFAR10  

BPD  error  NLL  BPD  NLL  Entropy  
Discriminative ()  15.40*  4.26%  0.225  15.20*  4.60  0.998 
Hybrid ()  3.35  4.86%  0.260  7.06  5.06  1.153 
Hybrid ()  2.40  5.23%  0.253  6.16  4.23  1.677 
Hybrid ()  2.23  7.27%  0.268  7.03  2.69  2.143 
Model  MNISTerror  MNISTNLL 

1000 labels only  6.61%  0.276 
1000 labels + unlabeled  2.21%  0.168 
All labeled  0.73%  0.035 
Next, we compare the generative density of the hybrid model^{1}^{1}1We report results for ; higher values are qualitatively similar. to that of the pure discriminative model (), quantifying the results in bits per dimension (BPD). Since the discriminative variant was not optimized to learn , we expect it to have a high BPD for both in and outof distribution sets. This experiment is then a sanity check that a discriminative objective alone is insufficient for OOD detection and a hybrid objective is necessary. First examining the discriminative models’ BDP in Table 1, we see that it assigns similar values to MNIST and NotMNIST: vs respectively. While at first glance this difference suggests OOD detection is possible, a closer inspection of the per instance histogram—which we provide in Subfigure 4(a)—shows that the distribution of train and test set densities are heavily overlapped. Subfigure 4(b) shows the same histograms for the DIGLM trained with a hybrid objective. We now see conspicuous separation between the NotMNIST (red) and MNIST (blue) sets, which suggests the threshold rejection rule would work well in this case.
Using the safe classification setup described earlier in equation 10, we use head when where the threshold and estimated using the label counts. The results are shown in Table 1. As expected, the hybrid model exhibits higher uncertainty and achieves better NLL and entropy on NotMNIST. To demonstrate that the hybrid model learns meaningful representations, we compute convex combinations of the latent variables . Figure 4(c) shows these interpolations in the MNIST latent space.
We move on to natural images, performing a similar evaluation on SVHN. For these experiments we use a larger network of Glow blocks and employ multiscale factoring (Dinh et al., 2017) every blocks. We use a larger Highway network containing 300 hidden units. In order to preserve the visual structure of the image, we apply only a 3 pixel random translation as data augmentation during training. The rest of the training details are the same as those used for MNIST. We use CIFAR10 for the OOD set.
Table 2 summarizes the classification results, reporting the same metrics as for MNIST. The trends are qualitatively similar to what we observe for MNIST: the model has the best classification performance, but the hybrid model is competitive. Figure 5(a) reports the evaluations for SVHN vs CIFAR10. We see from the clear separation between the SVHN (blue) and CIFAR10 (red) histograms that the hybrid model can detect the OOD CIFAR10 samples. Figure 5(b) visualizes interpolations in latent space, again showing that the model learns coherent representations. Figure 5(c) shows confidence versus accuracy plots (Lakshminarayanan et al., 2017), using the selective classification rule described in Section id1, when tested on indistribution and OOD, which shows that the hybrid model is able to successfully reject OOD inputs.
As discussed in Section id1, one advantage of the hybrid model is the ability to leverage unlabeled data. We first performed a sanity check on simulated data, using interleaved half moons. Figure 6 shows samples from the model when trained without unlabeled data (left) and with unlabeled data (right). The rightmost figure shows noticeably less warping.
Next we present results on MNIST trained with only labeled points ( of the data set) while using the rest as unlabeled data. For the unlabeled points, we maximize in the usual way and minimize the entropy for the head, corresponding to entropy minimization (Grandvalet & Bengio, 2005). Table 3 shows the results. The semisupervised hybrid model achieves classification error rate even though only a very small fraction of the data has been labeled. Techniques like virtual adversarial training (Miyato et al., 2018) and information regularization (Szummer & Jaakkola, 2003) might further improve the performance of semisupervised learning. We leave this investigation to future work.
We have presented a neural hybrid model created by combining deep invertible features and GLMs. We have shown that this model is competitive with discriminative models in terms of predictive performance but more robust to outofdistribution inputs and nonstationary problems. The availability of exact allows us to simulate additional data, as well as compute many quantities readily, which could be useful for downstream applications of generative models, including but not limited to semisupervised learning, active learning, and domain adaptation.
There are several interesting avenues for future work. Firstly, recent work has shown that deep generative models can assign higher likelihood to OOD inputs (Nalisnick et al., 2019; Choi & Jang, 2018), meaning that our rejection rule is not guaranteed to work in all settings. This is a challenge not just for our method but for all deep hybrid models. The DIGLM’s abilities may also be improved by considering flows constructed in other ways than stacking ACLs. Recent work on continuoustime flows (Grathwohl et al., 2019) and invertible residual networks (Behrmann et al., 2018) may prove to be more powerful that the Glow transform that we use, thereby improving our results. Lastly, we have only considered KLdivergencebased training in this paper. Alternative training criteria such as Wasserstein distance could potentially further improve performance.
Acknowledgements We thank Danilo Rezende for helpful feedback.
References
 Behrmann et al. (2018) Behrmann, J., Duvenaud, D., and Jacobsen, J.H. Invertible Residual Networks. ArXiv ePrints, 2018.
 Bishop (1994) Bishop, C. M. Novelty Detection and Neural Network Validation. IEE ProceedingsVision, Image and Signal processing, 141(4):217–222, 1994.
 Bishop (1995) Bishop, C. M. Training with Noise is Equivalent to Tikhonov Regularization. Neural Computation, 7(1):108–116, 1995.
 Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations (ICLR), 2019.
 Choi & Jang (2018) Choi, H. and Jang, E. Generative Ensembles for Robust Anomaly Detection. ArXiv ePrints, 2018.
 Cordella et al. (1995) Cordella, L. P., De Stefano, C., Tortorella, F., and Vento, M. A Method for Improving Classification Reliability of Multilayer Perceptrons. IEEE Transactions on Neural Networks, 6(5):1140–1147, 1995.
 Deisenroth & Ng (2015) Deisenroth, M. P. and Ng, J. W. Distributed Gaussian Processes. In International Conference on Machine Learning (ICML), 2015.
 Dinh et al. (2015) Dinh, L., Krueger, D., and Bengio, Y. NICE: NonLinear Independent Components Estimation. ICLR Workshop Track, 2015.
 Dinh et al. (2017) Dinh, L., SohlDickstein, J., and Bengio, S. Density Estimation Using Real NVP. In International Conference on Learning Representations (ICLR), 2017.
 Druck et al. (2007) Druck, G., Pal, C., McCallum, A., and Zhu, X. SemiSupervised Classification with Hybrid Generative/Discriminative Methods. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 280–289. ACM, 2007.
 Efron (1975) Efron, B. The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis. Journal of the American Statistical Association, 70(352):892–898, 1975.
 Fumera & Roli (2002) Fumera, G. and Roli, F. Support Vector Machines with Embedded Reject Option. In Pattern Recognition with Support Vector Machines, pp. 68–82. Springer, 2002.
 Geifman & ElYaniv (2017) Geifman, Y. and ElYaniv, R. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 Girosi et al. (1995) Girosi, F., Jones, M., and Poggio, T. Regularization Theory and Neural Networks Architectures. Neural Computation, 7(2):219–269, 1995.
 Gomez et al. (2017) Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The Reversible Residual Retwork: Backpropagation without Storing Activations. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 Gordon & HernándezLobato (2017) Gordon, J. and HernándezLobato, J. M. Bayesian Semisupervised Learning with Deep Generative Models. ArXiv ePrints, 2017.
 Grandvalet & Bengio (2005) Grandvalet, Y. and Bengio, Y. SemiSupervised Learning by Entropy Minimization. In Advances in Neural Information Processing Systems (NeurIPS), 2005.
 Grathwohl et al. (2019) Grathwohl, W., Chen, R. T. Q., Bettencourt, J., and Duvenaud, D. Scalable Reversible Generative Models with Freeform Continuous Dynamics. In International Conference on Learning Representations, 2019.
 Hellman (1970) Hellman, M. E. The Nearest Neighbor Classification Rule with a Reject Option. IEEE Transactions on Systems Science and Cybernetics, 6(3):179–185, 1970.
 Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian Processes for Big Data. In Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
 Herbei & Wegkamp (2006) Herbei, R. and Wegkamp, M. H. Classification with Reject Option. Canadian Journal of Statistics, 34(4):709–721, 2006.
 Jaakkola & Haussler (1999) Jaakkola, T. and Haussler, D. Exploiting Generative Models in Discriminative Classifiers. In Advances in Neural Information Processing Systems (NeurIPS), 1999.
 Jaakkola & Jordan (1997) Jaakkola, T. and Jordan, M. A Variational Approach to Bayesian Logistic Regression Models and their Extensions. In Sixth International Workshop on Artificial Intelligence and Statistics, volume 82, pp. 4, 1997.
 Jacobsen et al. (2018) Jacobsen, J.H., Smeulders, A. W., and Oyallon, E. iRevNet: Deep Invertible Networks. In International Conference on Learning Representations, 2018.
 Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2):183–233, 1999.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR), 2014.
 Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative Flow with Invertible 1x1 Convolutions. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. AutoEncoding Variational Bayes. International Conference on Learning Representations (ICLR), 2014.
 Kingma et al. (2014) Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. SemiSupervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
 Kuleshov & Ermon (2017) Kuleshov, V. and Ermon, S. Deep Hybrid Models: Bridging Discriminative and Generative Approaches. In Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
 Lakshminarayanan et al. (2016) Lakshminarayanan, B., Roy, D. M., and Teh, Y. W. Mondrian Forests for Large Scale Regression when Uncertainty Matters. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
 Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 Lasserre et al. (2006) Lasserre, J. A., Bishop, C. M., and Minka, T. P. Principled Hybrids of Generative and Discriminative Models. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, pp. 87–94. IEEE, 2006.
 Maaløe et al. (2016) Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. Auxiliary Deep Generative Models. In International Conference on Machine Learning (ICML), 2016.
 McCallum et al. (2006) McCallum, A., Pal, C., Druck, G., and Wang, X. MultiConditional Learning: Generative / Discriminative Training for Clustering and Classification. In AAAI, pp. 433–439, 2006.
 Miyato et al. (2018) Miyato, T., Maeda, S.i., Ishii, S., and Koyama, M. Virtual Adversarial Training: A Regularization Method for Supervised and SemiSupervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 Nalisnick et al. (2019) Nalisnick, E., Matsukawa, A., Whye Teh, Y., Gorur, D., and Lakshminarayanan, B. Do Deep Generative Models Know What They Don’t Know? In International Conference on Learning Representations (ICLR), 2019.
 Nelder & Baker (1972) Nelder, J. A. and Baker, R. J. Generalized Linear Models. Wiley Online Library, 1972.
 Ng & Jordan (2002) Ng, A. Y. and Jordan, M. I. On Discriminative vs Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. In Advances in Neural Information Processing Systems (NeurIPS), 2002.
 Raina et al. (2004) Raina, R., Shen, Y., Mccallum, A., and Ng, A. Y. Classification with Hybrid Generative / Discriminative Models. In Advances in Neural Information Processing Systems (NeurIPS), pp. 545–552, 2004.
 Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. Gaussian Processes for Machine Learning. MIT Press, 2006.
 Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational Inference with Normalizing Flows. In International Conference on Machine Learning (ICML), 2015.
 Rifai et al. (2011) Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive AutoEncoders: Explicit Invariance During Feature Extraction. In International Conference on Machine Learning (ICML), 2011.
 Riquelme et al. (2018) Riquelme, C., Tucker, G., and Snoek, J. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. In International Conference on Learning Representations (ICLR), 2018.
 Srivastava et al. (2015) Srivastava, R. K., Greff, K., and Schmidhuber, J. Training Very Deep Networks. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
 Szummer & Jaakkola (2003) Szummer, M. and Jaakkola, T. S. Information Regularization with Partially Labeled Data. In Advances in Neural Information Processing Systems (NeurIPS), 2003.
 Tulyakov et al. (2017) Tulyakov, S., Fitzgibbon, A., and Nowozin, S. Hybrid VAE: Improving Deep Generative Models Using Partial Observations. In Advances in Neural Information Processing Systems (NeurIPS), 2017.