Semisupervised Deep Kernel Learning:
Regression with Unlabeled Data by Minimizing Predictive Variance
Abstract
Large amounts of labeled data are typically required to train deep learning models. For many realworld problems, however, acquiring additional data can be expensive or even impossible. We present semisupervised deep kernel learning (SSDKL), a semisupervised regression model based on minimizing predictive variance in the posterior regularization framework. SSDKL combines the hierarchical representation learning of neural networks with the probabilistic modeling capabilities of Gaussian processes. By leveraging unlabeled data, we show improvements on a diverse set of realworld regression tasks over supervised deep kernel learning and semisupervised methods such as VAT and mean teacher adapted for regression.
Semisupervised Deep Kernel Learning:
Regression with Unlabeled Data by Minimizing Predictive Variance
Neal Jean^{†}^{†}thanks: denotes equal contribution, Sang Michael Xie^{1}^{1}footnotemark: 1, Stefano Ermon Department of Computer Science Stanford University Stanford, CA 94305 {nealjean, xie, ermon}@cs.stanford.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
The prevailing trend in machine learning is to automatically discover good feature representations through endtoend optimization of neural networks. However, most success stories have been enabled by vast quantities of labeled data [1]. This need for supervision poses a major challenge when we encounter critical scientific and societal problems where finegrained labels are difficult to obtain. Accurately measuring the outcomes that we care about—e.g., childhood mortality, environmental damage, or extreme poverty—can be prohibitively expensive. Although these problems have limited data, they often contain underlying structure that can be used for learning; for example, poverty is strongly correlated over both space and time.
Semisupervised learning approaches offer promise when few labels are available by allowing models to supplement their training with unlabeled data [2]. Mostly focusing on classification tasks, these methods often rely on strong assumptions about the structure of the data (e.g., cluster assumptions, low data density at decision boundaries) that generally do not apply to regression [3, 4, 5].
In this paper, we present semisupervised deep kernel learning, which addresses the challenge of semisupervised regression by building on previous work combining the feature learning capabilities of deep neural networks with the ability of Gaussian processes to capture uncertainty [6]. SSDKL incorporates unlabeled training data by minimizing predictive variance in the posterior regularization framework, a flexible way of encoding prior knowledge in Bayesian models [7, 8].
Our main contributions are the following:

We introduce semisupervised deep kernel learning (SSDKL), a regression model that combines the strengths of heavily parameterized deep neural networks and nonparametric Gaussian processes. While the deep Gaussian process kernel induces structure in an embedding space, the model also allows a priori knowledge of structure (i.e., spatial or temporal) in the input features to be naturally incorporated through kernel composition.

By formalizing the semisupervised variance minimization objective in the posterior regularization framework, we unify previous semisupervised approaches such as minimum entropy and minimum variance regularization under a common framework. To our knowledge, this is the first paper connecting semisupervised methods to posterior regularization.

We demonstrate that SSDKL can use unlabeled data to learn more generalizable features and improve performance on a range of regression tasks, outperforming the supervised deep kernel learning method and semisupervised methods such as virtual adversarial training (VAT) and mean teacher [9, 10]. In a challenging realworld task of predicting poverty from satellite images, SSDKL outperforms the stateoftheart by on average—by incorporating prior knowledge of spatial structure, the improvement increases to .
2 Preliminaries
We assume a training set of labeled examples and unlabeled examples with instances and labels . Let refer to the aggregated features and targets, where , , and . At test time, we are given examples that we would like to predict.
We will consider both inductive and transductive semisupervised learning. In inductive semisupervised learning, the labeled data and unlabeled data are used to learn a function that generalizes well and is a good predictor on unseen test examples [2]. In transductive semisupervised learning, the unlabeled examples are exactly the test data that we would like to predict, i.e., [11]. A transductive learning approach tries to find a function , with no requirement of generalizing to additional test examples.
Gaussian processes
A Gaussian process (GP) is a collection of random variables, any finite number of which form a multivariate Gaussian distribution. Following the notation of [12], a Gaussian process defines a distribution over functions from inputs to target values. If
with mean function and covariance kernel function parameterized by , then any collection of function values is jointly Gaussian,
with mean vector and covariance matrix defined by the GP, s.t. and . In practice, we often assume that observations include i.i.d. Gaussian noise, i.e., where , and the covariance function becomes
where . To make predictions at unlabeled points , we can compute a Gaussian posterior distribution in closed form by conditioning on the observed data . For a more thorough introduction, we refer readers to [13].
Deep kernel learning
Deep kernel learning (DKL) combines neural networks with GPs by using a neural network embedding as input to a deep kernel [6]. Given input data , a neural network parameterized by is used to extract features . The outputs are modeled as
for some mean function and base kernel function with parameters . Parameters of the deep kernel are learned jointly by minimizing the negative log likelihood of the labeled data [12]:
(1) 
For Gaussian distributions, the marginal likelihood is a closedform, differentiable expression, allowing DKL models to be trained via backpropagation.
Posterior regularization
In probabilistic models, domain knowledge is generally imposed through the specification of priors. These priors, along with the observed data, determine the posterior distribution through the application of Bayes’ rule. However, it can be difficult to encode our knowledge in a Bayesian prior. Posterior regularization offers a more direct and flexible mechanism for controlling the posterior distribution.
Let be a collection of observed data. [8] present a regularized optimization formulation called regularized Bayesian inference, or RegBayes. In this framework, the regularized posterior is the solution of the following optimization problem:
(2) 
where is defined as the KLdivergence between the desired postdata posterior over models and the standard Bayesian posterior , . The goal is to learn a posterior distribution that is not too far from the standard Bayesian posterior while also fulfilling some requirements imposed by the regularization.
3 Semisupervised deep kernel learning
We introduce semisupervised deep kernel learning (SSDKL) for problems where labeled data is limited but unlabeled data is plentiful. To learn from unlabeled data, we observe that a Bayesian approach provides us with a predictive posterior distribution—i.e., we are able to quantify predictive uncertainty. Thus, we regularize the posterior by adding an unsupervised loss term that minimizes the predictive variance at unlabeled data points:
(3) 
(4) 
where and are the numbers of labeled and unlabeled training examples and is a hyperparameter controlling the tradeoff between supervised and unsupervised components.
3.1 Variance minimization as posterior regularization
Let be the observed input data and be the input data with labels for the labeled data . Let denote a space of functions where for , maps from the inputs to the target values, and let be a posterior distribution over the functions, where are parameters of from some parameter space and is a family of distributions where is restricted to be a Dirac delta centered on . We assume that a likelihood density exists and let , where is the Bayesian posterior.
Instead of maximizing the marginal likelihood of the labeled training data in a purely supervised approach, we train our model in a semisupervised fashion by minimizing the compound objective
(5) 
where controls the tradeoff between supervised and unsupervised components.
This semisupervised variance minimization objective is a specific form of posterior regularization in the RegBayes framework. As in [8], we assume that is a complete separable metric space and is an absolutely continuous probability measure (with respect to background measure ) on , where is the the Borel algebra, such that a density exists where .
Theorem 1.
We include a formal derivation in Appendix A.1 and give a brief outline here. It can be shown that solving the variational optimization objective
is equivalent to minimizing the first term of the RegBayes objective in Theorem 1, and the minimizer is precisely the Bayesian posterior . When we restrict , given any , the optimizing value is in the form where is the Bayesian posterior.
In general, the optimal postdata posterior (after regularization) may be in a different form than the Bayesian posterior. However, note that the variance regularizer only depends on . In this case, the optimal postdata posterior in the regularized objective is still of the form , and is modified by the regularization function only through changing . From here, we can recover the variance minimization objective.
Intuition for variance minimization
By minimizing , we trade off maximizing the likelihood of our observations with minimizing the posterior variance on unlabeled data that we wish to predict. Since the deep kernel parameters are jointly learned, the neural net is encouraged to learn a feature representation in which the unlabeled examples are similar to the labeled examples, thereby reducing the variance on our predictions. If we imagine the labeled data as “supports” for the surface representing the posterior mean, we are optimizing for embeddings where unlabeled data tend to cluster around these labeled supports.
Another interpretation is that the semisupervised objective is a regularizer that reduces overfitting to labeled data. The model is discouraged from learning features from labeled data that are not also useful for making lowvariance predictions at unlabeled data points. In settings where unlabeled data provide additional variation beyond labeled examples, this can improve model generalization.
Training
Semisupervised deep kernel learning scales well with large amounts of unlabeled data since the unsupervised objective naturally decomposes into a sum over conditionally independent terms. This allows for minibatch training on unlabeled data with stochastic gradient descent. Since all of the labeled examples are interdependent, computing exact gradients for labeled examples requires full batch gradient descent on the labeled data. However, previous work in approximate gradients in GPs using structured or sparse matrices allows for stochastic batched training and can directly be applied in our model, allowing SSDKL to scale with respect to both labeled and unlabeled data [14, 15, 16].
4 Experiments and results
We apply SSDKL to a variety of realworld regression tasks, beginning with eight datasets from the UCI repository [17]. We also explore the novel and challenging task of predicting local poverty measures from highresolution satellite imagery [18]. In our reported results, we use the squared exponential or radial basis function kernel. We also experimented with polynomial kernels, but it had generally worse performance. Additional training details are provided in Appendix A.3.
4.1 Baselines
We first compare SSDKL to the purely supervised DKL, showing the contribution of unlabeled data. Following previous work, the DKL model is initialized from a NN+GP model that holds the weights of a pretrained neural network fixed while optimizing the parameters of the GP.
In addition to the supervised DKL method, we compare against semisupervised methods including cotraining, consistency regularization, generative models, and label propagation. Since many methods are originally for semisupervised classification, we adapt them for regression.
Coreg, or Cotraining Regressors, uses two nearest neighbor (NN) regressors, each of which generates labels for the other during the learning process [19]. Unlike traditional cotraining, which requires splitting features into sufficient and redundant views, Coreg achieves regressor diversity by using different distance metrics for its two regressors [20].
Consistency regularization methods aim to make model outputs invariant to local input perturbations [9, 21, 10]. For semisupervised classification, [22] found that VAT and mean teacher were the best methods using fair evaluation guidelines. Virtual adversarial training (VAT) via local distributional smoothing (LDS) enforces consistency by training models to be robust to adversarial local input perturbations [9, 23]. Unlike adversarial training [24], the virtual adversarial perturbation is found without labels, making semisupervised learning possible. We adapt VAT for regression by choosing the output distribution for input , where is a parameterized mapping and is fixed. Optimizing the likelihood term is then equivalent to minimizing squared error; the LDS term is the KLdivergence between the model distribution and a perturbed Gaussian. (see Appendix A.2). Mean teacher enforces consistency by penalizing deviation from the outputs of a model with the exponential weighted average of the parameters over SGD iterations [10].
Label propagation defines a graph structure over the data with edges that define the probability for a categorical label to propagate from one data point to another [25]. If we encode this graph in a transition matrix and let the current class probabilities be , then the algorithm iteratively propagates , rownormalizes , clamps the labeled data to their known values, and repeats until convergence. We make the extension to regression by letting be realvalued labels. As in [25], we use a fullyconnected graph and the radialbasis kernel for edge weights. The kernel scale hyperparameter is chosen using a validation set.
Generative models such as the variational autoencoder (VAE) have shown promise in semisupervised classification especially for visual and sequential tasks [26, 27, 28, 29]. We compare against a semisupervised VAE by first learning an unsupervised embedding of the data and then using the embeddings as input to a supervised multilayer perceptron.
Percent reduction in RMSE compared to DKL  
Dataset  SSDKL  Coreg  Label Prop  VAT  Mean Teacher  VAE  SSDKL  Coreg  Label Prop  VAT  Mean Teacher  VAE  
Skillcraft  
Parkinsons  
Elevators  
Protein  
Blog  
CTslice  
Buzz  
Electric  
Average 
4.2 UCI regression experiments
We evaluate SSDKL on eight regression datasets from the UCI repository. For each dataset, we train on labeled examples, retain examples as the hold out test set, and treat the remaining data as unlabeled examples. Following [22], the labeled data is randomly split 9010 into training and validation samples, giving a realistically small validation set. We use the validation set for hyperparameter search and early stopping. We report test RMSE averaged over 10 trials of random splits to combat the small data sizes. Following [12], we choose a neural network with a similar [10050502] architecture and twodimensional embedding. We reduced the size of the lower layers since our labeled training sets are much smaller, but similarly to [12], results were not sensitive to these choices. Following [22], we use this same base model for all deep models, including SSDKL, DKL, VAT, mean teacher, and the VAE encoder, in order to make results comparable across methods. Since label propagation creates a kernel matrix of all data points, we limit the number of unlabeled examples for label propagation to a maximum of 20000 due to memory constraints. We initialize labels in label propagation with a kNN regressor with to speed up convergence.
SSDKL performs at least as well as DKL across all datasets, and a Wilcoxon signedrank test shows significance at the level for at least one labeled training set size for 6 of the 8 datasets. SSDKL gives a and average RMSE improvement over the supervised DKL in the cases respectively, superior to other semisupervised methods adapted for regression.
The same hyperparameters and initializations are used across all UCI datasets for SSDKL. We use learning rates of and for the neural network and GP parameters respectively and initialize all GP parameters to . In Fig. 2 (right), we study the effect of varying to trade off between maximizing the likelihood of labeled data and minimizing the variance of unlabeled data. A large emphasizes minimization of the predictive variance while a small focuses on fitting labeled data. SSDKL improves on DKL for values of between and , indicating that performance is not overly reliant on the choice of this hyperparameter. Fig. 2 (left) compares SSDKL to purely supervised DKL, Coreg, and VAT as we vary the labeled training set size.
Surprisingly, Coreg outperformed SSDKL on the Blog, CTslice, and Buzz datasets. We found that these datasets happen to be bettersuited for nearest neighborsbased methods such as Coreg. A kNN regressor using only the labeled data outperformed DKL and SSDKL on all three datasets for , beat DKL on all three for , and beat SSDKL on two of three for . Thus, the kNN regressor is already outperforming SSDKL with only labeled data—it is unsurprising that SSDKL is unable to close the gap on a semisupervised nearest neighbors method like Coreg.
Representation learning
To gain some intuition about how the unlabeled data helps in the learning process, we visualize the neural network embeddings learned by the DKL and SSDKL models on the Skillcraft dataset. In Fig. 3 (left), we first train DKL on labeled training examples and plot the twodimensional neural network embedding that is learned. In Fig. 3 (right), we train SSDKL on the same labeled training examples along with additional unlabeled data points and plot the resulting embedding. In the left panel, DKL learns a poor embedding—different colors representing different output magnitudes are intermingled. In the right panel, SSDKL is able to use the unlabeled data for regularization, and learns a better representation of the dataset.
Percent reduction in RMSE ()  
Country  Spatial SSDKL  SSDKL  DKL 
Malawi  
Nigeria  
Tanzania  
Uganda  
Rwanda  
Average 
4.3 Poverty prediction
Highresolution satellite imagery offers the potential for cheap, scalable, and accurate tracking of changing socioeconomic indicators. In this task, we predict local poverty measures from satellite images using limited amounts of poverty labels. As described in [30], the dataset consists of villages across five Africa countries: Nigeria, Tanzania, Uganda, Malawi, and Rwanda. These include some of the poorest countries in the world (Malawi and Rwanda) as well as some that are relatively better off (Nigeria), making for a challenging and realistically diverse problem.
In this experiment, we use labeled satellite images for training. With such a small dataset, we can not expect to train a deep convolutional neural network (CNN) from scratch. Instead we take a transfer learning approach as in [18], extracting 4096dimensional visual features and using these as input. More details are provided in Appendix A.4.
Incorporating spatial information
In order to highlight the usefulness of kernel composition, we explore extending SSDKL with a spatial kernel. Spatial SSDKL composes two kernels by summing an image feature kernel and a separate location kernel that operates on location coordinates (lat/lon). By treating them separately, it explicitly encodes the knowledge that location coordinates are spatially structured and distinct from image features.
As shown in Table 2, all models outperform the baseline stateoftheart ridge regression model from [30]. Spatial SSDKL significantly outperforms all other models, while the concatenated kernel does not improve over the basic SSDKL model, which uses only image features. Spatial SSDKL outperforms the other models by directly modeling location coordinates as spatial features, showing that kernel composition can effectively incorporate prior knowledge of structure.
5 Related work
[31] introduced deep Gaussian processes, which stack GPs in a hierarchy by modeling the outputs of one layer with a Gaussian process in the next layer. Despite the suggestive name, these models do not integrate deep neural networks and Gaussian processes.
[6] proposed deep kernel learning, combining neural networks with the nonparametric flexibility of GPs and training endtoend in a fully supervised setting. Extensions have explored approximate inference, stochastic gradient training, and recurrent deep kernels for sequential data [14, 15, 16].
Our method draws inspiration from transductive experimental design, which chooses the most informative points (experiments) to measure by seeking data points that are both hard to predict and informative for the unexplored test data [32]. Similar prediction uncertainty approaches have been explored in semisupervised classification models, such as minimum entropy and minimum variance regularization, which can now also be understood in the posterior regularization framework [3, 33].
Recent work in generative adversarial networks (GANs) [26], variational autoencoders (VAEs) [27], and other generative models have achieved promising results on various semisupervised classification tasks [28, 29]. However, we find that these models are not as wellsuited for generic regression tasks such as in the UCI repository as for audiovisual tasks.
Consistency regularization posits that the model’s output should be invariant to reasonable perturbations of the input [9, 21, 10]. Combining adversarial training [24] with consistency regularization, virtual adversarial training uses a labelfree regularization term that allows for semisupervised training [9]. Mean teacher adds a regularization term that penalizes deviation from a exponential weighted average of the parameters over SGD iterations [10]. For semisupervised classification, [22] found that VAT and mean teacher were the best methods across a series of fair evaluations.
Label propagation defines a graph structure over the data points and propagates labels from labeled data over the graph. The method must assume a graph structure and edge distances on the input feature space without the ability to adapt the space to the assumptions. Label propagation is also subject to memory constraints since it forms a kernel matrix over all data points, requiring quadratic space in general, although sparser graph structures can reduce this to a linear scaling.
Cotraining regressors trains two kNN regressors with different distance metrics which label each others’ unlabeled data. This works when neighbors in the given input space are meaningful, but cannot adapt the space. As a fully nonparametric method, inference requires retaining the full dataset.
Much of the previous work in semisupervised learning is in classification and the assumptions do not generally translate to regression. Our experiments show that SSDKL outperforms other adapted semisupervised methods in a battery of regression tasks.
6 Conclusions
Many important problems are challenging because of the limited availability of training data, making the ability to learn from unlabeled data critical. In experiments with UCI datasets and a realworld poverty prediction task, we find that minimizing posterior variance can be an effective way to incorporate unlabeled data when labeled training data is scarce. SSDKL models are naturally suited for many realworld problems, as spatial and temporal structure can be explicitly modeled through the composition of kernel functions. While our focus is on regression problems, we believe the SSDKL framework is equally applicable to classification problems—we leave this to future work.
Appendix A Appendix
a.1 Posterior regularization
Proof of Theorem 1.
Let be a collection of observed data. Let be the observed input data points. As in [8], we assume that is a complete separable metric space and is a absolutely continuous probability measure (with respect to background measure ) on , where is the the Borel algebra, such that a density exists where . Let be a space of parameters to the model, where we treat as random variables. With regards to the notation in the RegBayes framework, the model is the pair . We assume as in [8] that the likelihood function is the likelihood distribution which is dominated by a finite measure for all with positive density, such that a density exists where .
We would like to compute the posterior distribution
which is involves an intractable integral. We introduce a variational approximation which approximates , where is a family of approximating distributions such that is restricted to be a Dirac delta centered on . We claim that the exact optimal solution of the following optimization problem is precisely the Bayesian posterior :
We note that adding the constant to the objective,
so that the claim holds, and we see that the objective is equivalent to the first term of the RegBayes objective (Section 2.3). When we restrict ,
(6)  
(7)  
(8)  
(9)  
(10)  
(11) 
where in equation (8) we note that does not vary with or , and can be removed from the optimization, and similarly in equation (10) we remove the constant . For every , the optimizing value is , which is the Bayesian posterior given the model parameters. Substituting this optimal value into (11),
(12)  
(13)  
(14) 
using that in (13). The optimization problem over reflects maximizing the likelihood of the data.
The regularization term in the RegBayes framework is expressed variationally as
where are slack variables, is a penalty function, and is a subspace of feasible distributions satisfying specified constraints. An equivalent formulation of the RegBayes problem is then
(15) 
Let the regularization function be
where , , and is restricted to the family of distributions
where is the Bayesian posterior from the unregularized objective. Note that given values of and , the optimal . Thus the regularization function is
Note that the regularization function only depends on . Therefore the optimal postdata posterior in the regularized objective is still in the form , and is modified by the regularization function only through .
Thus, augmenting the objective from (11) and using the optimal postdata posterior , the regularized optimization objective is
where we use that in the thirdtolast equality. ∎
a.2 Virtual Adversarial Training
Virtual adversarial training (VAT) is a general training mechanism which enforces local distributional smoothness (LDS) by optimizing the model to be less sensitive to adversarial perturbations of the input [9]. The VAT objective is to augment the marginal likelihood with an LDS objective:
where
Note that the LDS objective does not require labels, so that unlabeled data can be incorporated. The experiments in the original paper are for classification, although VAT is general. We use VAT for regression by choosing where is a parameterized mapping (a neural network), and is fixed. Optimizing the likelihood term is then equivalent to minimizing the squared error and the LDS term is the KLdivergence between the model’s Gaussian distribution and a perturbed Gaussian distribution, which is also in the form of a squared difference. The adversarial perturbation is calculated with the secondorder Taylor approximated at each step using a first dominant eigenvector calculation of the Hessian. The eigenvector calculation is done via a finitedifference approximation to the power iteration method. As in [9], one step of the finitedifference approximation was used in all of our experiments.
a.3 Training details
In our reported results, we use the standard squared exponential or radial basis function (RBF) kernel,
where and represent the signal variance and characteristic length scale. We also experimented with polynomial kernels, , , but found that performance generally decreased. To enforce positivity constraints on the kernel parameters and positive definiteness of the covariance, we train these parameters in the logdomain. Although the information capacity of a nonparametric model increases with the dataset size, the marginal likelihood automatically constrains model complexity without additional regularization [13]. The parametric neural networks are regularized with L2 weight decay to reduce overfitting, and models are implemented and trained in TensorFlow using the ADAM optimizer [34, 35].
a.4 Poverty prediction
Highresolution satellite imagery offers the potential for cheap, scalable, and accurate tracking of changing socioeconomic indicators. The United Nations has set 17 Sustainable Development Goals (SDGs) for the year 2030—the first of these is the worldwide elimination of extreme poverty, but a lack of reliable data makes it difficult to distribute aid and target interventions effectively. Traditional data collection methods such as largescale household surveys or censuses are slow and expensive, requiring years to complete and costing billions of dollars [36]. Because data on the outputs that we care about are scarce, it is difficult to train models on satellite imagery using traditional supervised methods. In this task, we attempt to predict local poverty measures from satellite images using limited amounts of poverty labels. As described in [30], the dataset consists of villages across five Africa countries: Nigeria, Tanzania, Uganda, Malawi, and Rwanda. These countries include some of the poorest in the world (Malawi, Rwanda) as well as regions of Africa that are relatively better off (Nigeria), making for a challenging and realistically diverse problem. The raw satellite inputs consist of pixel RGB satellite images downloaded from Google Static Maps at zoom level 16, corresponding to m ground resolution. The target variable that we attempt to predict is a wealth index provided in the publicly available Demographic and Health Surveys (DHS) [37, 38].
References
 [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [2] Xiaojin Zhu and Andrew B Goldberg. Introduction to semisupervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
 [3] Yves Grandvalet and Yoshua Bengio. Semisupervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2004.
 [4] Olivier Chapelle and Alexander Zien. Semisupervised classification by low density separation. In AISTATS, pages 57–64, 2005.
 [5] Aarti Singh, Robert Nowak, and Xiaojin Zhu. Unlabeled data: Now it helps, now it doesn’t. In Advances in neural information processing systems, pages 1513–1520, 2009.
 [6] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. The Journal of Machine Learning Research, 2015.
 [7] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11(Jul):2001–2049, 2010.
 [8] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to infinite latent svms. Journal of Machine Learning Research, 15(1):1799–1847, 2014.
 [9] Takeru Miyato, Shinichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.
 [10] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1195–1204. Curran Associates, Inc., 2017.
 [11] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for transductive transfer learning. Proc. Seventh IEEE Int’,l Conf. Data Mining Workshops, 2007.
 [12] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 370–378, 2016.
 [13] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. The MIT Press, 2006.
 [14] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (KISSGP). In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 1775–1784, 2015.
 [15] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016.
 [16] Maruan AlShedivat, Andrew Gordon Wilson, Yunus Saatchi, Zhiting Hu, and Eric P Xing. Learning scalable deep kernels with recurrent structure. arXiv preprint arXiv:1610.08936, 2016.
 [17] M. Lichman. UCI machine learning repository, 2013.
 [18] Michael Xie, Neal Jean, Marshall Burke, David Lobell, and Stefano Ermon. Transfer learning from deep features for remote sensing and poverty mapping. AAAI Conference on Artificial Intelligence, 2016.
 [19] ZhiHua Zhou and Ming Li. Semisupervised regression with cotraining. In IJCAI, volume 5, pages 908–913, 2005.
 [20] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with cotraining. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM, 1998.
 [21] Samuli Laine and Timo Aila. Temporal ensembling for semisupervised learning. ICLR 2017.
 [22] Augustus Odena, Avital Oliver, Colin Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of semisupervised learning algorithms. 2018.
 [23] Takeru Miyato, Shinichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semisupervised learning. arXiv preprint arXiv:1704.03976, 2017.
 [24] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 [25] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, 2002.
 [26] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [27] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [28] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
 [29] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546–3554, 2015.
 [30] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
 [31] Andreas C. Damianou and Neil D. Lawrence. Deep gaussian processes. The Journal of Machine Learning Research, 2013.
 [32] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. The International Conference on Machine Learning (ICML), 2006.
 [33] Chenyang Zhao and Shaodan Zhai. Minimum variance semisupervised boosting for multilabel classification. In 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 1342–1346. IEEE, 2015.
 [34] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 [35] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 3rd International Conference for Learning Representations, 2015.
 [36] Morten Jerven. Poor numbers: how we are misled by African development statistics and what to do about it. Cornell University Press, 2013.
 [37] ICF International. Demographic and health surveys (various) [datasets], 2015.
 [38] Deon Filmer and Lant H Pritchett. Estimating wealth effects without expenditure data—or tears: An application to educational enrollments in states of india*. Demography, 38(1):115–132, 2001.