Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance

Semi-supervised Deep Kernel Learning:
Regression with Unlabeled Data by Minimizing Predictive Variance

Neal Jean,  Sang Michael Xie11footnotemark: 1,  Stefano Ermon
Department of Computer Science
Stanford University
Stanford, CA 94305
{nealjean, xie, ermon}@cs.stanford.edu
denotes equal contribution
Abstract

Large amounts of labeled data are typically required to train deep learning models. For many real-world problems, however, acquiring additional data can be expensive or even impossible. We present semi-supervised deep kernel learning (SSDKL), a semi-supervised regression model based on minimizing predictive variance in the posterior regularization framework. SSDKL combines the hierarchical representation learning of neural networks with the probabilistic modeling capabilities of Gaussian processes. By leveraging unlabeled data, we show improvements on a diverse set of real-world regression tasks over supervised deep kernel learning and semi-supervised methods such as VAT and mean teacher adapted for regression.

 

Semi-supervised Deep Kernel Learning:
Regression with Unlabeled Data by Minimizing Predictive Variance


  Neal Jeanthanks: denotes equal contribution,  Sang Michael Xie11footnotemark: 1,  Stefano Ermon Department of Computer Science Stanford University Stanford, CA 94305 {nealjean, xie, ermon}@cs.stanford.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

The prevailing trend in machine learning is to automatically discover good feature representations through end-to-end optimization of neural networks. However, most success stories have been enabled by vast quantities of labeled data [1]. This need for supervision poses a major challenge when we encounter critical scientific and societal problems where fine-grained labels are difficult to obtain. Accurately measuring the outcomes that we care about—e.g., childhood mortality, environmental damage, or extreme poverty—can be prohibitively expensive. Although these problems have limited data, they often contain underlying structure that can be used for learning; for example, poverty is strongly correlated over both space and time.

Semi-supervised learning approaches offer promise when few labels are available by allowing models to supplement their training with unlabeled data [2]. Mostly focusing on classification tasks, these methods often rely on strong assumptions about the structure of the data (e.g., cluster assumptions, low data density at decision boundaries) that generally do not apply to regression [3, 4, 5].

In this paper, we present semi-supervised deep kernel learning, which addresses the challenge of semi-supervised regression by building on previous work combining the feature learning capabilities of deep neural networks with the ability of Gaussian processes to capture uncertainty [6]. SSDKL incorporates unlabeled training data by minimizing predictive variance in the posterior regularization framework, a flexible way of encoding prior knowledge in Bayesian models [7, 8].

Our main contributions are the following:

  • We introduce semi-supervised deep kernel learning (SSDKL), a regression model that combines the strengths of heavily parameterized deep neural networks and nonparametric Gaussian processes. While the deep Gaussian process kernel induces structure in an embedding space, the model also allows a priori knowledge of structure (i.e., spatial or temporal) in the input features to be naturally incorporated through kernel composition.

  • By formalizing the semi-supervised variance minimization objective in the posterior regularization framework, we unify previous semi-supervised approaches such as minimum entropy and minimum variance regularization under a common framework. To our knowledge, this is the first paper connecting semi-supervised methods to posterior regularization.

  • We demonstrate that SSDKL can use unlabeled data to learn more generalizable features and improve performance on a range of regression tasks, outperforming the supervised deep kernel learning method and semi-supervised methods such as virtual adversarial training (VAT) and mean teacher [9, 10]. In a challenging real-world task of predicting poverty from satellite images, SSDKL outperforms the state-of-the-art by on average—by incorporating prior knowledge of spatial structure, the improvement increases to .

Figure 1: Depiction of the variance minimization approach behind semi-supervised deep kernel learning (SSDKL). The -axis represents one dimension of a neural network embedding and the -axis represents the corresponding output. Left: Without unlabeled data, the model learns an embedding by maximizing the likelihood of labeled data. The black and gray dotted lines show the posterior distribution after conditioning. Right: Embedding learned by SSDKL tries to minimize the predictive variance of unlabeled data, encouraging unlabeled embeddings to be near labeled embeddings. Observe that the representations of both labeled and unlabeled data are free to change.

2 Preliminaries

We assume a training set of labeled examples and unlabeled examples with instances and labels . Let refer to the aggregated features and targets, where , , and . At test time, we are given examples that we would like to predict.

We will consider both inductive and transductive semi-supervised learning. In inductive semi-supervised learning, the labeled data and unlabeled data are used to learn a function that generalizes well and is a good predictor on unseen test examples [2]. In transductive semi-supervised learning, the unlabeled examples are exactly the test data that we would like to predict, i.e., [11]. A transductive learning approach tries to find a function , with no requirement of generalizing to additional test examples.

Gaussian processes

A Gaussian process (GP) is a collection of random variables, any finite number of which form a multivariate Gaussian distribution. Following the notation of [12], a Gaussian process defines a distribution over functions from inputs to target values. If

with mean function and covariance kernel function parameterized by , then any collection of function values is jointly Gaussian,

with mean vector and covariance matrix defined by the GP, s.t. and . In practice, we often assume that observations include i.i.d. Gaussian noise, i.e., where , and the covariance function becomes

where . To make predictions at unlabeled points , we can compute a Gaussian posterior distribution in closed form by conditioning on the observed data . For a more thorough introduction, we refer readers to [13].

Deep kernel learning

Deep kernel learning (DKL) combines neural networks with GPs by using a neural network embedding as input to a deep kernel [6]. Given input data , a neural network parameterized by is used to extract features . The outputs are modeled as

for some mean function and base kernel function with parameters . Parameters of the deep kernel are learned jointly by minimizing the negative log likelihood of the labeled data [12]:

(1)

For Gaussian distributions, the marginal likelihood is a closed-form, differentiable expression, allowing DKL models to be trained via backpropagation.

Posterior regularization

In probabilistic models, domain knowledge is generally imposed through the specification of priors. These priors, along with the observed data, determine the posterior distribution through the application of Bayes’ rule. However, it can be difficult to encode our knowledge in a Bayesian prior. Posterior regularization offers a more direct and flexible mechanism for controlling the posterior distribution.

Let be a collection of observed data. [8] present a regularized optimization formulation called regularized Bayesian inference, or RegBayes. In this framework, the regularized posterior is the solution of the following optimization problem:

(2)

where is defined as the KL-divergence between the desired post-data posterior over models and the standard Bayesian posterior , . The goal is to learn a posterior distribution that is not too far from the standard Bayesian posterior while also fulfilling some requirements imposed by the regularization.

3 Semi-supervised deep kernel learning

We introduce semi-supervised deep kernel learning (SSDKL) for problems where labeled data is limited but unlabeled data is plentiful. To learn from unlabeled data, we observe that a Bayesian approach provides us with a predictive posterior distribution—i.e., we are able to quantify predictive uncertainty. Thus, we regularize the posterior by adding an unsupervised loss term that minimizes the predictive variance at unlabeled data points:

(3)
(4)

where and are the numbers of labeled and unlabeled training examples and is a hyperparameter controlling the trade-off between supervised and unsupervised components.

3.1 Variance minimization as posterior regularization

Let be the observed input data and be the input data with labels for the labeled data . Let denote a space of functions where for , maps from the inputs to the target values, and let be a posterior distribution over the functions, where are parameters of from some parameter space and is a family of distributions where is restricted to be a Dirac delta centered on . We assume that a likelihood density exists and let , where is the Bayesian posterior.

Instead of maximizing the marginal likelihood of the labeled training data in a purely supervised approach, we train our model in a semi-supervised fashion by minimizing the compound objective

(5)

where controls the trade-off between supervised and unsupervised components.

This semi-supervised variance minimization objective is a specific form of posterior regularization in the RegBayes framework. As in [8], we assume that is a complete separable metric space and is an absolutely continuous probability measure (with respect to background measure ) on , where is the the Borel -algebra, such that a density exists where .

Theorem 1.

Given , a suitable space of functions , and parameter space , the semi-supervised variance minimization problem (5)

is equivalent to the RegBayes optimization problem (2)

where , and is a family of distributions where is restricted to be a Dirac delta centered on .

We include a formal derivation in Appendix A.1 and give a brief outline here. It can be shown that solving the variational optimization objective

is equivalent to minimizing the first term of the RegBayes objective in Theorem 1, and the minimizer is precisely the Bayesian posterior . When we restrict , given any , the optimizing value is in the form where is the Bayesian posterior.

In general, the optimal post-data posterior (after regularization) may be in a different form than the Bayesian posterior. However, note that the variance regularizer only depends on . In this case, the optimal post-data posterior in the regularized objective is still of the form , and is modified by the regularization function only through changing . From here, we can recover the variance minimization objective.

Intuition for variance minimization

By minimizing , we trade off maximizing the likelihood of our observations with minimizing the posterior variance on unlabeled data that we wish to predict. Since the deep kernel parameters are jointly learned, the neural net is encouraged to learn a feature representation in which the unlabeled examples are similar to the labeled examples, thereby reducing the variance on our predictions. If we imagine the labeled data as “supports” for the surface representing the posterior mean, we are optimizing for embeddings where unlabeled data tend to cluster around these labeled supports.

Another interpretation is that the semi-supervised objective is a regularizer that reduces overfitting to labeled data. The model is discouraged from learning features from labeled data that are not also useful for making low-variance predictions at unlabeled data points. In settings where unlabeled data provide additional variation beyond labeled examples, this can improve model generalization.

Training

Semi-supervised deep kernel learning scales well with large amounts of unlabeled data since the unsupervised objective naturally decomposes into a sum over conditionally independent terms. This allows for mini-batch training on unlabeled data with stochastic gradient descent. Since all of the labeled examples are interdependent, computing exact gradients for labeled examples requires full batch gradient descent on the labeled data. However, previous work in approximate gradients in GPs using structured or sparse matrices allows for stochastic batched training and can directly be applied in our model, allowing SSDKL to scale with respect to both labeled and unlabeled data [14, 15, 16].

4 Experiments and results

We apply SSDKL to a variety of real-world regression tasks, beginning with eight datasets from the UCI repository [17]. We also explore the novel and challenging task of predicting local poverty measures from high-resolution satellite imagery [18]. In our reported results, we use the squared exponential or radial basis function kernel. We also experimented with polynomial kernels, but it had generally worse performance. Additional training details are provided in Appendix A.3.

4.1 Baselines

We first compare SSDKL to the purely supervised DKL, showing the contribution of unlabeled data. Following previous work, the DKL model is initialized from a NN+GP model that holds the weights of a pre-trained neural network fixed while optimizing the parameters of the GP.

In addition to the supervised DKL method, we compare against semi-supervised methods including co-training, consistency regularization, generative models, and label propagation. Since many methods are originally for semi-supervised classification, we adapt them for regression.

Coreg, or Co-training Regressors, uses two -nearest neighbor (NN) regressors, each of which generates labels for the other during the learning process [19]. Unlike traditional co-training, which requires splitting features into sufficient and redundant views, Coreg achieves regressor diversity by using different distance metrics for its two regressors [20].

Consistency regularization methods aim to make model outputs invariant to local input perturbations [9, 21, 10]. For semi-supervised classification, [22] found that VAT and mean teacher were the best methods using fair evaluation guidelines. Virtual adversarial training (VAT) via local distributional smoothing (LDS) enforces consistency by training models to be robust to adversarial local input perturbations [9, 23]. Unlike adversarial training [24], the virtual adversarial perturbation is found without labels, making semi-supervised learning possible. We adapt VAT for regression by choosing the output distribution for input , where is a parameterized mapping and is fixed. Optimizing the likelihood term is then equivalent to minimizing squared error; the LDS term is the KL-divergence between the model distribution and a perturbed Gaussian. (see Appendix A.2). Mean teacher enforces consistency by penalizing deviation from the outputs of a model with the exponential weighted average of the parameters over SGD iterations [10].

Label propagation defines a graph structure over the data with edges that define the probability for a categorical label to propagate from one data point to another [25]. If we encode this graph in a transition matrix and let the current class probabilities be , then the algorithm iteratively propagates , row-normalizes , clamps the labeled data to their known values, and repeats until convergence. We make the extension to regression by letting be real-valued labels. As in [25], we use a fully-connected graph and the radial-basis kernel for edge weights. The kernel scale hyperparameter is chosen using a validation set.

Generative models such as the variational autoencoder (VAE) have shown promise in semi-supervised classification especially for visual and sequential tasks [26, 27, 28, 29]. We compare against a semi-supervised VAE by first learning an unsupervised embedding of the data and then using the embeddings as input to a supervised multilayer perceptron.

Percent reduction in RMSE compared to DKL
Dataset SSDKL Coreg Label Prop VAT Mean Teacher VAE SSDKL Coreg Label Prop VAT Mean Teacher VAE
Skillcraft
Parkinsons
Elevators
Protein
Blog
CTslice
Buzz
Electric
Average
Table 1: Percent reduction in RMSE compared to baseline supervised deep kernel learning (DKL) model for semi-supervised deep kernel learning (SSDKL), Coreg, label propagation, virtual adversarial training (VAT), mean teacher, and variational auto-encoder (VAE) models. Results are averaged across 10 trials for each UCI regression dataset. Here is the total number of examples, is the input feature dimension, and is the number of labeled training examples. Final row shows average percent reduction in RMSE achieved by using unlabeled data.

4.2 UCI regression experiments

We evaluate SSDKL on eight regression datasets from the UCI repository. For each dataset, we train on labeled examples, retain examples as the hold out test set, and treat the remaining data as unlabeled examples. Following [22], the labeled data is randomly split 90-10 into training and validation samples, giving a realistically small validation set. We use the validation set for hyperparameter search and early stopping. We report test RMSE averaged over 10 trials of random splits to combat the small data sizes. Following [12], we choose a neural network with a similar [-100-50-50-2] architecture and two-dimensional embedding. We reduced the size of the lower layers since our labeled training sets are much smaller, but similarly to [12], results were not sensitive to these choices. Following [22], we use this same base model for all deep models, including SSDKL, DKL, VAT, mean teacher, and the VAE encoder, in order to make results comparable across methods. Since label propagation creates a kernel matrix of all data points, we limit the number of unlabeled examples for label propagation to a maximum of 20000 due to memory constraints. We initialize labels in label propagation with a kNN regressor with to speed up convergence.

SSDKL performs at least as well as DKL across all datasets, and a Wilcoxon signed-rank test shows significance at the level for at least one labeled training set size for 6 of the 8 datasets. SSDKL gives a and average RMSE improvement over the supervised DKL in the cases respectively, superior to other semi-supervised methods adapted for regression.

Figure 2: Left: Average test RMSE vs. number of labeled examples, , for UCI Parkinsons dataset. SSDKL outperforms supervised DKL, co-training regressors (Coreg), and virtual adversarial training (VAT). Right: SSDKL performance on poverty prediction (section 4.3) as a function of , which controls the trade-off between labeled and unlabeled objectives, for . The dotted lines plot the performance of DKL and Coreg. VAT performed worse and is excluded for clarity. All results averaged over 10 trials. All shading represents one standard deviation.

The same hyperparameters and initializations are used across all UCI datasets for SSDKL. We use learning rates of and for the neural network and GP parameters respectively and initialize all GP parameters to . In Fig. 2 (right), we study the effect of varying to trade off between maximizing the likelihood of labeled data and minimizing the variance of unlabeled data. A large emphasizes minimization of the predictive variance while a small focuses on fitting labeled data. SSDKL improves on DKL for values of between and , indicating that performance is not overly reliant on the choice of this hyperparameter. Fig. 2 (left) compares SSDKL to purely supervised DKL, Coreg, and VAT as we vary the labeled training set size.

Surprisingly, Coreg outperformed SSDKL on the Blog, CTslice, and Buzz datasets. We found that these datasets happen to be better-suited for nearest neighbors-based methods such as Coreg. A kNN regressor using only the labeled data outperformed DKL and SSDKL on all three datasets for , beat DKL on all three for , and beat SSDKL on two of three for . Thus, the kNN regressor is already outperforming SSDKL with only labeled data—it is unsurprising that SSDKL is unable to close the gap on a semi-supervised nearest neighbors method like Coreg.

Figure 3: Left: Two-dimensional embeddings learned by supervised deep kernel learning (DKL) model on the Skillcraft dataset using labeled training examples. The colorbar shows the magnitude of the outputs. Right: Embeddings learned by semi-supervised deep kernel learning (SSDKL) model using the same 50 labeled training examples plus an additional unlabeled examples. By using unlabeled data for regularization, SSDKL learns a better representation.

Representation learning

To gain some intuition about how the unlabeled data helps in the learning process, we visualize the neural network embeddings learned by the DKL and SSDKL models on the Skillcraft dataset. In Fig. 3 (left), we first train DKL on labeled training examples and plot the two-dimensional neural network embedding that is learned. In Fig. 3 (right), we train SSDKL on the same labeled training examples along with additional unlabeled data points and plot the resulting embedding. In the left panel, DKL learns a poor embedding—different colors representing different output magnitudes are intermingled. In the right panel, SSDKL is able to use the unlabeled data for regularization, and learns a better representation of the dataset.

Percent reduction in RMSE ()
Country Spatial SSDKL SSDKL DKL
Malawi
Nigeria
Tanzania
Uganda
Rwanda
Average
Table 2: Percent RMSE reduction in a poverty measure prediction task compared to baseline ridge regression model used in [30]. SSDKL and DKL models use only satellite image data. Spatial SSDKL incorporates both location and image data through kernel composition. Final row shows average RMSE reduction of each model averaged over 10 trials.

4.3 Poverty prediction

High-resolution satellite imagery offers the potential for cheap, scalable, and accurate tracking of changing socioeconomic indicators. In this task, we predict local poverty measures from satellite images using limited amounts of poverty labels. As described in [30], the dataset consists of villages across five Africa countries: Nigeria, Tanzania, Uganda, Malawi, and Rwanda. These include some of the poorest countries in the world (Malawi and Rwanda) as well as some that are relatively better off (Nigeria), making for a challenging and realistically diverse problem.

In this experiment, we use labeled satellite images for training. With such a small dataset, we can not expect to train a deep convolutional neural network (CNN) from scratch. Instead we take a transfer learning approach as in [18], extracting 4096-dimensional visual features and using these as input. More details are provided in Appendix A.4.

Incorporating spatial information

In order to highlight the usefulness of kernel composition, we explore extending SSDKL with a spatial kernel. Spatial SSDKL composes two kernels by summing an image feature kernel and a separate location kernel that operates on location coordinates (lat/lon). By treating them separately, it explicitly encodes the knowledge that location coordinates are spatially structured and distinct from image features.

As shown in Table 2, all models outperform the baseline state-of-the-art ridge regression model from [30]. Spatial SSDKL significantly outperforms all other models, while the concatenated kernel does not improve over the basic SSDKL model, which uses only image features. Spatial SSDKL outperforms the other models by directly modeling location coordinates as spatial features, showing that kernel composition can effectively incorporate prior knowledge of structure.

5 Related work

[31] introduced deep Gaussian processes, which stack GPs in a hierarchy by modeling the outputs of one layer with a Gaussian process in the next layer. Despite the suggestive name, these models do not integrate deep neural networks and Gaussian processes.

[6] proposed deep kernel learning, combining neural networks with the non-parametric flexibility of GPs and training end-to-end in a fully supervised setting. Extensions have explored approximate inference, stochastic gradient training, and recurrent deep kernels for sequential data [14, 15, 16].

Our method draws inspiration from transductive experimental design, which chooses the most informative points (experiments) to measure by seeking data points that are both hard to predict and informative for the unexplored test data [32]. Similar prediction uncertainty approaches have been explored in semi-supervised classification models, such as minimum entropy and minimum variance regularization, which can now also be understood in the posterior regularization framework [3, 33].

Recent work in generative adversarial networks (GANs) [26], variational autoencoders (VAEs) [27], and other generative models have achieved promising results on various semi-supervised classification tasks [28, 29]. However, we find that these models are not as well-suited for generic regression tasks such as in the UCI repository as for audio-visual tasks.

Consistency regularization posits that the model’s output should be invariant to reasonable perturbations of the input [9, 21, 10]. Combining adversarial training [24] with consistency regularization, virtual adversarial training uses a label-free regularization term that allows for semi-supervised training [9]. Mean teacher adds a regularization term that penalizes deviation from a exponential weighted average of the parameters over SGD iterations [10]. For semi-supervised classification, [22] found that VAT and mean teacher were the best methods across a series of fair evaluations.

Label propagation defines a graph structure over the data points and propagates labels from labeled data over the graph. The method must assume a graph structure and edge distances on the input feature space without the ability to adapt the space to the assumptions. Label propagation is also subject to memory constraints since it forms a kernel matrix over all data points, requiring quadratic space in general, although sparser graph structures can reduce this to a linear scaling.

Co-training regressors trains two kNN regressors with different distance metrics which label each others’ unlabeled data. This works when neighbors in the given input space are meaningful, but cannot adapt the space. As a fully non-parametric method, inference requires retaining the full dataset.

Much of the previous work in semi-supervised learning is in classification and the assumptions do not generally translate to regression. Our experiments show that SSDKL outperforms other adapted semi-supervised methods in a battery of regression tasks.

6 Conclusions

Many important problems are challenging because of the limited availability of training data, making the ability to learn from unlabeled data critical. In experiments with UCI datasets and a real-world poverty prediction task, we find that minimizing posterior variance can be an effective way to incorporate unlabeled data when labeled training data is scarce. SSDKL models are naturally suited for many real-world problems, as spatial and temporal structure can be explicitly modeled through the composition of kernel functions. While our focus is on regression problems, we believe the SSDKL framework is equally applicable to classification problems—we leave this to future work.

Appendix A Appendix

a.1 Posterior regularization

Proof of Theorem 1.

Let be a collection of observed data. Let be the observed input data points. As in [8], we assume that is a complete separable metric space and is a absolutely continuous probability measure (with respect to background measure ) on , where is the the Borel -algebra, such that a density exists where . Let be a space of parameters to the model, where we treat as random variables. With regards to the notation in the RegBayes framework, the model is the pair . We assume as in [8] that the likelihood function is the likelihood distribution which is dominated by a -finite measure for all with positive density, such that a density exists where .

We would like to compute the posterior distribution

which is involves an intractable integral. We introduce a variational approximation which approximates , where is a family of approximating distributions such that is restricted to be a Dirac delta centered on . We claim that the exact optimal solution of the following optimization problem is precisely the Bayesian posterior :

We note that adding the constant to the objective,

so that the claim holds, and we see that the objective is equivalent to the first term of the RegBayes objective (Section 2.3). When we restrict ,

(6)
(7)
(8)
(9)
(10)
(11)

where in equation (8) we note that does not vary with or , and can be removed from the optimization, and similarly in equation (10) we remove the constant . For every , the optimizing value is , which is the Bayesian posterior given the model parameters. Substituting this optimal value into (11),

(12)
(13)
(14)

using that in (13). The optimization problem over reflects maximizing the likelihood of the data.

The regularization term in the RegBayes framework is expressed variationally as

where are slack variables, is a penalty function, and is a subspace of feasible distributions satisfying specified constraints. An equivalent formulation of the RegBayes problem is then

(15)

Let the regularization function be

where , , and is restricted to the family of distributions

where is the Bayesian posterior from the unregularized objective. Note that given values of and , the optimal . Thus the regularization function is

Note that the regularization function only depends on . Therefore the optimal post-data posterior in the regularized objective is still in the form , and is modified by the regularization function only through .

Thus, augmenting the objective from (11) and using the optimal post-data posterior , the regularized optimization objective is

where we use that in the third-to-last equality. ∎

a.2 Virtual Adversarial Training

Virtual adversarial training (VAT) is a general training mechanism which enforces local distributional smoothness (LDS) by optimizing the model to be less sensitive to adversarial perturbations of the input [9]. The VAT objective is to augment the marginal likelihood with an LDS objective:

where

Note that the LDS objective does not require labels, so that unlabeled data can be incorporated. The experiments in the original paper are for classification, although VAT is general. We use VAT for regression by choosing where is a parameterized mapping (a neural network), and is fixed. Optimizing the likelihood term is then equivalent to minimizing the squared error and the LDS term is the KL-divergence between the model’s Gaussian distribution and a perturbed Gaussian distribution, which is also in the form of a squared difference. The adversarial perturbation is calculated with the second-order Taylor approximated at each step using a first dominant eigenvector calculation of the Hessian. The eigenvector calculation is done via a finite-difference approximation to the power iteration method. As in [9], one step of the finite-difference approximation was used in all of our experiments.

a.3 Training details

In our reported results, we use the standard squared exponential or radial basis function (RBF) kernel,

where and represent the signal variance and characteristic length scale. We also experimented with polynomial kernels, , , but found that performance generally decreased. To enforce positivity constraints on the kernel parameters and positive definiteness of the covariance, we train these parameters in the log-domain. Although the information capacity of a non-parametric model increases with the dataset size, the marginal likelihood automatically constrains model complexity without additional regularization [13]. The parametric neural networks are regularized with L2 weight decay to reduce overfitting, and models are implemented and trained in TensorFlow using the ADAM optimizer [34, 35].

a.4 Poverty prediction

High-resolution satellite imagery offers the potential for cheap, scalable, and accurate tracking of changing socioeconomic indicators. The United Nations has set 17 Sustainable Development Goals (SDGs) for the year 2030—the first of these is the worldwide elimination of extreme poverty, but a lack of reliable data makes it difficult to distribute aid and target interventions effectively. Traditional data collection methods such as large-scale household surveys or censuses are slow and expensive, requiring years to complete and costing billions of dollars [36]. Because data on the outputs that we care about are scarce, it is difficult to train models on satellite imagery using traditional supervised methods. In this task, we attempt to predict local poverty measures from satellite images using limited amounts of poverty labels. As described in [30], the dataset consists of villages across five Africa countries: Nigeria, Tanzania, Uganda, Malawi, and Rwanda. These countries include some of the poorest in the world (Malawi, Rwanda) as well as regions of Africa that are relatively better off (Nigeria), making for a challenging and realistically diverse problem. The raw satellite inputs consist of pixel RGB satellite images downloaded from Google Static Maps at zoom level 16, corresponding to m ground resolution. The target variable that we attempt to predict is a wealth index provided in the publicly available Demographic and Health Surveys (DHS) [37, 38].

References

  • [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [2] Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
  • [3] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2004.
  • [4] Olivier Chapelle and Alexander Zien. Semi-supervised classification by low density separation. In AISTATS, pages 57–64, 2005.
  • [5] Aarti Singh, Robert Nowak, and Xiaojin Zhu. Unlabeled data: Now it helps, now it doesn’t. In Advances in neural information processing systems, pages 1513–1520, 2009.
  • [6] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. The Journal of Machine Learning Research, 2015.
  • [7] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11(Jul):2001–2049, 2010.
  • [8] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to infinite latent svms. Journal of Machine Learning Research, 15(1):1799–1847, 2014.
  • [9] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.
  • [10] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1195–1204. Curran Associates, Inc., 2017.
  • [11] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for transductive transfer learning. Proc. Seventh IEEE Int’,l Conf. Data Mining Workshops, 2007.
  • [12] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 370–378, 2016.
  • [13] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. The MIT Press, 2006.
  • [14] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (KISS-GP). In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1775–1784, 2015.
  • [15] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pages 2586–2594, 2016.
  • [16] Maruan Al-Shedivat, Andrew Gordon Wilson, Yunus Saatchi, Zhiting Hu, and Eric P Xing. Learning scalable deep kernels with recurrent structure. arXiv preprint arXiv:1610.08936, 2016.
  • [17] M. Lichman. UCI machine learning repository, 2013.
  • [18] Michael Xie, Neal Jean, Marshall Burke, David Lobell, and Stefano Ermon. Transfer learning from deep features for remote sensing and poverty mapping. AAAI Conference on Artificial Intelligence, 2016.
  • [19] Zhi-Hua Zhou and Ming Li. Semi-supervised regression with co-training. In IJCAI, volume 5, pages 908–913, 2005.
  • [20] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM, 1998.
  • [21] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. ICLR 2017.
  • [22] Augustus Odena, Avital Oliver, Colin Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of semi-supervised learning algorithms. 2018.
  • [23] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976, 2017.
  • [24] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [25] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, 2002.
  • [26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [27] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [28] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
  • [29] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546–3554, 2015.
  • [30] Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
  • [31] Andreas C. Damianou and Neil D. Lawrence. Deep gaussian processes. The Journal of Machine Learning Research, 2013.
  • [32] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. The International Conference on Machine Learning (ICML), 2006.
  • [33] Chenyang Zhao and Shaodan Zhai. Minimum variance semi-supervised boosting for multi-label classification. In 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 1342–1346. IEEE, 2015.
  • [34] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  • [35] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 3rd International Conference for Learning Representations, 2015.
  • [36] Morten Jerven. Poor numbers: how we are misled by African development statistics and what to do about it. Cornell University Press, 2013.
  • [37] ICF International. Demographic and health surveys (various) [datasets], 2015.
  • [38] Deon Filmer and Lant H Pritchett. Estimating wealth effects without expenditure data—or tears: An application to educational enrollments in states of india*. Demography, 38(1):115–132, 2001.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199965
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description