Icebreaker: Elementwise Active Information Acquisition with Bayesian Deep Latent Gaussian Model
Abstract
In this paper we introduce the icestart problem, i.e., the challenge of deploying machine learning models when only little or no training data is initially available, and acquiring each feature element of data is associated with costs. This setting is representative for the realworld machine learning applications. For instance, in the healthcare domain, when training an AI system for predicting patient metrics from lab tests, obtaining every single measurement comes with a high cost. Active learning, where only the label is associated with a cost does not apply to such problem, because performing all possible lab tests to acquire a new training datum would be costly, as well as unnecessary due to redundancy. We propose Icebreaker, a principled framework to approach the icestart problem. Icebreaker uses a full Bayesian Deep Latent Gaussian Model (BELGAM) with a novel inference method. Our proposed method combines recent advances in amortized inference and stochastic gradient MCMC to enable fast and accurate posterior inference. By utilizing BELGAM’s ability to fully quantify model uncertainty, we also propose two information acquisition functions for imputation and active prediction problems. We demonstrate that BELGAM performs significantly better than the previous VAE (Variational autoencoder) based models, when the data set size is small, using both machine learning benchmarks and realworld recommender systems and healthcare applications. Moreover, based on BELGAM, Icebreaker further improves the performance and demonstrate the ability to use minimum amount of the training data to obtain the highest test time performance.
1
1 Introduction
Acquiring information is costly in many realworld applications. For example, to make a correct diagnosis and perform effective treatment, a medical doctor often needs to carry out a sequence of medical tests to gather information. Performing each of these tests is associated with a cost in terms of money, time, and health risks. To this end, an AI system should be able to suggest the information to be acquired in the form of “one measurement (feature) at a time” to enable the accurate predictions (diagnosis) for any new users/patients. Recently, testtime active prediction methods, such as EDDI (Efficient Dynamic Discovery of highvalue Inference) (ma2018eddi), provide a solution for when there is sufficient amount of training data available. Unfortunately, in these scenarios, training data can be also challenging and costly to obtain. For example, new data needs to be collected by taking measurements of currently hospitalized patients with their consent. Ideally, we would like to deploy an AI system, such as EDDI when no or only limited training data is available. We call this problem the icestart problem. It is desired to have a method to actively selected training dataelement for such task.
The key to address the icestart problem is to have a scalable model which knows what it does not know, aka to quantify the epistemic uncertainty. In this way, this uncertainty can be used to guide the acquisition of training data, e.g., it would prefer to acquire unfamiliar but informative feature elements over others, such as the familiar but uninformative ones. Thus, such an approach can reduce the cost of acquiring training data. We refer to this as elementwise trainingtime active acquisition.
Trainingtime active acquisition is needed in a great range of applications. Apart from general prediction tasks, such as the above health care example, general imputation tasks such as recommender system also need to address the icestart problem. For example, when a new shop need a recommander where no historical customer data is available. Thus, a framework that can handle different type of tasks is desired.
Despite the success of elementwise testtime active prediction (ma2018eddi; lewenberg2017knowing; shim2018joint; zannone2019odin), few works have provided a general and scalable solutions for the problem of icestart. Additionally, these works (melville2004active; krause2010utility; krumm2019traffic) commonly are limited in a specific application scenario. An elementwise method needs to handle partial observations at any time. More importantly, we need to design new acquisition functions that takes the model parameter uncertainty into account.
In this work, we propose Icebreaker, a principled framework to solve the icestart problem. Icebreaker actively acquires informative feature elements during training and also perform active test prediction with small amount of data for training. To enable Icebreaker, we contribute:

We propose a Bayesian deep Latent Gaussian Model (BELGAM). Standard training of the deep generative model cares about the point estimates for the parameters, whereas our approach applies a fully Bayesian treatment to the weights. Thus, during the training time acquisition, we can leverage the epistemic uncertainty. (Section 2)

We design a novel partial amortized inference method for BELGAM, naming PABELGAM. We combine recent advances in amortized inference for the local latent variables and stochastic gradient MCMC for the model parameters, i.e. the weights of the neural network, to ensure high inference accuracy. (Section 2.2)

To complete Icebreaker, we propose two trainingtime information acquisition functions based on the uncertainties modeled by PABELGAM to identify informative elements. One acquisition function is designed for imputation tasks (Section LABEL:sec:impute), and the other for active prediction tasks. (Section LABEL:sec:predict)

We evaluate the proposed PABELGAM as well as the entire Icebreaker approach on wellused machine learning benchmarks and a realworld healthcare task. The method demonstrates clear improvements compared to multiple baselines and shows that it can be effectively used to solve the icestart problem. (Section LABEL:sec:exp)
2 Bayesian Deep Latent Gaussian Model (BELGAM) with Partial Amortized Inference
As discussed before, to enable a generic and scalable solution for the Icestart problem. We first need a flexible model with epidemic uncertainty quantification which can also handle missing values in the data. Here, we propose Bayesian Deep Latent Gaussian Model (BELGAM) with scaleable and accurate approximate inference.
2.1 Bayesian Deep Latent Gaussian Model (BELGAM)
We design a flexible full Bayesian model which provides the model uncertainty quantification. A Bayesian latent variable generative model as shown in Figure 1 is a common choice, but previous work of such models are typically linear and not flexible enough to model highly complex data. A Deep Latent Gaussian Model which uses a neural network mapping is flexible but not fully Bayesian as the uncertainty of the model itself is ignored. We thus propose a Bayesian Deep Latent Gaussian Model (BELGAM), which uses a Bayesian neural network to generate observations from local latent variables with global latent weights shown in 1. The model is thus defined as:
(1) 
where is the observed data. The goal is to infer the posterior, , for both local latent variable and global latent weights . Given the posterior, we can infer the missing data as in (ma2018eddi). Such a model is generally intractable and approximate inference is needed (zhang2017advances; li2018approximate). Variational inference (VI) (wainwright2008graphical; zhang2017advances; jordan1999introduction; beal2003variational; li2018approximate) and samplingbased methods (andrieu2003introduction) are two types of approaches used for this task. Samplingbased approaches are known for accurate inference performances and theoretical guarantees.
However, sampling the local latent variable is computationally expensive as the cost scales linearly with the data set size. To best trade off the computational cost against the inference accuracy, we propose to amortize the inference for and keep an accurate samplingbased approach for the global latent weights . Specifically, we use preconditioned stochastic gradient Hamiltonian Monte Carlo (SGHMC) (chen2016bridging) (see appendix for details).
2.2 Partial Amortized BELGAM
Revisiting amortized inference in the presence of missing data.
Amortized inference (kingma2014auto; Rezende2014StochasticBA) is an efficient extension for variational inference. It was originally proposed for deep latent Gaussian models where only local latent variables need to be inferred. Instead of using an individually parametrized approximation for each data instance , amortized inference uses a deep neural network as a function estimator to compute using as input, . Thus, the estimation of the local latent variable does not scale with data set size during model training.
Footnotes
 Contributed during his internship in Microsoft Research
Now at Google AI, Berlin, Germany (contributed while being with Microsoft Research)
Correspondence to: Cheng Zhang and Wenbo Gong