Icebreaker: Element-wise Active Information Acquisition with Bayesian Deep Latent Gaussian Model
In this paper we introduce the ice-start problem, i.e., the challenge of deploying machine learning models when only little or no training data is initially available, and acquiring each feature element of data is associated with costs. This setting is representative for the real-world machine learning applications. For instance, in the health-care domain, when training an AI system for predicting patient metrics from lab tests, obtaining every single measurement comes with a high cost. Active learning, where only the label is associated with a cost does not apply to such problem, because performing all possible lab tests to acquire a new training datum would be costly, as well as unnecessary due to redundancy. We propose Icebreaker, a principled framework to approach the ice-start problem. Icebreaker uses a full Bayesian Deep Latent Gaussian Model (BELGAM) with a novel inference method. Our proposed method combines recent advances in amortized inference and stochastic gradient MCMC to enable fast and accurate posterior inference. By utilizing BELGAM’s ability to fully quantify model uncertainty, we also propose two information acquisition functions for imputation and active prediction problems. We demonstrate that BELGAM performs significantly better than the previous VAE (Variational autoencoder) based models, when the data set size is small, using both machine learning benchmarks and real-world recommender systems and health-care applications. Moreover, based on BELGAM, Icebreaker further improves the performance and demonstrate the ability to use minimum amount of the training data to obtain the highest test time performance.
Acquiring information is costly in many real-world applications. For example, to make a correct diagnosis and perform effective treatment, a medical doctor often needs to carry out a sequence of medical tests to gather information. Performing each of these tests is associated with a cost in terms of money, time, and health risks. To this end, an AI system should be able to suggest the information to be acquired in the form of “one measurement (feature) at a time” to enable the accurate predictions (diagnosis) for any new users/patients. Recently, test-time active prediction methods, such as EDDI (Efficient Dynamic Discovery of high-value Inference) (ma2018eddi), provide a solution for when there is sufficient amount of training data available. Unfortunately, in these scenarios, training data can be also challenging and costly to obtain. For example, new data needs to be collected by taking measurements of currently hospitalized patients with their consent. Ideally, we would like to deploy an AI system, such as EDDI when no or only limited training data is available. We call this problem the ice-start problem. It is desired to have a method to actively selected training data-element for such task.
The key to address the ice-start problem is to have a scalable model which knows what it does not know, aka to quantify the epistemic uncertainty. In this way, this uncertainty can be used to guide the acquisition of training data, e.g., it would prefer to acquire unfamiliar but informative feature elements over others, such as the familiar but uninformative ones. Thus, such an approach can reduce the cost of acquiring training data. We refer to this as element-wise training-time active acquisition.
Training-time active acquisition is needed in a great range of applications. Apart from general prediction tasks, such as the above health care example, general imputation tasks such as recommender system also need to address the ice-start problem. For example, when a new shop need a recommander where no historical customer data is available. Thus, a framework that can handle different type of tasks is desired.
Despite the success of element-wise test-time active prediction (ma2018eddi; lewenberg2017knowing; shim2018joint; zannone2019odin), few works have provided a general and scalable solutions for the problem of ice-start. Additionally, these works (melville2004active; krause2010utility; krumm2019traffic) commonly are limited in a specific application scenario. An element-wise method needs to handle partial observations at any time. More importantly, we need to design new acquisition functions that takes the model parameter uncertainty into account.
In this work, we propose Icebreaker, a principled framework to solve the ice-start problem. Icebreaker actively acquires informative feature elements during training and also perform active test prediction with small amount of data for training. To enable Icebreaker, we contribute:
We propose a Bayesian deep Latent Gaussian Model (BELGAM). Standard training of the deep generative model cares about the point estimates for the parameters, whereas our approach applies a fully Bayesian treatment to the weights. Thus, during the training time acquisition, we can leverage the epistemic uncertainty. (Section 2)
We design a novel partial amortized inference method for BELGAM, naming PA-BELGAM. We combine recent advances in amortized inference for the local latent variables and stochastic gradient MCMC for the model parameters, i.e. the weights of the neural network, to ensure high inference accuracy. (Section 2.2)
To complete Icebreaker, we propose two training-time information acquisition functions based on the uncertainties modeled by PA-BELGAM to identify informative elements. One acquisition function is designed for imputation tasks (Section LABEL:sec:impute), and the other for active prediction tasks. (Section LABEL:sec:predict)
We evaluate the proposed PA-BELGAM as well as the entire Icebreaker approach on well-used machine learning benchmarks and a real-world health-care task. The method demonstrates clear improvements compared to multiple baselines and shows that it can be effectively used to solve the ice-start problem. (Section LABEL:sec:exp)
2 Bayesian Deep Latent Gaussian Model (BELGAM) with Partial Amortized Inference
As discussed before, to enable a generic and scalable solution for the Ice-start problem. We first need a flexible model with epidemic uncertainty quantification which can also handle missing values in the data. Here, we propose Bayesian Deep Latent Gaussian Model (BELGAM) with scaleable and accurate approximate inference.
2.1 Bayesian Deep Latent Gaussian Model (BELGAM)
We design a flexible full Bayesian model which provides the model uncertainty quantification. A Bayesian latent variable generative model as shown in Figure 1 is a common choice, but previous work of such models are typically linear and not flexible enough to model highly complex data. A Deep Latent Gaussian Model which uses a neural network mapping is flexible but not fully Bayesian as the uncertainty of the model itself is ignored. We thus propose a Bayesian Deep Latent Gaussian Model (BELGAM), which uses a Bayesian neural network to generate observations from local latent variables with global latent weights shown in 1. The model is thus defined as:
where is the observed data. The goal is to infer the posterior, , for both local latent variable and global latent weights . Given the posterior, we can infer the missing data as in (ma2018eddi). Such a model is generally intractable and approximate inference is needed (zhang2017advances; li2018approximate). Variational inference (VI) (wainwright2008graphical; zhang2017advances; jordan1999introduction; beal2003variational; li2018approximate) and sampling-based methods (andrieu2003introduction) are two types of approaches used for this task. Sampling-based approaches are known for accurate inference performances and theoretical guarantees.
However, sampling the local latent variable is computationally expensive as the cost scales linearly with the data set size. To best trade off the computational cost against the inference accuracy, we propose to amortize the inference for and keep an accurate sampling-based approach for the global latent weights . Specifically, we use preconditioned stochastic gradient Hamiltonian Monte Carlo (SGHMC) (chen2016bridging) (see appendix for details).
2.2 Partial Amortized BELGAM
Revisiting amortized inference in the presence of missing data.
Amortized inference (kingma2014auto; Rezende2014StochasticBA) is an efficient extension for variational inference. It was originally proposed for deep latent Gaussian models where only local latent variables need to be inferred. Instead of using an individually parametrized approximation for each data instance , amortized inference uses a deep neural network as a function estimator to compute using as input, . Thus, the estimation of the local latent variable does not scale with data set size during model training.
- Contributed during his internship in Microsoft Research
Now at Google AI, Berlin, Germany (contributed while being with Microsoft Research)
Correspondence to: Cheng Zhang and Wenbo Gong