The Effectiveness of Variational Autoencoders for Active Learning

The Effectiveness of Variational Autoencoders for Active Learning

Farhad Pourkamali-Anaraki
Department of Computer Science
University of Massachusetts Lowell MA USA Michael B. Wakin
Department of Electrical Engineering
Colorado School of Mines CO USA
Abstract

The high cost of acquiring labels is one of the main challenges in deploying supervised machine learning algorithms. Active learning is a promising approach to control the learning process and address the difficulties of data labeling by selecting labeled training examples from a large pool of unlabeled instances. In this paper, we propose a new data-driven approach to active learning by choosing a small set of labeled data points that are both informative and representative. To this end, we present an efficient geometric technique to select a diverse core-set in a low-dimensional latent space obtained by training a Variational Autoencoder (VAE). Our experiments demonstrate an improvement in accuracy over two related techniques and, more importantly, signify the representation power of generative modeling for developing new active learning methods in high-dimensional data settings.

1 Introduction

In many machine learning applications such as medical diagnosis, a major challenge is to collect sufficient amounts of supervised training data because the labeling process tends to require domain expertise and immense amounts of computational/experimental resources. A promising approach to overcome this problem is active learning [2, 23, 5, 8], which focuses on practical ways to choose a small subset of the data for labeling to train accurate predictive models. In the pool-based active learning setting, we have access to a large set of unlabeled instances, as data collection is often straightforward. The goal is then to query a small number of labels from an oracle, e.g., a human annotator. A recent work defined active learning as a core-set selection problem to improve the performance of existing methods [22]. Core-set construction has been a promising technique for large-scale learning such as classification and clustering [15, 12, 20].

From a geometric perspective, the idea behind core-sets in unsupervised settings is to find diverse subsets that best cover the entire data. Thus, in the context of active learning with a limited labeling budget, it is reasonable to query labels for only representative examples to train supervised models. However, most existing works, including [22], attempt to find such set covers in the original input space.

A remarkable drawback of prior work is that geometric methods often have poor performance when analyzing high-dimensional data sets, such as image data, because of the inefficiency of similarity measures such as the Euclidean norm [14]. Moreover, covering methods are often NP-hard and require multiple initializations. Therefore, solving related geometric optimization problems in the high-dimensional input space is computationally expensive and impractical, specifically for a large number of unlabeled instances.

In this paper, we propose a new approach to active learning by learning a mapping from the observed space to a lower-dimensional latent space. To learn an efficient compression of the input data, we use Variational Autoencoders (VAEs) [7, 21, 11]. VAEs are popular techniques to find complex latent-variable generative models for high-dimensional data sets such as image data. The success of VAEs stems from the representation power of neural networks for approximating complex functions and variational inference for Bayesian models [27].

Our second contribution is to design an efficient and easy-to-implement method for finding core-sets in the latent space of a VAE. The proposed method uses K-means clustering [18] to recognize regions of interest with possibly homogeneous labels. Subsequently, the learner trains a supervised model, e.g., a classifier, in the latent space. Due to the generative nature of our framework, we can map data points outside the original pool into the latent space. Therefore, our active learning approach complements existing works focusing on choosing representative examples in the input space and paths the way for future improvements in terms of achieving higher accuracies and computational savings.

We also present experiments on the MNIST database of handwritten digits [9] to verify the performance of our proposed method empirically. The results of an active learning algorithm are typically depicted by a curve measuring the trade-off between the number of labeled points and classification accuracy. Unlike most existing works, we do not assume a fixed hypothesis or model for the classification task because an appropriate classifier is usually not known a priori. To have a fair comparison in our experiments, instead, we allow our active learning method to decide the best classifier according to the labeling budget.

The rest of this paper is organized as follows. Section 2 formulates the active learning problem and provides a brief overview of prior art. Section 3 introduces the proposed method. Extensive experiments in Section 4 demonstrate the effectiveness of VAEs for active learning, while Section 5 provides concluding remarks.

2 Relation to Prior Work

Suppose we are given a set of unlabeled points in with a label space , corresponding to classification with classes. We then acquire labels for a small set of points from to train a classifier. Thus, the labeling budget is in the pool-based active learning framework.

In the classical setting, active learning methods choose a single point from the pool of unlabeled instances at each iteration until requesting labels for a total of points. To this end, a fixed model actively selects the data points for which the most uncertain [10, 24, 6, 5, 1]. Intuitively, such a sampling strategy will choose more informative data points compared to sampling uniformly at random from the unlabeled pool [26]. For example, a binary classifier, such as logistic regression, selects a point at each round whose posterior probability is closest to , i.e., has the maximum entropy.

Active learning methods using uncertainty information have two main shortcomings. For large-scale data, it is impractical to query a single point from at each active learning iteration due to the high cost of sampling, e.g., calculating the entropy of the entire unlabeled set. Also, the model has to be retrained after acquiring each label, which can lead to computational and statistical inefficiencies. For example, a single point is likely to have no statistically significant impact on the accuracy of the trained classifier. The second issue is that highly correlated queries across different iterations of active learning reduce the overall efficiency, as a large part of the budget may focus on repeatedly choosing nearby points [25, 17]. This problem is exacerbated in multi-class classification problems as standard active learning methods are often biased towards certain regions and miss critical information about the distribution of the unlabeled pool .

Therefore, there is consensus now on the need for improved active learning methods that take into account the diversity of selected points for labeling. The authors of [22] proposed a geometric approach, which finds a small subset that best covers the entire unlabeled pool . To be formal, given a candidate set with size less than , the next point to query is chosen as follows:

(1)

where is a distance metric in the input space . A critical component of this geometric approach is the choice of . Previous work considered the Euclidean distance for simplicity. However, it is still an open question whether we can find more meaningful metrics for complex high-dimensional data.

In this paper, we propose a new approach to tackle this problem based on learning an informative and low-dimensional data representation using the unlabeled data set before acquiring labels for points. Hence, the proposed framework has several advantages: it will lead to significant improvements in terms of computational and statistical efficiencies, and furthermore, it can be seamlessly integrated into other frameworks focusing on finding diverse subsets of data points for active learning or other purposes.

3 Proposed Method

In this section, we explain why variational autoencoders (VAEs) are effective tools to find a small set of diverse examples from the unlabeled data set in the pool-based active learning framework. To this end, we briefly review the main underlying ideas behind VAEs. Next, we discuss our proposed method in the latent space provided by training a VAE on . While we focus on using the entire unlabeled set in this paper, an interesting future research direction is to train VAEs using a portion of unlabeled instances .

VAEs are powerful generative models capable of learning unsupervised latent representations of complex high-dimensional data. In the VAE framework, one approximates the intractable posterior distribution over a set of latent variables, i.e., , with another distribution . The Kullback-Leibler divergence between these two distributions [3], i.e., , is minimized by maximizing a lower bound on the marginal log-likelihood over the data in the following form when is a predefined distribution such as an isotropic Gaussian:

(2)

From an information theoretic perspective, the variables , , are latent representations of the input data points . From the neural network viewpoint, VAEs consist of an encoder , a decoder , and a loss function which can be optimized by stochastic gradient descent. Therefore, the encoder network takes as input data in and maps it into a lower-dimensional latent representation in . This feature transformation process forms the main building block for our proposed active learning method by facilitating the design of effective geometric methods.

The proposed geometric approach in the latent space of a VAE is an efficient method capable of providing a diverse set of points that cover the entire data. To be formal, we propose to employ a clustering algorithm, such as K-means clustering, to partition the latent representations into clusters (); without of loss of generality, we assume that is an integer. This step allows us to capture the underlying structure in the latent space; see [13] for a theoretical discussion. Next, we sample points uniformly at random from each cluster to create a set of points for labeling. Hence, our approach approximately solves the following optimization problem over a core-set to cover the full data in :

(3)

That is, our proposed method aims to find data points in the latent space such that the largest Euclidean distance between every point in and its nearest representative from the core-set is minimized. As a result, learning a latent representation of the input data has two benefits. It leads to significant computational savings for high-dimensional data and considerable improvements in performance when finding a meaningful metric for measuring distances in the data space is a difficult task. Moreover, the introduced active learning method in this paper offers an elegant framework for massive data sets with thousands or even millions of points by employing non-uniform sampling methods within each of the clusters (instead of the uniform sampling currently used). Specifically, the proposed framework allows us to incorporate uncertainty information concerning a model within each of the partitions to achieve further improvements. However, in this work, our main focus is to design an agnostic active learning algorithm that works for any hypothesis class [4].

4 Experimental Results

We demonstrate the effectiveness of VAEs for reducing the number of labeled data points that are required to get a specified accuracy on the MNIST data set (loaded from Keras). To this end, we first present a visualization of the latent space corresponding to three classes and show the advantages of our proposed framework compared to existing techniques. The second experiment consists of four and five classes from the MNIST data set to further show that our framework is appropriate for multi-class classification problems.

Classification with three classes: In this experiment, we consider the problem of classifying three digits, i.e., 0, 3, and 9, using data which contains training images and test images of size (). Test images are used only for reporting the classification accuracy and not for training VAEs or classifiers.

(a) Visualizing latent space obtained via VAE (b) Uniform sampling in input space vs. latent space (c) Geometric sampling in input space vs. latent space
Figure 1: Active learning with three classes from MNIST.

We use the unlabeled training data consisting of images to learn a latent space in with . To train a VAE for the image data, the encoder network consists of four convolutional layers [16], where both the height and width of convolution windows are set to be . The numbers of filters in these four layers are , , , and , respectively. We use ReLU activation functions and the standard optimizer RMSProp. As a result, we map the -dimensional input space into . Before reporting active learning results, we plot the latent space for the training data in Fig. 1(a), where we color-code the embedding for visualization purposes (although we did not use those labels for training). As we see, the learned mapping provides an informative representation of the data according to their known digit class.

To exhibit the efficacy of VAEs in this example, we also perform K-means clustering in both input space and latent space when the number of clusters is . While our main goal in this paper is to train supervised models, the clustering accuracy is a sensible measure to confirm that the obtained latent representation is indeed able to extract the underlying structure of the data. Normalized mutual information (NMI) [19] is a popular clustering quality metric which ranges from to , and larger values of NMI indicate the higher quality of clustering. In this experiment, the values of NMI in the input space and the latent space are and , respectively.

In the active learning setting, we do not have access to all labels. Hence, we choose varying number of labeled points from to , and we then train a support vector classifier (SVC) on the labeled data (SVC loaded from scikit-learn, and supports multi-class classification). The classifier is optimized by cross-validation using GridSearchCV, and chooses between radial basis and linear kernel functions.

In Fig. 1(b), we report the mean and standard deviation of accuracy over independent trials for the number of labeled points ranging from to . Here, we drop the standard deviation of geometric techniques to demonstrate the improvement achieved by just using a better data representation combined with uniform sampling. We observe that uniform sampling in the latent space consistently outperforms the same sampling method in the input space. For example, the averaged classification accuracy in the latent space reaches with even labeled points, while at least labeled instances are required to reach the same accuracy without transforming the input data. Furthermore, uniform sampling in the latent space results in greater reduction in the standard deviation of classification accuracy. Thus, the low-dimensional representation achieved by the VAE is beneficial for reducing the number of labeled points to achieve a desired accuracy.

In Fig. 1(c), we report the mean and standard deviation of core-set sampling methods in both input space and latent space with (number of partitions explained in Section 3) and up to labeled points. We see that the mean accuracy of our proposed sampling technique in the latent space is about when labeled points are available. Even for larger values of labeled points, such as labeled points, our approach outperforms other related techniques such that the mean accuracy is with the standard deviation .

Classification with four and five classes: In this experiment, we consider two problems of classifying four digits (0, 3, 7, and 9) and five digits (0, 2, 3, 7, and 9) from the MNIST data set. We use the same network architecture as the previous case. However, we increase the dimension of latent space slightly; for the data set with four digits and when we have five digits. Empirically, we observe that growing the latent space dimensionality leads to higher quality representations as the number of classes increases.

In Fig. 2, the mean and standard deviation of classification accuracy over independent trials are reported for varying values of . In Fig. 2(a) and (b), we set and , respectively. The main reason we choose for the case with five classes is to make sure that we have a few points from each class to train a classifier and tune corresponding parameters using cross-validation. Similar to the previous case, our proposed sampling method in the latent space consistently outperforms other methods in the input space. Therefore, the proposed active learning method offers great potential to reduce the labeling cost to reach a certain accuracy.

(a) Active learning with four classes (b) Active learning with five classes
Figure 2: Active learning with four and five classes.

5 Conclusion

In this paper, we presented a new approach to active learning through finding compact representations of high-dimensional data using variational autoencoders. The proposed framework allows us to design efficient sampling strategies to query labels, which will result in developing cost-effective and accurate supervised models.

References

  • [1] W. Beluch, T. Genewein, A. Nürnberger, and J. Köhler (2018) The power of ensembles for active learning in image classification. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9368–9377. Cited by: §2.
  • [2] D. Cohn, Z. Ghahramani, and M. Jordan (1996) Active learning with statistical models. Journal of artificial intelligence research 4, pp. 129–145. Cited by: §1.
  • [3] B. Dai, Y. Wang, J. Aston, G. Hua, and D. Wipf (2018) Connections with robust PCA and the role of emergent sparsity in variational autoencoder models. Journal of Machine Learning Research 19 (1), pp. 1573–1614. Cited by: §3.
  • [4] S. Dasgupta, D. Hsu, and C. Monteleoni (2008) A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems, pp. 353–360. Cited by: §3.
  • [5] Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In International Conference on Machine Learning, pp. 1183–1192. Cited by: §1, §2.
  • [6] A. Joshi, F. Porikli, and N. Papanikolopoulos (2009) Multi-class active learning for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2372–2379. Cited by: §2.
  • [7] D. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. In International Conference on Learning Representations, Cited by: §1.
  • [8] A. Krishnamurthy, A. Agarwal, T. Huang, H. Daumé, and J. Langford (2019) Active learning for cost-sensitive classification. Journal of Machine Learning Research 20 (65), pp. 1–50. Cited by: §1.
  • [9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • [10] D. Lewis and J. Catlett (1994) Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings, pp. 148–156. Cited by: §2.
  • [11] R. Lopez, J. Regier, M. Jordan, and N. Yosef (2018) Information constraints on auto-encoding variational Bayes. In Advances in Neural Information Processing Systems, pp. 6114–6125. Cited by: §1.
  • [12] M. Lucic, M. Faulkner, A. Krause, and D. Feldman (2017) Training Gaussian mixture models at scale via coresets. Journal of Machine Learning Research 18 (1), pp. 5885–5909. Cited by: §1.
  • [13] M. Meila (2018) How to tell when a clustering is (approximately) correct using convex relaxations. In Advances in Neural Information Processing Systems, pp. 7407–7418. Cited by: §3.
  • [14] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui, and J. Long (2018) A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access 6, pp. 39501–39514. Cited by: §1.
  • [15] A. Munteanu, C. Schwiegelshohn, C. Sohler, and D. Woodruff (2018) On coresets for logistic regression. In Advances in Neural Information Processing Systems, pp. 6561–6570. Cited by: §1.
  • [16] V. Papyan, Y. Romano, and M. Elad (2017) Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research 18 (1), pp. 2887–2938. Cited by: §4.
  • [17] R. Pinsler, J. Gordon, E. Nalisnick, and J. Hernández-Lobato (2019) Bayesian batch active learning as sparse subset approximation. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [18] F. Pourkamali-Anaraki and S. Becker (2017) Preconditioned data sparsification for big data with applications to pca and k-means. IEEE Transactions on Information Theory 63 (5), pp. 2954–2974. Cited by: §1.
  • [19] F. Pourkamali-Anaraki and S. Becker (2019) Improved fixed-rank Nyström approximation via QR decomposition: practical and theoretical aspects. Neurocomputing 363, pp. 261–272. Cited by: §4.
  • [20] F. Pourkamali-Anaraki (2019) Large-scale sparse subspace clustering using landmarks. In IEEE International Workshop on Machine Learning for Signal Processing, Cited by: §1.
  • [21] D. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286. Cited by: §1.
  • [22] O. Sener and S. Savarese (2018) Active learning for convolutional neural networks: a core-set approach. In International Conference on Learning Representations, Cited by: §1, §1, §2.
  • [23] B. Settles (2012) Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1), pp. 1–114. Cited by: §1.
  • [24] S. Tong and D. Koller (2001) Support vector machine active learning with applications to text classification. Journal of Machine Learning Research 2, pp. 45–66. Cited by: §2.
  • [25] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. Hauptmann (2015) Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113 (2), pp. 113–127. Cited by: §2.
  • [26] C. You, C. Li, D. Robinson, and R. Vidal (2018) Scalable exemplar-based subspace clustering on class-imbalanced data. In European Conference on Computer Vision, pp. 67–83. Cited by: §2.
  • [27] C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt (2018) Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 2008–2026. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398580
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description