Distribution Aware Active Learning
Abstract
Discriminative learning machines often need a large set of labeled samples for training. Active learning (AL) settings assume that the learner has the freedom to ask an oracle to label its desired samples. Traditional AL algorithms heuristically choose query samples about which the current learner is uncertain. This strategy does not make good use of the structure of the dataset at hand and is prone to be misguided by outliers. To alleviate this problem, we propose to distill the structural information into a probabilistic generative model which acts as a teacher in our model. The active learner uses this information effectively at each cycle of active learning. The proposed method is generic and does not depend on the type of learner and teacher. We then suggest a query criterion for active learning that is aware of distribution of data and is more robust against outliers. Our method can be combined readily with several other query criteria for active learning. We provide the formulation and empirically show our idea via toy and real examples.
Distribution Aware Active Learning
Arash Mehrjou Department of Empirical Inference Max Planck Institute for Intelligent Systems arash.mehrjou@tuebingen.mpg.de Mehran Khodabandeh School of Computing Science Simon Fraser University mkhodaba@sfu.ca Greg Mori School of Computing Science Simon Fraser University mori@cs.sfu.ca
noticebox[b] Indicates equal contribution. The author ordering determined by coin flip over Google Hangouts. \end@float
1 Introduction

“A rare pattern can contain an enormous amount of information, provided that it is closely linked to the structure of the problem space.”
—Allen Newell and Herbert A. Simon , Human problem solving, 1972
Active learning algorithms need to choose the most effective unlabeled data points to label in order to improve the current classifier. The core issue in designing an active learning algorithm is deciding on the notion of potential effectiveness of these data points.
At its heart, the intuition is the following. Effective data points are those for which the current classifier is uncertain. However, it is also crucial that these data points are common, in the sense that they come from high density regions of input space. Such data points are likely to have a higher impact on future labeling and lead to better generalization. Utilizing this structural information regarding the input space is important for active learning algorithms to focus labeling resources where it matters.
Merging structural information with AL has been studied before. However, due to the difficulty in capturing the manifold of high dimensional distributions, approximate unsupervised learning methods such as clustering have been used. Recent advances in generative models (goodfellow2014generative, ) especially variational likelihoodbased (kingma2013auto, ) or those which give a scalar as an unnormalized density (rezende2015variational, ) have opened up a new window to efficiently and directly take unsupervised information into account.
We take advantage of the recent progress in nonparametric density estimators to inform active learning algorithms about the structural information of the dataset. The structural information of the dataset is learned offline by a density estimator. This information is then combined with the query criterion of a conventional active learning process.
1.1 Contribution
Our proposed method is simple and modular and can be combined with many existing AL algorithms. We show that this approach gives advantages in some current issues of active learning, i.e., 1) robustness against outliers, 2) choosing batches of data at each AL cycle, and 3) biased initial labeled sets. The method is explained in Sec. 3 and each of these motivations is discussed in Sec. 3.2. Each motivation is then supported by empirical results in Sec. 4.
2 Related Work
Active learning (AL) aims to ease the process of learning by interlacing training the classifier with data collection. To this end, it cleverly chooses the samples to be labeled by a trusted annotator called the oracle. To ease the presentation, we assume two machines in this learning framework: learner and selector. The learner is doing the main aimed classification task and the selector chooses which sample to query. In addition, there exists an oracle who provides the true label for each sample asked by the selector. Normally in AL, a small set of labeled samples is given alongside a large set of unlabeled samples . We can describe active learning by a multicycle learning procedure. Initially, the learner is trained by at cycle . At each subsequent cycle, selector decides which samples from must be labeled by the oracle. Then the knowledge of the learner is updated by the new set of labeled samples and the procedure repeats until some criterion is met.
There are various strategies in AL based on the inputs/outputs of the selector. Assume selector implements the function with input space and output space . The output space is the same as the feature space of learner since the query samples must be meaningful for the oracle. In the querysynthesis AL strategy, the selector synthesizes samples to query. These synthesized samples do not necessarily belong to .
Recent advances in generative models especially likelihoodfree approaches such as generative adversarial networks (goodfellow2014generative, ) have increased the interest in this strategy which used to be mostly of theoretical interest before (ZhuB17, ). Another strategy is so called selective or streambased sampling where sampling from the data distribution is free where selector sees one sample at each learning cycle and decides whether to query the sample or discard it (smailovic2014stream, ). Therefore, the input to the selector function at cycle is a single sample .
For many realworld tasks, a large set of labeled samples can be collected at once. This motivates the poolbased active learning where the input space of function is and selector can choose the best query at cycle among the whole unlabeled samples (lewis1994sequential, ). We build our work upon the poolbased strategy, i.e., we assume the set of unlabeled samples is available beforehand. Once the selector is given a single sample in streambased or a set of samples in poolbased sampling, it needs a method to choose a sample to query. Assume that learner approximates the conditional distribution by at cycle of the active learning process. The selector at cycle chooses the query sample from based on looking at the current learner knowledge which is embedded in . The selector may use in many different ways and achieve different selection strategies.
One popular method called uncertaintysampling chooses for which the classifier is least certain. There are several measures of uncertainty in the literature but the most used one is entropy (shannon1948mathematical, ). Uncertainty sampling has pros and cons. Having intuitive interpretation and ease of implementation are among the positive properties. Moreover, it is modular and generic by seeing learner as a black box which is only asked by the selector for the confidence score for each sample. However, one major drawback of uncertaintysampling is that the the decision of selector at each cycle depends solely on . Since is learned by only a small set of initially labeled samples , it may induce a significant bias in choosing the queries and consequently updating the learner by biased samples. This condition may result in a myopic learner that has a tiny chance to see samples from distant regions of the feature space if the initial labeled set does not contain samples from those regions (richards2011active, ). The general idea of our work is to enhance the vision of learner to give it a broader view of the feature space.
The idea of using structural information in AL has been around and studied in previous work. Current work generally uses clustering as a method to encode structural information during active learning mainly because approximating is not tractable. The idea behind (nguyen2004active, ) is to precluster data and give more weight to the centroids of clusters as representative samples. In addition, repetitive labeling of the samples belonging to the same cluster is suppressed. This is an improvement over previous works (zhang2002active, ; zhang2000value, ; zhu2005kernel, ) which used cluster centroids as most interesting sample but did not provide any measure to avoid repeated sampling from the same cluster. Clustering based AL methods often have unrealistic assumptions about the distribution of data and also the distribution of the labels of samples withing each cluster. They often assume that data is well distributed into clusters and once the cluster membership is known for a sample, its label is known as well. Another limitation is the simple distribution which is assumed for each cluster (e.g. Gaussian).
Active learning and semisupervised learning have been combined by various methods as an approach to take advantage of structural information. For example, similarity between unlabeled points are modeled as a graph with weighted edges in (ZhuB17, ) where a Gaussian field is constructed for which the generalization error can be efficiently approximated and used for query selection. However, the similarity measure between two samples is defined in terms of an RBF kernel which can be remote from reality. A semisupervised active learning framework is proposed in (leng2013combining, ) to use class central samples as a guide to choose better class boundary samples. However, the full use of the data distribution is still missing. We propose a simple modular way to make use of an approximate data distribution that gives better query selection strategy.
There are a couple of recent works which also take structural information into account mainly in an indirect way. For example, (bachman2017learning, ; ravi2018meta, ) uses metalearning to learn an active learning strategy which can be transferred to other tasks. Similarly, (konyushkova2017learning, ) learns the AL strategy which is a regression method to predict the reduction in the classification error after labeling a sample and then transferring this strategy to novel tasks. Even though structural information has augmented traditional AL heuristics in these works, the explicit value of unnormalized probability has not been used. In addition the proposed methods are not modular and cannot be easily combined with previous AL strategies.
The cost of AL is mainly defined as the labeling cost which is proportional to the number of queried samples. However, this is not the only conceived cost. Retraining the classifier at each AL cycle also charges the user. (shen2017deep, ) takes this cost into account and proposes a method to reduce it. Our proposed method reduces the retraining cost as well. It enables batch sampling while the samples of each batch are more informative than conventional batch sampling in AL.
The proposed method in this paper is modular, it can be combined with other AL strategies. Moreover, with minor modifications, it can be used with unconventional oracles that do not directly provide correct labels for the queried samples (xu2017noise, ; murugesan2017active, ).
3 The proposed method: DAAL
The intuitive idea behind our work is that the selector can make wiser decision in choosing queries if it has knowledge about the structure of the feature space fully represented in probability distribution of data. However, normally we do not have direct access to distribution and are instead given a bunch of unlabeled samples generated from it represented by . Hence, the first step is to approximate by some function called teacher. Structural information of is then distilled in and can be used to guide the selector to choose more effective query samples. Because query selection at each active learning cycle is informed by distribution information of the dataset, we call our method Distribution Aware Active Learning (DAAL). The information content of each sample is not fully determined only by the uncertainty of the current learner about that sample. Formally speaking, uncertaintysampling (US) defines the information content of sample with unknown label , as , where . DAAL augments the definition of the information content of sample with structural information , i.e., is a function of and . The detailed definition of and is discussed later in Section 3.
Here we provide the logic behind our idea and its formulation in its generic form. Assume we have a base criterion for proposal of a conventional active learner. We suggest a new proposal criterion on top of this base criterion which is formulated as follows:
(1) 
In this formulation, encapsulates the structural information of dataset which is already distilled in the teacher component. The hyperparameter controls the of the selector to the teacher. Larger turns the selector’s decision for querying the samples more towards the knowledge of teacher than the current knowledge of the learner. The criterion can be any simple active learning criterion which is here assumed to be uncertainty sampling, i.e, it chooses samples which are closer to decision boundaries where the labels are most ambiguous. The major question now is how to distill the structural information of in the teacher component and use it to design . Next section presents one practical way to do so.
3.1 VAE as density estimator for DAAL
Variational autoencoder (kingma2013auto, ) is a setup for doing inference in a class of deep probabilistic models. The class of models can be almost any unsupervised density estimator with latent random variables. Here, we briefly present the essence of VAEs and show how it can be used in DAAL. As any other latent variable models of observed variables , a new set of variables is introduced and the joint probability distribution over is factorized by Bayes formula . The generative process is to first generate samples from prior distribution and then generate samples from the conditional distribution . Inference means computing the conditional distribution which requires computing the evidence factor . The evidence is hard to compute because the marginalizing integral is taken over exponentially many configurations of latent variables. Variational inference tries to approximate with chosen from a family of functions indexed by . This can be done by minimizing the KullbackLeibler divergence between these two distributions:
(2) 
This criterion is hard to compute because of the presence of intractable . Algebraic rearrangement of the terms reveals the following equations which are of our most interest
(3)  
(4) 
Jensen’s inequality ensures that KLdivergence is nonnegative. This said and because the lefthand side of Eq. 4 does not depend on , we can maximize as an implicit way to minimize Eq. 2. Therefore, we have and after the optimization is completed, . Exponentiation both sides, we have an approximate to the probability distribution of observed data . This approximation to is then used in DAAL to design the structural part of Eq. 1. In practice, we observed that passing the value of through a sigmoid function, i.e., gives a better performance where .
3.2 Motivations for DAAL
In this section, we investigate the motivations for DAAL and how augmenting a base selector with structural data can lead to improvements over traditional AL methods. The motivations are listed and described below.
Robustness against outliers— The uncertainty based criterion only cares about the distance of samples from the decision boundary. There is always a chance that this large distance is caused by an outlier for which . Choosing an outlier will misguide the learner and changes the decision boundary dramatically. Query criterion of Eq. 1 on the other hand takes the relative rareness of outliers into account through its second term and prevents selector from mistakenly choosing them and asking the oracle for their labels. This not only saves the decision boundary from being affected by the outliers, but also removes the extra cost imposed on the oracle to label a useless sample.
A practical example of this is object detection, where obtaining bounding boxes of objects in images is a costly process. Active learning can decrease this cost by proposing bounding boxes that most likely contain an object to the oracle and only ask for the label. However, finding those bounding boxes is not an easy task and is prone to outliers (bounding boxes with no objects). This is where our method can be useful. This can be done by learning a density estimator that can predict the likelihood of bounding boxes containing objects.
Batch active learning— uncertainty based sampling can choose only one sample at a time. The reason for this limitation is clear. If the decision function of the classifier is slowly varying with respect to , the entropy of the labels and consequently changes slowly as well:
(5) 
This implies that if a batch of samples is chosen by uncertainty sampling instead of a single sample, the set lacks diversity and the information content of considerably decreases when is known to the learner. This is a known effect that choosing the highest score samples from the AL pool gives samples with low diversity (guo2008discriminative, ). DAAL has an automatic means to mitigate this problem and enable selector to choose multiple samples at each AL cycle. The reason for this higher diversity can be explained as follows. Assume the unlabeled samples of the set are sorted by the criterion of Eq. 1. For two samples with close scores, we can write:
(6)  
(7) 
For this equation to hold, does not need to be close to each other in the feature space. The lefthand side of the above equation can be far from unity for a multimodal . This effect gets magnified for large values of resulting in more diversity in batch chosen by criterion .
Starting from scratch by annealing ()— Many active learning algorithms depend on an initial set of labeled samples. This may bias the AL process towards small regions of the feature space especially when the initial set is small or unrepresentative of the underlying distribution. DAAL deals with this problem with no need for manually selecting the initial set. Here we introduce a dynamical approach inspired by nonautonomous dynamical systems that changes the problem setting over time (bengio2009curriculum, ; mehrjou2018analysis, ; mehrjou2017annealed, ). Assume that at the very beginning of AL process, is large and is mainly influenced by . This amounts to choosing samples from with highest values of which are most representative samples from data distribution. For example, in a multimodal distribution, this criterion ensures us that in the beginning, the most representative samples of each mode are selected. As AL proceeds, we decrease the value of . This results in a more prominent role for uncertainty term of which is . That is, more focus on precising the decision boundaries in the regions of the feature space where there still exist ambiguity in terms of label uncertainty. Simply speaking, by annealing the attention hyperparameter from (some large value) to , in the beginning, the selector is highly attentive to the teacher and selects a diverse set of representative samples. This results in finding coarse decision boundaries in the beginning. As decreases over AL cycles, selector becomes less attentive to the teacher and more attentive to learner. Being more attentive to the learner means choosing more samples from regions of the feature space that help resolve ambiguity of the learner in those regions.
In the next section, we provide experiments to empirically show the aforementioned points.
4 Experiment
4.1 Toy example
We design a toy example to showcase the efficacy of our method and visualize its performance. Assume the task is to separate two classes. Class conditional densities from which the samples of each class are generated are represented by and . In addition, when active learner explores the world to query new samples, it may encounter outliers or noisy samples which come from neither nor . To simulate this effect, we assume a third class distribution called outlier distribution which is represented by . The active learner sees samples in the world that come from either or . From the active learner’s perspective, we assume the samples come from a single distribution represented by which is itself a mixture of true data and outliers:
(8) 
The class conditional distributions for the following experiment are represented as colored dots in Fig. 1. We assume is uniform over a bounding box around the domain of . A variational autoencoder is then trained on samples from and the heatmap of (see sec. 3.1) for different values of is depicted in Fig. 3. The selector uses eq. 1 to sort the pool of unlabeled samples and find the best ones to query. As stated in section 3, DAAL becomes normal active learning (blind to distribution) for . As increases, the role of the distribution becomes prominent. We have chosen an intermediate value in this experiment. Fig. 3 illustrates how the proposed sampling strategy of Eq. 1 influences a conventional active learning criterion (e.g. label entropy) at each AL cycle. Using Eq. 1, the outliers that gain large values of will get a lower overall score due to the structural term and have lower chance of being queried by the selector.
To show the effectiveness of our algorithm, we compare a normal active learning process () with our method (, in this case ). At each iteration, the selector queries data points. Qualitative results are shown in Fig. 5 and quantitative results are shown in Fig. 5. Fig. 5 shows the average accuracy of the classifier for runs trained on queried samples at each iteration of active learning. At the beginning (cycle ) of each run, the labeled set (of size ) is initialized at random ( sample per class). However, this initial labeled set is identical for both our method () and the baseline (). In all the experiments performed on the toy example we used a simple multilayer perceptron with two hidden layers containing and nodes, respectively, with ReLU as activation. We used the VAE architecture of (kingma2013auto, ) as the density estimator.
4.2 High Dimensional Data
As a high dimensional example, we test DAAL on MNIST, a dataset containing handwritten digits(lecun1998mnist, ). To mimic the notion of outlier and inliers, we use the first five digits as the dataset of interest and the remaining digits as outliers. We first randomly sample images of each digit in from the training set (in total ) to train the generative model, and use the rest of the training set combined with a portion of outliers as the pool from which the selector is to choose query samples (in our experiments the total number of outliers is two times more than inliers). The classifier is validated on the MNIST test set. Fig. 6 shows the performance of our method compared with the baseline. In all the experiments on MNIST, we used LeNet (lecun1998gradient, ) as the classifier and VAE (kingma2013auto, ) as the density estimator.
4.3 Annealing
In the previous experiments, we showed the usefulness of DAAL for a fixed value of as a measure that takes into account the structural information of a dataset. In this section, we investigate the effect of annealing from a large value at he beginning towards a small value as active learning proceeds. We show that DAAL is not only robust against outliers, but also gives a better performance even in absence of outliers. It allows us to start training from scratch removing the need to have a initial set of labeled samples. In addition, it enables selector to choose a diverse batch of samples in every cycle of active learning that allows a faster convergence consequently reducing the total cost of training. We have conducted an experiment to emphasize these points and quantitatively show how DAAL results in a lower value of cost throughout the course of active learning. Fig. 5(a) illustrates the results of this experiment. In this experiment, we tried three ways of obtaining the initial labeled set. (a) balanced: An oracle selects a balanced limited set of data from a large unlabeled pool and annotates them. (b) biased: An oracle selects a biased unbalanced set of training data from a large unlabeled pool (c) beta (our method): We select a batch with highest score from the unlabeled pool. In our method, we used a large value of at the beginning ( in this case) and used geometric annealing with some rate constant ( in this case), i.e., we update the attention parameter at each AL cycle. A batch of size is queried by the selector at each AL cycle. In this experiment we used samples from MNIST dataset to train the VAE and put the remaining samples in the active learning pool.
4.4 VAE latent space analysis
We introduce this experiment to better investigate what actually happens while AL cycles are conducted by DAAL when is annealed. We monitor the latent space of the employed VAE which is used by the selector. Fig 7 illustrates the samples which are selected at each cycle of active learning in the latent space of VAE. The predicted label for each point by the classifier before and after each cycle is shown with corresponding colors. At each AL cycle, the chosen samples show a clustered distribution where each cluster corresponds to ambiguity in some parts of the decision boundary. For example, the decrease in the entropy of labels in locations at cycle 1 or at cycle 2 is observed when we move from top row(before) to bottom row(after). This shows that DAAL can reduce ambiguity of the decision boundaries for distant areas in the feature space. Conventional AL methods which are blind to structural information can reduce ambiguity of decision boundaries only on nearby locations or on one area at each cycle.



5 Discussion
We have proposed a simple modular method to augment the conventional active learning algorithms with the structural information of unsupervised data. We used variational autoencoder as a module that learns and encapsulates the structural information which is later used by the active learner to decide which samples are more informative. The flexibility and generality of this modular approach separates our work from the other studies who use some kind of structural information during active learning. Several experiments were done to enlighten different aspects of our proposed method. Synthetic and real datasets showed that our method is more robust against outliers. Even in the absence of outliers, having structural information enables the active learner to start with a less biased initial labeled set and take more diverse batches at each AL cycle. Future directions include combining other stateoftheart generative (wang2016learning, ) and density estimator (deen, ) models with active training of discriminative models. Furthermore, generative components of the variational autoencoder and generative adversarial networks could enhance synthesized sampling strategies for active learning.
References
 [1] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [2] Diederik P Kingma and Max Welling. Autoencoding variational bayes. The International Conference on Learning Representations (ICLR), 2013.
 [3] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 [4] JiaJie Zhu and José Bento. Generative adversarial active learning. CoRR, abs/1702.07956, 2017.
 [5] Jasmina Smailović, Miha Grčar, Nada Lavrač, and Martin Žnidaršič. Streambased active learning for sentiment analysis in the financial domain. Information sciences, 285:181–203, 2014.
 [6] David D Lewis and William A Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3–12. SpringerVerlag New York, Inc., 1994.
 [7] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
 [8] Joseph W Richards, Dan L Starr, Henrik Brink, Adam A Miller, Joshua S Bloom, Nathaniel R Butler, J Berian James, James P Long, and John Rice. Active learning to overcome sample selection bias: application to photometric variable star classification. The Astrophysical Journal, 744(2):192, 2011.
 [9] Hieu T Nguyen and Arnold Smeulders. Active learning using preclustering. In Proceedings of the twentyfirst International Conference on Machine learning, page 79. ACM, 2004.
 [10] Cha Zhang and Tsuhan Chen. An active learning framework for contentbased information retrieval. IEEE transactions on multimedia, 4(2):260–268, 2002.
 [11] Tong Zhang and F Oles. The value of unlabeled data for classification problems. In Proceedings of the Seventeenth International Conference on Machine Learning,(Langley, P., ed.), pages 1191–1198. Citeseer, 2000.
 [12] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14(1):185–205, 2005.
 [13] Yan Leng, Xinyan Xu, and Guanghui Qi. Combining active learning and semisupervised learning to construct svm classifier. KnowledgeBased Systems, 44:121–131, 2013.
 [14] Philip Bachman, Alessandro Sordoni, and Adam Trischler. Learning algorithms for active learning. arXiv preprint arXiv:1708.00088, 2017.
 [15] Sachin Ravi and Hugo Larochelle. Metalearning for batch mode active learning. OpenReview, 2018.
 [16] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from real and synthetic data. arXiv preprint arXiv:1703.03365, 2017.
 [17] Yanyao Shen, Hyokun Yun, Zachary C Lipton, Yakov Kronrod, and Animashree Anandkumar. Deep active learning for named entity recognition. arXiv preprint arXiv:1707.05928, 2017.
 [18] Yichong Xu, Hongyang Zhang, Kyle Miller, Aarti Singh, and Artur Dubrawski. Noisetolerant interactive learning using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2431–2440, 2017.
 [19] Keerthiram Murugesan and Jaime Carbonell. Active learning from peers. In Advances in Neural Information Processing Systems, pages 7011–7020, 2017.
 [20] Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In Advances in neural information processing systems, pages 593–600, 2008.
 [21] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
 [22] Arash Mehrjou. Analysis of nonautonomous adversarial systems. arXiv preprint arXiv:1803.05045, 2018.
 [23] Arash Mehrjou, Bernhard Schölkopf, and Saeed Saremi. Annealed generative adversarial networks. arXiv preprint arXiv:1705.07505, 2017.
 [24] Sybren Ruurds De Groot and Peter Mazur. Nonequilibrium thermodynamics. Courier Corporation, 2013.
 [25] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
 [26] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [27] Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
 [28] Saeed Saremi, Arash Mehrjou, Bernhard Schölkopf, and Aapo Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.