Active learning procedure via sequential experimental design and uncertainty sampling
Classification is an important task in many fields including biomedical research and machine learning. Traditionally, a classification rule is constructed based a bunch of labeled data. Recently, due to technological innovation and automatic data collection schemes, we easily encounter with data sets containing large amounts of unlabeled samples. Because to label each of them is usually costly and inefficient, how to utilize these unlabeled data in a classifier construction process becomes an important problem. In machine learning literature, active learning or semi-supervised learning are popular concepts discussed under this situation, where classification algorithms recruit new unlabeled subjects sequentially based on the information learned from previous stages of its learning process, and these new subjects are then labeled and included as new training samples. From a statistical aspect, these methods can be recognized as a hybrid of the sequential design and stochastic approximation procedure. In this paper, we study sequential learning procedures for building efficient and effective classifiers, where only the selected subjects are labeled and included in its learning stage. The proposed algorithm combines the ideas of Bayesian sequential optimal design and uncertainty sampling. Computational issues of the algorithm are discussed. Numerical results using both synthesized data and real examples are reported.
keywords:Active learning, Uncertainty sampling, Sequential experimental design, D-optimal design, Bayes rule
Classification is an important task in many fields including biomedical research, engineering, sociology and many others. How to construct a classification rule based on a labeled data set is a classical statistical problem. In machine learning literature, there are several types of learning problems discussed, and depending on how labeled subjects are included into a learning process, they are usually termed as supervised, unsupervised or semi-supervised learning (Seeger, 2000; Seettles, 2010). Recently, due to technical innovation, “big data” becomes a buzz phrase in many fields, and we now often encounter with data sets that have huge amount of unlabeled data. Hence, how to utilize these unlabeled data efficiently to construct a classification rule becomes an important problem. Because to label each unlabeled subject is usually costly and inefficient, a common approach is active learning (see, for example, Cohn et al., 1996; Yu et al., 2006). This type of a leaning process will only inquire the label information for the “selected” subjects, which are usually chosen based on the information learned in the previous learning stages, and then include the newly labeled subjects into its training stage. A learning process will usually go on until a prefixed criterion is reached, such as a prefixed total number of labeled subjects to be used in the training stage.
Moreover, because in an active learning process, subjects are dynamically and sequentially selected, labeled and then added to the training set, this process is naturally related to sequential experimental designs in Statistics, where a new observation/experiment is conducted at some particular design points selected according to the information obtained using the data gathered up to current stage. Since data are observed adaptively, this type of methods are also related to the stochastic approximation process, which was first discussed in Robbins and Monro (1951). Their original procedure is called Robbins-Monro (RM) procedure and can be viewed as a stochastic version of Newton-Raphson method for nonlinear root-finding problems. Following Robbins and Monro (1951), sequential design methods have been intensively studied, and there are even more papers discussed different modifications of RM procedure and their corresponding convergence rates. Recently, Joseph (2004) further modified RM procedure to improve on its efficiency. This type of procedures is nonparametric in the sense that no parametric model assumption is presumed.
However, RM procedure can also be derived from a parametric form. For example, using the maximum likelihood estimate (MLE) of a logistic model, Wu (1985) proposed a logit-MLE method for binary data that uses the currently available labeled data to fit a logistic model, and then select the next input with the desired probability based on the fitted logistic model. Because a classification rule construction under active learning framework can be formulated as a problem of estimating the threshold boundary between two groups, which can usually be defined using a probability quantile, it can also be viewed as a stochastic root-finding procedure described above. Moreover, logistic models are commonly used models in binary classification problems, and the properties of sequential estimation for generalized linear model (GLM) under general adaptive designs are well studied (Chang, 2001; Zacks, 2008). Hence, it is natural to construct a binary classification rule, sequentially and adaptively, by putting all these ingredients together. An active learning algorithm developed in Deng et al. (2009), which combines the logic-MLE of Wu (1985) and D-optimal design is a successful example. This kind of a method depends on the properties of MLE. Although, the existence and uniqueness of MLE can be achieved after quite a few initial observations (Silvapulle, 1981), it may still suffer from severe bias, when sample size is small, which usually results in an inefficient learning process. In modern literature, Joseph et al. (2007) developed a Bayesian extension of Wu’s approach, where they used the maximum a posterior (MAP) estimates of the parameters of a logistic model rather than MLEs. Dror and Steinberg (2008) suggested a new sequential experimental design for GLM, where observations are selected sequentially based on a Bayesian D-optimality criterion and Bayesian estimates of model parameters. These methods motivate us to study a novel modification of Deng et al. (2009).
As in conventional regression analysis, it is well-known that when the number of dimensionality of the unknown vector of parameters becomes large, the estimated information of it will be very unstable. Because active learning processes usually rely such kind of information, the unstable estimates of parameters will also affect the learning process. In the real example studied in Deng et al. (2009), those two variables are selected based on experts’ opinions. However, this situation is rare and there are usually more variables considered for a real example. Thus, how to stabilize a learning process in high dimensional case is difficult and important. In this paper, we focus on the higher dimensional data sets. A Bayesian sequential design is used and the related computational issues are discussed. In addition, for practical usages, we also study the effects of using different sizes of labeled data sets as an initial training set of an active learning process. As to the subject selection during a process, the major difference between a sequential design and an active learning process is that with sequential design, an experiment will be conducted at the selected points, while in active learning processes with existent unlabeled data, we can only select points near the theoretical ones from an existent data set. Hence, how to select the next point based on the available information plays a key role in an active learning process. Deng et al. (2009) aimed at shortening the distance between the estimated boundary and the true one, such that their subject selection scheme heavily depends on the initial model assumption. In practice, the form of true model is usually unknown. Hence, in order to diminish the effect of model assumptions, we adopt a different design point selection scheme. The advantage of the proposed method will be discussed from both theoretical and practical aspects.
The rest of this paper is organized as follows. In Section 2, we first review the active learning algorithm Deng et al. (2009), and then discuss the proposed algorithm and some modifications. Simulation results and numerical studies with real data sets are presented in Sections 3 and 4, respectively. Section 5 is a summarization. Technical details are given in Appendix.
2.1 Model and Parameter Estimation
Let be the explanatory vector of subjects and variable or denotes the category a subject belonging to. Suppose that be the probability model of given . Assume further that each variable has a positive relationship with the response; that is, for larger value of , the higher the probability of . Then Deng et al. (2009) assumed that had a parametric form
where , for each , and . Let be a vector of parameters. Then following (1), for a given , is a Bernoulli random variable with mean . Model (1) can be re-written as a conventional logistic regression model:
where and . The Fisher information matrix of with a set of design points is
where is the regression matrix with th row, equal to and is a diagonal matrix with , . It is clear that this information matrix is non-linear in and depends on the unknown only through .
Suppose that are observed labeled data of size . Using this training set, we obtain an MAP estimates of both and and let and denote these two estimates. Using the current estimates of parameters, the classification rule based on the estimate of becomes
with an estimated boundary
where when there is no extra information, such as available. (In general, the cutting point for a logistic classification function is 0.5. However, when there is a prior information about the event, such as prevalence rate in epidemiology study, the cutting point will usually be adjusted accordingly. This will be discussed later.) Therefore, the active learning problem under this set up becomes how to recruit a set of training subjects efficiently such that when a learning process is stopped, the final classification function will have good prediction power.
2.2 Subject Selection
Intuitively, in order to have an efficient learning process, we should learn the most uncertain subjects first, because to do it this way may most improve a classifier. Thus, when using a probabilistic learning model in an active learning framework, the most commonly used query for getting new data is the uncertainty sampling (Seettles, 2010), where an active learner will query the label information of instance whose class membership is least certain. For a binary classification problem, this simply means to query the instance whose membership probability is closest to (Lewis and Gale, 1994; Lewis and Catlett, 1994). Thus, in a binary classification case the uncertainty is usually measured by
where . (Note that in Deng et al. (2009), they used only one parameter for measuring the uncertainty and adjusting the cutting point, and said that this parameter can be data dependent. However, our numerical studies show that for our method, using two different parameters for measuring uncertainty and adjusting cutting point, separately, will usually perform better. Regarding this phenomenon, more discussions, based on statistical decision theory viewpoints, are given in Section 4.3.)
Let be the unlabeled data set. Then rank points in in ascending order based on (6), and an active learning procedure will choose the top ranked point as follows:
That is, to choose the one with an estimated probability closest to 0.5 as the next point to be labeled. Because in high dimensional cases, there may be a lot of points that have the same or similar , we choose top points as candidates first, where , in our method, is decided by a local D-efficiency method using a locally optimal design discussed in Woods et al. (2006) and Dror and Steinberg (2008). (For the details of this method, please refer to their original papers.) As mentioned in Deng et al. (2009), to use (7) as the only criterion cannot provide good estimates of model parameters, and the method of optimal design can be a good supplement to this disadvantage. Thus, let be the set of candidate points that are screen out using (7). We then access these candidates further with some optimal experimental design criterions.
One of the major differences between our method and the one in Deng et al. (2009) is that we use an uncertainty sampling method instead of distance based scheme to select the candidate set. The effect of uncertainty sampling becomes obvious when the difference between the sample sizes of two groups is large. This situation happens very often in those problems that aim for detecting a set of rare subjects within a large data set, or when the population sizes of two groups are uneven. When the true model is exactly linear and the variables for this model are completely known, these two methods are the same. However, in practice, the form of the true model and the variables involved in it are usually unknown. For instance, the example discussed in Deng et al. (2009), those two variables used in their model are selected from a large number of variables by experts, and in fact the true model may involve other variables. When a model is an approximation with some leftover random errors, then the candidate set defined by a Euclidean distance-based method will be very different from the one obtained using a uncertainty measure. This situation can be easily illustrated using Figure 1, where Figure 1(a) is the probability contour plot when the true model is linear, and Figure 1(b) is a contour plot of the probabilities for the same linear model plus a small nonlinear error term. That is, when some perturbation exists, the contour lines can no longer be parallel. Thus, to use a perpendicular distance to find a candidate set, as that in Deng et al. (2009) cannot be the best choice. That is the reason why we use an uncertainty sampling scheme to define a candidate set first, then use a (Bayesian) D-optimal design method to screen out the best subject for parameter estimation. Moreover, when the number of dimensionality becomes larger, the computation of the determinant of a Fisher information matrix is difficult; especially when the size of labeled data is small, the information matrix will be either singular or nearly singular, which provides less information for designs. Thus, we adopt a Bayesian D-optimal design instead, which will stabilize the beginning stages of a learning process.
Computation of Fisher information matrix
It is known that an active learning is a sequential process, each learning stage heavily relies on the information obtained from its predecessors. Naturally, an unstable initial stage will make the process inefficient and even result in a bias classification rule. Hence, in order to have a stable learning process, especially in higher dimensional data cases, we adopt a Bayesian D-optimal design instead (see, for example, Chaloner and Larntz, 1989; Firth and Hinde, 1997), which is an extension of the original D-optimality by replacing the determinant of Fisher information with
where denotes the prior distribution for , and the expectation is with respect to this prior distribution. Because to compute the integration in (8) is not trivial, especially when the dimension of is high. This time-consuming step has to be repeated at each stage in an active learning process, hence any simplification of it will be beneficial. For this purpose, instead of the exact value of , Dror and Steinberg (2008) proposed using an approximation to (8) below:
where are weights obtained with a Monte-Carlo method (see Remark 1). A new subject in the candidate set that maximizes is selected.
The weights ’s are computed using simple Monte-Carlo method (see Niederreiter, 1988). We first generate a large number of points, say M, from the prior , and denote them as . Let be large enough to represent the prior distribution. Then for a vector , , and observations taken at , the likelihood is
Normalizing the likelihood across the samples, we have weights . Hence, at each stage of the experiment, the likelihood for can be rapidly computed.
In Deng et al. (2009), they used a local D-optimality criterion to access the unlabeled data in their candidate set (with a prefixed number) and select a subject that maximizes the determinant of the Fisher information matrix for , . This new subject will then be labeled by experts and included to the learning process. It is clear that the determinant is numerically unstable when the number of design points is small and the number of dimensionality of is large. When the information matrix is singular or even just nearly singular, it is hard to provide useful information for selecting next design points. Thus, a Bayesian D-optimal design is a good alternative.
2.3 The proposed learning algorithm
Let be labeled data points at the initial stage. Then the proposed algorithm consists of the following steps:
Compute — the posterior estimate of with the currently available labeled data (When , we will use the prior median instead);
Rank the unlabeled data points in based on Equation (6). If the estimated posterior probabilities for all points are equal to either 0 or 1, then stop iteration and use current estimated as the final classifier; otherwise, go to S3;
Select a new unlabeled point from the set according to following criteria:
If the design points up to current stage form a nonsingular information matrix of , then choose the next point that maximizes in (9); that is,
If the information matrix is singular, then select the next point from that maximizes based on the cumulated points, -augmentation and the candidate point. That is,
We consider the case with , so a Dirichlet distribution is a reasonable prior for . Hence, the following priors are used:
Assume that , and are mutually independent, then the posterior distribution of , based on the labeled data points is
where , and . Then, the MAP is
Note that a modified Bayesian D-optimal design in S3 is only used to determine (see Dror and Steinberg, 2008), and the reason to use it is because of its computational efficiency. In S4, a more precise criterion is used to evaluate the candidates found in the previous step.
At early stages, because only few labeled data points are available, the information matrix may be singular, and this is one of the reasons why S4 (ii) is adopted, which is similar to the method used in Dror and Steinberg (2008). In addition, at the early stage the estimates probabilities of whole unlabeled data points may be close to or due to the unstable coefficient estimate. It implies that the corresponding uncertainty measure provides little information. When this situation happens, we will use the distance-based measurement as in Deng et al. (2009) instead until the coefficient estimate becomes stable.
3 Simulation Study
In this section, the performance of the proposed method is evaluated through simulation, and compare with that of Deng et al. (2009) with two variables . (For short, we will refer to their method as ADSL in the rest of this paper.) We evaluate the performances of two methods with the same misclassification error formulae used in Deng et al. (2009), which can be estimated by , where is the total number of data points, and are the numbers of the false-positive and false-negative subjects, respectively. (Note that in Deng et al. (2009), they only have one parameter, , in their paper. That is, they have all the time, and let when event probability, , is not available. They also suggested that the parameter should be adjusted when there is information about the event probability. Note that since they have only one parameter , to adjust means to adjust both uncertainty measure and cutting threshold.)
We also assess the closeness between the estimated boundaries and the true boundary based on the distance-based measurement used in Deng et al. (2009), which is defined as follows: let
where is a set of points that lie evenly on the true boundary, ranging from -3 to 3 on the coordinate of , and is the distance of to the estimated boundary for . Using (15), a distance-based performance measure is
where is the number of simulations, and is the distance defined in (15) for the -th simulation.
3.1 Synthesized Data
We first compare the propose method with ALSD using a two-dimensional data set with following steps:
(1) Data Generation: We generate simulation data from model (1) with parameters , and . Let , and , where . We then uniformly generate from each interval , which are referred to as . The variable is then calculated according to . Using and , we then generate the response with probability based on the specified logistic model.
(2) Priors: The priors for , and are described below. First, consider the prior for . Assume that the mean of to be , we set , which implies . To get a flat prior, we take . We then consider the priors for and . Based on the lowest and highest value of (denoted them as and ) and using formula , we choose two extreme points, and . Let and be the suspicious levels for and , respectively. Plugging these values into (1), we have
Solving the equations above, we obtain and below:
Take as the sample variance of , , where . Then we complete the prior specification for all three parameters.
(3) For each method, we select points sequentially among the total points. Based on the current labeled points, we estimate the classification function and calculate the misclassification error and the distance by Equation (15) for both methods.
(4) Repeat the process times. The final results are based on the average of 100 runs.
According to the previous design, an example set of simulated data points is illustrated in Figure 2, where circle and square denote two different groups. From this figure, we can see that the two labeled points are mixed together, and the range of remains in . The responses with are observed mostly when two explanatory variables are both large.
Based on 100 runs, curves of the misclassification error and distance-based measure are shown in Figure 3 (a) and (b), respectively. The misclassification errors of the proposed method are slightly smaller than those of ALSD starting from around . The estimated boundary of the proposed method also moves towards the true boundary faster than that of ALSD. Note that the number of candidate set in ALSD is , and fixed for all stages. Because in our simulation, the number of candidate set for the proposed method is , which vary according to the criterion of the local efficiency method mentioned before and is smaller than . Hence, we only have to access less candidate points and is usually computational more efficient.
3.2 Advantages of a small amount of initial learning subjects
Deng et al. (2009) did not discuss the effects of using more than one labeled subject as an initial training set. However, because active learning algorithms are sequential procedures, the performance of the current stage relies on the information obtained from its predecessors. Hence, how to have a good and stable early performance in stages will play an important role in a successful active learning process. An easy way to have a good start is to have more labeled samples in its initial stage. Thus, to see the effects of different initial training sizes, we generate a data set with data points, and compare the results of the proposed method with equal to to that of ALSD with . Figure 4 shows misclassification curves of both the proposed methods with , respectively, and ALSD with . As expected the proposed method with performs better than ALSD, and the larger the initial training size , the better performance of the proposed method. In Figure 4 (d), for example, shows that at around labeled samples, the proposed method can achieve the same classification performance of ALSD at . Because the computation of active learning is time consuming process at each stage, and this is especially the case in problems with high dimensional data. Thus, a method requires less learning stages can help to save the computational time. Hence, in order to ensure a stable and efficient learning process, it is recommended to start with a small amount of labeled data, if they can be available. In fact, for some cases, we actually require less total number of the labeled subjects to achieve the same performance of ALSD. This situation can be seen from some real data examples and will be discussed later.
4 Real Examples
For illustration and comparison purposes, we apply both the proposed method and ALSD to Liver Disorders (BUPA) and Wisconsin Diagnostic Breast Cancer (WDBC) data sets, which are available at the UCI Repository of Machine Learning Databases (Bache and Lichman, 2013). Our main interest is correct classification rate, so we use the same misclassification error formulae defined before to evaluate their performances.
4.1 BUPA data set
The original BUPA data set is from the California state, USA, which contains 345 records (145 liver patient and 200 non-liver patient records) with 6 attributes as shown in Table 1. The first 5 variables are from blood tests and sensitive to liver disorders, which might be due to excessive alcohol consumption. All features are positive related to the response in a general sense. That is, the higher the value of variables, the higher the probability that the corresponding subject is liver disordered. The performances of the two methods (the proposed one and ALSD) in terms of misclassification error are illustrated in Figure 5. Our method performs similarly to ALSD when . However, in the proposed method, because , we only need to evaluate candidates at each stage which is smaller than the number of candidates () used in ALSD. That is, we only have to access a smaller number of candidates, which will save us a lot of computational time.
|Mcv||Integer||Mean corpuscular volume|
|Drinks||Real||number of half-pint equivalents of|
|alcoholic beverages drunk per day|
As in the previous section, we also start with different as an initial data set. The total size of BUPA is (about ), and we set , and . Figure 5 shows that our method performs better than the ALSD as gets larger, and the difference of two curves increases as increases. It is worth to note that even with at around 130 training samples, the proposed method can achieve the same classification performance of ALSD at 150 labeled data points. That is, it saves about 10 labeled samples in total in this case. Similar situations can be found in the cases with other ’s. For example, with , the proposed method requires only, in average, 110 labeled subjects to achieve the performance of ADSL with 150 labeled samples. Because it is a sequential process, it implies that the proposed method requires less training stages to achieve the same performance level, and therefore is more efficient in terms of training time. In practice, there will be some cost for experts to label subjects. Thus, to save labeled samples is not only to save learning time, but also the budget of a learning process.
4.2 Application to WDBC data set
The WDBC data set contains breast masses with benign and malignant cases. Ten different features are measured including radius, perimeter, area, compactness, smoothness, concavity, concave points, symmetry, fractal dimension and texture. All features are numerically modeled such that larger values are typically indicated a higher likelihood of malignancy (see Street et al., 1993). The details can also be found in Wolberg et al. (1994), and Mu and Nandi (2008). The mean value, extreme (largest or “worst”) value and standard error of each feature are computed for each image, which resulted in a total of 30 features of 569 images, and yielded a database of samples.
We apply both the proposed method and ALSD to WDBC data set, and their misclassification error curves are shown in Figure 6 with different and for the proposed method and for ALSD. When the initial training sample size increases, the proposed method outperforms ALSD as expected. It is worth to note that the misclassification errors for processes starting with a small amount of labeled data () are smaller than that of ALSD from the very beginning. It also shows that with a small amount of initial training subjects the proposed method achieves the same classification performance sooner than ALSD at a stage with less labeled samples. For example, in 6 (d), with around 75 to 90 labeled samples, the proposed method achieves a misclassification error that is similar to that of ALSD with 150 labeled samples. Hence, even using 45 labeled subjects at the initial stage, we still save about 15 to 30 subjects. Thus, to start with a small amount of labeled samples as an initial training set will actually be more efficient in both cost and computational time.
4.3 Active learning when group sizes are uneven
When either the ratio of two group size or the odds ratio of two groups is extreme, then the classification rule should take this information into consideration. Deng et al. (2009) suggested using and adjusted based on the probability of a case if there is a prior information available. (Note that in Deng et al. (2009), they used the same number, denoted as , in both uncertainty measurement and event probability adjustment.) In this section, we conduct some numerical studies with uneven group sizes. The results in Figure 7 are based on simulated data with size ratio equals to 1 to 4. We first set both the uncertainty probability () and cutting point () of the proposed method equal to 0.8, and the in ALSD. It can be seen from Figure 7 (a) , that the misclassification rate of the proposed method under such a setup is worse than that of ALSD with . However, if we set the uncertainty sampling probability equal to and adjust the uneven group sizes with a shift cutting point based on the ratio of two sample sizes (i.e. ), then the performance of the proposed method is much improved (see Figure 7 (b)), and is better than that of ALSD. (All the learning curves in Figure 7 are based on the average of 100 replications of each method.) We also conducted simulations for other group size ratios, and the results are all similar and therefore omitted here.
Real Data Examples
Similar results are obtained, when we apply both methods to BUPA and WDBC data sets. Figure 8 shows results of three different methods : ALSD, the propose method with and without sample sizes adjustment. The last two methods are denoted as “proposed-1” and “proposed”, respectively.
The ratio of sample sizes of two groups in BUPA data set is 0.58, which is close to 0.5. Hence, the effect of uneven group sizes is not that obvious especially when . In WDBC data set, the ratio is 0.627, which is slightly far away from 0.5. We can see from Figure 8 (c) that the misclassification curve of the proposed method with adjusted cutting point (proposed-1) becomes the best one when the number of the cumulated labeled subjects is larger than around 40. In Figure 8 (d), it shows that with , the proposed method with an adjusted cutting point (proposed-1) is the most stable one, among three methods from the very beginning.
The phenomenon of using two different parameters for uncertainty measure and cutting threshold in fact can be explained from a statistical decision theory viewpoint. The details are discussed in Appendix A.
Active learning selects its own training samples in a sequential manner and requires fewer labeled instances from domain experts, and still achieves high classification performance. In this paper, we focus on a higher dimensional case and propose a new subject selection scheme that combines a Bayesian D-optimal design and an uncertainty sampling method. Thus, the proposed method inherits the advantage of methods of stochastic approximation and optimal design as suggested in Wu (1985). Because of using a Bayesian D-optimal design method, the active learning process is more stable in high dimensional cases even when the information matrix is nearly singular, and therefore will be more suitable for modern analysis with large data sets. In addition, we also demonstrate that with a small amount of labeled subjects are an initial training set, active learning process is more stable and efficient in both training time and the size of the labeled data. For uneven group sizes case, we suggest to use separate parameters to control uncertainty sampling and adjust the cutting threshold for better performance. From our numerical studies, we found that the uncertainty measure and the probability of a event might play different roles in an active learning process; especially when the sizes of two groups are uneven. We found that to use uncertainty measure at 0.5 and then adjust the boundary according to the proportion of group sizes as that in classical logistic regression models produces better results in our studies.
These types of methods are suitable for problems with large amount of unlabeled data available, and have great potential for analyzing “big data” problems. From practical viewpoints, to include one new subject at a time is not practical. Not only because of the computational efficiency, but also the operational complexity. This is similar to the situation in clinical trials, where sampling in batch as in a group sequential procedure is usually preferred. Moreover, to label an unclassified subject is not only time consuming, there is also some operational costs such as experts’ charge and so on. Hence, how to conduct an active learning process with a batch of updated subjects, and how to construct a classification rule with a satisfactory performance under a given budget constraint are important problems in both practical and theoretical viewpoints.
- Bache and Lichman (2013) Bache, K., Lichman, M., 2013. UCI machine learning repository[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. URL: http://archive.ics.uci.edu/ml.
- Chaloner and Larntz (1989) Chaloner, K., Larntz, K., 1989. Optimal bayesian design applied to logistic regression experiments. Journal of Statistical Planning and Inference 21, 191 – 208.
- Chang (2001) Chang, Y.c.I., 2001. Sequential confidence regions of generalized linear models with adaptive designs. Journal of Statistical Planning and Inference 93, 277 – 293.
- Cohn et al. (1996) Cohn, T.J., Ghahramani, D.E., Jordan, M.I., 1996. Active learning with statistical models. Journal of Artificial Intelligence Research 4, 129 – 145.
- Deng et al. (2009) Deng, X.W., Joseph, V.R., Sudjianto, A., Wu, C.F.J., 2009. Active learning through sequential design, with applications to detection of money laundering. Journal of the American Statistical Association 104, 969 – 981.
- Dror and Steinberg (2008) Dror, H.A., Steinberg, D.M., 2008. Sequential experimental designs for generalized linear models. Journal of the American Statistical Association 103, 288 – 298.
- Firth and Hinde (1997) Firth, D., Hinde, J., 1997. On bayesian d-optimum design criteria and the equivalence theorem in non-linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59, 793–797.
- Givens and Hoeting (2012) Givens, G.H., Hoeting, J.A., 2012. Computational statistics. volume 708. John Wiley & Sons.
- Joseph (2004) Joseph, V.R., 2004. Efficient robbins-monro procedure for binary data. Biometrika 91, 461 – 470.
- Joseph et al. (2007) Joseph, V.R., Tian, Y., Wu, C.J., 2007. Adaptive designs for stochastic root-finding. Statistica Sinica 17, 1549.
- Lewis and Catlett (1994) Lewis, D.D., Catlett, J., 1994. Heterogenous uncertainty sampling for supervised learning., in: ICML, pp. 148–156.
- Lewis and Gale (1994) Lewis, D.D., Gale, W.A., 1994. A sequential algorithm for training text classifiers. Proceedings of 17th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval , 3 – 12.
- Mu and Nandi (2008) Mu, T., Nandi, A.K., 2008. Breast cancer diagnosis from fine-needle aspiration using supervised compact hyperspheres and establishment of confidence of malignancy, in: The 16th European Signal Processing Conference, EUSIPCO, Lausanne, Switzerland.
- Niederreiter (1988) Niederreiter, H., 1988. Low-discrepancy and low-dispersion sequences. Journal of number theory 30, 51–70.
- Robbins and Monro (1951) Robbins, H., Monro, S., 1951. A stochastic approximation method. Ann.Math.statist 22, 400 – 407.
- Seeger (2000) Seeger, M., 2000. Learning with labeled and unlabeled data(technical report). Edinburgh University .
- Seettles (2010) Seettles, B., 2010. Active learning literature survey. Computer Sciendes Technical Report 1648, University of Wisconsin-Madison .
- Silvapulle (1981) Silvapulle, M.J., 1981. On the existence of maximum likelihood estimators of the binomial response medel. Journal of the Royal Statistical Society 43, 310 – 313.
- Street et al. (1993) Street, W.N., Wolberg, W.H., Mangasarian, O.L., 1993. Nuclear feature extraction for breast tumor diagnosis, in: IS&T/SPIE’s Symposium on Electronic Imaging: Science and Technology, International Society for Optics and Photonics. pp. 861–870.
- Webb and Copsey (2011) Webb, A., Copsey, K., 2011. Statistical Pattern Recognition. 3rd ed., Wiley.
- Wolberg et al. (1994) Wolberg, W.H., Street, W.N., Mangasarian, O., 1994. Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates. Cancer letters 77, 163–171.
- Woods et al. (2006) Woods, D., Lewis, S., Eccleston, J., Russell, K., 2006. Designs for generalized linear models with several variables and model uncertainty. Technometrics 48, 284–292.
- Wu (1985) Wu, C.F.J., 1985. Efficient sequential designs with binary data. Journal of the American Statistical Association 80, 974 – 984.
- Yu et al. (2006) Yu, K., Bi, J., Tresp, V., 2006. Active learning via transductive experimental design. In Proceedings of the 23rd International Conference on Machine Learning .
- Zacks (2008) Zacks, S., 2008. Adaptive Designs for Generalized Linear Models. John Wiley & Sons, Inc.. chapter 3. pp. 55–74. doi:dx.doi.org/10.1002/9780470466957.ch3.
Appendix A: Statistical Decision Theory Viewpoint
Let and be the prior probabilities of two groups, and and are the corresponding posterior probabilities given . In conventional statistical decision theory, when the prior probabilities, and is known and there is no other information available, the best decision rule , for any given subject, will be: , if ; , otherwise. (The decision function denotes that the subject with explanatory variable is assigned to Class 0, and vice versa.) When a logistic model is assumed, and suppose that the odds-ratio satisfies that , the problem becomes how to estimate the unknown , and the decision rule will be made based on the posterior probabilities given observed ; that is, , if , and , otherwise. It follows from Bayes formulae, this decision rule is equivalent to
That is, when a logistic model is used in a classification problem, based on (17) the prior probabilities of two groups are already considered. Thus, it suffices to use to measure the uncertainty.
Moreover, let and be the misclassification costs of false positive and false negative errors, respectively. If we introduce these costs of misclassification into the decision rule, then Bayes decision rule becomes , if ; , otherwise. Because and can be treated as weights of two types of misclassification errors, we can assume that , and the overall misclassification error becomes , where and denote the false positive and false negative probabilities. This weighted misclassification error can usually be estimated by (see Webb and Copsey, 2011). In fact, in Deng et al. (2009, page 975), they also measured the misclassification using this formulae, which is in their notations, where and are numbers of false positive and false negative results. That is, the same parameter , in their paper, is used to measure the uncertainty and to adjust the weights of two different types of errors as well. From the discussion above, it is reasonable to treat the uncertainty measure and weights of misclassification errors, separately. When prior probabilities are known, we can use them to adjust the cutting point in order to minimize the weighted misclassification errors, but not the uncertainty measure.
In practice, these probabilities are usually unknown, so it motivates us an interesting future study – “whether can we use the estimated ratio of sample sizes to adjusted the cutting point?” Moreover, because active learning processes are conducted sequentially. It is naturally to ask whether we can apply a stopping rule to a learning process with a pre-fixed performance target. All these issues are related sequential estimate of the event probability under adaptive sampling and the results will be reported elsewhere.
Using as an approximation to
Here expectation is taken with respect to a prior distribution for . is the prior distribution on . is the current posterior distribution of .
According to the importance sampling approach discussed in Givens and Hoeting (2012), can be written in the form
where the prior serves as the importance sampling distribution. Draw from and then the estimator is
where and .
where is a constant, we obtain . Therefore,