Constraining the Parameters of HighDimensional Models with Active Learning
Abstract
Constraining the parameters of physical models with parameters is a widespread problem in fields like particle physics and astronomy. The generation of data to explore this parameter space often requires large amounts of computational resources. A reduction of the relevant physical parameters hampers the generality of the results. In this paper we show that this problem can be alleviated by the use of active learning. We illustrate this with examples from high energy physics, a field where computationally expensive simulations and large parameter spaces are common. We show that the active learning techniques querybycommittee and querybydropoutcommittee allow for the identification of model points in interesting regions of highdimensional parameter spaces (e.g. around decision boundaries). This makes it possible to constrain model parameters more efficiently than is currently done with the most common sampling algorithms. Code implementing active learning can be found on GitHub .
I Introduction
With the rise of computational power seen over the last decades, science has gained the power to evaluate predictions of new theories and models at unprecedented speeds. Determining the output or predictions of a model given a set of input parameters often boils down to running a program and waiting for it to finish. The same is however not true for the inverse problem: determining which (ranges of) input parameters a model can take to produce a certain output (e.g., finding which input parameters of a universe simulation yield a universe that looks like ours) is still a challenging problem. In fields like high energy physics and astronomy, where highdimensional models are widespread, determining which model parameter sets are still allowed given experimental data is a timeconsuming process that is currently often approached by looking only at lowerdimensional simplified models. This not only still requires large amounts of computational resources, in general it also reduces the range of possible physics the model is able to explain.
In this paper we approach this problem by exploring the use of active learning settles:2010 (); Seung:1992:QC:130385.130417 (); Cohn1994 (), an iterative method that applies machine learning to guide the sampling of new model points to specific regions of the parameter space. Active learning reduces the time needed to run expensive simulations by evaluating points that are expected to lie in regions of interest. As this is done iteratively, this method increases the resolution of the true boundary with each iteration. For classification problems this results in the selection (i.e. the sampling) of points around – and thereby a better resolution on – decision boundaries, as can be seen in Figure 1. In this paper we investigate a technique called querybycommittee Seung:1992:QC:130385.130417 (), which allows for usage of active learning in highdimensional parameter spaces.
The paper is structured as follows: in Section II we explain how active learning works. In Section III we show applications of active learning to determine decision bounds of a model in the context of high energy physics, working in model spaces of a 19dimensional supersymmetry (SUSY) model^{1}^{1}1Supersymmetry (SUSY) is a theory that extends the current theory of particles and particle interactions by adding another spacetime symmetry. It predicts the existence of new particles which could be measured in particle physics experiments, if supersymmetry is realised in nature.. We conclude the paper in Section IV with a summary and future research directions.
Ii Active Learning
Simulations are nowadays widespread in science. However, as these can be computationally expensive to run, exploring the output space of these simulations can be a costly endeavour. Approximations of simulations can however be constructed in the form of machine learning estimators, which are typically quick to evaluate. Active learning leverages this speed, exploiting the ability to quickly estimate how much information can be gained by querying a specific point to the true (expensive) labeling procedure.
Active learning works as an iterative sampling technique. In this paper we specifically explore a technique called poolbased sampling settles:2010 (), of which a diagrammatic representation can be found in Figure 2. In this technique an initial data set is sampled from the parameter space and queried to the labeling procedure (also called the oracle). After retrieving the new labels one or more machine learning estimators are trained on the available labeled data. This estimator (or set of estimators) can then provide an approximation of the boundary of the region of interest. We gather a set of candidate (unlabeled) data points, which can for example be sampled randomly or be generated through some simulation, and provide these to the trained estimator. The output of the estimator can then be used to identify which points should be queried to the oracle. For a classification problem this might for example entail finding out which of the candidate points the estimator is most uncertain about. As only these points are queried to the oracle, it will not spend time on evaluating points which are not expected to yield significant information about our region of interest. The selected data points and their labels are then added to the total data set. This procedure of creating an estimator, collecting points, finding the most interesting points with respect to the region of interest, labeling them and adding them to the data set can be repeated to get an increasingly better estimation of the region of interest and be stopped when e.g. the collected data set reaches a certain size or when the performance increase between iterations becomes smaller than a predetermined size.
It should be noted that the active learning procedure as described above has hyperparameters: the size of the initial dataset, the size of the pool of candidate data points and the number of candidate data points queried to the oracle in each iteration. Finding the optimal configuration for the active learning procedure requires a dedicated search. As we intend to show the added benefit of active learning and not what the absolute best performance of active learning is, we did not perform an extensive grid search for the optimisation. Instead we performed a small random search on the hyperparameters of the experiments in Section III and selected the best configuration for all experiments. For completeness a discussion on the hyperparameters can be found in Appendix A. We do want to note that in any active learning configuration we experimented with, active learning always performed at least equally as good as random sampling.
In Figure 2 arguably the most important step is to select those points that ought to be queried to the labeling procedure from a large set of candidate data points. As the problems we look at here are classification problems, the closeness to the boundary can be estimated by the uncertainty of the trained estimator on the classification of the model point.
This uncertainty can for example be obtained from an algorithm like Gaussian Processes gaussianprocesses (), which has already been successfully applied in high energy physics to steer sampling of new points around 2dimensional exclusion boundaries algp (). Due to the computational complexity of this algorithm it is however limited to lowdimensional parameter spaces, as it scales at best with the number of data points squared 2018arXiv180911165G (). Because of this, we investigate specifically the querybycommittee and querybydropoutcommittee scheme.
ii.1 QuerybyCommittee (QBC)
By training multiple machine learning estimators on the same data set, one could use their disagreement on the prediction for a data point as a measure for uncertainty. Points with a high disagreement in their predictions are expected to provide the highest information gain. This method is called querybycommittee (QBC) Seung:1992:QC:130385.130417 (). To create and enhance the disagreement among the committee members in uncertain regions the training set can be changed for each estimator (e.g. via bagging bagging ()) or by varying the configuration of the estimator (e.g. when using a committee of neural networks, each of these could have a different architecture or different initial conditions), such that we get a reasonable amount of diversity in the ensemble.
The disagreement among the estimators can for example be quantified by the standard deviation. For binary classification problems it can even be done by taking the mean of the outputs of the set of estimators. If the classes are encoded as 0 and 1, a mean output of 0.5 would mean maximal uncertainty, so an uncertainty measure for estimators could for example be
(1) 
An uncertainty of 1.0 would indicate maximum uncertainty.
The advantage of the QBC approach is that it is not bound to a specific estimator. If one were to use estimators of which the training scales linearly with the number of data points , the active learning procedure would have a computational complexity of for each iteration. This allows for the use of large amounts of data, as is needed in highdimensional parameter space.
ii.2 QuerybyDropoutCommittee (QBDC)
The committee can also be built by using a technique called Monte Carlo dropout 2015arXiv150602142G (). This technique uses a neural network with dropout layers dropout () as the machine learning estimator. These dropout layers are normally used to prevent overtraining (i.e. increased performance on the training set at the cost of a reduction in performance on general data sets) by disabling a fraction of the neurons in the preceding layer of the network at random at each evaluation of input data during training. In this way it cannot learn to rely entirely on specific features and correlations in the input data, resulting in more robustness during inference. The dropout is then typically disabled when actually used to create predictions on unseen data, so that the full network is used for inference.
In Monte Carlo dropout, on the other hand, these layers are left enabled during evaluation of input data, even after training, making the output of the network vary in each evaluation. The prediction for a constant input will therefore change for each evaluation and the number of times a prediction is made can then be interpreted as the number of members in a committee of a QBC approach. The advantage here however is that only a single network has to be trained dropout_based_al (); 2015arXiv151106412D (); 2018arXiv181103897P (). Due to the use of Monte Carlo dropout, this method is called QuerybyDropoutCommittee (QBDC).
Iii Applications in HEP
In this section active learning as a method is investigated using data sets from high energy physics. The experiments investigated here are all classification problems, as these have a clear region of interest: the decision boundary. It should be noted that the methods explored here also hold for regression problems with a region of interest (e.g. when searching for an optimum). Although active learning can also be used to improve the performance of a regression algorithm over the entire parameter space, whether or not this works is highly problem and algorithm dependent, as can for example be seen in ref. Schein2007 ().
iii.1 Increase resolution of exclusion boundary
As there are no significant experimental signals found in “beyond the standard model” searches that indicate the presence of unknown physics, the obtained experimental data is used to find the region in the model parameter space that is excluded – or notexcluded yet – by experiment. Sampling the region around this boundary in highdimensional spaces is highly nontrivial with conventional methods due to the curse of dimensionality.
We test the application of active learning on a 19dimensional model of new physics (the 19dimensional pMSSM Martin:1997ns ()) as a method to tackle this problem. This test is related to earlier work on the generalisation of highdimensional results, which resulted in SUSYAI Caron:2016hib (). In that work the exclusion information on model points as determined by the ATLAS collaboration Aad2015 () was used; the same data is used in this study. We investigate three implementations of active learning: two Random Forest set ups, one with a finite and the other with an infinite pool, and a setup with a QBDC. The performance of each of these is compared to the performance of random sampling, in order to evaluate the added value of active learning. This comparison is quantified by using the following steps:

Call max_performance the maximum reached performance for random sampling;

Call the number of data points needed for random sampling to reach max_performance;

Call the minimum number of data points needed for active learning to reach max_performance;

Calculate the performance gain through
(2)
The configurations of the experiments were explicitly made identical and were not optimized on their own. The results of the experiments are therefore not able to identify which setup works best and only serve to investigate whether, and if so by how much, each of these techniques outperforms random sampling in constraining parameters in highdimensional models.
iii.1.1 Random Forest with a finite pool
Just as for SUSYAI we trained a Random Forest classifier on the public ATLAS exclusion data set Aad2015 () (details on the configuration of this experiment can be found in Appendix B). This data set was split into three parts: an initial training set of 1,000 model points, a test set of 100,000 model points and a pool of the remaining model points. As the labeling of the points is 0 for excluded points and 1 for allowed points, after each training iteration the 1,000 new points with their Random Forest prediction closest to 0.5 (following the QBC scheme outlined in Section II.1) are selected from the pool and added to the training set. Using this now expanded dataset a new estimator is trained from scratch. The performance of this algorithm is determined using the test set.
This experiment is also performed with all points selected from the pool at random, so that a comparison of the performance of active learning and random sampling becomes possible. The results of both experiments are shown in Figure 3. The bands around the curves in this figure indicate the range in which the curves for 7 independent runs of the experiment lie. The figure shows that active learning outperforms random sampling initially, but after a while random sampling catches up in performance. The decrease in accuracy of the active learning method is caused by an overall lack of training data. After having selected approximately 70,000 points via active learning, new data points are selected further away from this boundary, causing a relative decrease of the weight of the points around the decision boundary, degrading the generalisation performance.
iii.1.2 Random Forest with an infinite pool
We replace the finite ATLAS data pool with a sampling procedure in which new points are sampled from a uniform prior of the training volume of SUSYAI. Although in each iteration only a limited set of candidate points is considered, the fact that this set is sampled anew in each iteration guarantees that the decision boundary is never depleted of new candidate points. Because of this, the pool can be considered infinite. In contrast to the experiment in Section III.1.1, where labeling (i.e., excluded or allowed) was readily available, determining true labeling on these newly sampled data points would be extremely costly. Because of this SUSYAI Caron:2016hib () was used as a standin for this labeling process^{2}^{2}2Since SUSYAI has an accuracy of 93.2% on the decision boundary described by the ATLAS data Aad2015 (), active learning will not find the decision boundary described by the true labeling in the ATLAS data. However, as the goal of this example is to show that it is possible to find a decision boundary in a highdimensional parameter space in the first place, we consider this not to be a problem.. Since we are training a Random Forest estimator, we retrained SUSYAI as a neural network, to make sure the trained Random Forest estimator would not be able to exactly match the SUSYAI model, as this would compromise the possibility to generalise the result beyond toy examples like this one. The accuracy of this neural network was comparable to the accuracy of the original SUSYAI. Details on the technical implementation can be found in Appendix B.
The accuracy development as recorded in this experiment is shown in Figure 4. The bands again correspond to the ranges of the accuracy as measured over 7 independent runs of the experiment. The gain of active learning with respect to random sampling (as described by Equation 2) is 5 to 6. The overall reached accuracy is however lower than in Figure 3, but note that this experiment stopped when a total of points as sampled, compared to the points in the previous experiment.
iii.1.3 QBDC with an infinite pool
To test the performance of QBDC, the infinite pool experiment above was repeated, but now with a QBDC setup. The technical details of the setup can be found in Appendix B. The accuracy development plot resulting from the experiment can be seen in Figure 5. The bands around the lines representing the accuracies for active learning and random sampling indicate the minimum and maximum gained accuracy for the corresponding data after running the experiment 7 times. The performance gain (as defined in Equation 2) for active learning in this experiment lies in the range 3 to 4. QBDC sampling is approximately times faster than ensemble sampling with committee members for a fixed number of samples, as only one network has to be trained. However, as active learning outperforms random sampling by a factor of 3 to 4, it depends on how expensive training of the estimator is in comparison to how much computational time is gained.
Compared to Figure 3 and 4 the accuracies obtained in Figure 5 are significantly higher. This can be caused by the fact that the model trained to quantify the performance more strongly resembles the oracle (both of them are neural networks with a similar architecture), or that the neural network is inherently more capable of capturing the exclusion function. In the two earlier experiments the trained models were Random Forests that tried to replicate the true ATLAS exclusion function and the SUSYAI neural network respectively.
iii.2 Identifying uncertain regions and steering new searches
Instead of using active learning e.g. to iteratively increase the resolution on for example a decision boundary, the identification of uncertain regions of the parameter space on which active learning is built can also be used to identify regions of interest.
For example, in high energy physics one could train an algorithm to identify model points around the exclusion boundary in a highdimensional model. These model points could then be used as targets for new searches or even new experiments. This is an advantage over the conventional method of trying to optimise a 2dimensional exclusion region in a plot, as this method works over the full dimensionality of the model, which thereby can respect a more detailed account of the underlying theory that is being tested for. One could even go a step further by reusing the same pool for these searchimprovement studies, so that regions of parameter space that no search has been able to exclude can be identified. Analogous to this one could also apply this method to find targets for the design of a new experiment.
To test the application of this technique in the context of searches for new physics we trained a neural network on the publicly available ATLAS exclusion data on the pMSSM19 Aad2015 (), enhanced with the 13 TeV exclusion information as calculated by Barr:2016inz (). The technical setup is detailed in Appendix B. We sampled model points in the SUSYAI parameter space Caron:2016hib () using a spectrum generator (SOFTSUSY 4.1.0 softsusy ()) and selected 1,000 points with the highest uncertainty following the QBDC technique outlined in Section III.1.3.
Figure 6 shows the sampled model points in the gluino mass  LSP mass projection. As the LSP mass was not directly one of the input parameters, the fact that the selected points are nevertheless wellsampled in the region of the decision boundary, we conclude that the active learning algorithm did successfully find the decision boundary in the 19dimensional model.
We conclude this section by noting that in all the active learning experiments in this section new points were selected exclusively with active learning. In more realistic scenarios the user can of course use a combination of random sampling and active learning, in order not to miss any features in parameter space that were either unexpected or not sampled by the initial dataset.
Iv Conclusion
In this paper we illustrated the possibility to improve the resolution of regions of interest in highdimensional parameter spaces. We specifically investigated querybycommittee and querybydropoutcommittee as a tool to constrain parameters and the possibility to improve the identification of uncertain regions in parameter space to steer the design of new searches. We find that all active learning strategies presented in this paper query the oracle more efficiently than random sampling, up to a factor of 6.
One of the limiting factors of the techniques as presented in this paper is the fact that still a pool of candidate points needs to be sampled from the parameter space. If sampling candidate points randomly yields too few points of high enough interest, generative models can be used to sample candidate points more specifically.
Code showing the implementation of the three investigated active learning techniques is made public on GitHub ^{3}^{3}3https://github.com/bstienen/activelearning.
Acknowledgements
This research is supported by the Netherlands eScience Center grant for the iDark: The intelligent Dark Matter Survey project.
Appendix A Active learning hyperparameters
The active learning procedure as implemented for this paper has three hyperparameters:

size_initial: The size of the data set used at the start of the active learning procedure;

size_sample: The size of the pool of candidate data points to be sampled in each iteration

size_select: The number of data points to select from the pool of candidate data points and query to the oracle.
Which settings are optimal depends on the problem at hand, although some general statements can be made about the possible values for these hyperparameters. To illustrate this we performed a hyperparameter optimisation for the experiment in Section III.1.2, although it should be noted that this optimisation was performed only for illustration purposes and was not used to configure the experiments in this paper.
The size_initial for example configures how well the first trained machine learning estimator approximates the oracle. If this approximation is bad, the first few sampling iterations will sample points in what will later turn out to be uninteresting regions. A higher value for size_initial would therefore be preferable over a smaller value, although this could diminish the initial motivation for active learning: avoiding having to run the oracle on points that are not interesting with respect to a specific goal.
The size_sample parameter however will have an optimum: if chosen too small the selected samples will be more spread out and possibly less interesting points will be queried to the oracle. If chosen too high on the other hand the data could be focused in a specific subset of the region of interest because the trained estimator happens to have a local minimum there. The existence of an optimal value for size_sample can be seen in Figure 7.
It should be noted that the location of the optimum does not only depend on size_sample, but also on size_select. If one were to set size_select to 1, the size of the candidate pool is best as large as possible, in order to be sure that the selected point is really the most informative one you can select. This would avoid the selection of clustered data points, but this comes at the cost of having to run the procedure for more iterations in order to get the same size for the final data set. This would however be very expensive if the cost for training the ML estimator(s) is very high. The dependence of the accuracy on these two variables is shown in Figure 8, in which the accuracy gained in the last step of the active learning procedure is shown for different configurations of these two parameters. The script to generate this figure can be found on GitHub .
Appendix B Experiment configuration
All networks were trained using Keras keras () with a Tensorflow tensorflow () backend linked to CUDA cuda (). For the Random Forest implementation scikitlearn scikitlearn () was used.
Increase resolution of exclusion boundary
The configuration of the active learning procedure can be found in Table 1. The experiments are denoted by the section in this paper in which they were covered.
III.A.1  III.A.2  III.A.3  
Initial dataset  10,000  
Step size  2,500  
#candidates  remaining pool  100,000  
Maximum size  until pool empty  100,000  
Committee size  100  25  
#iterations  7  
#test points  1,000,000 
Random Forest with a finite pool
The trained Random Forest classifier followed the defaults of scikitlearn scikitlearn (): it consisted out of 10 decision trees with gini impurity as splitting criterion.
Random Forest with an infinite pool
Layer type  Config.  Output shape  Param. # 

Input  (None,19)  0  
Dense  500 nodes  (None, 500)  10,000 
Activation  selu  (None, 500)  0 
Dense  100 nodes  (None, 100)  50,100 
Activation  selu  (None, 100)  0 
Dense  100 nodes  (None, 100)  10,100 
Activation  selu  (None, 100)  0 
Dense  50 nodes  (None, 50)  5,050 
Activation  selu  (None, 50)  0 
Dense  2 nodes  (None, 2)  102 
Activation  softmax  (None, 2)  0 
Total params:  75,352 
For active learning we trained a Random Forest randomforest () classifier that consisted out of 100 decision trees with gini impurity as splitting criterion. All other settings were left at their default values.
As the oracle we used a neural network with the architecture in Table 2. This network was optimised using Adam ADAM () on the binary cross entropy loss. The network was trained using the ATLAS pMSSM19 dataset Aad:2015baa () for 300 epochs with the EarlyStopping EarlyStopping () callback using a patience of 50.
QBDC with an infinite pool
Layer type  Config.  Output shape  Param. # 

Input  (None,19)  0  
Dense  500 nodes  (None, 500)  10,000 
Activation  relu  (None, 500)  0 
Dropout  0.2  (None, 500)  0 
Dense  100 nodes  (None, 100)  50,100 
Activation  relu  (None, 100)  0 
Dropout  0.2  (None, 100)  0 
Dense  100 nodes  (None, 100)  10,100 
Activation  relu  (None, 100)  0 
Dropout  0.2  (None, 100)  0 
Dense  50 nodes  (None, 50)  5,050 
Activation  relu  (None, 50)  0 
Dropout  0.2  (None, 50)  0 
Dense  2 nodes  (None, 2)  102 
Activation  softmax  (None, 2)  0 
Total params:  75,352 
The network architecture for the trained neural network used for active learning can be found in Table 3. The active learning network was optimized using Adam ADAM () on a binary crossentropy loss. It was fitted on the data in 1000 epochs, a batch size of 1000 and the EarlyStopping EarlyStopping () callback using a patience of 20. The neural network from the infinite pool experiment described above is also used in this experiment.
Identifying uncertain regions and steering new searches
The network architecture for the trained neural network can be found in Table 3. The network was optimized using Adam ADAM () on a binary cross entropy loss. It was fitted on the data in 1000 epochs, a batch size of 1000 and with the EarlyStopping EarlyStopping () callback using a patience of 50..
The network was trained on the zscore normalised ATLAS dataset Aad:2015baa () of 310,324 data points, of which 10 % was used for validation.
References
 (1) Georges Aad et al. Summary of the ATLAS experiment’s sensitivity to supersymmetry after LHC Run 1 — interpreted in the phenomenological MSSM. JHEP, 10:134, 2015.
 (2) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 (3) B. C. Allanach. SOFTSUSY: a program for calculating supersymmetric spectra. Comput. Phys. Commun., 143:305–331, 2002.
 (4) Alan Barr and Jesse Liu. First interpretation of 13 TeV supersymmetry searches in the pMSSM. 2016.
 (5) Leo Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001.
 (6) Breiman, L. Bagging Predictors. 1996.
 (7) Sascha Caron, Jong Soo Kim, Krzysztof Rolbiecki, Roberto Ruiz de Austri, and Bob Stienen. The BSMAI project: SUSYAI–generalizing LHC limits on supersymmetry with machine learning. Eur. Phys. J., C77(4):257, 2017.
 (8) François Chollet et al. Keras. https://keras.io, 2015.
 (9) David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, May 1994.
 (10) K. Cranmer, L. Heinrich, and G. Louppe. ”levelset estimation with bayesian optimisation”. https://indico.cern.ch/event/702612/contributions/2958660/. Accessed: 20190205.
 (11) Melanie Ducoffe and Frederic Precioso. QBDC: Query by dropout committee for training deep supervised architecture. arXiv eprints, page arXiv:1511.06412, Nov 2015.
 (12) Y. Gal and Z. Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ArXiv eprints, June 2015.
 (13) Jacob R. Gardner, Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. GPyTorch: Blackbox MatrixMatrix Gaussian Process Inference with GPU Acceleration. arXiv eprints, page arXiv:1809.11165, September 2018.
 (14) D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. ArXiv eprints, December 2014.
 (15) Stephen P. Martin. A Supersymmetry primer. pages 1–98, 1997. [Adv. Ser. Direct. High Energy Phys.18,1(1998)].
 (16) N. Morgan and H. Bourlard. Advances in neural information processing systems 2. chapter Generalization and Parameter Estimation in Feedforward Nets: Some Experiments, pages 630–637. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
 (17) John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, March 2008.
 (18) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 (19) Remus Pop and Patric Fulop. Deep Ensemble Bayesian Active Learning : Addressing the Mode Collapse issue in Monte Carlo dropout via Ensembles. arXiv eprints, page arXiv:1811.03897, Nov 2018.
 (20) Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
 (21) Andrew I. Schein and Lyle H. Ungar. Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235–265, Oct 2007.
 (22) Burr Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
 (23) H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 287–294, New York, NY, USA, 1992. ACM.
 (24) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 06 2014.
 (25) The ATLAS collaboration. Summary of the atlas experiment’s sensitivity to supersymmetry after lhc run 1 — interpreted in the phenomenological mssm. Journal of High Energy Physics, 2015(10):134, Oct 2015.
 (26) Evgenii Tsymbalov, Maxim Panov, and Alexander Shapeev. Dropoutbased active learning for regression. In Wil M. P. van der Aalst, Vladimir Batagelj, Goran Glavaš, Dmitry I. Ignatov, Michael Khachay, Sergei O. Kuznetsov, Olessia Koltsova, Irina A. Lomazova, Natalia Loukachevitch, Amedeo Napoli, Alexander Panchenko, Panos M. Pardalos, Marcello Pelillo, and Andrey V. Savchenko, editors, Analysis of Images, Social Networks and Texts, pages 247–258, Cham, 2018. Springer International Publishing.