Irreproducibility; Nothing is More Predictable In these matters the only certainty is that nothing is certain Pliny the Elder, AD2379
Abstract
The increasing ease of data capture and storage has led to a corresponding increase in the choice of data, the type of analysis performed on that data, and the complexity of the analysis performed. The main contribution of this paper is to show that the subjective choice of data and analysis methodology substantially impacts the identification of factors and outcomes of observational studies. This subjective variability of inference is at the heart of recent discussions around irreproducibility in scientific research. To demonstrate this subjective variability, data is taken from an existing study, where interest centres on understanding the factors associated with a young adult’s propensity to fall into the category of ‘not in employment, education or training’ (NEET). A fully probabilistic analysis is performed, set in a Bayesian framework and implemented using Reversible Jump Markov chain Monte Carlo (RJMCMC). The results show that different techniques lead to different inference but that models consisting of different factors often have the same predictive performance, whether the analysis is frequentist or Bayesian, making inference problematic. We demonstrate how the use of prior distributions in Bayesian techniques can be used to as a tool for assessing a factor’s importance.
1] David, Kohn
2, 5]Nick Glozier
2, 5]Ian B. Hickie
3, 6]Hugh DurrantWhyte
3,4]Sally Cripps
Introduction
Much has been written about the lack of reproducibility of scientific research in general, [Ioannidis2005, Gelman2014], and in epidemiology and public health in particular, [Peng2006]. There are many explanations put forward for this lack of reproducibility, which, broadly speaking, can be grouped into two categories. The first category is behavioural, such as the tendency to submit, publish and cite positive findings more often than negative ones. Psychology and psychiatry are possibly the worst offending disciplines in this category, [Fanelli2010, Fanelli2012]. The second category relates to the choice, quality and understanding of the statistical analysis. This category includes the confounding prevalent in observational studies, [Goodman2017], the failure to account for model uncertainty and multiplicities, [Scott2010], and the misuse and variability of the pvalue, [Nuzzo2014, Halsey2015].
Of course, the two categories are related: the bias for publishing positive results affects the choice of data used and the type of analysis performed. This is what [Simmons2011] and [Gelman2014] refer to as ‘degrees of researcher freedoms’.
In this paper we aim to show just how difficult the task of identifying factors which affect an outcome can be, when using observational data, and the consequent lack of reproducibility. To do this we perform a Bayesian analysis using data from a previous study [ODea2014], where interest centres around the identification of factors that most influence an individual’s propensity to fall into the category of ‘not in employment, education or training’ (NEET) [OECD2012]. This group of disengaged young people form a key political concern in many high income societies [Mascherini2012]. If a young person misses out on the key transition from education into adult roles they face a lifetime of social, economic and health disadvantage.
One key factor which influences a young adult’s propensity to become, and stay, disengaged appears to be poor mental health [Scott2013]. In Australia, a large programme of open access services for young adults aged 12 25, predominantly aimed at mental illhealth and associated problems, has been established and showed a much higher proportion of those seeking help at such services are NEET, than the youth population in general, [ODea2014]. In this cross sectional study NEET status in young people with mental health problems was influenced by age, sex, depression, substance use and criminal charges, but not education or cultural background as suggested by previous work, [Bynner2002]. Longterm NEET status was associated with chronic depression, or poorer cognitive function, but these inferences depended upon how the data were assessed, [Odea2016].
The study by [ODea2014] used a frequentist approach and provided a single final model, based on stepwise regression, limiting the number of potential explanatory factors according to some criteria. This is common practice in analytical epidemiology, where the goal is to identify a set of explanatory factors, often associated with each other, and derive causal inferences, acknowledging study design limitations and biases. In contrast to this frequentist approach, the Bayesian approach proposed in this paper, produces an ensemble of models, each model weighted by its posterior probability, acknowledging that these posterior probabilities depend on what was used as the initial certainties, or ‘prior’ probabilities. In this fashion a Bayesian approach accounts for model uncertainty referred to by [Scott2010].
The different results we present for this one cohort study, highlight how those wishing to prioritize or target interventions may be misled by the development of single models as opposed to approaches which use an ensemble of models, such as the Bayesian approach presented in this paper.
The dataset in the [ODea2014] study is of a very modest size by today’s standards, both in terms of the number of people and the number of predictive factors; it has observations on approximately 600 individuals and 50 possible explanatory factors. Yet we show that, even with a relatively small dataset, the conclusions depend upon the type of analysis performed. This is not surprising when one considers that the number of possible factor combinations that one can construct from 50 factors is , which is approximately the number of insects on earth.
This paper makes four contributions.

It demonstrates that the choices available to researchers in data processing and analysis, can lead to very different narratives for the factors underpinning the outcome, in this case youth disengagement.

It highlights the difficulty of identifying a ‘final model’ of important factors and combination of factors that best predicts an outcome and shows that the predictive performance, as measured by the receiver operating curve (ROC) area under curve (AUC) metric, can be very similar for models of different factor combinations.

It presents a Bayesian approach to quantifying model uncertainty, and compares the predictive performance of a Bayesian model averaging approach with the model selection approach of [ODea2014].

It investigates how the necessity of explicitly stating a prior probability for a factor’s inclusion, can be used as a tool for assessing a factor’s importance.
Methods
The methods sections is divided into three parts; a model and priors section which describes the Bayesian model and prior assumptions; a prediction and estimation section, where we describe how the parameters of the model are estimated and how predictive distributions are obtained; a data section where we give an overview of the data and provide references to a more detailed description.
Model and Priors
Suppose we have observations on individuals’ outcome (NEET status), , where if individual is classified as NEET and otherwise. We have corresponding measurements on possible explanatory factors (often called covariates in the statistical literature) for each of these individuals, contained in , where, , denotes the transpose and , is the row of and contains the factor measurements for individual . Note that the at the beginning of each row is for the intercept (or ”offset” in the machine learning literature). For example, if we have factors, age and depression score, and if individual is 22 years old with a depression score of 0.5, then .
A common way to model the dependence between and is to use a generalized linear model (GLM),
(1) 
where the notation , can be translated as the probability that individual is classified as NEET, given the factor measurements, , for that individual. In equation \eqrefeq_bin_reg, is the vector of regression coefficients and is some link function which maps a value on the real number line to the interval . A common choice for in this setting is a cumulative distribution function (CDF) such as a standard normal CDF (probit) or logistic (logit) CDF. This article uses the standard normal CDF as the link function, which we denote by , so that the data augmentation method of [Albert1993] can be implemented to facilitate the sampling scheme used to estimate the regression coefficients and quantify uncertainty.
Given the regression coefficients and the factors, the likelihood function is the product of the probability mass functions for independent Bernoulli random variables and is given by,
where .
To fully specify the model we place priors on those parameters needed to evaluate the likelihood, namely the regression coefficients. We wish to place priors on these regression coefficients to allow for the possibility that some factors on which we have measurements, do not have an impact on the future NEET status of an individual.
Specifically we introduce an indicator vector , where if factor is in the model and otherwise. Note that each possible value of the vector, , represents a combination of factors, or model. For example, if there are 4 possible factors and if , then this means that factors one and four are in the model, while factors two and three are not.
We follow [Smith96], [Holmes2006] and [Lamnisos2009], and many others, and allow the prior on the regression coefficients, , to depend upon this indicator vector as well as the factor measurements in . Let be the set of indices corresponding to those factors which are included in the model, and be the set of indices corresponding to those factors which are not included. Then our on prior is
The symbol “” means conditional on while the symbol “” means is distributed as. For example the notation can be translated as Given and , the regression coefficients have a normal distribution with mean zero and covariance matrix . In this paper is prespecifed and depends upon those factors corresponding to the set . The notation can be translated as the regression coefficients which are not included in the model, are identically zero.
To specify a prior on , we define to be the prior probability that the factor is in the model. There are a number of choices for . For example, we could assume that the probability that one factor is included in the model is independent, a priori, of the probability that another factor is included. In this case the prior for is . We could then choose the value of to correspond to some prior belief by eliciting expert opinion and have different values of for different factors and if we believe the probability that factor is in the model depends upon the presence of factor then we could induce dependency between and , by altering the prior on .
If we don’t wish to assume a particular value for , then we could replace the value of with a prior distribution, for example , which means that is equally likely to take any value between 0 and 1.
For this paper we take the subjective prior that is independent of , and to account for multiplicity we follow [Scott2010] by assuming that the average number of factors in the model is fixed, irrespective of the number of factors available for selection, . This prior has the effect of decreasing as the number of factors under consideration increases, and thus accounts for multiplicity.
To gauge the sensitivity of inference to this subjective choice of prior and to assess the importance of a factor to an individual’s NEET status we perform sensitivity analysis on this prior value, see Figure 4 in the Results section for details.
The full model and prior specification is
(2) 
where the notation denotes the Bernoulli distribution, with parameter .
Prediction and Estimation
Our article takes a Bayesian approach and makes inference regarding the importance of factors via the joint posterior distribution and predicts the unobserved NEET status of an individual, denoted by , given specific factor measurements, , via its posterior predictive density, . This density also accounts for the uncertainty surrounding the selection of factors, and the uncertainty surrounding the values of the regression coefficients corresponding to a particular factor selection, by averaging over all possible factor combinations and values of the regression coefficients, where the average is w.r.t the joint posterior distribution of and . This is commonly referred to as marginalization. Specifically we compute
(3)  
where and are samples drawn from the joint posterior . We approximate the LHS of equation \eqrefeq_marg by obtaining draw of and using RJMCMC as in [Lamnisos2009]. The MCMC sampling scheme and diagnostics respectively appear in Supplementary Note 2. MCMC sampling scheme and Supplementary Note 3. Diagnostics. The sampling scheme is coded in Matlab and can be downloaded from the Github repository https://github.com/divadnhok/bayesianvariableselection.
Data
[Purcell2015] describe the study methodology involving a longitudinal cohort design implemented within four HeadSpace youth mental health services in Australia. We use this dataset to compare the results from our approach with that of [ODea2014].
The model selection procedure of [ODea2014] is a two step process; single factor logistic regression models are estimated for those factors thought to be significant either from prior research or from expert opinion. Then a subset of these factors, those with single factor regression coefficients which are significantly different from zero, as judged by pvalue less than some prespecified value, are included in a multiple regression where stepwise regression is run to select the best combination of factors.
Although this is common practice, it is an adhoc procedure which can result in the omission of important factors in the final model, particularly in observational studies where confounds are problematic. A heatmap of correlations between all possible factors in [ODea2014] appears in Figure 1. Figure 1 shows that about half of the pairwise correlations exceed, in absolute terms, the value of 0.08 (the approximate upper limit of a confidence interval for a sample correlation coefficient in this context). For example the pairwise correlations between Anxiety and other factors exceeds 0.08 approximately 60% of the time. The same is true for the Rumination. These values suggest that confounds may be an issue in this study.
Additionally the stepwise procedure ignores model uncertainty, while in a Bayesian context factor and model uncertainty are quantified by estimating the marginal posterior probability (), and joint posterior probability (), respectively of that model. The is the probability that a single factor is included in the model, after accounting for uncertainty in the selection of all other factors, and is denoted by for . The is the probability that a given set of factors are included , and prediction proceeds by weighting all possible model predictions by this probability as in equation \eqrefeq_marg.
There is an extensive statistical and biostatistical literature on Bayesian variable selection and model averaging, which documents the advantages of Bayesian model averaging over model selection procedures, such as stepwise regression, not only in terms of better predictions, but also in terms of more accurate statistical inference, ([Raftery1997], [stodden2006model], [Wang04], [Viallefont01], and [Genell10]).
The factors used by [ODea2014], together with the pvalues for the single factor logistic regression models in the first step, and again for the multiple logistic regression model, in the second step, appear in the last two columns of Table 1. We use shortened, descriptive factor names but the full names of the factors corresponding to factors listed in [Purcell2015] can be found in Supplementary Note 1. Factor names.
Data availability
The datasets analysed during the current study are available at the Github repository https://github.com/divadnhok/bayesianvariableselection. Please note that the data associated with the factor Ethnicity have been omitted from this file but not from the analysis. This was done so that individuals could not be identified.
Results
The results section is split into four parts. First, we report the importance of single factors, as measured by the pvalue for the [ODea2014] analysis, and as measured by the , for the Bayesian analysis. In addition we report the estimation and associated uncertainty of regression coefficients corresponding to these factors. Second, we report the importance of factor combinations, as measured by the . Third, the evaluation of predictive performance of various models is documented. Fourth, we represent results on the sensitivity of to changes in prior specifications.
In each part we discuss three sets of results: the original results using a logistic regression, based on a reduced set of factors, from [ODea2014], which we call the Original Dataset; the results from the same factor set but using a Bayesian model, which we call the Small Dataset; and the results from a model based on all 50 potential predictors in the dataset as possible predictors, which we call the Large Dataset.
Identification of single factors
The single factor results appear in Table 1. Columns 1 and 2, titled and , contain the results reported by [ODea2014]. Column 1 reports the pvalue from running logistic regressions on single factors and column 2 reports the pvalues of the regression coefficients from running a multiple logistic regression, containing the factors for which the pvalue in the single factor models was less than .
Columns 3 and 4, titled , and , contain an estimate of the posterior mean and standard deviation of the regression coefficients respectively, conditional on that coefficient being nonzero. Column 5, titled , is an estimate of the marginal posterior probability of the factors considered in [ODea2014]. These three statistics, , and the , were used to summarize the posterior distributions of the regression coefficients because these posterior distributions all consist of a discrete and continuous component, and the posterior mean and standard deviation, are not sufficient statistics to describe such distributions.
Columns 6, 7 and 8 contain the corresponding values for the Large Dataset. Note that the results corresponding to the Large Dataset do not contain the results for all factors included in the Large Dataset, but instead show the same 20 factors used in the Small Dataset as well as those factors with the next ten highest s, after those factors already included in the Small Dataset. The full results for the Large Dataset can be found in Supplementary Note 4. Results. The s for individual factors are directly interpretable as the probability a factor is useful in predicting an individual’s NEET status, after accounting for the uncertainty in all other factors.
Original  Small  Large  

Variable  MPP %  MPP %  
Criminal  0.00  0.02  0.65  0.24  26  0.70  0.24  25 
Clinical stage  0.00  0.56  0.21  21  0.57  0.21  17  
Economic  0.00  0.08  0.33  0.18  4  0.36  0.20  3 
Sex  0.00  0.01  0.55  0.16  68  0.57  0.17  60 
Functioning  0.00  0.20  0.10  3  0.24  0.11  3  
Depression  0.00  0.00  0.32  0.09  85  0.35  0.10  86 
Age  0.00  0.00  0.28  0.08  59  0.29  0.08  55 
Cannabis  0.00  0.08  0.15  0.07  2  0.15  0.07  2 
Anxiety  0.01  0.04  0.12  1  0.01  0.13  0  
Discrimination  0.02  0.11  0.11  1  0.11  0.13  0  
Tobacco  0.05  0.07  0.09  0  0.09  0.10  0  
Alcohol  0.19  0.12  0.09  1  0.11  0.09  1  
Tertiary edu  0.50  0.44  0.24  5  
Immigrant dichotomous  0.65  
Indigenous  0.75  0.14  0.38  2  
Centre  0.92  
Ethnicity1  6.31  6.31  32  
Education1  0.48  0.29  3  
Education2  0.47  0.22  5  
Dissaving  0.38  0.4  24  
Parent M.I.  0.38  0.18  4  
Dissaving  0.38  0.21  3  
Unpaid expenses  0.31  0.20  2  
Phys neg  0.26  0.10  5  
BAS fun  0.25  0.09  5  
Mat auth  0.22  0.08  12  
BIS  0.21  0.09  3  
OASIS  0.18  0.12  1  
Sleep time  0.17  0.08  2  
Sexual abuse  0.16  0.10  1  
Pat overprot  0.15  0.09  1 
Identification of factor combinations
Table 2 summarizes the results for the top 20 combinations of factors (as measured by the ) for the Small Dataset. The first column lists the factor combinations while the second column contains the joint posterior probability, , of those factors. Column 3 contains the areaunderthecurve (AUC) of the Receiver Operating Characteristic (ROC) curve for predictions obtained by running a crossvalidated logistic regression using those factors in column 1. Table 3 reports analogous results for the Large Dataset.
Figure 2 contains a visualization to summarize how factors contribute to each of the models. Each row corresponds to a model and each column corresponds to a given factor being in that model. A black cell represents a given factor being included in a given model. The inference is that a column which is almost completely black indicates that this variable is present in the majority of the models and a row which is nearly black corresponds to a model which contains all variables.
Factor combinations  %  AUC% 

Age, Sex, Depression  15  70 
Sex, Depression  14  66 
Centre3, Age, Sex, Depression  7  71 
Age, Criminal, Depression  4  68 
Age, Depression  3  67 
Sex, Depression, Clinical stage  3  67 
Age, Sex, Criminal, Depression  2  70 
Centre3, Age, Criminal, Depression  2  71 
Age, Sex, Depression, Clinical stage  2  70 
Depression  2  63 
Sex, Criminal, Depression  2  67 
Centre3, Sex, Depression  1  68 
Centre3, Age, Sex, Criminal, Depression  1  72 
Criminal, Depression, Clinical stage  1  67 
Criminal, Depression  1  65 
Centre3, Age, Depression  1  69 
Age, Criminal, Depression, Clinical stage  1  70 
Depression, Clinical stage  1  65 
Sex, Economic, Depression  1  67 
Clinical stage  1  52 
Factor combinations  JPP %  AUC % 

Sex, Depression  5  66 
Age, Sex, Depression  4  69 
Sex, Ethnicity1, Depression  2  66 
Age, Sex, Ethnicity1, Depression  2  69 
Age, Criminal, Depression  2  69 
Age, Depression  1  68 
Centre3, Age, Sex, Depression  1  71 
Depression  1  63 
Sex, Education4, Depression  1  66 
Sex, Depression, Clinical stage  1  68 
Criminal, Depression  1  64 
Sex, Depression, M.authority  1  68 
Centre3, Age, Sex, Ethnicity1, Depression  1  72 
Age, Ethnicity1, Depression  1  68 
Centre3, Age, Criminal, Depression  1  70 
Clinical stage  1  54 
Age, Sex, Depression, Ruminative  1  70 
Criminal, Depression, Clinical stage  1  67 
Sex, Ethnicity1, Depression, Clinical stage  1  66 
Centre3, Age, Sex, Depression, M.authority  1  72 
Evaluation of predictive performance
In this section we evaluate the predictive performance of models, measured by the areaunderthecurve (AUC) of the (ROC) curve using 5 fold cross validation. The factor combinations evaluated are produced for both the Small Dataset and the Large Dataset, using two methods.
The first method is to order models by their and the predictive results for models formed in this way are reported in column 3 of Tables 2 and 3. The second method is to order models by sequentially adding factors in order of their . For example for the Large Dataset the first model would contain the factor Depression, while the second would contain Depression and Sex, and the third, Depression, Sex and Age, etc. The corresponding models for the Small Dataset are formed similarly.
Figure 3 plots the predictive performance for the models formed in order of the , for both the Small Dataset (blue) and for Large Dataset (black) and shows that those factors with the highest MPP contribute to the greatest marginal increases in the AUC, while adding subsequent factors has little or negative effect when added to the model, as evidenced in the relatively flat curves for the weighted models after approximately the first five factors.
Using priors to assess factor importance
Figure 4 shows the sensitivity of posterior inference to the prior probability of factor inclusion. The prior probability that a factor is included is given by , for , as in equation 2. Inference regarding that factor’s inclusion is via the marginal posterior distribution . Figure 4 shows that Depression, Sex and Age are relatively insensitive to changes in the prior probability and appear important factors regardless of the prior. Similarly, altering the prior probability of inclusion of WHODAS, Cannabis, Sleep Duration and Emo Abuse has little effect on the posterior probability; these factors are unlikely to be important irrespective of the prior. In contrast, inference regarding factors such as Criminal, Clinical Stage and Education influence depends very much on how much weight is given to their prior probability of inclusion.
Discussion
In this section we demonstrate a number of issues that contribute to difficulties in inferences in epidemiological data, and subsequently to the irreproducibility of results, by comparing the results of our Bayesian analysis, with that of [ODea2014].
Impact of data choice on single factor inference
The first issue we address is the impact of the choice of factors to include and data preprocessing on the analysis. In situations where the number of factors is relatively large it is common, and sometimes desirable, to reduce the number of factors available for prediction, using expert knowledge or results of prior studies. [ODea2014] did just this and the number of factors considered for prediction was reduced from approximately 50 to 20. The decision to omit factor from a model is equivalent to stating that a priori, the probability of inclusion is identically zero, i.e. . This is a very strong prior and could be replaced with , and letting vary, as shown in Figure 4.
The last column in Table 1 shows the for the most probable factors when the entire dataset is used and highlights that the decision to exclude variables can result in missing many important factors, such Centre Location, Ethnicity1, Clinical Stage, Dissavings and Mat Auth, to name a few.
Another aspect of data choice is to identify observations that may strongly influence inference drawn from the study. For example the for Ethnicity1, is equal to 0.32. There is only one individual, out of approximately 600, with this ethnicity. The measurements on the factors associated with this individual have values which make this individual an outlier in the covariate space, meaning that the values for this individual have more than average influence over the fitted values. Known as high leverage points, these points pull the fitted values towards these observations. Figure 5 shows a plot of the leverage values for all observations, and shows that the individual with Ethnicity1 has extremely high leverage values. Clearly we would not want to base a policy decision, for example the development of an intervention strategy for this particular ethnic group, based on a single observation.
Furthermore the posterior distribution of the regression coefficient for the factor Ethnicity1, shown in Supplementary Figure S10, is very nonnormal, the majority of iterates of the regression coefficient, (), are positive, a small proportion of iterates are negative, , but large, (a consequence of related factors being in or out of the Bayesian model at each iteration), while are identically zero. Clearly it is not meaningful to discuss the distribution of a regression coefficient which is based on a single observation, but it does highlight the importance of rigorous data analysis.
Impact of analytic technique on single factor inference
The second issue we address is the impact of the choice of analytic technique, Bayesian vs frequentist, has on inference concerning the individual factors that affect an individual’s likelihood of becoming NEET.
The final model of [ODea2014] selected Age (), Sex (), Criminal () and Depression () as significant at the level, while Cannabis () and Economic () were also significant at the level. This is similar to the Bayesian inference for the Small Dataset with notable exceptions. The Bayesian analysis of the Small Dataset, identifies Depression, Age and Sex as the top three important factors; consistent with [ODea2014]. However the factor Clinical Stage, identified as having the fifth highest in the Bayesian analysis, had a pvalue of , in the final model in [ODea2014], and was therefore deemed insignificant. Conversely factors such as Cannabis and Economic, while significant at the 10% levels in the [ODea2014] analysis, had very low ’s in the Bayesian analysis.
The difference in inference between the Bayesian analysis of the Large Dataset and the [ODea2014] analysis, is even more marked. Now factors such as Clinical Stage, Mat Auth, and Dissaving emerge as important factors. These factors were either discarded in the modelling process or not considered at all in [ODea2014].
Impact of analytic technique on inference for combinations of factors
For our discussion we categorize predictive factors into those which are nonmodifiable, such as Age and Sex, and those which are modifiable i.e. have the potential to be changed by some intervention. For the Original Dataset and Small Dataset inference regarding the nonmodifiable factors, Age and Sex, is consistent; for the Small Dataset dataset Age and Sex are in 70% and 60% of the top 10 models respectively while for the Large Dataset the corresponding figures are 50% and 70%. However inference regarding the modifiable factors is mixed. For example, Depression is in 95% of the top 20 factors combinations for the Small Dataset and Large Dataset. Clinical Stage, is in 25% of the top 20 models in the Small Dataset and in 20% of the top 20 models in the Large Dataset but not at all in the frequentist analysis while Cannabis is identified as important only in the frequentist analysis.
Note that the Bayesian analyses of both datasets identify one of the headspace Centre locations, Centre3 as an important factor, whereas this factor, as a dichotomous variable indicating the centre being in Sydney or Melbourne, did not even make it past the initial variable selection procedure in the [ODea2014] analysis. We suspect that this is because of data preprocessing and the arbitrary manner in which the variable Centre was coded, and this is the subject of future research. Our results suggest that Centre3 is a proxy for other factors, not measured, associated with the location of that centre, and perhaps offers an avenue of future research to determine what those others factors might be, an avenue that would not be apparent in the previous analysis of [ODea2014].
With respect to inference, Table 3 shows that, in addition to the core group of nonmodifiable factors, Age and Sex, other nonmodifiable factors such as Ethnicity (Ethnicity1, Ethnicity2, Ethnicity3) become important. And for modifiable factors a whole sluice of additional factors appear to be related to NEET status, including Emo Abuse, Education, Mat Auth, Fatigue, Parental M.I. and Phys Neg. These mixed results for modifiable factors clearly present problems when recommending intervention strategies.
Predictive performance
The results in Tables 2 and 3 show just how difficult selecting the one best model can be; multiple models have similar predictive performance yet very different factor combinations. For example, if we were to introduce interventions based on the most probable model (highest ) then we would target Depression only, while if those interventions were based on the model with the highest predictive ability (largest AUC) then those we would target Mat Auth and Centre3, as well as Depression.
Furthermore many models have comparable predictive performance, as measured by AUC, yet different factor combinations. If prediction of the outcome is the only aim, then this inconsistency in inference can be overlooked, but it presents a problem if the aim is to develop invention strategies to reduce youth disengagement.
In summary, given that public policy is often based on epidemiological associations rather than formal intervention studies, this lack of reproducibility and the impact of different research groups’ prior assumptions, which are often implicit rather than explicit, really do matter. However it is worth noting that Depression, a potentially modifiable factor, is robustly associated with youth disengagement across all models and inference regarding this factor is relatively independent of prior probabilities. Hence the study of how this factor impacts youth disengagement is worthy of support. For example does depression cause disengagement or it is a consequence of it? What is the best treatment for depression and does this vary with age and/or NEET status? More specific intervention studies, which attempt to control for other sources of uncertainties, are needed to answer these questions. In contrast factors such as Clinical Staging and Education need more investigation before being easily translated into larger scale policy changes.
Conclusion
When dealing with observational data there are many sources of variability in the journey to discovery; variability due to data selection and preprocessing, variability due to the type of analysis performed, variability due to the models proposed, and variability due to sampling. Each source of variability builds on the previous one. Given that pvalues were designed to capture only the variability due to sampling, it is not surprising that results published on the basis of ‘statistical significance’, as measured by pvalues, are often irreproducible.
In this paper we have taken a relatively small dataset, and shown that while the predictive capabilities of different models, are similar, inference regarding which factors effect a young person’s propensity to become disengaged, varies with the data used and the type of analysis performed. Even for a fixed set of data and choice of analysis, the Bayesian approach showed that models with very different factor combinations have similar predictive ability. Therefore if models are selected on the basis of predictive performance, it is not surprising that research findings based on model selection techniques have a high level of irreproducibility.
As the volume of data available for analysis increases, so too will the researcher’s degrees of freedom and if we continue to ignore other sources of variability then this will lead to an increase in irreproducibility; nothing is more certain. What is needed is an acknowledgment and acceptance of the various sources of uncertainty, and a rethink of how we measure it, report it, and take it into account when making decisions.
We argue strongly that, in addition to making data and code available, journal editors should require researchers to perform different types of analyses, using different data preprocessing techniques, and different models, such as presented in this paper. It is in the comparison of differing sets of results, that insights into the issue at hand become more apparent.
References
Footnotes
 thanks: Corresponding author.