Partial Information Framework: Model-Based Aggregation of Estimates from Diverse Information Sources
Prediction polling is an increasingly popular form of crowdsourcing in
which multiple participants estimate the probability or magnitude of
some future event. These estimates are then aggregated into a single
forecast. Historically, randomness in scientific estimation has been
generally assumed to arise from unmeasured factors which are viewed as
measurement noise. However, when combining subjective estimates, heterogeneity
stemming from differences in the participants’ information is often
more important than measurement noise. This paper formalizes
information diversity as an alternative source of such heterogeneity
and introduces a novel modeling framework that is particularly
well-suited for prediction polls. A practical specification of this
framework is proposed and applied to the task of aggregating
probability and point estimates from two real-world prediction
polls. In both cases our model outperforms standard
measurement-error-based aggregators, hence providing evidence in favor
of information diversity being the more important source of
Keywords: Expert belief; Forecast heterogeneity; Judgmental forecasting; Model averaging; Noise reduction
Past literature has distinguished two types of polling: prediction and opinion polling. In broad terms, an opinion poll is a survey of public opinion, whereas a prediction poll involves multiple agents collectively predicting the value of some quantity of interest (goel2010prediction; mellers2014psychological). For instance, consider a presidential election poll. An opinion poll typically asks the voters who they will vote for. A prediction poll, on the other hand, could ask which candidate they think will win in their state. A liberal voter in a dominantly conservative state is likely to answer differently to these two questions. Even though opinion polls have been the dominant focus historically, prediction polls have become increasingly popular in the recent years, due to modern social and computer networks that permit the collection of a large number of responses both from human and machine agents. This has given rise to crowdsourcing platforms, such as MTurk and Witkey, and many companies, such as Myriada, Lumenogic, and Inkling, that have managed to successfully capitalize on the benefits of collective wisdom.
This paper introduces statistical methodology designed specifically for the rapidly growing practice of prediction polling. The methods are illustrated on real-world data involving two common types of responses, namely probability and point forecasts. The probability forecasts were collected by the Good Judgment Project (GJP) (ungar2012good; mellers2014psychological) as a means to estimate the likelihoods of international political future events deemed important by the Intelligence Advanced Research Projects Activity (IARPA). Since its initiation in 2011, the project has recruited thousands of forecasters to make probability estimates and update them whenever they felt the likelihoods had changed. To illustrate, Figure 2 shows the forecasts for one of these events. This example involves forecasters making a total of predictions between 30 July 2012 and 30 December 2012 when the event finally resolved as “No” (represented by the red line at ). In general, the forecasters reported updates very infrequently. Furthermore, not all forecasters made probability estimates for all the events, making the dataset very sparse. The point forecasts for our second application were collected by moore2008use who recruited undergraduates from Carnegie Mellon University to guess the weights of people based on a series of pictures. This is an experimental setup where each participant was required to respond to all the questions, leading to a fully completed dataset. The responses are illustrated in Figure 2 that shows the boxplots of the forecasters’ guesses for each of the people. The red dots represent the corresponding true weights.
Once the predictions have been collected, they are typically combined into a single consensus forecast for the sake of decision-making and improved accuracy. Unfortunately, this can be done in many different ways, and the final combination rule can largely determine the out-of-sample performance. The past literature distinguishes two broad approaches to forecast aggregation: empirical aggregation and model-based aggregation. Empirical aggregation is by far the more widely studied approach; see, e.g., stacking (breiman1996stacked), Bayes model averaging (raftery1997bayesian), linear opinion pools (degroot1991optimal), and extremizing aggregators (Ranjan08; satopaa; satopaa2014probability). All these methods are akin to machine learning in a sense that they first learn the aggregator based on a training set of past forecasts of known outcomes and then use that aggregator to combine future forecasts of unknown outcomes. Unfortunately, in a prediction polling setup, constructing such a training set requires a lot of effort and time on behalf of the forecasters and the polling agent. Therefore a training set is often not available. Instead, the participants are typically handed a single questionnaire that simultaneously inquires about their predictions of one or more unknown outcomes. This leads to a dataset consisting only of forecasts, which means that empirical aggregation cannot be applied.
Fortunately, model-based aggregation can be performed even when prior knowledge of outcomes is not available. This approach begins by proposing a plausible probability model for the source of heterogeneity among the forecasts, that is, for how and why the forecasts differ from the target outcome. Under this assumed forecast-outcome link, it is then possible to construct an optimal aggregator that can be applied directly to the forecasts without learning the aggregator first from a separate training set. Given this broad applicability, the current paper focuses only on the model-based approach. In particular, outcomes are not assumed available for aggregation at any point in the paper. Instead, aggregation is performed solely based on forecasts, leaving all empirical techniques well outside the scope of the paper.
Historically, potentially due to early forms of data collection, model-based aggregation has considered measurement error as the main source of forecast heterogeneity. This choice motivates aggregators with central tendency such as the (weighted) average, median, and so on. Intuitively, measurement error may be reasonable in modeling repeated estimates from a single instrument. However, it is unlikely to hold in prediction polling, where the estimates arise from multiple, often widely different sources. It is also known that a non-trivial weighted average is not the optimal aggregator (in terms of the expected quadratic and many other loss functions) under any joint distribution of the outcome and its (conditionally unbiased) forecasts (dawid1995coherent; Ranjan08; satopaa2015combining). This questions the role of measurement error in model-based aggregation and highlights the need for a different source of forecast heterogeneity.
The main contribution of this paper is a new source of forecast heterogeneity, called information diversity, that explains variation by differences in the information available to the forecasters and how they decide to use it. For instance, forecasters studying the same (or different) articles about a company may use separate parts of the information and hence report differing predictions on the company’s future revenue. Such diversity forms the basis of a novel modeling framework known as the partial information framework. Theory behind this framework was originally introduced for probability forecasts by satopaamodeling; though their specification is somewhat restrictive for empirical applications. The current paper generalizes the framework beyond probability forecast and removes all unnecessary assumptions, leading to a new specification that is more appropriate for practical applications. This specification allows the decision-maker to build models for different types of forecast-outcome pairs, such as probability forecasts of binary events or point forecasts of real-valued outcomes. Each such model motivates and describes an explicit joint distribution for the target outcome and its forecasts. The optimal aggregator under this joint distribution is available and serves as a more principled model-based alternative to the usual (weighted) average or median.
The paper is structured as follows. Section 2 first describes the partial information framework at its most general level and then introduces a practical specification of the framework. The section ends with a brief review of previous work on model-based aggregation. Section 3 derives a general procedure that guides efficient estimation of the information structure among the forecasters. Section 4 illustrates on real-world data how specific models within the framework can be constructed and applied. In particular, the models are derived and evaluated on probability and point forecasts from the two prediction polls discussed above. Overall, the resulting partial information aggregators achieve a noticeable performance improvement over the common measurement-error-based aggregators, suggesting that information diversity is the more appropriate model of forecast heterogeneity. Finally, Section 5 concludes with a summary and discussion of future research.
2 Model-Based Aggregation
2.1 Bias and Noise
Consider forecasters and suppose forecaster predicts for some quantity of interest . For instance, in our weight estimation example is the true weight of a person and is the guess given by the th undergraduate. In our probability forecasting application, on the other hand, is binary, reflecting whether the event happens or not, and is a probability forecast for its occurrence. This section, however, avoids such application specific choices and treats and as generic random variables. In general, prediction is nothing but an estimator of . Therefore, as is the case with all estimators, its deviation from the truth can be broken down into two components: bias and noise. On the theoretical level, these two components can be separated and hence are often addressed by different mechanisms. This suggests a two-step approach to forecast aggregation: i) eliminate any bias in the forecasts, and ii) combine the unbiased forecasts.
Historically, bias in human judgment has been extensively studied in the psychology literature (for reviews, see lichtenstein1977calibration; yates1990judgment; keren1991calibration). This bias often exhibits well-known patterns (see, e.g., the easy-hard effect in lichtenstein1977those; juslin1993explanation), and many authors have proposed both cognitive and motivational models to explain it (koriat1980reasons; kruglanski1990motivations; soll1996determinants; moore2008trouble). These models and other results in this popular area of research suggest ways for ex-ante bias reduction. Such techniques, however, are not in the scope of this paper. Instead, the focus here is on noise reduction and hence specifically on developing methodology for the second step in the overall process of forecast aggregation. In particular, Section 2.2 describes our new framework for modeling the noise component. This is then compared in Section 2.3 to previous noise models. These models make different assumptions about the way the unbiased forecasts relate to the target outcome and hence motivate very different classes of model-based aggregators.
2.2 Partial Information Framework
2.2.1 General Framework
The partial information framework assumes that and are measurable under some common probability space . The probability measure provides a non-informative yet proper prior on and reflects the basic information known to all forecasters. Such a prior has been discussed extensively in the economics and game theory literature where it is usually known as the common prior. Even though this is a substantive assumption in the framework, specifying a prior distribution cannot be avoided as long as the model depends on a probability space. This includes essentially any probability model for forecast aggregation. How the prior is incorporated depends on the problem context: it can be chosen explicitly by the decision-maker, computed based on past observations of , or estimated directly from the forecasts.
The principal -field can be interpreted as all the possible information that can be known about . On top of the basic information reflected in the prior, the th forecaster uses some personal partial information set and predicts . Therefore if , and forecast heterogeneity stems purely from information diversity. Note, however, that if forecaster uses a simple rule, may not be the full -field of information available to the forecaster but rather a smaller -field corresponding to the information used by the rule. Furthermore, if two forecasters have access to the same -field, they may decide to use different sub--fields, leading to different predictions. This is particularly salient in our weight estimation example where each forecaster has access to the exact same information, namely the picture of the person, but can choose to use different subsets of this information. Therefore, information diversity does not only arise from differences in the available information, but also from how the forecasters decide to use it. This general point of view was motivated in satopaamodeling with simple examples that illustrate how the optimal aggregate is not well-defined without assumptions on the information structure among the forecasters.
satopaamodeling also show that is precisely the same as having a calibrated (sometimes also known as reliable) forecast, that is, . Therefore the form arises directly from the existence of an underlying probability model and calibration. Overall, calibration has been widely discussed in the statistical and meteorological forecasting literature (see, e.g., dawid1995coherent; Ranjan08; jolliffe2012forecast), with traces at least as far back as murphy1987general. Given that the condition depends on the probability measure , it should be referred to as -calibration when the choice of the probability measure needs to be emphasized. This dependency shows the main conceptual difference between -calibration and the notion of empirical calibration (dawid1982well; foster1998asymptotic; and many others). However, as was pointed out by dawid1995coherent, these two notions can be expressed in formally identical terms by letting represent the limiting joint distribution of the forecast-outcome pairs.
In practice researchers have discovered many calibrated subpopulations of experts, such as meteorologists (murphy1977can; murphy1977reliability), experienced tournament bridge players (keren1987facing), and bookmakers (dowie1976efficiency). Generally, calibration can be improved through team collaboration, training, tracking (mellers2014psychological), performance feedback (murphy1984impacts), representative sampling of target events (gigerenzer1991probabilistic; juslin1993explanation), or by evaluating the forecasters’ performance under a loss function that is minimized by the conditional expectation of , given the forecaster’s information (banerjee2005optimality). If one is nonetheless left with uncalibrated forecasts, they can be calibrated ex-ante as follows. First, consider some (possibly uncalibrated) forecasts defined on . Choose some distribution for . For instance, dawid1995coherent suggest first choosing a distribution for and then setting , where is an arbitrary aggregator (such as the average of probability forecasts of a binary event) acting as . Alternatively, one may search for an appropriate in the large literature of quantitative psychology. Regardless how is constructed, however, the calibrated version of is . This forecast is -calibrated and can be written as , where is the -field generated by . Intuitively, calibrating is equivalent to replacing forecast by for all possible values . Perhaps, however, one does not want to work under this particular model. To accommodate alternative models (such as the Gaussian model described in Section 2.2.2), the next proposition shows how -calibrated forecasts can be transformed into forecasts that are calibrated under some other probability measure . All the proofs are deferred to Appendix A.
Consider a probability measure such that . Let denote the Radon-Nikodym derivative of with respect to . The forecasts under the new model are then given by the transformation , where .
This shows that uncalibrated forecasts from “non-experts” can be calibrated as long as one agrees on some joint distribution for the target outcome and its forecasts. While such constructs certainly deserve further analysis, they are not in the scope of this paper and hence are left for future work. Therefore, from now on, the forecasts are assumed to be calibrated. Note, however, that in general the forecasts should satisfy some minimal performance criterion; simply aggregating entirely arbitrary forecasts is hardly going to lead to improved forecasting accuracy. To this end, foster1998asymptotic analyze probability forecasts and state that “calibration does seem to be an appealing minimal property that any probability forecast should satisfy.” They show that one needs to know almost nothing about the outcomes in order to be calibrated. Thus, in theory, calibration can be achieved very easily and overall seems like an appropriate base assumption for developing a general theory of forecast aggregation.
Given that the partial information framework generates all forecast variation from information diversity, it is important to understand the extent to which the forecasters’ partial information sets can be measured in practice. First, note that, for the purposes of aggregation, any available information discarded by a forecaster may as well not exist because information comes to the aggregator only through the forecasts. Therefore it is not in any way restrictive to assume that . Second, the following proposition describes observable measures for the amount of information in each forecast and for the amount of information overlap between any two forecasts.
If such that for all , then the following holds.
Forecasts are marginally consistent: .
Variance increases in information: if . Given that , the variances of the forecasts are upper bounded as for all .
if . Again, expressing implies that for all .
This proposition is important for multiple reasons. First, item i) provides guidance in estimating the prior mean of from the observed forecasts. Second, item ii) shows that quantifies the amount of information used by forecaster . In particular, increases to as forecaster learns and becomes more informed. Therefore increased variance reflects more information and is deemed helpful. This is a clear contrast to the standard statistical models that often regard higher variance as increased noise and hence harmful. The covariance , on the other hand, can be interpreted as the amount of information overlap between forecasters and . Given that being non-negatively correlated is not generally transitive (langford2001property), these covariances are not necessarily non-negative even though all forecasts are non-negatively correlated with the outcome. Such negatively correlated forecasts can arise in a real-world setting. For instance, consider two forecasters who see voting preferences of two different sub-populations that are politically opposed to each other. Each individually is a weak predictor of the total vote on any given issue, but they are negatively correlated because of the likelihood that these two blocks will largely oppose each other.
Third and finally, item iii) shows that the covariance matrix of the s extends to the unknown as follows:
where denotes the diagonal of . This is the key to regressing on the s without a separate training set of past forecasts of known outcomes. The resulting estimator, called the revealed aggregator, is
where is the -field generated (or information revealed) by the s. The revealed aggregator uses all the information that is available in the forecasts and hence is the optimal aggregator under the distribution of . To make this precise, consider a scoring rule that represents the loss of predicting when the outcome is . A scoring rule is said to be consistent for the mean of if for all . savage1971elicitation showed, subject to weak regularity conditions, that all such scoring rules can be written in the form
where is a convex function with subgradient . An important special case is the quadratic loss that arises when . Now, if an aggregator is defined as any random variable , then is an aggregator that minimizes expectation of any scoring rule of the form (2):
Ranjan08 showed a similar results for probability forecasts. For these reasons, is considered the relevant aggregator under each specific instance of the framework. The next section shows how this aggregator can be captured in practice.
2.2.2 Gaussian Partial Information Model
Even though the general framework is convenient for theoretical analysis, it is clearly too abstract for practical applications. Fortunately, applying the framework in practice only requires one extra assumption, namely the choice of a parametric family for the distribution of . One approach is to refer to Proposition 2.2 and choose a family that is parametrized in terms of the first two joint moments. This points at the multivariate Gaussian distribution that is a typical starting point in developing statistical methodology and often provides the cleanest entry into the issues at hand.
The Gaussian distribution is also the most common choice for modeling measurement error. This is typically motivated by assuming the terms to represent sums of a large number of independent sources of error. The central limit theorem then gives a natural motivation for the Gaussian distribution. A similar argument can be made under the partial information framework. First, consider some pieces of information. Each piece either has a positive or negative impact and hence respectively either increases or decreases . The total sum (integral) of these pieces determines the value of . Each forecaster, however, only observes the sum of some subset of them. Based on this sum, the forecaster makes an estimate of . If the pieces are independent and have small tails, then the joint distribution of the forecasters’ observations will be asymptotically Gaussian. Given that the number of information pieces in a real-world setup is likely to be large, it makes sense to model the forecasters’ observations as jointly Gaussian. Of course, other distributions, such as the multivariate -distribution, are possible. At this point, however, such alternative specifications are best left for future work.
The model variables can be modeled directly with a Gaussian distribution as long as they are all real-valued. In many applications, however, and may not be supported on the whole real line. For instance, the aforementioned Good Judgment Project collected probability forecasts of binary events. In this case, and . Fortunately, different types of outcome-forecast pairs can be easily addressed by borrowing from the theory of generalized linear models (mccullagh1989generalized) and utilizing a link function. The result is a close yet widely applicable specification called the Gaussian partial information model. This model begins by introducing information variables that follow a multivariate Gaussian distribution with the covariance pattern (1):
This distribution supports the Gaussian model similarly to the way the ordinary linear regression supports the class of generalized linear models. In particular, the information variables transform into the outcome and forecasts via an application-specific link function ; that is, and . Given that fully determines , it is sufficient for all information that can be known about . The remaining variables , on the other hand, summarize the forecasters’ partial information. To make this more concrete, consider our two real-world applications. For probability forecasts of a binary event a reasonable link function is the indicator function , where for some threshold value . For real-valued and , on the other hand, a reasonable choice is the reverse standardizing function , where and are the prior mean and standard deviation of , respectively. In general, it makes sense to have map from the real-numbers to the support of such that has the correct prior .
Overall, this model can be considered as a close yet practical specification of the general framework. After all, it only adds on the assumption of Gaussianity. This extra assumption, however, is enough to allow the construction of the revealed aggregator . For and also the conditional expectations can be often computed via the following conditional distributions:
where . For instance, if both and are real-valued, then and , where . These conditional distributions arise directly from the well-known conditional distributions of the multivariate Gaussian distribution (see, e.g., ravishanker2001first).
2.3 Previous Work on Model-Based Aggregation
2.3.1 Interpreted Signal Framework
The interpreted signal framework is a behavioral model that assumes different predictions to arise from differing interpretation procedures (hong2009interpreted). For example, consider two forecasters who visit a company and predict its future revenue. One forecaster may carefully examine the company’s technological status while the other pays closer attention to what the managers say. Even though the forecasters receive and possibly even use the exact same information, they may interpret it differently and hence end up reporting different forecasts. Therefore forecast heterogeneity is assumed to stem from “cognitive diversity”.
This is a very reasonable model and hence has been used in various forms to simulate and illustrate theory about expert behavior (see, e.g., broomell2009experts; parunak2013characterizing). Consequently, previous authors have constructed many highly specialized toy models of interpreted forecasts. For instance, dawid1995coherent construct simple models of two forecasts to support their discussion on coherent forecast aggregation; Ranjan08 use one of these models to simulate calibrated forecasts; and Bacco introduce a model for two forecasters whose (interpreted) log-odds predictions follow a joint Gaussian distribution. Unfortunately, their model is very narrow due to its detailed assumptions and extensive computations. Furthermore, it is not clear how the model can be used in practice or extended to forecasters. All in all, it seems that successful previous applications of the interpreted signal framework have used it as a basis for illustrating theory instead of actually aiming to model real-world forecasts. In this respect, the framework has remained relatively abstract.
Our partial information framework, however, formalizes the intuition behind it, allows quantitative predictions, and provides a flexible construction for modeling many different forecasting setups. Overall, the framework is very general and, in fact, encompasses all the other authors’ models mentioned above as different sub-cases. Unlike the Gaussian model, however, these models make many restrictive assumptions in addition to just choosing a parametric family. Even though the general partial information framework, as described in Section 2.2, does not allow the forecasters to interpret information differently and hence does not capture all aspects of the interpreted signal framework, personal interpretations can be easily introduced by associating forecaster with a probability measure that describes that forecaster’s interpretation of information. If denotes the expectation under , then it is possible that even if . In practice, however, eliciting the details of each is hardly possible. Therefore, to keep the model tractable, it is convenient to assume a common interpretation for all .
2.3.2 Measurement Error Framework
In the absence of a quantitative interpreted signal model, prior applications have typically explained forecast heterogeneity with standard statistical models. These models are different formalizations of the measurement error framework that generates forecast heterogeneity purely from a probability distribution. More specifically, this framework assumes a “true” (possibly transformed) forecast , which can be interpreted as the prediction made by an ideal forecaster. The forecasters then somehow measure with mean-zero idiosyncratic error. For instance, in our probability forecasting application one possible measurement error model is
where is the log-odds operator. Given that the errors are generally assumed to have mean zero, measurement error forecasts are unbiased estimates of , that is, . Observe that this is not the same as assuming calibration . Therefore an unbiased estimation model is very different from a calibrated model. This distinction is further emphasized by the fact that never reduces to a (non-trivial) weighted average of the forecasts (satopaa2015combining). Given that the measurement-error aggregators are often different types of weighted averages, measurement error and information diversity are not only philosophically different but they also require very different aggregators.
Example (4) illustrates the main advantages of the measurement error framework: simplicity and familiarity. Unfortunately, there are a number of disadvantages. First, measurement-error aggregators estimate instead of the realized value of the random variable . For this reason, these aggregators often do not satisfy even the minimal performance requirements. For instance, a non-trivial weighted average of calibrated forecasts is necessarily both uncalibrated and under-confident (Ranjan08; satopaa2015combining). Second, the standard assumption of conditional independence of the observations forces a specific and highly unrealistic structure on interpreted forecasts (hong2009interpreted). Measurement-error aggregators also cannot leave the convex hull of the individual forecasts, which further contradicts the interpreted signal framework (parunak2013characterizing) and can be easily seen to result in poor empirical performance on many datasets. Third, the underlying model is rather implausible. Relying on a true forecast invites philosophical debate, and even if one assumes the existence of such a value, it is difficult to believe that the forecasters are actually seeing it with independent noise. Therefore, whereas the interpreted signal framework proposes a plausible micro-level explanation, the measurement error model does not; at best, it forces us to imagine a group of forecasters who apply the same procedures to the same data but with numerous small mistakes.
3 Model Estimation
This section describes methodology for estimating the information structure . Even though is mostly used for aggregation, it also describes the information among the forecasters (see end of Section 2.2.1) and hence should be of interest to decision analysts, psychologists, and the broader community studying collective problem solving. Unfortunately, estimating in full generality based on a single prediction per forecaster is difficult. Therefore, to facilitate model estimation, the forecasters are assumed to predict related events. For instance, in our second application undergraduates guessed the weights of people. This yielded a matrix that was then used to estimate .
3.1 General Estimation Problem
Denote the outcome of the th event with and the th forecaster’s prediction for this outcome with . For the sake of generality, this section does not assume any particular link function but instead operates directly with the corresponding information variables, denoted with . In practice, the forecasts can be often transformed into at least approximately. This is illustrated in Section 4. Recall that aggregation cannot access to the outcomes or their corresponding information variables . Instead, is estimated only based on , where the vector collects the forecasters’ information about the th event.
This estimation must respect the covariance pattern (3). More specifically, if denotes the set of symmetric positive semidefinite matrices and
for some symmetric matrix , then the final estimate must satisfy the condition . Intuitively, this is satisfied if there exists a random variable for which the forecasts are jointly calibrated. In terms of information, this means that it is physically possible to allocate information about among the forecasters in the manner described by . Therefore the condition is named information coherence.
Unfortunately, simply finding an accurate estimate of does not guarantee precise aggregation. To see this, recall from Section 2.2.2 that . This term is generally found in the revealed aggregator and hence deserves careful treatment. Re-express the term as , where is the solution to . The rate at which the solution changes with respect to a change in depends on the condition number , i.e., the ratio between the maximum and minimum eigenvalues of . If the condition number is very large, a small error in can cause a large error in . If the condition number is small, is called well-conditioned and error in will not be much larger than the error in . Thus, to prevent estimation error from being amplified during aggregation, the estimation procedure should require for a given threshold .
This all gives the following general estimation problem:
where is some objective function. The feasible region defined by the two constraints is convex. Therefore, if is convex in , expression (5) is a convex optimization problem. Typically the global optimum to such a problem can be found very efficiently. Problem (5), however, involves variables. Therefore it can be solved efficiently with standard optimization techniques, such as the interior point methods, as long as the number of variables is not too large, say, not more than 1,000. Unfortunately, this means that the procedure cannot be applied to prediction polls with more than about forecasters. This is very limiting as many prediction polls involve hundreds of forecasters. For instance, our two real-world applications involve and forecasters. Fortunately, by choosing the loss function carefully one can perform dimension reduction and estimate under a much larger . This is illustrated in the following subsections.
3.2 Maximum Likelihood Estimator
Under the Gaussian model the information structure is a parameter of an explicit likelihood. Therefore estimation naturally begins with the maximum likelihood approach (MLE). Unfortunately, the Gaussian likelihood is not convex in . Consequently, only a locally optimal solution is guaranteed with standard optimization techniques. Furthermore, it is not clear whether the dimension of this form can be reduced. won2006maximum discuss the MLE under a condition number constraint. They are able to transform the original problem with variables to an equivalent problem with only variables, namely the eigenvalues of . This transformation, however, requires an orthogonally invariant problem. Given that the constraint is not orthogonally invariant, the same dimension-reduction technique cannot be applied. Instead, the MLE must be computed with the variables, making estimation slow for small and undoable even for moderately large . For these reasons the MLE is not discussed further in this paper.
3.3 Least Squares Estimator
Past literature has discussed many simple covariance estimators that can be applied efficiently to large amounts of data. Unfortunately, these estimators are not guaranteed to satisfy the conditions in (5). This section introduces a correctional procedure that inputs any covariance estimator and modifies it minimally such that the end result satisfies the conditions in (5). More specifically, is projected onto the feasible region. This approach, sometimes known as the least squares approach (LSE), motivates a convex loss function that guarantees a globally optimal solution and facilitates dimension reduction. Most importantly, however, it provides a general tool for estimating , regardless whether one is working with a Gaussian model or possibly some future non-Gaussian model.
From the computational perspective, it is more convenient to project instead of . Even though this could be done under many different norms, for the sake of simplicity, this paper only considers the squared Frobenius norm , where is the trace operator. The LSE is then given by , i.e., without the first row and column, where is the solution to
Both and are constants defined to maintain the covariance pattern (3). More specifically, if denotes the th standard basis vector of length , then
If satisfies the other two conditions, namely and , then also satisfies them. This follows from the fact that is a principal sub-matrix of . Therefore implies . Furthermore, Cauchy’s interlace theorem (see, e.g., hwang2004cauchy) states that and such that . Of course, requiring instead of shrinks the region of feasible s. At this point, however, the exact value of is arbitrary and merely serves to control . Section 3.4 introduces a procedure for choosing from the data. Under such an adaptive procedure, problem (6) can be considered equivalent to directly projecting onto the feasible region.
The first step towards solving (6) is to express the feasible region as an intersection of the following two sets:
Given that both of these sets are convex, projecting onto their intersection can be computed with the Directional Alternating Projection Algorithm (gubin1967method). This method makes progress by repeatedly projecting onto the sets and . Consequently, it is efficient only if projecting onto each of the individual sets is fast. Fortunately, as will be shown next, this turns out to be the case.
First, projecting an symmetric matrix onto is a linear map. To make this more specific, let be a column-wise vectorization of . If is a matrix with the th row equal to , the linear constraints in (6) can be expressed as . Then, the projection of onto is given by . This expression simplifies significantly by close inspection. In fact, it is equivalent to setting and for replacing , , and by their average . Denote this projection with the operator .
Second, tanaka2014positive describe a univariate optimization problem that is almost equivalent to projecting onto . The only difference is that their solution set also includes the zero-matrix . Assuming that such a limiting case can be safely handled in the implementation, their approach offers a fast projection onto even for a moderately large . To describe this approach, consider the spectral decomposition and the univariate function
where is a diagonal matrix with diagonal and is the positive part operator. The function can be minimized very efficiently by solving a series of smaller convex problems, each with a closed form solution. The result is a binary-search-like procedure described by Algorithm LABEL:projSD_algo in Appendix A. If and
for , then is the projection of onto . Call this projection .
Algorithm 1 uses these projections to solve (6). Each iteration projects twice on one set and once on the other set. The general form of the algorithm does not specify which projection should be called twice. Therefore, given that takes longer to run than , it is beneficial to choose to call twice. The complexity of each iteration is determined largely by the spectral decomposition which is fairly fast for moderately large . Overall time to convergence, of course, depends on the choice of the stopping criterion. Many intuitive criteria are possible. Given that and , the stopping criterion suggests that the return value is in and close to in every direction. Based on our experience, the algorithm converges quite quickly. For instance, our implementation in C++ generally solves (6) for and in less than a second on a 1.7 GHz Intel Core i5 computer. This code will be made available online upon publication. For the remainder of the paper, projecting onto the feasible region is denoted with the operator .
The estimation procedure described in the previous section has one tuning parameter, namely the condition number threshold . This subsection discusses an in-sample approach, called conditional validation, that can be used for choosing any tuning parameter, such as , under the partial information framework. To motivate, recall that the revealed aggregator uses to regress on the rest of the s. Of course, the accuracy of this prediction cannot be known until the actual outcome is observed. However, apart from being unobserved, the variable is theoretically no different to the other s. This suggests the following algorithm: for some value compute , let each of the s in turn play the role of , predict its value based on for , and choose the value of that yields the best overall accuracy. Even though many accuracy measures could be chosen, this paper uses the conditional log-likelihood. Therefore, if collects the th forecaster’s information about the events, the chosen value of is
where the log-likelihood is now conditional on s for and is computed based on all the forecasts . Plugging this into the projection algorithm gives the final estimate .
Unfortunately, the optimization problem (7) is non-convex in . However, as was mentioned before, Algorithm 1 is fast for moderately sized . Therefore can be chosen efficiently (possibly in parallel on multicore machines) over a grid of candidate values. Overall, the idea in conditional validation is similar to cross-validation but, instead of predicting across rows (observations), the prediction is performed across columns (variables). This not only mimics the actual process of revealed aggregation but is also likely to be more appropriate for prediction polling that typically involves a large number of forecasters (large ) predicting relatively few events (small ). Furthermore, it has no tuning parameters and remains more stable when is small; see Appendix B for an illustration of this result under synthetic data.
This section applies the partial information framework to different types of real world forecasts. For each type there may be different ways to adopt the Gaussian model. The main point, however, is not to find the optimal way to do this but rather to give illustrative examples on using the framework and also to show how the resulting partial information aggregators outperform the commonly used measurement error aggregators.
4.1 Probability Forecasts of Binary Outcomes
During the second year of the Good Judgment Project (GJP) the forecasters made probability estimates for events, each with two possible outcomes. One of these events was illustrated in Figure 2. Each prediction problem had a timeframe, defined as the number of days between the first day of forecasting and the anticipated resolution day. These timeframes varied largely among problems, ranging from 12 days to 519 days with a mean of 185.4 days. During each timeframe the forecasters were allowed to update their predictions as frequently as they liked. The forecasters knew that their estimates would be assessed for accuracy using the quadratic loss (often known as the Brier score; see brier for more details). This is a proper loss function that incentivized the forecasters to report their true beliefs instead of attempting to game the system. In addition to receiving $150 for meeting minimum participation requirements that did not depend on prediction accuracy, the forecasters received status rewards for their performance via leader-boards displaying the losses for the best forecasters. Depending on the details of the reward structure, such a competition for rank may eliminate the truth-revelation property of proper loss functions (see, e.g., lichtendahl2007probability).
This data collection raises several issues. First, given that the current paper does not focus on modeling dynamic data, only forecasts made within some common time interval should be considered. Second, not all forecasters made predictions for all the events. Furthermore, the forecasters generally updated their forecasts infrequently, resulting into a very sparse dataset. Such high sparsity can cause problems in computing the initial unconstrained estimator . Evaluating different techniques to handle missing values, however, is well outside the scope of this paper. Therefore, to somewhat alleviate the effect of missing values, only the hundred most active forecasters are considered. This makes sufficient overlap highly likely but, unfortunately, still not guaranteed.
All these considerations lead to a parallel analysis of three scenarios: High Uncertainty (HU), Medium Uncertainty (MU), and Low Uncertainty (LU). Important differences are summarized in Table 1. Each scenario considers the forecasters’ most recent prediction within a different time interval. For instance, LU only includes each forecaster’s most recent forecast during days before the anticipated resolution day. The resulting dataset has events of which occurred. In the corresponding table of forecasts, around 42 % of the values are missing. The other two scenarios are defined similarly.
|Scenario||Time Interval||# of Events||Missing (%)|
|High Uncertainty (HU)|
|Medium Uncertainty (MU)|
|Low Uncertainty (LU)|
4.1.2 Model Specification and Aggregation
The first step is to pick a link function and derive a Gaussian model for probability forecasts of binary events. Overall, this construction resembles in many ways the latent variable version of a standard probit model.
Model Instance. Identify the th event with . These outcomes link to the information variables via the following function:
where is some threshold value. Therefore the link function is simply the indicator function of the event . This threshold is defined by the prior probability of the th event , where is the CDF of a standard Gaussian distribution. Given that the thresholds are allowed to vary among the events, each event has its own prior. The corresponding probability forecasts are
In a similar manner, the revealed aggregator for event is
All the parameters of this model can be estimated from the data. The first step is to specify a version of the unconstrained estimate . If the ’s do not change much, a reasonable and simple estimate is obtained by transforming the sample covariance matrix of the probit scores . More specifically, if , where , then an unconstrained estimator of is given by . Recall that the GJP data holds many missing values. This is handled by estimating each pairwise covariance in based on all the events for which both forecasters made predictions. Next, compute , where is chosen over a grid of candidate values between and . Finally, the threshold can be estimated by letting , observing that , and computing the precision-weighted average:
If has missing values, the corresponding rows and columns of are dropped. Intuitively, this estimator gives more weight to the forecasters with very little information. These estimates are then plugged in to (8) to get the revealed aggregator .
This aggregator is benchmarked against the state-of-the-art measurement-error aggregators, namely the average probability, median probability, average probit-score, and average log-odds. Unequally weighted averages were not considered because it is unclear how the weights would be determined based on forecasts alone, and even if this could be done somehow (perhaps based on self-assessment or organizational status), using unequal weights often leads to no or very small performance gains (rowse1974comparison; ashton1985aggregating; flores1989subjective). To avoid infinite log-odds and probit scores, extreme forecasts and were censored to and , respectively. The results remain insensitive to the exact choice of censoring as long as this is done in a reasonable manner to keep the extreme probabilities from becoming highly influential in the logit- or probit-space. The accuracy of the aggregates is measured with the average root-mean-squared-error (RMSE). Note that this is nothing but the square root of the commonly used Brier score. Instead of considering all the forecasts at once, the aggregators are evaluated under different via repeated subsampling of the most active forecasters; that is, choose forecasters uniformly at random, aggregate their forecasts, and compute the RMSE. This is repeated 1,000 times with forecasters. Due to high computational cost, the simulation was stopped after . In the rare occasion where no pairwise overlap is available between one or more pairs of the selected forecasters, the subsampling is repeated until all pairs have at least one problem in common.
Figure 3 shows the average RMSEs under the three scenarios described in Table 1. Here a reasonable upper bound is given by as this is the RMSE one would receive by constantly predicting . All presented scores, however, are well below it and improve uniformly from left to right, that is, from HU to LU. This reflects the decreasing level of uncertainty. In all the figures the measurement-error aggregators rank in the typical order (from worst to best): average probability, median probability, average probit, and average log-odds. Regardless of the level of uncertainty, the revealed aggregator outperforms the averaging aggregators as long as . The relative advantage, however, increases from HU to LU. More specifically, the improvement from Log-odds to is about %, %, and % in HU, MU, and LU, respectively. This trend can be explained by several reasons. First, as can be seen in Table 1, the amount of data increases from HU to LU. This yields a better estimate of and hence more accurate revealed aggregation. Second, the forecasters are more likely to be well-calibrated under MU and LU than under HU (see, e.g., braun1992case). Third, under HU the events are still inherently very uncertain. Consequently, the forecasters are unlikely to hold much useful information as a group. Under such low information diversity, measurement-error aggregators generally perform relatively well (satopaamodeling). In the contrary, under MU the events have lost a part of their inherent uncertainty, allowing some forecasters to possess useful private information. These individuals are then prioritized by while the averaging-aggregators continue treating all forecasts equally. Consequently, the performance of the measurement error aggregators plateaus after or so. Therefore having more than about forecasters does not make a difference if one is determined to aggregate their predictions using the measurement error techniques; a similar results was reported by satopaa. In contrast, however, the RMSE of continues to improve linearly in , suggesting that is able to find some residual information in each additional forecaster and use this to increase its performance advantage.
4.1.3 Information Diversity
The GJP assigned the forecasters to make predictions either in isolation or in teams. Furthermore, after the first year of the tournament, the top 2% forecasters were elected to the elite group of “super-forecasters.” These super-forecasters then worked in exclusive teams to make highly accurate predictions on the same events as the rest of the forecasters. Overall, these assignments directly suggest a level of information overlap. In particular, recalling the interpretation of from Section 2.2.1, super-forecasters can be expected to have the highest s and forecasters in the same team should have a relatively high . This subsection examines how well aligns with this prior knowledge about the forecasters’ information structure.
For the sake of brevity, only the LU scenario is analyzed as this is where presented the highest relative improvement. The associated 100 forecasters involve 36 individuals predicting in isolation, 33 forecasting team-members (across 24 teams), and 31 super-forecasters (across 5 teams). Figure 3(a) displays for the five most active forecasters. This group involves two forecasters working in isolation (Iso. A and B) and three super-forecasters (Sup. A, B, and C), of whom the super-forecasters A and B are in the same team. Overall, agrees with this classification: the only two team members, namely Sup. A and B have a relatively high information overlap. In addition, the three super-forecasters are more informed than the non-super-forecasters. Such a high level of information unavoidably leads to higher information overlap with the rest of the forecasters.
By and large, this agreement generalizes to the entire group of forecasters. To illustrate, Figure 3(b) displays for all the 100 forecasters. The information structure has been ordered with respect to the diagonal such that the more informed forecasters appear on the right. Furthermore, a colored rug has been appended on the top. This rug shows whether each forecaster worked in isolation, in a non-super-forecaster team, or in a super-forecaster team. Observe that the super-forecasters are mostly situated on the right among the most informed forecasters. The average estimated among the super-forecaster is . On the other hand, the average estimated among the individuals working in isolation or in non-super-forecaster teams are and , respectively. Therefore working in a team makes the forecasters’ predictions, on average, slightly more informed.
In general, a plot such as Figure 3(b) is useful for assessing the level of information diversity among the forecasters: the further away it is from a monochromatic plot, the higher the information diversity. That being said, the colorful Figure 3(b) suggests that the GJP forecasters have high information diversity. This makes sense as these forecasters were asked to make predictions about international political events. Given that on such events the forecasters’ background knowledge, education, how closely they follow the news, and so on matter, one should expect a high level of information diversity. Therefore not only does clearly outperform the common measurement error aggregators in terms of prediction accuracy but the Gaussian model also captures true structure in the data.
4.2 Point Forecasts of Continuous Outcomes
moore2008use hired undergraduates from Carnegie Mellon University to guess the weights of people based on a series of pictures. These forecasts were illustrated in Figure 2. The target people were between 7 and 62 years old and had weights ranging from to pounds, with a mean of pounds. All the students were shown the same pictures and hence given the exact same information. Therefore any information diversity arises purely from the participants’ decisions to use different subsets of the same information. Consequently, information diversity is likely to be low compared to Section 4.1 where diversity also stemmed from differences in the information available to the forecasters.
Unlike in Section 4.1, the Gaussian model can be applied almost directly to the data. Only the effect of extreme values was reduced via a % Winsorization (hastings1947low). This handled some obvious outliers. For instance, the original dataset contained a few estimates above pounds and as low as pounds. Winsorization generally improved the performance of all the competing aggregators.
4.2.2 Model Specification and Aggregation
Model Instance. Suppose and are real-valued. If the proper non-informative prior distribution of is , then . Consequently, for all . Therefore for some . If , then the revealed aggregator for the th event is
Under this model the prior distribution of is specified by and . Given that for all , the sample average provides an initial estimate of . The value of can be estimated by assuming a distribution for the s. More specifically, let be i.i.d. on the interval and use the resulting likelihood to estimate . For instance, a non-informative choice is to assume , which leads to the maximum likelihood estimator . This has a downward bias that can be corrected by a multiplicative factor of . Therefore, replacing with the sample variance gives the final estimate . Using these estimates, the s can be transformed into the s whose sample covariance matrix provides the unconstrained estimator for the projection algorithm. The value of is chosen over a grid of values between and . Once has been computed, the prior means are updated with the precision-weighted averages . In the end, all these estimates are plugged in (9) to get the revealed aggregator .
This aggregator is compared against the average, median, and average of the median and average (AMA). The last competitor, namely AMA is a heuristic aggregator that lobo2010human showed to work particularly well on many different real-world forecasting datasets. In this section the overall accuracy is measured with the RMSE averaged over sub-samplings of the participants. That is, each iteration chooses participants uniformly at random, aggregates their forecasts, and computes the RMSE. The size of the sub-samples is varied between and with increments of . These scores are presented in Figure 6. The average outperforms the median across all . The performance of AMA falls between that of average and median, reflecting its nature as a compromise of the two. The revealed aggregator is the most accurate once . The relatively worse performance at suggests that observations is not enough to estimate accurately. As approaches , however, collects information efficiently and increases the performance advantage against the other aggregators.
Figure 6 shows for all the 416 forecasters. Similarly to before, the matrix has been ordered such that the most knowledgeable forecasters are on the right. Overall, this plot is much more monochromatic than the one presented earlier in Figure 3(b), suggesting that information diversity among the 416 students is rather lower. This aligns with the expectations laid out earlier in Section 4.2.1. If there were no information diversity, i.e., all the forecasters used the same information, then averaging aggregators, such as the simple average, would perform very well (satopaamodeling). Such a limiting case, however, is rarely encountered in practice. Often at least some information diversity is present. The results in the current section show that the revealed aggregator does not require extremely high information diversity in order to outperform the measurement-error aggregators.
This paper introduced the partial information framework for modeling forecasts from different types of prediction polls. Even though the framework can be used for theoretical analysis and studying information among groups of experts, the main focus was on model-based aggregation of forecasts. Such aggregators do not require a training set. Instead, they operate under a model of forecast heterogeneity and hence can be applied to forecasts alone. Under the partial information framework, all forecast heterogeneity stems from differences in the way the forecasters use information. Intuitively, this is more plausible at the micro-level than the historical measurement error. To facilitate practical applications, the partial information framework motivates and describes the forecasters’ information with a patterned covariance matrix (Equation 1). A correctional procedure was proposed (Algorithm 1) as a general tool for estimating these information structures. This procedure inputs any covariance estimator and modifies it minimally such that the final output represents a physically feasible allocation of information. Even though the general partial information framework describes an optimal aggregator, it is generally too abstract to be directly applied in practice. As a solution, this paper discusses a close yet practical specification within the framework, known as the Gaussian model (Section 2.2.2). The Gaussian model permits a closed-form solution for the optimal aggregator and extends to different types of forecast-outcome pairs via a link function. These partial information aggregators were evaluated against the common measurement error aggregators on two different real-world (Section 4) prediction polls. In each case the Gaussian model outperformed the typical measurement-error-based aggregators, suggesting that information diversity is more important for modeling forecast heterogeneity.
Generally speaking, partial information aggregation works well because it downweights pairs or sets of forecasters that share more information and upweights ones that have unique information (or choose to attend to unique information as is the case, e.g., in Section 4.2, where forecasters made judgments based on the same pictures). This is very different from measurement-error aggregators that assume all forecasters to have the same information and hence consider them equally important. While simple measurement-error techniques, such as the average or median, can work well when the forecasters truly operate on the same information set, in real-world prediction polls participants are more likely to have unequal skill and information sets. Therefore prioritizing is almost certainly called for. Of course, the more diverse these sets are, the better the partial information aggregators can be expected to perform relative to the measurement error aggregators. To illustrate this result, compare the relative performances in Section 4.1 (high information diversity) against those in Section 4.2 (low information diversity).
Overall, the partial information framework can be applied and extended in many different ways. For instance, in this paper the th forecaster’s prediction was assumed to be the expectation of after observing some partial information . In some applications, however, other constructs, such as the conditional median or other quantiles, may be more appropriate. Such extensions can be handled by considering the distribution of and then equating the th forecaster’s prediction to any desired functional of this distribution. This is particularly easy under the Gaussian model, where conveniently follows a Gaussian distribution.
In terms of future research, the partial information framework offers both theoretical and empirical directions. One theoretical avenue involves estimation of information overlap. In some cases the higher order overlaps have been found to be irrelevant to aggregation. For instance, degroot1991optimal show that the pairwise conditional (on the truth) distributions of the forecasts are sufficient for computing the optimal weights of a weighted average. Theoretical results on the significance or insignificance of higher order overlaps under the partial information framework would be desirable. Given that the Gaussian model can only accommodate pairwise information overlap, such a result would reveal the need of a specification that is more complex than the Gaussian model.
A promising empirical direction is the Bayesian approach. These techniques are very natural for fitting hierarchical models such as the ones discussed in this paper. Furthermore, in many applications with small or moderately sized datasets, Bayesian methods have been found to be more stable than the likelihood-based alternatives. Therefore, given that the number of forecasts in a prediction poll is typically quite small, a Bayesian approach is likely to improve the quality of the final aggregate. This would involve developing a prior distribution for the information structure – a problem that seems interesting in itself. Overall, this avenue should certainly be pursued, and the results tested against other high performing aggregators.