Expert judgement is routinely required to inform risk analyses and critically important decisions. Whilst expert judgement can be remarkably useful when data are absent or uninformative, it can be easily influenced by a range of contextual biases which can lead to poor judgements and subsequently poor decisions. Ill-informed and inappropriate methods for eliciting these judgements can exacerbate these biases. Structured elicitation protocols make use of research in psychology, judgement and decision-making and mathematics to guard against biases and to provide more accurate and better calibrated (aggregated) judgments.
Some of the current structured protocols aim to subject expert judgements to the same level of care and scrutiny as would be expected for empirical data, ensuring that if judgements are to be used as data, they are subject to basic scientific principles of review, critical appraisal, and repeatability. Objectively evaluating the quality of expert data and, more importantly, validating expert judgements are other essential elements.
Considerable research suggests that the performance of experts should be evaluated by scoring expert assessments on questions related to the elicitation questions, but for which the answers could be obtained. Experts who can provide accurate, well-calibrated and informative judgements should receive more weight in a potential final aggregation of judgements. This is referred to as performance-weighting in the mathematical aggregation of multiple expert judgements. The weights however depend on the chosen measures of performance.
We are yet to understand the best methods to aggregate judgements, how well such aggregations perform out of sample (i.e. predict future performance), or the involved costs (time and money), as well as the benefits of the various approaches.
In this presentation we define and explore a new measure of performance, more specifically a measure of experts’ calibration. A sizeable data set containing predictions (made by a large number of experts) for outcomes of geopolitical events for the period 2011–2015 is used in this research to investigate the properties and advantages of this measure of calibration when compared to other, well established measures.