Doug Hubbard, owner of Hubbard Decision Research, explains why subject matter experts must be calibrated like any other measurement instrument.
Do you calibrate your subject matter experts (SMEs)? If not, you’re missing a critical risk management method that would have led to better decision making for your organisation, says Doug Hubbard, the author of ‘How to Measure Anything’ and ‘The Failure of Risk Management’.
Speaking at Risk Awareness Week 2020, Hubbard shared research based on the estimation ‘calibration’ of 434 individuals. Collectively, this data added up to over 52,000 test responses to trivia questions, around 20,000 of which were in a true / false format. For these questions, people were asked not just whether they thought the answer was true or false, but how certain they were.
The purpose of the exercise was to judge people’s ability to estimate uncertain things using probabilities. Hubbard wasn’t just measuring how often people were right in their answer to the true/false question, but how good they were at predicting that. For instance, if somebody says that they are 80 per cent certain, you want them to be correct at least 80 per cent of the time.
What the original research showed was that people are often not good at estimating or forecasting. They tend to be over-confident, and high indications of certainty do not always lead to a high level of success.
Hubbard said: “Whenever somebody said they were 90 per cent confident on their first test using our list of true / false trivia questions they were right only about 75 per cent of the time on average. All the times they said they were 100% confident they’re right less than 90% of the time.”
Hubbard used the research to show how an individual’s ability to estimate probabilities can be improved. He began by showing the effect of calibration training, which substantially improved the accuracy of certainty predictions, but still left significant room for improvement.
Next Hubbard showed the impact of various other steps including removing those people who had less aptitude for trivia questions, specifically those who did not fare better than random guessing.
He said: “We took out the bottom quartile of people based on their trivia test performance. 58 per cent was about the cut off where they weren’t doing much better than random chance. If we took out the poor performers, we got a lot of points that are now within the statistically allowable error margin.”
The lesson here, is that removing poor performers, also increases the calibration of estimates.
The second element Hubbard looked at was the idea of using adjustments based on people’s previous performance when estimating uncertainty. In this case, he looked at splitting the questions so that you have a test set, followed by another set. You use the results of the first set to create adjustments that you can apply to the certainty predictions in the second set.
Unsurprisingly, this helped improved estimations again, and the combination of calibration training, removing those with poor aptitude and making statistical adjustments – all led to forecasting that fell within the allowable error range.
However, while the improvements were substantial, these measures did not account for the issue of subject matter inconsistency.
Hubbard explained: “If you give subject matter experts a long list of things to estimate and… in fact you make the list so long that they won’t know if you repeat some items in the list… if they’re perfectly consistent they should give the exact same answer the second time as they gave the first time.
“Instead what we usually observe is that there’s inconsistency. If you could just statistically smooth out the inconsistency of an expert, then you would end up with better forecasts.”
To eliminate this inconsistency, Hubbard pointed to a simple solution – using teams of subject matter experts, rather than relying on individuals.
“A team,” Hubbard said, “can be selected, trained and mathematically aggregated in a way that outperforms any single individual.
“if you have team that you selected well to optimise a particular set of forecasts and you’ve trained them and you’re using the optimal elicitation methods and you’re aggregating their individual responses mathematically in a way that’s meant to improve forecast – what you’ve done is created new SME – we will call that the FrankenSME.”
To show the power of the FrankenSME, he matched up pairs of answers in his study, where two people had both given the same answer (either true or false) and also given the same certainty e.g. 80 per cent.
What he found that was the pairs significantly outperformed the individuals. So, if a pair said that they both had 80 per cent certainty in an answer being true, they were actually true 87 per cent of the time. If they both had 90 per cent confidence, then they were right 95 per cent of the time. Counter-intuitively, the pair’s certainty was greater than the sum of its parts.
Of course, not all teams are built equal, and Hubbard examined how teams can be specifically built to better improve their forecasting ability.
One lesson, for instance, was that putting people in a room and asking them to come to a consensus was not a great way to build forecasting teams. Instead it was better to ask people to make predictions individually and then find ways to mathematically aggregate them.
Another lesson came from decision-making theorist Philip Tetlock. Hubbard explained: “One of his requirements for a good team is that it had to be made of “belief updaters.” A belief updater is someone who is willing to change their mind when they’re given new information.
“This does not appear to be a universal trait among people. There are a lot of people who when they collaborate with their peers don’t change their minds – in fact, they double down and they become more confident in their original forecast even when they disagree with everybody.”
It is not just cyber security but it could be major mergers and acquisitions, big IT investments, big R&D investments etc. Think of all the ways your organisation relies on the estimates of subject matter experts and you have to ask what is the performance of that method. Figure out how well your teams estimate things, what is the performance, how well calibrated are your teams of SMEs, and how good is your FrankenSME.”